Three fundamental flaws of SIMD

Marcus

unread,

Aug 12, 2021, 1:08:41 PM8/12/21

to

I posted this article on my blog:

https://www.bitsnbites.eu/three-fundamental-flaws-of-simd

...basically stating the obvious (IMO), without putting too much
emphasis on how some alternatives solve some of the issues etc.

There were quite a few replies on reddit [1] and hackernesws [2],
and judging by the comments it seems that many software developers are
actually very fond of packed SIMD. :-/

/Marcus

[1]
https://www.reddit.com/r/programming/comments/p0yn45/three_fundamental_flaws_of_simd

[2] https://news.ycombinator.com/item?id=28114934

Thomas Koenig

unread,

Aug 12, 2021, 1:27:37 PM8/12/21

to

Marcus <m.de...@this.bitsnbites.eu> schrieb:

> I posted this article on my blog:
>
> https://www.bitsnbites.eu/three-fundamental-flaws-of-simd
>
> ...basically stating the obvious (IMO), without putting too much
> emphasis on how some alternatives solve some of the issues etc.
>
> There were quite a few replies on reddit [1] and hackernesws [2],
> and judging by the comments it seems that many software developers are
> actually very fond of packed SIMD. :-/

People are familiar with SIMD, and they apparently are not willing
to entertain the thought that there could be something better.

Plus, somebody who has some skill at wringing something useful
out of SIMD, which may be a source of pride and even monetary
compensation, will shy away from thinking too hard about anything
that may be better.

(When I learned about SIMD, coming from a vector computer background,
I certainly was disappointed by its limitations.)

Stefan Monnier

unread,

Aug 12, 2021, 2:36:41 PM8/12/21

to

Marcus [2021-08-12 19:08:37] wrote:
> I posted this article on my blog:
> https://www.bitsnbites.eu/three-fundamental-flaws-of-simd
> ...basically stating the obvious (IMO), without putting too much
> emphasis on how some alternatives solve some of the issues etc.
> There were quite a few replies on reddit [1] and hackernesws [2],
> and judging by the comments it seems that many software developers are
> actually very fond of packed SIMD. :-/

Most software developers are basically alien to computer architecture
as an engineering discipline.

The ISA is imposed on them from outside because it is decided by the
machines they can use/buy. From that point of view, they can only
understand your article as "don't use the SIMD primitives provided by
your hardware" ;-)

Stefan

BGB

unread,

Aug 12, 2021, 3:33:55 PM8/12/21

to

And, from the hardware design front, SIMD is comparably cheap and easy
to implement, more so if it can be done in a way which mostly reuses
parts of the CPU core that one already has laying around for Non-SIMD
operations.

Doing vectors by using more advanced forms of pipe-lining is not exactly
ideal in some cases, particularly if it would require increasing the
effective number of pipeline stages or the number of register file ports.

Though, I will admit that SIMD isn't necessarily the best option in an
ISA design elegance sense...

George Neuner

unread,

Aug 12, 2021, 4:33:16 PM8/12/21

to

On Thu, 12 Aug 2021 17:27:35 -0000 (UTC), Thomas Koenig
<tko...@netcologne.de> wrote:

>(When I learned about SIMD, coming from a vector computer background,
>I certainly was disappointed by its limitations.)

Coming from Connection Machine I was even /more/ disappointed.
POD vectors are nice, but vectors of pointers are a lot nicer.

MitchAlsup

unread,

Aug 12, 2021, 5:09:22 PM8/12/21

to

On Thursday, August 12, 2021 at 12:27:37 PM UTC-5, Thomas Koenig wrote:
> Marcus <m.de...@this.bitsnbites.eu> schrieb:
> > I posted this article on my blog:
> >
> > https://www.bitsnbites.eu/three-fundamental-flaws-of-simd
> >
> > ...basically stating the obvious (IMO), without putting too much
> > emphasis on how some alternatives solve some of the issues etc.
> >
> > There were quite a few replies on reddit [1] and hackernesws [2],
> > and judging by the comments it seems that many software developers are
> > actually very fond of packed SIMD. :-/
<
> People are familiar with SIMD, and they apparently are not willing
> to entertain the thought that there could be something better.
<

My, personal, guess is that people simply write code in HLLs and
let the compiler do its thing--not really caring about the SIMDness
of the code.
<
Compiler writers are aware and conscious of SIMDness caught between
the ever growing SIMD complexity and the screwy things they have to
do to get good code spit out the other end.
<
Operating System people are pretty ignorant, except for the size of the
stuff they have to save/restore around context switches, and when they
get inside various system library functions that the compiler spit out in
SIMD form.

Michael S

unread,

Aug 12, 2021, 6:59:08 PM8/12/21

to

You could spend your time better than writing it.
I could spend my time better than reading it.

MitchAlsup

unread,

Aug 12, 2021, 8:05:37 PM8/12/21

to

On Thursday, August 12, 2021 at 12:08:41 PM UTC-5, Marcus wrote:
> I posted this article on my blog:
>
> https://www.bitsnbites.eu/three-fundamental-flaws-of-simd

The fundamental flaw in SIMD is

a) one should be able to encode a vectorizable loop once and have it
run at the maximum performance of each future generation machine.
One should never have to spit out different instructions just because
the width of the SIMD registers was widened.

b) one should be able to encode a SIMD instruction such that it performs
as wide as the implementation has resources (or as narrow) using the
same OpCodes.

Quadibloc

unread,

Aug 12, 2021, 11:20:24 PM8/12/21

to

On Thursday, August 12, 2021 at 11:08:41 AM UTC-6, Marcus wrote:

> and judging by the comments it seems that many software developers are
> actually very fond of packed SIMD. :-/

Well, if it's all you can get, it's better than nothing.

On my web page, there's a discussion of an imaginary computer architecture
which began as a simple example of how computers work, but which then
grew to include every feature that existed on some historical computer
somewhere.

So it included both packed SIMD, like MMX and its successors, and
unpacked SIMD like the Cray-I, and illustrated how the two differed;
for example, there are two different ways to provide hardware assist
for FFT, and for some reason one was more suitable to packed SIMD
and the other to unpacked SIMD, or at least so I thought when I
prepared the page.

http://www.quadibloc.com/arch/ar0102.htm

et seq. ...

John Savard

Ivan Godard

unread,

Aug 12, 2021, 11:24:31 PM8/12/21

to

Fairly described as "ISA Stockholm Syndrome"?

Marcus

unread,

Aug 13, 2021, 2:53:04 AM8/13/21

to

Funny thing is that I come from a software developer background, and I
have hated those aspects mentioned in the article for as long as I can
remember.

When I first learned about how Cray-1 worked (some four years ago I
think) I was like "Wow! It must be a joy to program that thing".

Then I created my own vector ISA and wrote some demos, and I was like
"Wow! It's really a joy to program this thing".

I naively though that others would have similar feelings.

/Marcus

Marcus

unread,

Aug 13, 2021, 4:29:19 AM8/13/21

to

Feel free to skip reading it :-)

I think that some people need to read it, though, since apparently it's
far from obvious to most people. I wrote it since I found myself
explaining the same things over and over in different forums, and it's
easier to have a text to refer to.

/Marcus

Marcus

unread,

Aug 13, 2021, 4:35:18 AM8/13/21

to

Yep!

I also added:

c) Packed SIMD is just as sensitive to data hazards as scalar code is,
so you either need proper OoO HW, or you need to unroll all SIMD loops
in SW. And since /some/ implementations are in-order (at least for the
SIMD part), the compiler /always/ unrolls loops -> code size bloat ->
I$ penalty.

d) Tail handling for data sets that are not multiples of the SIMD
width...

/Marcus

Terje Mathisen

unread,

Aug 13, 2021, 4:54:54 AM8/13/21

to

This has been done well in C#, they have vector operations that will
turn into optimal SIMD instructions during the final JIT/AOT stage, this
way they can optimize for the local CPU, and update the compiler for
future platforms.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Anton Ertl

unread,

Aug 13, 2021, 5:05:43 AM8/13/21

to

Marcus <m.de...@this.bitsnbites.eu> writes:
>I posted this article on my blog:
>
>https://www.bitsnbites.eu/three-fundamental-flaws-of-simd

Flaw 1 is addressed with SVE. Interestingly, SVE has a flaw that
becomes apparent in the recently announced CPUs: processes cannot
migrate between cores with different register width, so they choose
128 bit wide registers for both small and large cores. How could SVE
have been modified to avoid this flaw? Does RVV have it?

Flaw 2 starts out with a false claim. There is no requirement that
execution units are as wide as the registers, and AMD has delivered a
lot of CPUs where the execution units were half as wide as the
registers: Palomino and K8 implemented SSE (and, for K8, SSE2) with
64-bit wide functional units, Jaguar and later cats, Bulldozer and its
descendents, and Zen 1 implemented AVX-256 with 128-bit-wide
functional units. Could be a solution for ARMs SVE flaw.

In-order CPUs don't play a role for the AMD64 architecture, and
compilers certainly don't unroll AVX code for that (there is no
AVX-capable in-order CPU). Compilers unroll loops in order to reduce
loop overhead, and for additional optimization options.

Flaw 3: In the comments section, you reveal that tail handling is a
problem for auto-vectorization. My take is that auto-vectorization is
a flawed concept, mainly because it is unreliable: you rub the
compiler the wrong way, and it will stop auto-vectorizing the loop
without giving any warning. Manual vectorization is a better
approach, and when done right, tail handling is no problem. See
Sections 2.2-2.4 of

http://www.complang.tuwien.ac.at/papers/ertl18manlang.pdf

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Michael S

unread,

Aug 13, 2021, 6:56:54 AM8/13/21

to

On Friday, August 13, 2021 at 12:05:43 PM UTC+3, Anton Ertl wrote:
> Marcus <m.de...@this.bitsnbites.eu> writes:
> >I posted this article on my blog:
> >
> >https://www.bitsnbites.eu/three-fundamental-flaws-of-simd
> Flaw 1 is addressed with SVE. Interestingly, SVE has a flaw that
> becomes apparent in the recently announced CPUs: processes cannot
> migrate between cores with different register width, so they choose
> 128 bit wide registers for both small and large cores. How could SVE
> have been modified to avoid this flaw? Does RVV have it?
>
> Flaw 2 starts out with a false claim. There is no requirement that
> execution units are as wide as the registers, and AMD has delivered a
> lot of CPUs where the execution units were half as wide as the
> registers: Palomino and K8 implemented SSE (and, for K8, SSE2) with
> 64-bit wide functional units, Jaguar and later cats, Bulldozer and its
> descendents, and Zen 1 implemented AVX-256 with 128-bit-wide
> functional units. Could be a solution for ARMs SVE flaw.
>

Intel had 64-bit FP EUs for SSE/SSE2/3 in their main CPU line since introduction of SSE and up until Merom,
i.e. ~8 years later.
On Atom side single precision EUs were 128-bit, but double-precision was half-width for many generations.
May be, still is, I didn't check.

I am a fan of >1 ratio between register width and EUs width myself.
But the reasonable ratios nowadays, i.e.
in presence of caches,
wide superscalar cores,
several cores sharing at least part of on-chip cache,
power consumption as main bottleneck,
no less than 16 software-visible VRs,
other things, I forgot,
are 1, 2 or 4.
CRAY-like ratio today is a horrible idea.
IIRC, Alpha Tarantula proposal argued for 8, but their cache subsystem was very different from how it's done today
and power consumption was low on the list of their priorities.

> In-order CPUs don't play a role for the AMD64 architecture, and
> compilers certainly don't unroll AVX code for that (there is no
> AVX-capable in-order CPU). Compilers unroll loops in order to reduce
> loop overhead, and for additional optimization options.
>

One reason to unroll, even on OoO, is a true (RaW) dependency through accumulator,
which is very typical in inner-product loops.

> Flaw 3: In the comments section, you reveal that tail handling is a
> problem for auto-vectorization. My take is that auto-vectorization is
> a flawed concept, mainly because it is unreliable: you rub the
> compiler the wrong way, and it will stop auto-vectorizing the loop
> without giving any warning. Manual vectorization is a better
> approach, and when done right, tail handling is no problem.

Very true.

Marcus

unread,

Aug 13, 2021, 7:47:39 AM8/13/21

to

On 2021-08-13 10:54, Terje Mathisen wrote:
> MitchAlsup wrote:
>> On Thursday, August 12, 2021 at 12:08:41 PM UTC-5, Marcus wrote:
>>> I posted this article on my blog:
>>>
>>> https://www.bitsnbites.eu/three-fundamental-flaws-of-simd
>>
>> The fundamental flaw in SIMD is
>>
>> a) one should be able to encode a vectorizable loop once and have it
>> run at the maximum performance of each future generation machine.
>> One should never have to spit out different instructions just because
>> the width of the SIMD registers was widened.
>>
>> b) one should be able to encode a SIMD instruction such that it performs
>> as wide as the implementation has resources (or as narrow) using the
>> same OpCodes.
>
> This has been done well in C#, they have vector operations that will
> turn into optimal SIMD instructions during the final JIT/AOT stage, this
> way they can optimize for the local CPU, and update the compiler for
> future platforms.

...which probably means that they will be able to map well to other
kinds of vector architectures too.

Even if a compiler / language can hide the details of the underlying
vector architecture, packed SIMD still suffers from code bloat though:
Essentially you need to describe things like pipelining and tail
handling in code rather than having the HW take care of it, which hurts
code density /and/ increases front end load.

>
> Terje
>

Marcus

unread,

Aug 13, 2021, 7:51:14 AM8/13/21

to

On 2021-08-13 05:24, Ivan Godard wrote:

[snip]

> Fairly described as "ISA Stockholm Syndrome"?

I'll have to remember that! :-D

/Marcus

BGB

unread,

Aug 13, 2021, 11:59:42 AM8/13/21

to

On 8/13/2021 3:11 AM, Anton Ertl wrote:
> Marcus <m.de...@this.bitsnbites.eu> writes:
>> I posted this article on my blog:
>>
>> https://www.bitsnbites.eu/three-fundamental-flaws-of-simd
>
> Flaw 1 is addressed with SVE. Interestingly, SVE has a flaw that
> becomes apparent in the recently announced CPUs: processes cannot
> migrate between cores with different register width, so they choose
> 128 bit wide registers for both small and large cores. How could SVE
> have been modified to avoid this flaw? Does RVV have it?
>

IME, moving to vectors larger than 128 bits doesn't tend to gain much
over 128 bits. As the register size gets larger, there is less that can
utilize it effectively.

It is like integer size:
16 bits: can be used heavily, may or may not be sufficient;
32 bits: used heavily, usually sufficient;
64 bits: used occasionally, often useful;
128 bits: used rarely, sometimes useful;
256 bits: novelty size...

For SIMD vectors, there is roughly a factor of 4 relation:
64 bits: can be used heavily, frequently insufficient;
128 bits: used heavily, usually sufficient;
256 bits: used occasionally, sometimes useful;
512 bits: rarely useful.

Once the SIMD vector exceeds the length of the data one would likely
express using vectors, its utility drops off sharply.

The march to endlessly bigger SIMD vectors would make sense if each step
up gave a linear improvement, but makes less sense if it is in
diminishing returns territory.

So, this is part of why my current policy for BJX2 is doing 64 and 128
bit "vectors" and calling it good enough.

Not going to go 256 bits, at least not on a 3-wide core.

Similar tricks to what I used for 128-bit vectors could be used to allow
256-bit vectors on a 6-wide / SMT capable core, since most of the
mechanisms and plumbing would already be in place. Such a vector would
effectively use groups of 4 registers in parallel (and ganging the
memory ports from both SMT threads to allow 256-bit load/store).

So, implicitly, if I had a GSVY, it would implicitly depend on the
existence of WEX-6W...

But, then one might ask: OK, so why not then just issue a 128-bit SIMD
operation on Lane1A and Lane1B?...

And, I could respond: What exactly do you think it is that these 256-bit
operations would be doing?...

For related reasons, one can't have 128-bit operations on a core that
doesn't already support WEX-3W.

> Flaw 2 starts out with a false claim. There is no requirement that
> execution units are as wide as the registers, and AMD has delivered a
> lot of CPUs where the execution units were half as wide as the
> registers: Palomino and K8 implemented SSE (and, for K8, SSE2) with
> 64-bit wide functional units, Jaguar and later cats, Bulldozer and its
> descendents, and Zen 1 implemented AVX-256 with 128-bit-wide
> functional units. Could be a solution for ARMs SVE flaw.
>

FWIW: Despite having 128-bit SIMD, my BJX2 core doesn't actually
(currently) have any units which work natively with 128-bit data.

The register file still mostly operates as 64 bits, and many 128 bit
operations are effectively performed by running two 64-bit operations in
parallel. In many cases, it is simply expanding a single SIMD
instruction over multiple lanes (effectively, a bundle of virtual
operations).

In the ALUX extensions, in a few cases, the ALUs effectively combine
"Voltron style" by feeding bits between each other.

The 128-bit shift operations are effectively two 64-bit funnel shift
operations "in disguise", ...

Well, also having 64-bit units which can occasionally combine for
128-bit operations is a lot less wasteful than having a 128-bit unit
where 99% of the time its capabilities go unused.

And, if one's data doesn't fit nicely into a 128-bit vector... they can
often use two different 64-bit vector ops in parallel.

Packed Integer SIMD isn't done by adding more ALUs, but rather by
splitting up the sub-units within a carry-select adder:
In you select the results where each 16-bit element had a carry-in of
zero, a packed-word ADD magically appears.

For FPU SIMD, there is only one FPU, so SIMD operations internally just
do 4 operations in a row (by sequentially feeding the values into the
FADD or FMUL units and capturing the result out the other end).

I did it the way I did not necessarily because it was "best", but
because I could do it cheaply...

There are ways it could be made better/faster/..., but they would add
resource cost.

Similarly, "going narrower" and presenting vector operations in terms of
scalar operations on individual elements would either:
Slow everything down by essentially turning it back into scalar code;
Require adding a large number of execute lanes and register ports.

Neither of these options is desirable.

> In-order CPUs don't play a role for the AMD64 architecture, and
> compilers certainly don't unroll AVX code for that (there is no
> AVX-capable in-order CPU). Compilers unroll loops in order to reduce
> loop overhead, and for additional optimization options.
>

Yep.

> Flaw 3: In the comments section, you reveal that tail handling is a
> problem for auto-vectorization. My take is that auto-vectorization is
> a flawed concept, mainly because it is unreliable: you rub the
> compiler the wrong way, and it will stop auto-vectorizing the loop
> without giving any warning. Manual vectorization is a better
> approach, and when done right, tail handling is no problem. See
> Sections 2.2-2.4 of
>
> http://www.complang.tuwien.ac.at/papers/ertl18manlang.pdf
>

Generally agreed.
I am actively against autovectorization on BJX2.
To what extent vectors are usable, they are in the form of explicit
language extensions.

I have done things in a way which which I feel doesn't suck nearly as
bad as the "xmmintrin.h" system, and also allows a subset of the GCC
vector extensions.

My approach also tries to be much less of an "ugly wart" on the C
language, so does things in ways that I feel are more consistent with
traditional C semantics (can use native operators, cast conversions, ...).

Though, this does mean some amount of expansion to the typesystem, but
as I feel it, expanding out the C typesystem and numeric tower is much
less evil than something like auto-vectorization, which implies
interpreting C code as if it were something very unlike what was
actually written (and for the compiler to essentially perform magic
tricks with the semantics), or "xmmintrin.h" which is more like thinly
veiled x86 ASM shoehorned into C's existing syntax.

Granted, it does also mean that, unless written to do so, existing
portable C code will not make any use of these sorts of vector extensions.

...

MitchAlsup

unread,

Aug 13, 2021, 12:20:13 PM8/13/21

to

You would enjoy programming in My 66000 ISA.

MitchAlsup

unread,

Aug 13, 2021, 12:29:35 PM8/13/21

to

I am afraid I have to disagree with you here:: The march to endlessly bigger
SIMD vectors is because the root of the necessity was never addressed
correctly.
<
For example, all of the SIMD instructions in x86-64 are "performed"
by the addition of exactly 2 My 66000 instructions, and may more
the x86-64 are possible to express. ISA explosions shold be proactively
prevented not embraced.

>
>
> So, this is part of why my current policy for BJX2 is doing 64 and 128
> bit "vectors" and calling it good enough.
>
> Not going to go 256 bits, at least not on a 3-wide core.
<

Not going to 256-bits wide EVER.

All of these wide things are performed using CARRY instruction.

Thomas Koenig

unread,

Aug 13, 2021, 12:44:10 PM8/13/21

to

BGB <cr8...@gmail.com> schrieb:

> On 8/13/2021 3:11 AM, Anton Ertl wrote:
>> Marcus <m.de...@this.bitsnbites.eu> writes:
>>> I posted this article on my blog:
>>>
>>> https://www.bitsnbites.eu/three-fundamental-flaws-of-simd
>>
>> Flaw 1 is addressed with SVE. Interestingly, SVE has a flaw that
>> becomes apparent in the recently announced CPUs: processes cannot
>> migrate between cores with different register width, so they choose
>> 128 bit wide registers for both small and large cores. How could SVE
>> have been modified to avoid this flaw? Does RVV have it?
>>
>
> IME, moving to vectors larger than 128 bits doesn't tend to gain much
> over 128 bits. As the register size gets larger, there is less that can
> utilize it effectively.

I grant you that there is a decreasing return, which currently
levels off at 256 bits - AVX2 is being used a lot in video codecs.

AVX512 is a fiasco, but mainly because Intel overspent its heat
budget and has to clock down the CPU a _lot_ to use it, destroying
most if any advantage in using it.

> Generally agreed.
> I am actively against autovectorization on BJX2.
> To what extent vectors are usable, they are in the form of explicit
> language extensions.

Count me out for using your architecture, then.

Explicit language extensions

- lock in the user to a specific architecture and compiler
- expose architecture details which should not be visible
- make code hard to write, read and thus maintain

[...]

> My approach also tries to be much less of an "ugly wart" on the C
> language, so does things in ways that I feel are more consistent with
> traditional C semantics (can use native operators, cast conversions, ...).

There are other programming languages than C. What would you
propose for Fortran, for example?

Thomas Koenig

unread,

Aug 13, 2021, 12:45:04 PM8/13/21

to

MitchAlsup <Mitch...@aol.com> schrieb:

> You would enjoy programming in My 66000 ISA.

So would I.

Any chance of this happening in the forseeable future?

BGB

unread,

Aug 13, 2021, 1:45:21 PM8/13/21

to

The traditional solution to this is to use ifdefs.

One already has to do this if they want the same code to use SSE and
NEON in a semi-effective manner.

Or use the common subset that exists with GCC's
"__attribute__((vector_size(N)))" system.

Otherwise, it is like saying that no one can use inline ASM, or that one
will refuse to use a compiler which supports using inline ASM.

One can just write traditional scalar code, and have it perform as such.
Its performance may suck in comparison, but it isn't precluded.

IME, auto vectorization only sometimes helps on traditional targets, and
frequently (if the optimizer is overzealous) can turn into a wrecking
ball on the performance front (performing worse, sometimes
significantly, than its non vectorized counterpart).

If it were up to me though, there would be explicit attributes to tell
the compiler what it should or should not vectorize.

Or, maybe, if some standardized way were defined to specify vector types
in C (preferably more concise than GCC's notation).

Say, for example, if the compiler allowed:
float[[vector(4)]] vec; //define a 4-element floating point vector.
Or:
[[vector(4)]] float vec; //basically equivalent.

> [...]
>
>> My approach also tries to be much less of an "ugly wart" on the C
>> language, so does things in ways that I feel are more consistent with
>> traditional C semantics (can use native operators, cast conversions, ...).
>
> There are other programming languages than C. What would you
> propose for Fortran, for example?
>

Dunno, not enough overlap between use-cases.

BJX2 is not intended for supercomputers or scientific computing, rather
I was more intending it for robot and machine control tasks. Originally
this was partway to address some annoyances I had with using ARM based
controllers for these tasks.

But, this part is an uphill battle, though despite its much slower
clock-speeds, it fares relatively well at some of the tasks I intended
it for, though sadly some tasks are seriously hindered by the available
memory bandwidth.

BGB

unread,

Aug 13, 2021, 2:26:09 PM8/13/21

to

The ISA explosion can be contained.

x86 and ARM just sorta did a particularly bad job at it, as they sort of
awkwardly hacked it onto an ISA design where it didn't really fit, so
pretty much the entire ISA needs to be duplicated several times over.

Well, and also NEON integrating format conversions within its operations
being kinda absurd.

>>
>>
>> So, this is part of why my current policy for BJX2 is doing 64 and 128
>> bit "vectors" and calling it good enough.
>>
>> Not going to go 256 bits, at least not on a 3-wide core.
> <
> Not going to 256-bits wide EVER.

That is also an option.

On a core which could support it, something like:
PADDYF R24, R36, R44
Would be functionally equivalent to:
PADDXF R26, R38, R46 | PADDXF R24, R36, R44

And, one could just define the latter form as the one to use, or add
256-bit vectors as a kind of syntactic sugar in the assembler.

Something like CARRY doesn't really map onto how things are done in
BJX2, where I prefer to avoid contextual encodings.

Granted, prefixes like Jumbo or Op64 could be considered as turning the
following instruction into a contextual encoding, but this can be
sidestepped in the Jumbo or Op64 prefix is seen as composing a new
widened encoding.

Similarly, given they are generally required to be adjacent sort of
avoids "spooky action at a distance" behaviors.

Sadly, this sorta breaks down with the WEX-6W "interleave" semantics,
but trying to define a 6-wide core in terms of two overlapping 3-wide
pipelines introduces a bit of hair. Off-hand, there wasn't a good
alternative that was less awful, aside from limiting the use of Jumbo
prefixes to match their use in the 3-wide profile.

This is how I defined the 5-wide profile to behave, which loses some
potential performance, with the trade-off of "less hair" as it allows
ignoring the use of pipeline interleaving.

Whether or not a 6W core is viable is still "yet to be seen".

Another option could be to support SMT without supporting 6W operation.
In this case, the two pipelines would be independent, and the second
pipeline simply go unused if not running in SMT. However, this could
sidestep some of the issues with trying to support a unified 12R/6W
register file (by essentially just running two 6R/3W register files in
parallel with a little bit of additional plumbing trickery).

Thomas Koenig

unread,

Aug 13, 2021, 3:13:02 PM8/13/21

to

I think that's sort of what 'make code hard to write, read and
thus maintain' means.

> One already has to do this if they want the same code to use SSE and
> NEON in a semi-effective manner.
>
> Or use the common subset that exists with GCC's
> "__attribute__((vector_size(N)))" system.

See above...

>
> Otherwise, it is like saying that no one can use inline ASM, or that one
> will refuse to use a compiler which supports using inline ASM.

No. Inline asm has its place, for example in system headers,
to access features which are not otherwise accessible.

However, if I would have to jump through hoops like the above
for a new architecture...

> One can just write traditional scalar code, and have it perform as such.
> Its performance may suck in comparison, but it isn't precluded.

Sure.

However, even with the sorry state of auto-vectorization, it often
generates better code than pure scalar code (and compilers are
indeed getting better at this). You are saying you do not want
this, and want to force the user to write your specialized version
if decent (even if non-optimal) performance is required.

Your architecture, your choice. Just count me out.

> IME, auto vectorization only sometimes helps on traditional targets, and
> frequently (if the optimizer is overzealous) can turn into a wrecking
> ball on the performance front (performing worse, sometimes
> significantly, than its non vectorized counterpart).

Rarely, and if it actually turns out to be a problem, you can turn
it off.

One problem, of course, is C's over-dependence on pointers, which
make a lot of vectorization options impossible because the compiler
cannot tell that there is no aliased load and stores.

> If it were up to me though, there would be explicit attributes to tell
> the compiler what it should or should not vectorize.

There is the OpenMP SIMD directive, for example.

> Or, maybe, if some standardized way were defined to specify vector types
> in C (preferably more concise than GCC's notation).
>
> Say, for example, if the compiler allowed:
> float[[vector(4)]] vec; //define a 4-element floating point vector.
> Or:
> [[vector(4)]] float vec; //basically equivalent.

Unless this becomes part of a language standard, this is even worse.

>
>> [...]
>>
>>> My approach also tries to be much less of an "ugly wart" on the C
>>> language, so does things in ways that I feel are more consistent with
>>> traditional C semantics (can use native operators, cast conversions, ...).
>>
>> There are other programming languages than C. What would you
>> propose for Fortran, for example?
>>
>
> Dunno, not enough overlap between use-cases.
>
> BJX2 is not intended for supercomputers or scientific computing

> rather
> I was more intending it for robot and machine control tasks.

I use Fortran because it is a nice general-purpose language (and
because I know it well), not especially because it is the language
of supercomputers.

Of course, if your target is the embedded market for C, then you
are likely not even using a hosted implementation, right? In that
case, Fortran with its big libraries is probably the wrong language
for you.

Terje Mathisen

unread,

Aug 13, 2021, 4:35:09 PM8/13/21

to

Not really an issue as long as you have a way to do a masked store, i.e.
you use the tail end count to initialize the write mask, after doing the
final iteration the normal way. You usually have to make two loop copies
though, one using regular stores for the full blocks, plus a final one
which could handle either 0 to N-1 or 1 to N elements.

Mill is far nicer here, as is Mitch's VVM.

MitchAlsup

unread,

Aug 13, 2021, 4:46:45 PM8/13/21

to

On Friday, August 13, 2021 at 1:26:09 PM UTC-5, BGB wrote:
> On 8/13/2021 11:29 AM, MitchAlsup wrote:
> > On Friday, August 13, 2021 at 10:59:42 AM UTC-5, BGB wrote:
> >> On 8/13/2021 3:11 AM, Anton Ertl wrote:

> > I am afraid I have to disagree with you here:: The march to endlessly bigger
> > SIMD vectors is because the root of the necessity was never addressed
> > correctly.
> > <
> > For example, all of the SIMD instructions in x86-64 are "performed"
> > by the addition of exactly 2 My 66000 instructions, and may more
> > the x86-64 are possible to express. ISA explosions shold be proactively
> > prevented not embraced.
<
> The ISA explosion can be contained.
>
> x86 and ARM just sorta did a particularly bad job at it,
<

I am going to nominate you for the understatement of the year award.

<
> as they sort of
> awkwardly hacked it onto an ISA design where it didn't really fit, so
> pretty much the entire ISA needs to be duplicated several times over.
>
> Well, and also NEON integrating format conversions within its operations
> being kinda absurd.

<snip>

> > <
> > All of these wide things are performed using CARRY instruction.
> Something like CARRY doesn't really map onto how things are done in
<
> BJX2, where I prefer to avoid contextual encodings.
>
> Granted, prefixes like Jumbo or Op64 could be considered as turning the
> following instruction into a contextual encoding, but this can be
> sidestepped in the Jumbo or Op64 prefix is seen as composing a new
> widened encoding.
>
> Similarly, given they are generally required to be adjacent sort of
> avoids "spooky action at a distance" behaviors.
>
>
> Sadly, this sorta breaks down with the WEX-6W "interleave" semantics,
> but trying to define a 6-wide core in terms of two overlapping 3-wide
> pipelines introduces a bit of hair.
<

Wow another candidate for the understatement of the year.
You are in rare form today.....

MitchAlsup

unread,

Aug 13, 2021, 4:55:28 PM8/13/21

to

On Friday, August 13, 2021 at 2:13:02 PM UTC-5, Thomas Koenig wrote:
> BGB <cr8...@gmail.com> schrieb:
> > On 8/13/2021 11:44 AM, Thomas Koenig wrote:
> >> BGB <cr8...@gmail.com> schrieb:

<snip>

> >
> > Or use the common subset that exists with GCC's
> > "__attribute__((vector_size(N)))" system.
> See above...
> >
> > Otherwise, it is like saying that no one can use inline ASM, or that one
> > will refuse to use a compiler which supports using inline ASM.
> No. Inline asm has its place, for example in system headers,
> to access features which are not otherwise accessible.
>
> However, if I would have to jump through hoops like the above
> for a new architecture...
<

Anyone designing a new architecture at this point should provide the
means where::
>
for( i = 0; I < MAX; i++ )
a[i] = b[i];
>
simply runs at the maximum rate this hardware can run it, and it is doubtful
that any future implementation will require a different means to run it as
fast as the machine can perform the desired semantics.

<
> > One can just write traditional scalar code, and have it perform as such.
> > Its performance may suck in comparison, but it isn't precluded.
> Sure.
>
> However, even with the sorry state of auto-vectorization, it often
> generates better code than pure scalar code (and compilers are
> indeed getting better at this). You are saying you do not want
> this, and want to force the user to write your specialized version
> if decent (even if non-optimal) performance is required.
>
> Your architecture, your choice. Just count me out.
<
> > IME, auto vectorization only sometimes helps on traditional targets, and
> > frequently (if the optimizer is overzealous) can turn into a wrecking
> > ball on the performance front (performing worse, sometimes
> > significantly, than its non vectorized counterpart).
<
> Rarely, and if it actually turns out to be a problem, you can turn
> it off.
>
> One problem, of course, is C's over-dependence on pointers, which
> make a lot of vectorization options impossible because the compiler
> cannot tell that there is no aliased load and stores.
<

C's precarious definition of "what a pointer can point at" is no problem
when using the vector facility of My 66000. The compiler has to perform
zero (none, zero, zilch) analysis to convert a loop from scalar form into
vector form.

<
> > If it were up to me though, there would be explicit attributes to tell
> > the compiler what it should or should not vectorize.
<

Any small loop should vectorize--even those with loop carried dependencies.

<
> There is the OpenMP SIMD directive, for example.
> > Or, maybe, if some standardized way were defined to specify vector types
> > in C (preferably more concise than GCC's notation).
> >
> > Say, for example, if the compiler allowed:
> > float[[vector(4)]] vec; //define a 4-element floating point vector.
> > Or:
> > [[vector(4)]] float vec; //basically equivalent.
> Unless this becomes part of a language standard, this is even worse.
<

These should all be wrapped into some kind of typedeff so one can type::
<
double_VEC4 variable;

Quadibloc

unread,

Aug 13, 2021, 5:17:13 PM8/13/21

to

On Thursday, August 12, 2021 at 9:24:31 PM UTC-6, Ivan Godard wrote:

> Fairly described as "ISA Stockholm Syndrome"?

That's a good one!

But it is not that, being tormented, I learned to love my tormentor. I
merely began to illustrate things as they are, went further afield, and
did not even try to imagine how things might be different or
better.

Certainly, even in Concertina II, where I try to design something
new around high performance, I don't attempt to approach the
kind of originality and creativity you are manifesting. Instead, I
just use what has been used before in a slightly different
combination.

John Savard

Quadibloc

unread,

Aug 13, 2021, 5:21:54 PM8/13/21

to

On Friday, August 13, 2021 at 3:05:43 AM UTC-6, Anton Ertl wrote:
> My take is that auto-vectorization is
> a flawed concept, mainly because it is unreliable: you rub the
> compiler the wrong way, and it will stop auto-vectorizing the loop
> without giving any warning.

A very good point, and I agree.

However, I would add that I still see auto-vectorization
as a useful idea.

For two reasons:

1) It can supply additional performance for code that
one didn't realize one could manually vectorize.

2) There is the issue of code being transportable to
computer systems which don't have a vector feature.

However, I do think that this flaw does mean that vector
systems, even if they have auto-vectorization, should have
compilers that let one easily write explicit vector code.
And by "easily", I don't mean by putting in subroutine calls.

John Savard

Quadibloc

unread,

Aug 13, 2021, 5:31:34 PM8/13/21

to

On Friday, August 13, 2021 at 10:44:10 AM UTC-6, Thomas Koenig wrote:
> BGB <cr8...@gmail.com> schrieb:

> > I am actively against autovectorization on BJX2.
> > To what extent vectors are usable, they are in the form of explicit
> > language extensions.

> Count me out for using your architecture, then.

> Explicit language extensions

> - lock in the user to a specific architecture and compiler
> - expose architecture details which should not be visible
> - make code hard to write, read and thus maintain

> [...]
> > My approach also tries to be much less of an "ugly wart" on the C
> > language, so does things in ways that I feel are more consistent with
> > traditional C semantics (can use native operators, cast conversions, ...).

> There are other programming languages than C. What would you
> propose for Fortran, for example?

There is an obvious solution.

Extend Fortran so that:

DOUBLE PRECISION A(1000), B(1000), C(1000)
...
A = B+C

is included in the language for *both* vector computers and purely
scalar computers, with the latter case generating much the same
code as

DOUBLE PRECISION A(1000), B(1000), C(1000)
...

DO 29007 ITEMP=1,1000
A(ITEMP) = B(ITEMP) + C(ITEMP)
29007 CONTINUE

After all, APL ran on many computers without vector hardware.

It isn't a language _extension_ if it's part of the language
standard for all machines on which the language is implemented.
The language construct, of course, is only applicable to cases that
don't include the kinds of things that make autovectorization fail.

John Savard

BGB

unread,

Aug 13, 2021, 6:18:06 PM8/13/21

to

Quick look...

Oddly enough, if Fortran defines things this way, it doesn't actually
need any SIMD extensions since the language already contains the
necessary semantics.

C does not, which is why one kinda needs explicit vectors, and the
assumptions it has to make and the hoops it has to go through is also
part of what makes implicit auto vectorization "evil".

Code can suffer bizarre performance impacts and/or crashes for sake of
stuff which the compiler tried to be clever and vectorize but probably
shouldn't have (such as assuming that things are 16B aligned without
sufficient reason, generating big evil masses of code for functions
which are only ever called for small N, ...).

For example, if you have a function which is typically only ever called
with 3 or 4 elements, but is called frequently, and the compiler
"cleverly" tries to turn this into a giant ugly mess that will only ever
be useful if called to operate on a much bigger array, this is not ideal.

Say:
vec_t DotProduct(vec_t *a, vec_t *b, int n)
{
vec_t f;
int i;

f=0;
for(i=0; i<n; i++)
f+=a[i]*b[i];
return(f);
}

Which is typically used like:
vec3_t iva;
vec3_t ivb;
float l;
...
l=DotProduct(iva, ivb, 3);

Where somewhere in a header:
typedef float vec_t;
typedef float vec3_t[3];

One can be like, "Please no, oh great compiler, please refrain from
turning DotProduct into some gigantic and horrible mess of SIMD
instructions!".

If you are lucky, it is early-out...
If you are unlucky, the program crashes because the compiler used MOVAPS
or similar without actually checking that the pointers were correctly
aligned.

...

BGB

unread,

Aug 13, 2021, 6:32:55 PM8/13/21

to

On 8/13/2021 4:21 PM, Quadibloc wrote:
> On Friday, August 13, 2021 at 3:05:43 AM UTC-6, Anton Ertl wrote:
>> My take is that auto-vectorization is
>> a flawed concept, mainly because it is unreliable: you rub the
>> compiler the wrong way, and it will stop auto-vectorizing the loop
>> without giving any warning.
>
> A very good point, and I agree.
>
> However, I would add that I still see auto-vectorization
> as a useful idea.
>
> For two reasons:
>
> 1) It can supply additional performance for code that
> one didn't realize one could manually vectorize.
>
> 2) There is the issue of code being transportable to
> computer systems which don't have a vector feature.
>

Yeah, in an ideal world, the vectors should not depend on a specific
SIMD ISA.

This is a big failing of the Intel/MSVC "mmintrin.h" / "xmmintrin.h"
system, as it was basically a very thin wrapper over MMX and SSE.
Also making the ability to use these dependent on targeting a CPU that
has these instructions, has the requisite compiler flags, ... sucks
really hard.

> However, I do think that this flaw does mean that vector
> systems, even if they have auto-vectorization, should have
> compilers that let one easily write explicit vector code.
> And by "easily", I don't mean by putting in subroutine calls.
>

In both the BGBCC and GCC extensions, one can typedef the vector, and
then later be like:
vec4f_t a, b, c;
a = (vec4f_t) { 1, 2, 3, 4 };
b = (vec4f_t) { 5, 6, 7, 8 };
c = a + b;

In both cases, in many of these cases, one can also even use the same
typedef, eg:
typedef float vec4f_t __attribute__((vector_size(16)));

And, the compiler will helpfully gloss over SSE vs NEON vs "just faking
it" vs ...

Though, admittedly things aren't strictly 1:1 between BGBCC and GCC's
extensions here.

BGB

unread,

Aug 14, 2021, 12:04:33 AM8/14/21

to

When declared with typedef, it is a lot nicer to use than most other
traditional options (eg, functions and "float *" pointers and similar).

>>
>> Otherwise, it is like saying that no one can use inline ASM, or that one
>> will refuse to use a compiler which supports using inline ASM.
>
> No. Inline asm has its place, for example in system headers,
> to access features which are not otherwise accessible.
>
> However, if I would have to jump through hoops like the above
> for a new architecture...
>

No one is forced to use the language extensions...

It also isn't anywhere near as low-level of a design as something like
"xmmintrin.h" was, it is, if anything, generally more like if part of
GLSL was glued onto C.

There is a common subset that exists with GCC's notation though, so it
is possible to use this and write code that works on both compilers.

The main limitation (relative to GCC) is that the vectors are
more-or-less required to be a padded to a power-of-2 size and within a
more limited set of possible combinations of types and sizes, whereas
GCC's vector system is a little more open-ended as to what it allows.

Though, it is possible I could consider going over to allowing more
open-ended vectors, using a type representation more like what I did for
"_BitInt" support.

Though, likely, it would be be vaguely similar:
Packed types: Byte/Word/DWord/QWord, Half/Single/Double;
Would still only allow vectors of primitive-type elements.
Vector sizes 2/3/4/N;
Storage sizes: 8B, 16B, or a multiple of 16B.
N-element storage being padded up to what fits in one of the above.

So, for example:
float v13 __attribute__((vector_size(52)));
Would be padded to 64-bytes and 16 elements.

Unlike GCC's system, it does borrow a few minor things from xmmintrin,
though they behave a bit differently, eg:
__m128 is cast-compatible with other 128-bit vector types;
It is cast compatible with __int128;
...

Similarly, it is possible to use cast-conversion to sidestep some other
traditional conversions, eg:
double f;
uint64_t fi;
memcpy(&fi, &f, 8); //traditional approach.
fi=(uint64_t)((__m64)f); //alternative (non-memcpy)

But, nothing says one can't still use memcpy if they want.
However, memcpy, unions, and pointer derefs, have the drawback that they
force the values to be spilled to memory, and use memory ops, whereas
the casts allow doing it via register operations.

So, eg, __m64 and __m128 are sort of like the equivalent of "void *" for
SIMD and Floating Point types, whereas direct-casting between
incompatible vector types either wont work, or will try to perform a
value-type conversion.

It is unclear whether or not a large vector type would be allowed to be
bit-cast via _BitInt, eg:
float v13 __attribute__((vector_size(52)));
_BitInt(416) lvi;
lvi=(_BitInt(416))v13; //should this be allowed?...

...

But, I guess it could be nicer if notations were more consistent.

Unlike GLSL though, neither BGBCC's nor GCC's vector system includes a
way to natively express things like matrices or matrix math though.

>> One can just write traditional scalar code, and have it perform as such.
>> Its performance may suck in comparison, but it isn't precluded.
>
> Sure.
>
> However, even with the sorry state of auto-vectorization, it often
> generates better code than pure scalar code (and compilers are
> indeed getting better at this). You are saying you do not want
> this, and want to force the user to write your specialized version
> if decent (even if non-optimal) performance is required.
>
> Your architecture, your choice. Just count me out.
>

I would be less opposed to auto vectorization if it could be done in
ways which weren't prone to occasionally break stuff or make performance
worse than in the baseline language.

I also have similar reservations about strict aliasing semantics.

But, granted, then one can argue that it is an uphill battle to convince
programmers to use the 'restrict' keyword or similar (or argue that
'restrict' is useless if the compiler uses strict aliasing semantics by
default, ...), but alas.

>> IME, auto vectorization only sometimes helps on traditional targets, and
>> frequently (if the optimizer is overzealous) can turn into a wrecking
>> ball on the performance front (performing worse, sometimes
>> significantly, than its non vectorized counterpart).
>
> Rarely, and if it actually turns out to be a problem, you can turn
> it off.
>

Except when the #pragma that is supposed to disable it, or the command
line options to turn it off, weren't working. Though, this being with
MSVC on X64; which seems to have lost the notion of being able to do an
optimized build with vectorization disabled (well, unless one is using
"/Os" rather than "/O2").

One case like this, I ended up needing to resort to doing a
type-conversion via manual bit twiddling, as this was the only option I
found that worked in this case, and breaking the vectorization with a
function call and bit-twiddling on floating point values was faster than
just letting the compiler continue to do what it wanted to do in this case.

> One problem, of course, is C's over-dependence on pointers, which
> make a lot of vectorization options impossible because the compiler
> cannot tell that there is no aliased load and stores.
>

Some of the compilers assume non-aliasing in situations where the
aliasing does occur. This can lead to code misbehaving and crashing.

Compiler writers then pass blame off on the code for containing strict
aliasing violations or similar (and/or treat strict aliasing as opt-out
rather than opt-in).

Ironically, that leaves MSVC as one of the few compilers where it is
disabled by default.

>> If it were up to me though, there would be explicit attributes to tell
>> the compiler what it should or should not vectorize.
>
> There is the OpenMP SIMD directive, for example.
>

Possible.

>> Or, maybe, if some standardized way were defined to specify vector types
>> in C (preferably more concise than GCC's notation).
>>
>> Say, for example, if the compiler allowed:
>> float[[vector(4)]] vec; //define a 4-element floating point vector.
>> Or:
>> [[vector(4)]] float vec; //basically equivalent.
>
> Unless this becomes part of a language standard, this is even worse.
>

Potentially...

The C attributes are supposed to be for things which don't change the
functional behavior of the program, though this is not so true with many
traditional uses of __declspec or __attribute__...

>>
>>> [...]
>>>
>>>> My approach also tries to be much less of an "ugly wart" on the C
>>>> language, so does things in ways that I feel are more consistent with
>>>> traditional C semantics (can use native operators, cast conversions, ...).
>>>
>>> There are other programming languages than C. What would you
>>> propose for Fortran, for example?
>>>
>>
>> Dunno, not enough overlap between use-cases.
>>
>> BJX2 is not intended for supercomputers or scientific computing
>
>> rather
>> I was more intending it for robot and machine control tasks.
>
> I use Fortran because it is a nice general-purpose language (and
> because I know it well), not especially because it is the language
> of supercomputers.
>
> Of course, if your target is the embedded market for C, then you
> are likely not even using a hosted implementation, right? In that
> case, Fortran with its big libraries is probably the wrong language
> for you.
>

Looking at it some, Fortran seems to support explicit operations over
arrays:
This is not autovectorization, this is explicit vectorization.

At present, programs run as mostly:
Bare ROM (size limited);
Binary loaded at boot time (has most "kernel" functionality statically
linked);
Application binary, loaded by the kernel/shell.

At present, in the "TestKern OS", the kernel and shell are essentially
the same program. It launches binaries, and any system calls return back
to it. Most of what would be separate utilities in a traditional Unix
are integrated into the shell as well...

So, by architectural analogy, it is like BusyBox were shoved into the
Linux kernel, which was itself reduced to starting itself up, trying to
launch an "autoexec" shell script or program, and then dumping the user
at the command-line in the default case.

Other cases, the binary is statically linked with all the kernel stuff
(filesystem and memory manager), and then put on the SDcard by itself to
boot and run.

At present, it uses a single big address space, and doesn't use a
preemptive scheduler. Partly this is because multiple address spaces and
preemptive scheduling are a bad situation for real-time programs.

Instead, main alternatives are cooperative multi-threading, and an
explicit task scheduler loop (typically operates using something more
akin to "continuation passing style"). For similar reasons, there is no
garbage collector, ...

One wants to be able to schedule events down to a time window of a few
microseconds for things like servomotor control and similar.

Some other parts were intended partly for processing image data from
camera modules, but had a non-zero overlap with what was needed for a
software rasterizer (and using them to implement a software-rasterized
OpenGL backend seemed like a reasonable test case), ...

This part involves lots of small-vector low-precision SIMD, and some
more obscure operators like taking the dot-products of vectors and
comparing them against a threshold value, ...

...

But, as noted, I am primarily testing it with Doom and Quake and similar...

BGB

unread,

Aug 14, 2021, 1:03:51 AM8/14/21

to

On 8/13/2021 3:55 PM, MitchAlsup wrote:
> On Friday, August 13, 2021 at 2:13:02 PM UTC-5, Thomas Koenig wrote:
>> BGB <cr8...@gmail.com> schrieb:
>>> On 8/13/2021 11:44 AM, Thomas Koenig wrote:
>>>> BGB <cr8...@gmail.com> schrieb:
> <snip>
>>>
>>> Or use the common subset that exists with GCC's
>>> "__attribute__((vector_size(N)))" system.
>> See above...
>>>
>>> Otherwise, it is like saying that no one can use inline ASM, or that one
>>> will refuse to use a compiler which supports using inline ASM.
>> No. Inline asm has its place, for example in system headers,
>> to access features which are not otherwise accessible.
>>
>> However, if I would have to jump through hoops like the above
>> for a new architecture...
> <
> Anyone designing a new architecture at this point should provide the
> means where::
>>
> for( i = 0; I < MAX; i++ )
> a[i] = b[i];
>>
> simply runs at the maximum rate this hardware can run it, and it is doubtful
> that any future implementation will require a different means to run it as
> fast as the machine can perform the desired semantics.

Making this universally true would require some sort of "arcane magic"...

Then again, if 'a' and 'b' are "long long *" or similar, it will already
run at full L2 speed on BJX2 (the difference between 64 and 128 bit
load/store mostly only becomes obvious for L1 copies).

On traditional targets, ones options are typically one of:
Relatively small and concise loop;
Mountain of SIMD instructions and special cases.

Even if on-average it is faster, the cases where it has not worked out
so well have kinda soured me to it.

More so, the:
1, small or moderate N, called rarely;
2, large N, called semi-frequently;
3, small N, called very frequently.
Where:
1, don't want vectorizing, this wastes space;
2, vectorizing usually helps;
3, don't want vectorizing, this usually makes it worse.

The compilers typically have no idea which category a given function
falls into, and compiling everything as if it were scenario 2 is very
bad for functions which fall into category 3.

If compiling a program that has a lot more scenario 3 functions than
scenario 2 functions, then vectorizing will often make it worse.

The compiler should at least have "some" idea about the relative call
frequencies and average 'n' values and similar. Given that it doesn't,
any decisions it can make in these areas are suspect at best.

Granted, if one replaces something like:
void VectorAdd(vec_t *a, vec_t *b, vec_t *c, int n)
{
int i;
for(i=0; i<n; i++)
c[i]=a[i]+b[i];
}
With:
void VectorAdd3(vec_t *a, vec_t *b, vec_t *c)
{
int i;
for(i=0; i<3; i++)
c[i]=a[i]+b[i];
}

Generally, the compiler will do a somewhat better job, where it will
typically just unroll the loop body into the equivalent of, say:
c[0]=a[0]+b[0];
c[1]=a[1]+b[1];
c[2]=a[2]+b[2];

...

>> There is the OpenMP SIMD directive, for example.
>>> Or, maybe, if some standardized way were defined to specify vector types
>>> in C (preferably more concise than GCC's notation).
>>>
>>> Say, for example, if the compiler allowed:
>>> float[[vector(4)]] vec; //define a 4-element floating point vector.
>>> Or:
>>> [[vector(4)]] float vec; //basically equivalent.
>> Unless this becomes part of a language standard, this is even worse.
> <
> These should all be wrapped into some kind of typedeff so one can type::
> <
> double_VEC4 variable;

In practice, this is usually what happens.

Thomas Koenig

unread,

Aug 14, 2021, 2:48:00 AM8/14/21

to

Quadibloc <jsa...@ecn.ab.ca> schrieb:

> On Friday, August 13, 2021 at 10:44:10 AM UTC-6, Thomas Koenig wrote:
>> BGB <cr8...@gmail.com> schrieb:
>
>> > I am actively against autovectorization on BJX2.
>> > To what extent vectors are usable, they are in the form of explicit
>> > language extensions.
>
>> Count me out for using your architecture, then.
>
>> Explicit language extensions
>
>> - lock in the user to a specific architecture and compiler
>> - expose architecture details which should not be visible
>> - make code hard to write, read and thus maintain
>
>> [...]
>> > My approach also tries to be much less of an "ugly wart" on the C
>> > language, so does things in ways that I feel are more consistent with
>> > traditional C semantics (can use native operators, cast conversions, ...).
>
>> There are other programming languages than C. What would you
>> propose for Fortran, for example?
>
> There is an obvious solution.
>
> Extend Fortran so that:
>
> DOUBLE PRECISION A(1000), B(1000), C(1000)
> ...
> A = B+C
>
> is included in the language for *both* vector computers and purely
> scalar computers,

Your suggestion is rather amusing, given that this has been the
case since 1991 (ever since Fortran 90 came out).

However, this does not solve the whole problem, as there are loops
which cannot be expressed with simple vector arithmetic.

Thomas Koenig

unread,

Aug 14, 2021, 2:57:01 AM8/14/21

to

MitchAlsup <Mitch...@aol.com> schrieb:

> C's precarious definition of "what a pointer can point at" is no problem
> when using the vector facility of My 66000. The compiler has to perform
> zero (none, zero, zilch) analysis to convert a loop from scalar form into
> vector form.

The compiler still has to follow the language semantics, and C
semantics (which are particularly bad for aliasing) are not the
only ones.

Then there is the problem of what a "loop" is. For examle, most
Fortran compilers will translate

a = b + c

into a loop upon converting the code to the intermediate language,
but Fortran also has DO and DO CONCURRENT loops (where the order
is indeterminate for the DO CONCURRENT case).

There is also

a(i:j) = a(i+1:j+1) + c

or, even worse,

a(i+k,j+k) = a(i:j) + c

for which, by the language semantics, the right-hand side has
to appear to the programmer to be evaluated before the left-hand
side, no matter what (and no matter the sign of k).

So, without having a way to try it, I am not convinced that My 66000
does the right thing for a language that is not C that you have not
tried yourself :-)

Thomas Koenig

unread,

Aug 14, 2021, 3:18:25 AM8/14/21

to

Still non-portable, and a PITA to use. I have only used it to
generate test cases for stuff where auto-vectorization did
not work.

>
>
>>>
>>> Otherwise, it is like saying that no one can use inline ASM, or that one
>>> will refuse to use a compiler which supports using inline ASM.
>>
>> No. Inline asm has its place, for example in system headers,
>> to access features which are not otherwise accessible.
>>
>> However, if I would have to jump through hoops like the above
>> for a new architecture...
>>
>
> No one is forced to use the language extensions...

No.

I'm beginning to sound repetetive there...

If your architecture needs language extensions to get adequate
performance because you do not want auto-vectorization on principle,
people who do not use the extensions will not get adequate performance,
so they are likely not to use your architecture at all.

Including myself.

[...]

> Unlike GLSL though, neither BGBCC's nor GCC's vector system includes a
> way to natively express things like matrices or matrix math though.

Use Fortran :-)

>
>
>>> One can just write traditional scalar code, and have it perform as such.
>>> Its performance may suck in comparison, but it isn't precluded.
>>
>> Sure.
>>
>> However, even with the sorry state of auto-vectorization, it often
>> generates better code than pure scalar code (and compilers are
>> indeed getting better at this). You are saying you do not want
>> this, and want to force the user to write your specialized version
>> if decent (even if non-optimal) performance is required.
>>
>> Your architecture, your choice. Just count me out.
>>
>
> I would be less opposed to auto vectorization if it could be done in
> ways which weren't prone to occasionally break stuff or make performance
> worse than in the baseline language.

Profile, and use -fno-tree-vectorize in gcc if it really makes
things slow.

>
> I also have similar reservations about strict aliasing semantics.
>
> But, granted, then one can argue that it is an uphill battle to convince
> programmers to use the 'restrict' keyword or similar (or argue that
> 'restrict' is useless if the compiler uses strict aliasing semantics by
> default, ...), but alas.

If a compiler were to assume restrict semantics for C code, that would
be a serious bug and should be reported and fixed.

Strict aliasing semantics is something else: Assume an architecture
that has different register types for float and int (as is usual).
The value of a variable has been loaded into a float register. A
store through a pointer to int occurs. Should the value in the
floating point register be invalidated, yes or no?

What does it do on your architecture?

>>> IME, auto vectorization only sometimes helps on traditional targets, and
>>> frequently (if the optimizer is overzealous) can turn into a wrecking
>>> ball on the performance front (performing worse, sometimes
>>> significantly, than its non vectorized counterpart).
>>
>> Rarely, and if it actually turns out to be a problem, you can turn
>> it off.
>>
>
> Except when the #pragma that is supposed to disable it, or the command
> line options to turn it off, weren't working.

Again, a serious bug that should be reported and fixed.

> Though, this being with
> MSVC on X64; which seems to have lost the notion of being able to do an
> optimized build with vectorization disabled (well, unless one is using
> "/Os" rather than "/O2").

Hm, maybe Microsoft doesn't care as much about bug reports...

[...]

>> One problem, of course, is C's over-dependence on pointers, which
>> make a lot of vectorization options impossible because the compiler
>> cannot tell that there is no aliased load and stores.
>>
>
> Some of the compilers assume non-aliasing in situations where the
> aliasing does occur. This can lead to code misbehaving and crashing.

If the code is not in accordance with the language specification,
what should a compiler do?

Is there a second specification somewhere for what would be an
acceptable sort of violation of the language standards? Which body
decided on it and passed it?

If this is an available and generally accepted document, maybe it
would be possible to have compiler writers use it.

> Compiler writers then pass blame off on the code for containing strict
> aliasing violations or similar (and/or treat strict aliasing as opt-out
> rather than opt-in).

Ah, yes... where is that document?

> Ironically, that leaves MSVC as one of the few compilers where it is
> disabled by default.

Maybe Microsoft doesn't care much about standards.

[...]

Anton Ertl

unread,

Aug 14, 2021, 5:39:25 AM8/14/21

to

BGB <cr8...@gmail.com> writes:
>IME, moving to vectors larger than 128 bits doesn't tend to gain much
>over 128 bits.

What experience is that?

>As the register size gets larger, there is less that can
>utilize it effectively.

Less, yes, but so much that it does not gain much?

>Once the SIMD vector exceeds the length of the data one would likely
>express using vectors, its utility drops off sharply.

Pictures tend to be much larger than 128 bits. Sounds tend to be much
larger than 128 bits. Videos tend to be much larger than 128 bits.

Even for stuff like memcpy and memset, the length is often larger than
128 bits (16 bytes).

Anton Ertl

unread,

Aug 14, 2021, 6:21:12 AM8/14/21

to

Thomas Koenig <tko...@netcologne.de> writes:
>AVX512 is a fiasco, but mainly because Intel overspent its heat
>budget and has to clock down the CPU a _lot_ to use it, destroying
>most if any advantage in using it.

The problem was (is?) in how they did the down-clocking. Earlier they
clocked down all the cores for 1ms on encountering a single AVX-256 or
AVX-512 instruction (given that all the cores are in the same power
domain, this makes sense if they want to do voltage scaling, not just
frequency scaling; but OTOH, they tend to take longer for voltage
scaling, so maybe that's not the reason). More recently I read that
they clock down only one core, they differentiate between light,
medium and heavy instructions, and have an overall more balanced
approach. Not sure if it is balanced enough yet.

In principle wider SIMD offers an energy advantage: With N times as
wide SIMD instructions, for the same computation you need to execute
1/N times as many of them; the data portion of each instruction
consumes N times as much energy, but the control portion consumes
hardly more. So overall the data portion consumes the same energy for
the task, while the control portion consumes less. But of course, it
does that in a shorter time, increasing power consumption. If you
then use frequency *and* voltage scaling to stay in the same power
envelope, you have to clock down by much less than a factor of N, not
just because of control (where ever-longer SIMD has diminishing
returns), but also because of voltage scaling, so overall you are
faster.

It's a bit more complicated with a mixture of SIMD and scalar stuff,
but given that longer SIMD gives energy benefits even without voltage
scaling, it should be possible to do ok even then. As a trivial
approach, just run the double-wide SIMD units at half speed if we are
exceeding the power limit. Double-wide SIMD instructions will then
run as fast as two single-wide SIMD instructions, but consume less
power for control, and the rest will consume the same power. A more
refined approach should be possible.

>Explicit language extensions
>
>- lock in the user to a specific architecture and compiler

Really?

APL locks the user to a specific architecture and compiler? I know of
several APL implementations for a number of architectures.

Fortran array expressions lock the user to a specifc architecture and
compiler?

>- expose architecture details which should not be visible

Which architecture details that APL and Fortran array expressions
expose should not be visible?

>- make code hard to write, read and thus maintain

I expect that APL and Fortran programmers will disagree.

Thomas Koenig

unread,

Aug 14, 2021, 6:32:36 AM8/14/21

to

Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:
> Thomas Koenig <tko...@netcologne.de> writes:
[...]

>>Explicit language extensions
>>
>>- lock in the user to a specific architecture and compiler
>
> Really?
>
> APL locks the user to a specific architecture and compiler? I know of
> several APL implementations for a number of architectures.
>
> Fortran array expressions lock the user to a specifc architecture and
> compiler?

A language extension is, by definition, something that is not
in the standard.

Array expressions have been in the Fortran standard since 1991,
so they are, by definition, not a language extension.

>>- make code hard to write, read and thus maintain
>
> I expect that APL and Fortran programmers will disagree.

I am a Fortran programmer, as you may have noticed.

The point was extending C with non-standardized constructs, that
is a whole different kettle of fish.

Anton Ertl

unread,

Aug 14, 2021, 6:43:18 AM8/14/21

to

Quadibloc <jsa...@ecn.ab.ca> writes:
>On Friday, August 13, 2021 at 3:05:43 AM UTC-6, Anton Ertl wrote:
>> My take is that auto-vectorization is
>> a flawed concept, mainly because it is unreliable: you rub the
>> compiler the wrong way, and it will stop auto-vectorizing the loop
>> without giving any warning.
>
>A very good point, and I agree.
>
>However, I would add that I still see auto-vectorization
>as a useful idea.
>
>For two reasons:
>
>1) It can supply additional performance for code that
>one didn't realize one could manually vectorize.

For non-benchmark code: If a programmer does not realize it, a
compiler will find that it cannot vectorize it, either. That's
because the programmer typically puts in something that throws off the
auto-vectorizer. And I found that, when I tried to make the code
amenable to auto-vectorization (e.g., by turning an array-of-structs
into a struct-of-arrays), the compiler still failed to auto-vectorize
it.

For benchmark code: Compiler writers tend to analyse benchmarks for
such things, and then try (and sometimes succeed) in teaching that to
the compiler. E.g., Sun managed to teach an array-of-structs to
structs-of-arrays transformation to the compiler for one of the SPEC
CPU 2006 or 2000 benchmarks, resulting in a huge speedup. But a large
number of conditions had to be satisfied in order to apply that
transformation, and so it is extremely unlikely that your
non-benchmark code would benefit from that.

>2) There is the issue of code being transportable to
>computer systems which don't have a vector feature.

Compiling language-level vector operations to scalar native code is
trivial, unlike the other direction (auto-vectorization).

Anton Ertl

unread,

Aug 14, 2021, 6:51:38 AM8/14/21

to

Quadibloc <jsa...@ecn.ab.ca> writes:
>On Friday, August 13, 2021 at 10:44:10 AM UTC-6, Thomas Koenig wrote:
>> There are other programming languages than C. What would you
>> propose for Fortran, for example?
>
>There is an obvious solution.
>
>Extend Fortran so that:
>
> DOUBLE PRECISION A(1000), B(1000), C(1000)
>...
> A = B+C
>
>is included in the language

This (or something very similar) has been in Fortran since Fortran 90.

>It isn't a language _extension_ if it's part of the language
>standard for all machines on which the language is implemented.

Relative to Fortran 77, the stuff above is a language extension. But
who cares if it is? The point is that you can write this.

George Neuner

unread,

Aug 14, 2021, 9:50:19 AM8/14/21

to

On Thu, 12 Aug 2021 14:09:21 -0700 (PDT), MitchAlsup
<Mitch...@aol.com> wrote:

>Operating System people are pretty ignorant, except for the size of the
>stuff they have to save/restore around context switches, and when they
>get inside various system library functions that the compiler spit out in
>SIMD form.

I don't think they're ignorant so much as they just don't use SIMD
much and so they don't really care about it - except (as you said) in
relation to saving/restoring register context.

It's hard to care deeply about something you (almost) never use. Other
than memmove/memcopy and perhaps encryption there's no call for SIMD
in OS code.

YMMV,
George

Terje Mathisen

unread,

Aug 14, 2021, 10:59:27 AM8/14/21

to

OTOH many OSs spend significant time doing memset to zero out memory
blocks before reissuing them. This should have been trivial, i.e. just
do a 4KB rep STOS (total code size less than 20 bytes: mov rdi,target /
mov rcx,512 / xor rax,rax / rep stosq), but when they notice that a
non-temporal unrolled SIMD block store is faster then we end up with
model-specific versions. :-(

Thomas Koenig

unread,

Aug 14, 2021, 11:04:29 AM8/14/21

to

Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:

> Quadibloc <jsa...@ecn.ab.ca> writes:
>>On Friday, August 13, 2021 at 10:44:10 AM UTC-6, Thomas Koenig wrote:
>>> There are other programming languages than C. What would you
>>> propose for Fortran, for example?
>>
>>There is an obvious solution.
>>
>>Extend Fortran so that:
>>
>> DOUBLE PRECISION A(1000), B(1000), C(1000)
>>...
>> A = B+C
>>
>>is included in the language
>
> This (or something very similar) has been in Fortran since Fortran 90.
>
>>It isn't a language _extension_ if it's part of the language
>>standard for all machines on which the language is implemented.
>
> Relative to Fortran 77, the stuff above is a language extension.

Fortran 77 has been dated for around 30 years now - about three
eternities in computer science :-)

> But
> who cares if it is? The point is that you can write this.

... and you cannot use non-standard extensions in code that is in
any way supposed to be portable.

By the way, even Fortran's array statements are often translated
into an IR as loops, on which the autovectorizer then tries
its thing.

Anton Ertl

unread,

Aug 14, 2021, 11:39:47 AM8/14/21

to

George Neuner <gneu...@comcast.net> writes:
> Other
>than memmove/memcopy and perhaps encryption there's no call for SIMD
>in OS code.

Moreover, they want to avoid it in the kernel to reduce the number of
registers that they have to save on system calls.

Anton Ertl

unread,

Aug 14, 2021, 12:16:33 PM8/14/21

to

Thomas Koenig <tko...@netcologne.de> writes:
>Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:
>> Thomas Koenig <tko...@netcologne.de> writes:
>[...]
>
>>>Explicit language extensions
>>>
>>>- lock in the user to a specific architecture and compiler
>>
>> Really?
>>
>> APL locks the user to a specific architecture and compiler? I know of
>> several APL implementations for a number of architectures.
>>
>> Fortran array expressions lock the user to a specifc architecture and
>> compiler?
>
>A language extension is, by definition, something that is not
>in the standard.

By definition? Where do I find this definition?

Certainly the Forth standard contains word sets with optional extensions:

|Each word set may have an extension, containing words that offer
|additional functionality.
<https://forth-standard.org/standard/intro>:

>Array expressions have been in the Fortran standard since 1991,
>so they are, by definition, not a language extension.

The are not in Fortran 77 or earlier standards, so they certainly
extend these standards.

>>>- make code hard to write, read and thus maintain
>>
>> I expect that APL and Fortran programmers will disagree.
>
>I am a Fortran programmer, as you may have noticed.
>
>The point was extending C with non-standardized constructs

Maybe that's what you understood. But the whole subthread started
with me stating that manual vectorization is better, citing Sections
2.2-2.4 of <http://www.complang.tuwien.ac.at/papers/ertl18manlang.pdf>
which happen to give Fortran 90 array operations as one example, and
do not mention C at all. BGB took the cue and mentioned language
extensions without, at first, mentioning C.

And in the discussion about auto-vectorization vs. manual
vectorization that is certainly a very relevant point: You don't
auto-vectorize an array language like APL, you auto-vectorize a scalar
language. And you cannot manually express vectorization in a scalar
language like Fortran 66 (reigning when the Cray-1 came to the market)
or Fortran 77, you need to extend it, e.g., with the Fortran 90 array
expressions. I think that opaque vectors are better than working with
transparent arrays, but even with the Fortran 90 array stuff, it is
possible (with a little discipline) to write code that makes things as
easy for the compiler as opaque vectors.

MitchAlsup

unread,

Aug 14, 2021, 12:52:45 PM8/14/21

to

On Saturday, August 14, 2021 at 12:03:51 AM UTC-5, BGB wrote:
> On 8/13/2021 3:55 PM, MitchAlsup wrote:
> > On Friday, August 13, 2021 at 2:13:02 PM UTC-5, Thomas Koenig wrote:
> >> BGB <cr8...@gmail.com> schrieb:
> >>> On 8/13/2021 11:44 AM, Thomas Koenig wrote:
> >>>> BGB <cr8...@gmail.com> schrieb:
> > <snip>
> >>>
> >>> Or use the common subset that exists with GCC's
> >>> "__attribute__((vector_size(N)))" system.
> >> See above...
> >>>
> >>> Otherwise, it is like saying that no one can use inline ASM, or that one
> >>> will refuse to use a compiler which supports using inline ASM.
> >> No. Inline asm has its place, for example in system headers,
> >> to access features which are not otherwise accessible.
> >>
> >> However, if I would have to jump through hoops like the above
> >> for a new architecture...
> > <
> > Anyone designing a new architecture at this point should provide the
> > means where::
> >>
> > for( i = 0; I < MAX; i++ )
> > a[i] = b[i];
> >>
> > simply runs at the maximum rate this hardware can run it, and it is doubtful
> > that any future implementation will require a different means to run it as
> > fast as the machine can perform the desired semantics.
<
> Making this universally true would require some sort of "arcane magic"...
<

It is called the Virtual Vector Method.

<
>
>
> Then again, if 'a' and 'b' are "long long *" or similar, it will already
> run at full L2 speed on BJX2 (the difference between 64 and 128 bit
> load/store mostly only becomes obvious for L1 copies).
<

The above statement was under the guise that a and b are machine types
{{int, long}×{signedness}, float}×{sizes}.

Vectorized My 66000 loops are more often smaller than their scalar counterparts.

<
> 2, vectorizing usually helps;
> 3, don't want vectorizing, this usually makes it worse.
<

Vectorized loops in My 60000 almost never run slower than Scalar code
even when the loop is executed once.

<
>
> The compilers typically have no idea which category a given function
> falls into, and compiling everything as if it were scenario 2 is very
> bad for functions which fall into category 3.
>
> If compiling a program that has a lot more scenario 3 functions than
> scenario 2 functions, then vectorizing will often make it worse.
<

Then fix the base hardware so it stops pessimizing the situation.

>
>
> The compiler should at least have "some" idea about the relative call
> frequencies and average 'n' values and similar. Given that it doesn't,
> any decisions it can make in these areas are suspect at best.
>
> Granted, if one replaces something like:
> void VectorAdd(vec_t *a, vec_t *b, vec_t *c, int n)
> {
> int i;
> for(i=0; i<n; i++)
> c[i]=a[i]+b[i];
> }
> With:
> void VectorAdd3(vec_t *a, vec_t *b, vec_t *c)
> {
> int i;
> for(i=0; i<3; i++)
> c[i]=a[i]+b[i];
> }
>
> Generally, the compiler will do a somewhat better job, where it will
> typically just unroll the loop body into the equivalent of, say:
> c[0]=a[0]+b[0];
> c[1]=a[1]+b[1];
> c[2]=a[2]+b[2];
<

And above you spoke about code bloat........

BGB

unread,

Aug 14, 2021, 1:26:48 PM8/14/21

to

Many programs do their vectors mostly using "float *" pointers.
Granted, yes, it kinda sucks, don't necessarily want the compiler making
it worse.

>>
>>
>>>>
>>>> Otherwise, it is like saying that no one can use inline ASM, or that one
>>>> will refuse to use a compiler which supports using inline ASM.
>>>
>>> No. Inline asm has its place, for example in system headers,
>>> to access features which are not otherwise accessible.
>>>
>>> However, if I would have to jump through hoops like the above
>>> for a new architecture...
>>>
>>
>> No one is forced to use the language extensions...
>
> No.
>
> I'm beginning to sound repetetive there...
>
> If your architecture needs language extensions to get adequate
> performance because you do not want auto-vectorization on principle,
> people who do not use the extensions will not get adequate performance,
> so they are likely not to use your architecture at all.
>
> Including myself.
>

As I see it, it is better to get consistent and reliable results for
baseline C code, rather than necessarily the fastest possible results.

If an optimization can't be done in a way that is "mostly invisible", or
its results tend to be rather inconsistent, then as I see it, it
probably shouldn't be the default.

> [...]
>
>> Unlike GLSL though, neither BGBCC's nor GCC's vector system includes a
>> way to natively express things like matrices or matrix math though.
>
> Use Fortran :-)
>

Dunno.

Don't have compiler support for it in BGBCC, ...

My vector system was fairly useful in writing my OpenGL rasterizer,
which sort of turned into a big detour into working on compiler support
for SIMD and vector operations.

But, as it was prior to this point, getting passable performance
otherwise looked pretty hopeless. Still ended up needing to write much
of the core in ASM.

Ironically, despite "kinda sucking", and being limited to 50MHz, my GL
backend still manages to be faster on BJX2 than a port of the C version
running on a RasPi (despite the RasPi having a roughly 20x the clock
speed advantage).

However, the RasPi is drastically faster when it comes to running
software rendered Doom and Quake (or most other "general purpose" code).

>>
>>
>>>> One can just write traditional scalar code, and have it perform as such.
>>>> Its performance may suck in comparison, but it isn't precluded.
>>>
>>> Sure.
>>>
>>> However, even with the sorry state of auto-vectorization, it often
>>> generates better code than pure scalar code (and compilers are
>>> indeed getting better at this). You are saying you do not want
>>> this, and want to force the user to write your specialized version
>>> if decent (even if non-optimal) performance is required.
>>>
>>> Your architecture, your choice. Just count me out.
>>>
>>
>> I would be less opposed to auto vectorization if it could be done in
>> ways which weren't prone to occasionally break stuff or make performance
>> worse than in the baseline language.
>
> Profile, and use -fno-tree-vectorize in gcc if it really makes
> things slow.
>

The "vectorizing makes things suck" issue is a lot more common in MSVC IME.

But, they don't have a way to turn it off on x64 with "/O2" or similar,
best one can do is use "/O1" or "/Os" for code where it is a problem.

As noted, it is mostly a problem case for code with lots of functions
manipulating small vectors via "float *" pointers or similar.

Comparably, doing vectors as structs containing the elements avoids this
issue, though I did run into a problem-case before where structs larger
than 16 bytes were significantly slower than structs larger than 16
bytes, eg:
12 or 16 bytes, MSVC was using direct register-based copies;
24 bytes, MSVC was setting up and using "REP MOVSB" (not ideal).

This issue led to a funky workaround of bit-packing 3 truncated doubles
into 128 bits, since the performance of the bit-twiddling to
unpack/repack at the point of use was better than that of passing them
around as 3x double structs.

>>
>> I also have similar reservations about strict aliasing semantics.
>>
>> But, granted, then one can argue that it is an uphill battle to convince
>> programmers to use the 'restrict' keyword or similar (or argue that
>> 'restrict' is useless if the compiler uses strict aliasing semantics by
>> default, ...), but alas.
>
> If a compiler were to assume restrict semantics for C code, that would
> be a serious bug and should be reported and fixed.
>
> Strict aliasing semantics is something else: Assume an architecture
> that has different register types for float and int (as is usual).
> The value of a variable has been loaded into a float register. A
> store through a pointer to int occurs. Should the value in the
> floating point register be invalidated, yes or no?
>
> What does it do on your architecture?
>

My ISA has a single register space for everything.

Default case for the compiler is to assume every (explicit) memory load
or store needs to be done as-written (doesn't cache anything loaded from
a structure or array, doesn't delay storing back any assignment to a
field or element).

So:
i=foo->bar->x;
j=foo->bar->y;
Will perform the 'foo->bar' part twice, though it can be (manually)
optimized via something like:
p=foo->bar;
i=p->x;
j=p->y;

A similar optimization is generally effective with MSVC, which tends to
have the same general behavior (so, for perf reasons one generally needs
to manually cache any values they are working with in local variables).

There are exceptions for variables with implicit load-store:
Stack variables are assumed to not alias unless their address is taken;
Typical case is to spill local variables back to memory at the end of a
basic-block (unless statically assigned to a register);
Global variables are always spilled at the end of each basic block;
Members of 'this' are treated basically like globals.

The compiler defaults to using heuristics to decide if and when a local
variable should be statically assigned to a register, but interprets the
"register" keyword as a hint to do so.

Temporaries frequently also end up also getting spilled to the stack,
but this is more because it is difficult to avoid this without
introducing other bugs (though avoiding these spills can result in a
fairly obvious speedup, I haven't yet come up with a good heuristic to
"prove" that a given temporary is no longer relevant).

Then again, I guess it is possible I could add a "DISCARD" IR op or
similar, which is generated implicitly by the frontend and flags
temporaries for which their value is no longer relevant (potentially
allowing the register spill to be avoided).

Eg:
register int i;
Add "1000 points" to 'i' being picked for being assigned to a register.

It also puts a bunch of points for the function being allowed to use 32
GPRs, where by default small functions are limited to only using R8..R14
for variables, and need to cross a certain register-pressure threshold
to enable use of the rest of the registers.

At present, R32..R63 are not generally used by the compiler. This may
change, but some initial testing implies the potential gains from using
R32..R63 in general purpose code are relatively minor (they only really
"make sense" for functions with an extremely high register pressure).

Cases where they make sense are mostly things like walking the edges of
a triangle in the GL rasterizer or doing matrix multiply (the 64 GPR
case having sufficient registers to perform the matrix multiply entirely
in registers).

>>>> IME, auto vectorization only sometimes helps on traditional targets, and
>>>> frequently (if the optimizer is overzealous) can turn into a wrecking
>>>> ball on the performance front (performing worse, sometimes
>>>> significantly, than its non vectorized counterpart).
>>>
>>> Rarely, and if it actually turns out to be a problem, you can turn
>>> it off.
>>>
>>
>> Except when the #pragma that is supposed to disable it, or the command
>> line options to turn it off, weren't working.
>
> Again, a serious bug that should be reported and fixed.
>
>> Though, this being with
>> MSVC on X64; which seems to have lost the notion of being able to do an
>> optimized build with vectorization disabled (well, unless one is using
>> "/Os" rather than "/O2").
>
> Hm, maybe Microsoft doesn't care as much about bug reports...
>

Looking into it, I think the answer is that there is currently no way to
disable vectorization in MSVC in /O2 on x64 targets...

Why, dunno, seems kinda stupid given how hit/miss it is.

> [...]
>
>>> One problem, of course, is C's over-dependence on pointers, which
>>> make a lot of vectorization options impossible because the compiler
>>> cannot tell that there is no aliased load and stores.
>>>
>>
>> Some of the compilers assume non-aliasing in situations where the
>> aliasing does occur. This can lead to code misbehaving and crashing.
>
> If the code is not in accordance with the language specification,
> what should a compiler do?
>
> Is there a second specification somewhere for what would be an
> acceptable sort of violation of the language standards? Which body
> decided on it and passed it?
>
> If this is an available and generally accepted document, maybe it
> would be possible to have compiler writers use it.
>

Usual answer is legacy code, eg:
Does this code originally written for Watcom C on MS-DOS still work as
expected?...

Granted, this is after allowing for more egregious issues to be fixed
(eg: array overruns or code which breaks if not using 32-bit pointers).

Ironically, it is usually easier to first port this code to MSVC on x64,
and then to BGBCC or GCC or Clang or similar.

>> Compiler writers then pass blame off on the code for containing strict
>> aliasing violations or similar (and/or treat strict aliasing as opt-out
>> rather than opt-in).
>
> Ah, yes... where is that document?
>

No formal document, just piles of "legacy code", and implicit rules for
"stuff has always worked this way".

There is a lot of this stuff...

>> Ironically, that leaves MSVC as one of the few compilers where it is
>> disabled by default.
>
> Maybe Microsoft doesn't care much about standards.
>

They have generally been pretty good about keeping legacy code running.

The code that is compiled may be 30 or 40% slower than the same code
compiled with GCC or Clang, it was stuck at roughly C95 levels until
roughly VS2015, and the binaries may be 2x or 3x larger, but like, it
works...

IIRC, MS had experimented with using an LLVM/Clang based alternative
with Visual Studio, but results were mixed, since Clang is prone to
break a lot of existing code.

BGB

unread,

Aug 14, 2021, 1:52:53 PM8/14/21

to

OK.

If you do something like:
char *cs, *ct;
...
while(*cs)
*ct++=*cs++;

On BJX2, its performance will "kinda suck pretty hard".
This construct manages to be slower than the DRAM copy speed.

OK, this is much less true of SSE.

There is AVX, but I generally didn't enable AVX in my builds because:
If one enables it, MSVC tries to autovectorize with it;
It manages to often make the performance situation worse than had one
not enabled it.

I am not entirely sure what MS's reasoning is behind the behavior of
their vectorizer, or why, given its issues, they don't have any explicit
command line options to control its behavior.

They don't seem to have any option for "no, don't use vectorization; but
I still want to be able to use it manually when it makes sense...".

Seemingly, you either get full autovectorization, or code is not allowed
to use SIMD at all...

>>
>> The compilers typically have no idea which category a given function
>> falls into, and compiling everything as if it were scenario 2 is very
>> bad for functions which fall into category 3.
>>
>> If compiling a program that has a lot more scenario 3 functions than
>> scenario 2 functions, then vectorizing will often make it worse.
> <
> Then fix the base hardware so it stops pessimizing the situation.

Could be.

>>
>>
>> The compiler should at least have "some" idea about the relative call
>> frequencies and average 'n' values and similar. Given that it doesn't,
>> any decisions it can make in these areas are suspect at best.
>>
>> Granted, if one replaces something like:
>> void VectorAdd(vec_t *a, vec_t *b, vec_t *c, int n)
>> {
>> int i;
>> for(i=0; i<n; i++)
>> c[i]=a[i]+b[i];
>> }
>> With:
>> void VectorAdd3(vec_t *a, vec_t *b, vec_t *c)
>> {
>> int i;
>> for(i=0; i<3; i++)
>> c[i]=a[i]+b[i];
>> }
>>
>> Generally, the compiler will do a somewhat better job, where it will
>> typically just unroll the loop body into the equivalent of, say:
>> c[0]=a[0]+b[0];
>> c[1]=a[1]+b[1];
>> c[2]=a[2]+b[2];
> <
> And above you spoke about code bloat........

You have no idea the horrors the compiler will unleash...

Like, a simple loop, if one looks at the ASM output, can turn into pages
upon pages of cruft...

Simply executing the loop body N times can be small and concise in
comparison, and also performs a lot better for small N.

With older versions of MSVC, it also seemed to have some sort of thing
where it would try to transform anything that looked like a memory copy
or memory zeroing loop into "REP MOVSB" or "REP STOSB", which wouldn't
be too bad (at least its encoding is small), apart from the fact that it
is needlessly slow (at least on most of the CPUs I was using).

Though, this appears to have gone away in more recent versions.

BGB

unread,

Aug 14, 2021, 2:07:00 PM8/14/21

to

Except when using a compiler which is overzealous and uses
transformations in places where they are not appropriate.

>> 2) There is the issue of code being transportable to
>> computer systems which don't have a vector feature.
>
> Compiling language-level vector operations to scalar native code is
> trivial, unlike the other direction (auto-vectorization).
>

Agreed.

IMO, support for vector style operations should ideally be provided by
the compiler whether or not the underlying hardware supports them
natively, since it is usually fairly straightforward to decompose them
into loops or similar.

Auto-vectorization has been a long standing crap-show IMHO.

Maybe works well for benchmarks or high-profile libraries, where the
compiler can be tuned for it, but "in the wild" the results tend to not
be nearly so good.

MitchAlsup

unread,

Aug 14, 2021, 2:13:19 PM8/14/21

to

On Saturday, August 14, 2021 at 12:26:48 PM UTC-5, BGB wrote:
> On 8/14/2021 2:18 AM, Thomas Koenig wrote:
> > BGB <cr8...@gmail.com> schrieb:

> >

> > Strict aliasing semantics is something else: Assume an architecture
> > that has different register types for float and int (as is usual).
> > The value of a variable has been loaded into a float register. A
> > store through a pointer to int occurs. Should the value in the
> > floating point register be invalidated, yes or no?
> >
> > What does it do on your architecture?
> >
> My ISA has a single register space for everything.
<

The single register file solves a bunch of issues, this being one of them.
It also reduces the surprise factor when separately compiled subroutines
pass data of the wrong types.
<
x = foo( float );
.....
float foo( int ) {}
<
and reduces the idiosyncrasy of varargs.

>
>
> Default case for the compiler is to assume every (explicit) memory load
> or store needs to be done as-written (doesn't cache anything loaded from
> a structure or array, doesn't delay storing back any assignment to a
> field or element).
>
> So:
> i=foo->bar->x;
> j=foo->bar->y;
> Will perform the 'foo->bar' part twice, though it can be (manually)
> optimized via something like:
> p=foo->bar;
> i=p->x;
> j=p->y;
<

Unless i is in the same memory space as foo-> or bar-> that optimization is
always allowed. So if foo-> and bar-> are data memory and I has been
allocated to the local stack or a register and cannot alias, then the optimization
is legal.
<snip>

> >
> > Hm, maybe Microsoft doesn't care as much about bug reports...
<

Unless generated from within MS.

> >
> Looking into it, I think the answer is that there is currently no way to
> disable vectorization in MSVC in /O2 on x64 targets...
>
> Why, dunno, seems kinda stupid given how hit/miss it is.
> > [...]
> >
> >>> One problem, of course, is C's over-dependence on pointers, which
> >>> make a lot of vectorization options impossible because the compiler
> >>> cannot tell that there is no aliased load and stores.
> >>>
> >>
> >> Some of the compilers assume non-aliasing in situations where the
> >> aliasing does occur. This can lead to code misbehaving and crashing.
> >
> > If the code is not in accordance with the language specification,
> > what should a compiler do?
> >
> > Is there a second specification somewhere for what would be an
> > acceptable sort of violation of the language standards? Which body
> > decided on it and passed it?
> >
> > If this is an available and generally accepted document, maybe it
> > would be possible to have compiler writers use it.
> >
> Usual answer is legacy code, eg:
> Does this code originally written for Watcom C on MS-DOS still work as
> expected?...
<

I might note: My 66000 vectorization violates none of the assumptions
old cod may have used, nor does it violate aliasing rules of strict vonNeumann
ordering. Code just runs slower when these aliasings occur. CRAY-style
vectors blow up in their entirety, and so do SIMD vectors.

<
>
> Granted, this is after allowing for more egregious issues to be fixed
> (eg: array overruns or code which breaks if not using 32-bit pointers).
<

One can (CAN) fix it in architecture (ISA) design. ANd given the sins of the
past remaining with us for eternity, that seems the best way forward.
>

Brian G. Lucas

unread,

Aug 14, 2021, 2:19:01 PM8/14/21

to

Well, for Fortran we have to wait for the LLVM Fortran front-end "flang"
can compile the above code to test My66000.

brian

Thomas Koenig

unread,

Aug 14, 2021, 2:56:00 PM8/14/21

to

Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:
> Thomas Koenig <tko...@netcologne.de> writes:
>>Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:
>>> Thomas Koenig <tko...@netcologne.de> writes:
>>[...]
>>
>>>>Explicit language extensions
>>>>
>>>>- lock in the user to a specific architecture and compiler
>>>
>>> Really?
>>>
>>> APL locks the user to a specific architecture and compiler? I know of
>>> several APL implementations for a number of architectures.
>>>
>>> Fortran array expressions lock the user to a specifc architecture and
>>> compiler?
>>
>>A language extension is, by definition, something that is not
>>in the standard.
>
> By definition? Where do I find this definition?

It's what is commonly used.

Example:

https://gcc.gnu.org/onlinedocs/gfortran/Extensions.html

> Certainly the Forth standard contains word sets with optional extensions:
>
>|Each word set may have an extension, containing words that offer
>|additional functionality.
><https://forth-standard.org/standard/intro>:
>
>>Array expressions have been in the Fortran standard since 1991,
>>so they are, by definition, not a language extension.
>
> The are not in Fortran 77 or earlier standards, so they certainly
> extend these standards.

A new standard is a new standard, not an extension of an existing
standard.

However, I will call it "non-standard feature" from now on, if
it makes you feel better.

>>>>- make code hard to write, read and thus maintain
>>>
>>> I expect that APL and Fortran programmers will disagree.
>>
>>I am a Fortran programmer, as you may have noticed.
>>
>>The point was extending C with non-standardized constructs
>
> Maybe that's what you understood. But the whole subthread started
> with me stating that manual vectorization is better, citing Sections
> 2.2-2.4 of <http://www.complang.tuwien.ac.at/papers/ertl18manlang.pdf>
> which happen to give Fortran 90 array operations as one example, and
> do not mention C at all. BGB took the cue and mentioned language
> extensions without, at first, mentioning C.

And I consider trying to push users into using a non-standard
feature for production code harmful. What will they do when the
compiler which supports that particular feature is discontinued?
What will they do if they want to run their code on a platform
which does not have that particular compiler?

It is also well-known that non-standard features are often poorly
integrated with the rest of the language they extend.

>
> And in the discussion about auto-vectorization vs. manual
> vectorization that is certainly a very relevant point: You don't
> auto-vectorize an array language like APL, you auto-vectorize a scalar
> language. And you cannot manually express vectorization in a scalar
> language like Fortran 66 (reigning when the Cray-1 came to the market)
> or Fortran 77, you need to extend it, e.g., with the Fortran 90 array
> expressions.

The Fortran compilers I know (very well for gfortran, somewhat for
NAG) rely on auto-vectorization to get the loops they translate their
array expressions to to vectorize, or on Cray-style IVDEP directives.

> I think that opaque vectors are better than working with
> transparent arrays, but even with the Fortran 90 array stuff, it is
> possible (with a little discipline) to write code that makes things as
> easy for the compiler as opaque vectors.

It took until Fortran 2008 for a very important detail to be added
to the language: The contiguous attribute for arrays...

BGB

unread,

Aug 14, 2021, 5:14:19 PM8/14/21

to

On 8/14/2021 1:13 PM, MitchAlsup wrote:
> On Saturday, August 14, 2021 at 12:26:48 PM UTC-5, BGB wrote:
>> On 8/14/2021 2:18 AM, Thomas Koenig wrote:
>>> BGB <cr8...@gmail.com> schrieb:
>
>>>
>>> Strict aliasing semantics is something else: Assume an architecture
>>> that has different register types for float and int (as is usual).
>>> The value of a variable has been loaded into a float register. A
>>> store through a pointer to int occurs. Should the value in the
>>> floating point register be invalidated, yes or no?
>>>
>>> What does it do on your architecture?
>>>
>> My ISA has a single register space for everything.
> <
> The single register file solves a bunch of issues, this being one of them.
> It also reduces the surprise factor when separately compiled subroutines
> pass data of the wrong types.
> <
> x = foo( float );
> .....
> float foo( int ) {}
> <
> and reduces the idiosyncrasy of varargs.

Yeah. The varargs are a little more stable as pretty much all 32 or 64
bit types are passed in the same way.

Passing 128-bit elements is still a special case though, since these use
paired registers rather than a single register.

I am also left with internal debate as to whether to change the ABI to
only pass 128-bit types using even-paired-registers, since this is what
the ISA requires to operate on them (and otherwise the arguments may be
in a form which is unusable until moved into different registers or
similar).

The tradeoff is that this would effectively "lose" an argument register
for cases which pass a 128-bit argument in a way where they are would be
aligned on an odd register (though, this would likely be "less bad" than
needing to deal with misaligned vector arguments).

...

>>
>>
>> Default case for the compiler is to assume every (explicit) memory load
>> or store needs to be done as-written (doesn't cache anything loaded from
>> a structure or array, doesn't delay storing back any assignment to a
>> field or element).
>>
>> So:
>> i=foo->bar->x;
>> j=foo->bar->y;
>> Will perform the 'foo->bar' part twice, though it can be (manually)
>> optimized via something like:
>> p=foo->bar;
>> i=p->x;
>> j=p->y;
> <
> Unless i is in the same memory space as foo-> or bar-> that optimization is
> always allowed. So if foo-> and bar-> are data memory and I has been
> allocated to the local stack or a register and cannot alias, then the optimization
> is legal.
> <snip>

It is legal, but my compiler doesn't do it...

Typically, MSVC doesn't do it either as far as I can tell.

>>>
>>> Hm, maybe Microsoft doesn't care as much about bug reports...
> <
> Unless generated from within MS.

One can also point out that it took nearly two decades before they
really started working on adding C99 support.

Trying recently to use some "fenv.h" stuff for sake of working on adding
it to BGBCC, one can observe that MSVC doesn't support the standard pragma.

So, eg:
#pragma STDC FENV_ACCESS ON
MSVC: "What the hell is this?..."
Meanwhile:
#pragma fenv_access(on)
MSVC: "OK"

I recently added these to BGBCC, which are sort-of required for "fenv"
to actually do anything; By default, the FPU only uses a fixed rounding
mode and does not record status flags. Though, the user-supplied
rounding mode/... will only have an effect within functions compiled
with the pragma. Though, this did mean giving in and adding status and
control flags for this stuff (kinda awkwardly shoved into GBR(63:48) and
ignored by default).

In effect, there is now sort of a special (fixed) rounding mode which
tells the FPU to pull its rounding mode from these bits and update the
flag bits. Not super happy about this, but this appeared basically
necessary to get the described semantics.

Implicitly, the emulator also ended up using "fenv.h" functions for sake
of adding operations with an explicit rounding mode (it was either this,
or fake it in software), which is where the above came from.

For BGBCC, I added both pragmas and more-or-less treat them as equivalent.

>>>
>> Looking into it, I think the answer is that there is currently no way to
>> disable vectorization in MSVC in /O2 on x64 targets...
>>
>> Why, dunno, seems kinda stupid given how hit/miss it is.
>>> [...]
>>>
>>>>> One problem, of course, is C's over-dependence on pointers, which
>>>>> make a lot of vectorization options impossible because the compiler
>>>>> cannot tell that there is no aliased load and stores.
>>>>>
>>>>
>>>> Some of the compilers assume non-aliasing in situations where the
>>>> aliasing does occur. This can lead to code misbehaving and crashing.
>>>
>>> If the code is not in accordance with the language specification,
>>> what should a compiler do?
>>>
>>> Is there a second specification somewhere for what would be an
>>> acceptable sort of violation of the language standards? Which body
>>> decided on it and passed it?
>>>
>>> If this is an available and generally accepted document, maybe it
>>> would be possible to have compiler writers use it.
>>>
>> Usual answer is legacy code, eg:
>> Does this code originally written for Watcom C on MS-DOS still work as
>> expected?...
> <
> I might note: My 66000 vectorization violates none of the assumptions
> old cod may have used, nor does it violate aliasing rules of strict vonNeumann
> ordering. Code just runs slower when these aliasings occur. CRAY-style
> vectors blow up in their entirety, and so do SIMD vectors.
> <

Yeah.

>>
>> Granted, this is after allowing for more egregious issues to be fixed
>> (eg: array overruns or code which breaks if not using 32-bit pointers).
> <
> One can (CAN) fix it in architecture (ISA) design. ANd given the sins of the
> past remaining with us for eternity, that seems the best way forward.

Array overruns and pointer size issues fall into the "this really should
be fixed" category, at least when porting the code to a new architecture
which may differ in these areas.

Though, this does sort of mean (in my case) being gradually forced into
making more use of "stdint.h" types and similar, since now pretty much
all targets I use support it (since it has types to deal with things
like target-specific pointer sizes, ...).

BGB

unread,

Aug 14, 2021, 7:04:48 PM8/14/21

to

On 8/14/2021 2:58 AM, Anton Ertl wrote:
> BGB <cr8...@gmail.com> writes:
>> IME, moving to vectors larger than 128 bits doesn't tend to gain much
>> over 128 bits.
>
> What experience is that?
>

Various, mostly 3D programming, signal processing, ...
Also "general reasoning about what is going on".

>> As the register size gets larger, there is less that can
>> utilize it effectively.
>
> Less, yes, but so much that it does not gain much?
>

Typically.

There are typically other factors that counter-balance any gains.

Typically a lot has to do what sort of granularity one is working with.
Potential gains drop off sharply when one exceeds the granularity of the
data they are working with.

It is much like the utility of larger integer types dropping off once
one exceeds the range of the values which they are used to contain.

>> Once the SIMD vector exceeds the length of the data one would likely
>> express using vectors, its utility drops off sharply.
>
> Pictures tend to be much larger than 128 bits. Sounds tend to be much
> larger than 128 bits. Videos tend to be much larger than 128 bits.
>

While the image as a whole is bigger, things like individual pixels
don't get any bigger.

One can process multiple pixels at a time, but what one gains from
working with multiple pixels at a time may be less than what they gain
from being able to work a whole pixel at a time.

Packing multiple logical pixels in each vector may start to face
diminishing returns in terms of needing to deal with more wrangling to
get the values where they need to go, ...

Similar with things like position vectors:
These fit nicely into 128 bits (3x or 4x single precision).

But, if doing calculation on position vectors, one may observe they
don't actually have much use or need for a larger vector than what is
needed to hold the coordinates.

Depending on the task, doing the same calculation on two sets of vectors
at a time may be irrelevant.

However, the gains from being able to work with a full vector at a time,
vs doing the math using a bunch of scalar operations, may be significant.

Say, for example, one is dealing with an operation like:
dv = obja.pos - objb.pos;
r1 = obja.rad;
r2 = objb.rad;
d2 = dot3(dv, dv);
dm = r1*r1 + r2*r2;
return(dm >= d2);

Say, the algorithm checks one bounding sphere against another sphere at
a time, but a larger scale change, like checking two spheres against two
other spheres, isn't worthwhile.

Well, and also one may note that auto-vectorization typically isn't
exactly going to reorganize the high-level parts of the algorithm to
perform multiple sets of sphere-checks in parallel. It likely can't see
the algorithm much past the collision function itself, say:
if(checkSphereCollideP(obja, objb))
{ doCollideStuff }

These sorts of limitations gets more significant if the code is written
according to traditional OOP conventions.

...

> Even for stuff like memcpy and memset, the length is often larger than
> 128 bits (16 bytes).
>

Doesn't necessarily mean these will benefit from a larger vector either.

If one has a vector which is twice the size, but the load or store
operation takes twice as long as the 128-bit load or store, is it still
a win?...

There also tends to be sort of a "RAM bandwidth wall", which once one
runs into it, there is little point in trying to optimize further.

One might observe that for larger copies, there is an unavoidable limit
at around 4GB/s, and for small copies (under 16K) a limit at around
16GB/s, and using YMM registers does not help.

More so, copying buffers in the MB range, they may hit this wall just
using QWORD moves.

Throwing multiple processes at the problem, they might observe that
there is a system-wide limit at around 11 GB/s or so, for which spawning
more processes causes everything else to get slower, ...

MitchAlsup

unread,

Aug 14, 2021, 10:06:25 PM8/14/21

to

On Saturday, August 14, 2021 at 4:14:19 PM UTC-5, BGB wrote:
> On 8/14/2021 1:13 PM, MitchAlsup wrote:
> > On Saturday, August 14, 2021 at 12:26:48 PM UTC-5, BGB wrote:
> >> On 8/14/2021 2:18 AM, Thomas Koenig wrote:
> >>> BGB <cr8...@gmail.com> schrieb:
> >
> >>>
> >>> Strict aliasing semantics is something else: Assume an architecture
> >>> that has different register types for float and int (as is usual).
> >>> The value of a variable has been loaded into a float register. A
> >>> store through a pointer to int occurs. Should the value in the
> >>> floating point register be invalidated, yes or no?
> >>>
> >>> What does it do on your architecture?
> >>>
> >> My ISA has a single register space for everything.
> > <
> > The single register file solves a bunch of issues, this being one of them.
> > It also reduces the surprise factor when separately compiled subroutines
> > pass data of the wrong types.
> > <
> > x = foo( float );
> > .....
> > float foo( int ) {}
> > <
> > and reduces the idiosyncrasy of varargs.
> Yeah. The varargs are a little more stable as pretty much all 32 or 64
> bit types are passed in the same way.
>
> Passing 128-bit elements is still a special case though, since these use
> paired registers rather than a single register.
<

My ABI passes them in consecutive registers without even-odd pairing.
Arguments as large ass 8 DoubleWords can be passed (either direction)
in the registers.

>
>
> I am also left with internal debate as to whether to change the ABI to
> only pass 128-bit types using even-paired-registers, since this is what
> the ISA requires to operate on them (and otherwise the arguments may be
> in a form which is unusable until moved into different registers or
> similar).
<

I would recommend against even-odd pairings, but recommend simply
consecutive registers.

It is sort of like having to shoot yourself in the foot all the while hoping
the rubber balls do as little damage as possible.
>
<snip>

> >> Granted, this is after allowing for more egregious issues to be fixed
> >> (eg: array overruns or code which breaks if not using 32-bit pointers).
> > <

> > One can (CAN) fix it in architecture (ISA) design. And given the sins of the

> > past remaining with us for eternity, that seems the best way forward.
<
> Array overruns and pointer size issues fall into the "this really should
> be fixed" category, at least when porting the code to a new architecture
> which may differ in these areas.
<

We had a graphics related bug where graphics memory used even the last
byte covered by an MTRR, and ended up prefetching across the boundary
into MMIO space-----------waiting for the laughter to die down........

Marcus

unread,

Aug 15, 2021, 4:28:34 AM8/15/21

to

On 2021-08-13 18:20, MitchAlsup wrote:
> On Friday, August 13, 2021 at 1:53:04 AM UTC-5, Marcus wrote:

[snip]

>> Then I created my own vector ISA and wrote some demos, and I was like
>> "Wow! It's really a joy to program this thing".
> <
> You would enjoy programming in My 66000 ISA.

I am pretty sure I would!

I willingly admit that your My 66000 and its VVM architecture is the
most elegant, programmer friendly and compiler friendly ISA that I have
seen. I am hoping to see it hit silicon.

/Marcus

Thomas Koenig

unread,

Aug 15, 2021, 4:43:12 AM8/15/21

to

MitchAlsup <Mitch...@aol.com> schrieb:

> On Saturday, August 14, 2021 at 12:26:48 PM UTC-5, BGB wrote:
>> On 8/14/2021 2:18 AM, Thomas Koenig wrote:
>> > BGB <cr8...@gmail.com> schrieb:
>
>> >
>> > Strict aliasing semantics is something else: Assume an architecture
>> > that has different register types for float and int (as is usual).
>> > The value of a variable has been loaded into a float register. A
>> > store through a pointer to int occurs. Should the value in the
>> > floating point register be invalidated, yes or no?
>> >
>> > What does it do on your architecture?
>> >
>> My ISA has a single register space for everything.
><
> The single register file solves a bunch of issues, this being one of them.

It's not a problem per se.

I was merely pointing out that on some architectures this is an issue,
which is why it makes sense for C to put the aliasing restriction into
the language as long as these architectures are supported.

> It also reduces the surprise factor when separately compiled subroutines
> pass data of the wrong types.
><
> x = foo( float );
> .....
> float foo( int ) {}

The "surprise factor" is that the code is dead wrong in just about
any reasonable languate.

A compiler should make every reasonable effort to catch this error
and report it to the user, and language design should make every
effort to make sure it does not happen.

Case in point:

A recent version of gfortran will discover and reject the previously
undiscovered error in

SUBROUTINE FOO
CALL BAZ(1)
END

SUBROUTINE BAR
CALL BAZ(1.0)
END

with

wrong.f:6:15:

2 | CALL BAZ(1)
| 2
......
6 | CALL BAZ(1.0)
| 1
Error: Type mismatch between actual argument at (1) and actual argument at (2) (REAL(4)/INTEGER(4)).

I put in that error message, and it certainly led to discovery
of some decades-old bugs in existing programs. Many people were
glad, some less so.

Thomas Koenig

unread,

Aug 15, 2021, 4:48:29 AM8/15/21

to

Brian G. Lucas <bag...@gmail.com> schrieb:

... and wait for an implementation to test it on to see if it
does the right thing. I certainly could not tell from reading
the assembler code.

I guess the question on when such an implementation might be available
as a soft core or in hardware would remain unanswered, so this
is a rather theoretical discussion...

BGB

unread,

Aug 15, 2021, 10:36:11 AM8/15/21

to

Yeah, this is one practical difference with my BJX2 project.

A prototype exists, it is on GitHub:
https://github.com/cr88192/bgbtech_btsr1arch

Whether or not it is "good" is another matter...

I was generally focusing more on things I could realistically implement,
rather than necessarily on design elegance. There are things I could
have done differently in retrospect, and layers of hackery have resulted
in a certain amount of cruft.

...

BGB

unread,

Aug 15, 2021, 11:04:50 AM8/15/21

to

My case is only 1 or 2 registers (64 or 128 bits).
Anything larger is passed by reference.

>>
>>
>> I am also left with internal debate as to whether to change the ABI to
>> only pass 128-bit types using even-paired-registers, since this is what
>> the ISA requires to operate on them (and otherwise the arguments may be
>> in a form which is unusable until moved into different registers or
>> similar).
> <
> I would recommend against even-odd pairings, but recommend simply
> consecutive registers.

Thus far I had been using (mostly) consecutive registers, though this
makes a few annoyances, namely:
* Odd register pairs can't be expressed in the current numbering scheme;
* None of the existing 128-bit ops will work with them;
* Because it is R4..R7, R20..R23, some cases may pass in R20:R7;
* ...

So, this means that storing them to memory requires a pair of MOV.Q
instructions (vs a single MOV.X), or moving them to/from different
registers requires a pair of MOV instructions (vs a single MOVX).

The original design made more sense before I added operations which work
with 128-bit register pairs (but which only work on even-numbered
registers).

This wasn't really the direction I wanted to go, but there wasn't really
much good alternative if I wanted the "fenv" stuff to behave roughly as
described in the C99 standard.

Sticking the bits in the high bits of GBR is also a bit of a hack, but
then again, short of adding a new control register, there wasn't much
better.

So, the options are now:
FADD/FSUB/FMUL (3R):
Hard wired to Round-To-Nearest.
Has no effect on status flags.
FADDG/FSUBG/FMULG (3R):
Like the above, but uses a dynamic rounding mode and updates flags.
FADD/FSUB/FMUL (4RI):
Explicit rounding modes, one of which behaves like FADDG.

If FENV_ACCESS is enabled, it uses FADDG/FSUBG/FMULG encodings in place
of the normal FADD/FSUB/FMUL.

>>
> <snip>
>>>> Granted, this is after allowing for more egregious issues to be fixed
>>>> (eg: array overruns or code which breaks if not using 32-bit pointers).
>>> <
>>> One can (CAN) fix it in architecture (ISA) design. And given the sins of the
>>> past remaining with us for eternity, that seems the best way forward.
> <
>> Array overruns and pointer size issues fall into the "this really should
>> be fixed" category, at least when porting the code to a new architecture
>> which may differ in these areas.
> <
> We had a graphics related bug where graphics memory used even the last
> byte covered by an MTRR, and ended up prefetching across the boundary
> into MMIO space-----------waiting for the laughter to die down........

Hmm...

In my recent effort to get the MMU working again, it kept deadlocking
the bus.

Turns out the "zero page", being a sort of ROM, rather than quietly
accepting attempts to write dirty cache lines to it and giving the
"Store OK" response, was giving a "Load OK" response. This response
would hit the L1 which would then deadlock.

Where, there are a few special ROM pages:
Zero Page: Any access here returns zeroes;
NOP Page: Any access here returns NOP instructions;
RTS Page: Any access here returns RTS instructions.

The contents of these pages is synthetic, given it is just a repeating
value. Trying to store into the ROM regions is quietly ignored. These
are mostly intended for internal use for memory-management.

Had also added options to expand the Boot ROM to 48K, and optionally the
SRAM to 16K. Though, as-is, a decent chunk of the ROM has ended up being
used for boot-time sanity checking (trying to verify that core parts of
the ISA are giving the expected results).

It was originally intended to be more like the x86 ROM BIOS, but in
effect is mostly just the code to load a boot image from FAT, and a
bunch of code to do sanity testing (the shell also does some amount of
its own sanity testing, generally of higher-level features).

Anton Ertl

unread,

Aug 15, 2021, 11:08:40 AM8/15/21

to

Thomas Koenig <tko...@netcologne.de> writes:
>There is also
>
> a(i:j) = a(i+1:j+1) + c
>
>or, even worse,
>
> a(i+k,j+k) = a(i:j) + c
>
>for which, by the language semantics, the right-hand side has
>to appear to the programmer to be evaluated before the left-hand
>side, no matter what (and no matter the sign of k).

Which makes it trivial to vectorize. Great! You still miss the
alignment and padding benefits of opaque vectors.

If you want to do it without storing the intermediate result, you get
back at least some of the complications that auto-vectorization has to
deal with; but at least you start and end with vectorized code
(possibly with an extra copying operation) rather than with scalar
code in case of auto-vectorization.

>So, without having a way to try it, I am not convinced that My 66000
>does the right thing for a language that is not C that you have not
>tried yourself :-)

My66000's VVM is designed for scalar code. If a simple compiler
compiles the code above to a scalar loop for the addition and a scalar
loop for copying the result, VVM will auto-vectorize both loops. I
don't think it will get rid of the loop for copying the result.

Anton Ertl

unread,

Aug 15, 2021, 12:39:22 PM8/15/21

to

BGB <cr8...@gmail.com> writes:
>While the image as a whole is bigger, things like individual pixels
>don't get any bigger.
>
>
>One can process multiple pixels at a time, but what one gains from
>working with multiple pixels at a time may be less than what they gain
>from being able to work a whole pixel at a time.
>
>Packing multiple logical pixels in each vector may start to face
>diminishing returns in terms of needing to deal with more wrangling to
>get the values where they need to go, ...

For image manipulation tasks like blurring, sharpening, colour or
brightness adjustments, you can benefit from very long SIMD registers.

>> Even for stuff like memcpy and memset, the length is often larger than
>> 128 bits (16 bytes).
>>
>
>Doesn't necessarily mean these will benefit from a larger vector either.
>
>
>If one has a vector which is twice the size, but the load or store
>operation takes twice as long as the 128-bit load or store, is it still
>a win?...

Not for copying, that's why load/store units tend to get wider over
time, just as SIMD registers get wider. E.g., Haswell and Zen 1 have
128-bit wide load/store units, Skylake and Zen 2 have 256 bit wide
units, and IIRC Ice Lake has 512 bit wide units.

>There also tends to be sort of a "RAM bandwidth wall", which once one
>runs into it, there is little point in trying to optimize further.

Main memory bandwidth tends to get higher over time, as does the
bandwidth to the caches.

>One might observe that for larger copies, there is an unavoidable limit
>at around 4GB/s, and for small copies (under 16K) a limit at around
>16GB/s, and using YMM registers does not help.

Looking at <2019Dec1...@mips.complang.tuwien.ac.at>, I see that
with 16K block sizes, on Skylake the fastest copy copies 29
bytes/cycle (pretty close to the 32b/c speed-of-light), which at 4GHz
(which this Skylake runs at) is 116GB/s. That's with REP MOVSB, where
the performance is very brittle, depending on alignment and block
size. A more stable AVX-based implementation gets 90-101GB/s. The
corresponding SSE-based implementation gets 49-53GB/s.

Where does your unavoidable limit come from?

Anton Ertl

unread,

Aug 15, 2021, 12:56:20 PM8/15/21

to

Thomas Koenig <tko...@netcologne.de> writes:
>Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:
>> Thomas Koenig <tko...@netcologne.de> writes:
>>>Array expressions have been in the Fortran standard since 1991,
>>>so they are, by definition, not a language extension.
>>
>> The are not in Fortran 77 or earlier standards, so they certainly
>> extend these standards.
>
>A new standard is a new standard, not an extension of an existing
>standard.
>
>However, I will call it "non-standard feature" from now on, if
>it makes you feel better.

We were not discussing non-standard features, but extensions (or, if
you prefer, features) in general, whether standardized or not. In
case of Fortran such a feature has been standardized.

>> And in the discussion about auto-vectorization vs. manual
>> vectorization that is certainly a very relevant point: You don't
>> auto-vectorize an array language like APL, you auto-vectorize a scalar
>> language. And you cannot manually express vectorization in a scalar
>> language like Fortran 66 (reigning when the Cray-1 came to the market)
>> or Fortran 77, you need to extend it, e.g., with the Fortran 90 array
>> expressions.
>
>The Fortran compilers I know (very well for gfortran, somewhat for
>NAG) rely on auto-vectorization to get the loops they translate their
>array expressions to to vectorize, or on Cray-style IVDEP directives.

That looks perverse, but I guess that is the result of both wanting to
auto-vectorize benchmarks, and having to also support scalar-only
hardware. So just auto-scalarize the vectorized source code, and then
hope that the back end will auto-vectorize it again.

>It took until Fortran 2008 for a very important detail to be added
>to the language: The contiguous attribute for arrays...

What does that do?

MitchAlsup

unread,

Aug 15, 2021, 1:41:32 PM8/15/21

to

Everytime SPARC had to do some integer manipulation on some FP data,
it had to store it to memory and reload it as integer. There was no movFP2Int
instruction.

Code explosion and you have not even gotten to SIMD......

BGB

unread,

Aug 15, 2021, 2:24:42 PM8/15/21

to

On 8/15/2021 11:08 AM, Anton Ertl wrote:
> BGB <cr8...@gmail.com> writes:
>> While the image as a whole is bigger, things like individual pixels
>> don't get any bigger.
>>
>>
>> One can process multiple pixels at a time, but what one gains from
>> working with multiple pixels at a time may be less than what they gain
>>from being able to work a whole pixel at a time.
>>
>> Packing multiple logical pixels in each vector may start to face
>> diminishing returns in terms of needing to deal with more wrangling to
>> get the values where they need to go, ...
>
> For image manipulation tasks like blurring, sharpening, colour or
> brightness adjustments, you can benefit from very long SIMD registers.
>

I was not trying to claim larger vectors are useless, but rather that
for many use-cases, if the program isn't structured in a way that can
use them, there is little real advantage.

More so if one is using a processor where the operations aren't any
faster than doing a pair of 128 bit operations.

>>> Even for stuff like memcpy and memset, the length is often larger than
>>> 128 bits (16 bytes).
>>>
>>
>> Doesn't necessarily mean these will benefit from a larger vector either.
>>
>>
>> If one has a vector which is twice the size, but the load or store
>> operation takes twice as long as the 128-bit load or store, is it still
>> a win?...
>
> Not for copying, that's why load/store units tend to get wider over
> time, just as SIMD registers get wider. E.g., Haswell and Zen 1 have
> 128-bit wide load/store units, Skylake and Zen 2 have 256 bit wide
> units, and IIRC Ice Lake has 512 bit wide units.
>

Could be.

Looking into it, the CPU I am using is Zen1 based, which apparently
performs 256-bit operations internally as a pair of 128 bit operations.

>> There also tends to be sort of a "RAM bandwidth wall", which once one
>> runs into it, there is little point in trying to optimize further.
>
> Main memory bandwidth tends to get higher over time, as does the
> bandwidth to the caches.
>

OK.

>> One might observe that for larger copies, there is an unavoidable limit
>> at around 4GB/s, and for small copies (under 16K) a limit at around
>> 16GB/s, and using YMM registers does not help.
>
> Looking at <2019Dec1...@mips.complang.tuwien.ac.at>, I see that
> with 16K block sizes, on Skylake the fastest copy copies 29
> bytes/cycle (pretty close to the 32b/c speed-of-light), which at 4GHz
> (which this Skylake runs at) is 116GB/s. That's with REP MOVSB, where
> the performance is very brittle, depending on alignment and block
> size. A more stable AVX-based implementation gets 90-101GB/s. The
> corresponding SSE-based implementation gets 49-53GB/s.
>
> Where does your unavoidable limit come from?
>

A similar sort of "memory bandwidth wall" similar to that described
seems to appear on both to my prior Piledriver based PC (AMD FX), and my
current Ryzen (Zen1 based).

It seems like there is sort of a bottleneck between the memory bandwidth
within a single core, and the total RAM bandwidth of the PC, which
generally takes around, where once one runs 4 processes each doing a
full-speed memcpy, then the values in the individual processes start to
drop (resulting in the 11GB/s figure).

Whereas, if one has a single thread doing memcpy, it hits a limit at
4.0GB/s.

There seems to be a pair of faster limits, at 8GB/s and 16GB/s,
depending on the size of the buffer being copied:
8GB/s for 32K .. 256K
16GB/s for 0 .. 16K
This being for memcpy style tests.

Numbers do seem to be a little higher for memset style operations than
memcpy though (numbers roughly double here).

On my FX, the memcpy limit seemed to happen at around 3.2 GB/s (other
numbers were smaller by a similar ratio).

The numbers followed a similar pattern, just my Ryzen seems to give
roughly 35% higher memory bandwidth relative to clock-speed vs the
Piledriver based AMD-FX.

Though, due to some weird scheduling quirk, I need to spawn multiple
processes rather than use threads.

I don't know as much how this compares with Intel chips, only Intel
system I have is running a Xeon E5410, which kinda predates AVX.

Though, I did previously note that there does seem to be a difference
here in terms of how the Xeon behaves, but would need more testing here
to say much more.

BGB

unread,

Aug 15, 2021, 3:28:32 PM8/15/21

to

Luckily, no trips though memory are required, though most traditional C
idioms here would require using a trip through memory.

There are some extensions though that allow sidestepping a trip through
memory.

I ended up adding an ABI control flag for "128-bit argument types use
even registers only", which is currently enabled for builds with SIMD
operations enabled. Still TBD.

As noted, this scenario kinda sucks.

Only other options would have been:
Don't implement "fenv.h", stick with fixed rounding only;
Fully give in to global mutable state which modifies FPU behavior.

While I am trying to keep costs low, this isn't necessarily about
minimizing the number of encodings or instruction mnemonics.

But, yeah, unlike x86, I don't have:
MOVAPS/MOVUPS/MOVDQA/MOVDQU/VMOVDQA/VMOVDQU/...
I have:
MOV.X and MOV.Q

Then, FP-SIMD:
PADD.F, PADDX.F, PADD.H, PADDX.D
PSUB.F, PSUBX.F, PSUB.H, PMULX.D
PMUL.F, PMULX.F, PSUB.H, PMULX.D
...

But, like, it is "less" than on x86.
There is also no rounding modes for the packed SIMD ops, they are
hard-wired to an implementation choice of either Nearest or Truncate.

Where:
F=Single, H=Half, D=Double;
W=Int16, L=Int32, UW=UInt16, UL=UInt32;
'X' denoting 128-bit forms.

No packed byte operations in this case, only converters.

Given everything is based around using paired 64-bit registers, the
"vector size" is ambiguous.

So, say:
PADDX.L R4, R6, R8
Is functionally analogous to:
PADD.L R5, R7, R9 | PADD.L R4, R6, R8
...

Thomas Koenig

unread,

Aug 15, 2021, 3:37:07 PM8/15/21

to

Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:
> Thomas Koenig <tko...@netcologne.de> writes:
>>Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:
>>> Thomas Koenig <tko...@netcologne.de> writes:
>>>>Array expressions have been in the Fortran standard since 1991,
>>>>so they are, by definition, not a language extension.
>>>
>>> The are not in Fortran 77 or earlier standards, so they certainly
>>> extend these standards.
>>
>>A new standard is a new standard, not an extension of an existing
>>standard.
>>
>>However, I will call it "non-standard feature" from now on, if
>>it makes you feel better.
>
> We were not discussing non-standard features, but extensions (or, if
> you prefer, features) in general, whether standardized or not. In
> case of Fortran such a feature has been standardized.

I know "extension" as something non-standard.

I have nothing against something that has made it into a standard
(even a supplementing standard like OpenMP to C or Fortran is
OK to my mind).

Like I said, I consider non-standard extensions harmful.

>
>>> And in the discussion about auto-vectorization vs. manual
>>> vectorization that is certainly a very relevant point: You don't
>>> auto-vectorize an array language like APL, you auto-vectorize a scalar
>>> language. And you cannot manually express vectorization in a scalar
>>> language like Fortran 66 (reigning when the Cray-1 came to the market)
>>> or Fortran 77, you need to extend it, e.g., with the Fortran 90 array
>>> expressions.
>>
>>The Fortran compilers I know (very well for gfortran, somewhat for
>>NAG) rely on auto-vectorization to get the loops they translate their
>>array expressions to to vectorize, or on Cray-style IVDEP directives.
>
> That looks perverse, but I guess that is the result of both wanting to
> auto-vectorize benchmarks,

And of wanting to write a compiler that generates good code in general
(which, of course, needs to be checked with micro-benchmarks at
least).

Would it help if I told you that I, in what I did for optimization
on gfortran (mostly frontend-passes.c if you're interested) I actually
did not consult benchmarks, not even the free Polyhedron suite?

I rather abhor SPEC because of its closed source nature and SPEC's
refusal to change benchmarks which rely on broken source code,
by the way.

> and having to also support scalar-only
> hardware. So just auto-scalarize the vectorized source code, and then
> hope that the back end will auto-vectorize it again.

The scalarizer is actually one of the most complex pieces of code
in a Fortran front end, and one of the most difficult to write
and maintain.

It would certainly be preferable from the standpoint of the Fortran
front end guys to hand all that off to the middle end.

>
>>It took until Fortran 2008 for a very important detail to be added
>>to the language: The contiguous attribute for arrays...
>
> What does that do?

In Fortran, you can pass array slices, like

real :: a(10,10)
call foo(a(1:5:2,1:5:2))

which will pass an array which will be visible to the callee as
having the bounds 1:3, 1:3, with the values of a(1,1), a(3,1),
a(5,1), a(3,1), a(3,3) ... and so on. This was introduced in
Fortran 90, and is very powerful, but has its drawbacks.

The compiler needs to pass stride information to the callee,
using an "array descriptor" set up by the caller (also called
"dope vector"). The callee usually does not know at compile
time if it gets passed a contiguous amount of memory or not,
so it needs to assume the worst, i.e. variable strides.

If the argument is declared as

real, intent(inout), dimension(:,:), contiguous :: a

then the callee can assume contiguous memory, can load with
unity stride, etc.

The caller has access to that information and then has to make
sure the callee only gets a contiguous argument, for example
by copy-in / copy-out. Good compilers have a diagnostic for that
so users are not completely surprised when their code does this
kind of thing :-)

The callee can also not declare an argument as contiguous and check
at runtime if it is, with an intrinsic function unsurprisingly
called "is_contiguous".

This feature was only added in Fortran 2008.

Brian G. Lucas

unread,

Aug 15, 2021, 4:53:52 PM8/15/21

to

One doesn't have to wait for silicon (or FPGA) to test compilers. A
functional simulator will do. Been there, done that. Tested binutils, gcc,
the RTOS and the phone user interface for a new ISA, all before silicon.
When silicon arrived, the code was ready.

Can't time benchmarks, however.

brian

MitchAlsup

unread,

Aug 15, 2021, 4:55:45 PM8/15/21

to

These have been called Dope Vectors since at least 1970..................
likely from the mid-1960s...........maybe earlier.........

Michael S

unread,

Aug 15, 2021, 6:32:28 PM8/15/21

to

On Sunday, August 15, 2021 at 7:39:22 PM UTC+3, Anton Ertl wrote:
> BGB <cr8...@gmail.com> writes:
> >While the image as a whole is bigger, things like individual pixels
> >don't get any bigger.
> >
> >
> >One can process multiple pixels at a time, but what one gains from
> >working with multiple pixels at a time may be less than what they gain
> >from being able to work a whole pixel at a time.
> >
> >Packing multiple logical pixels in each vector may start to face
> >diminishing returns in terms of needing to deal with more wrangling to
> >get the values where they need to go, ...
> For image manipulation tasks like blurring, sharpening, colour or
> brightness adjustments, you can benefit from very long SIMD registers.
> >> Even for stuff like memcpy and memset, the length is often larger than
> >> 128 bits (16 bytes).
> >>
> >
> >Doesn't necessarily mean these will benefit from a larger vector either.
> >
> >
> >If one has a vector which is twice the size, but the load or store
> >operation takes twice as long as the 128-bit load or store, is it still
> >a win?...
> Not for copying, that's why load/store units tend to get wider over
> time, just as SIMD registers get wider. E.g., Haswell and Zen 1 have
> 128-bit wide load/store units,

Haswell has 256-bit Load and Store data paths and dual 256-bit FMAC
Sandy Bridge and Ivy Bridge have 128-bit Load and Store data paths, one 256-bit FP_ADD and one 256-bit FP_MUL
Zen1 has everything 128-bit - Load, Store and dual FMAC.

MitchAlsup

unread,

Aug 15, 2021, 7:13:21 PM8/15/21

to

On Sunday, August 15, 2021 at 11:39:22 AM UTC-5, Anton Ertl wrote:
> BGB <cr8...@gmail.com> writes:
> >While the image as a whole is bigger, things like individual pixels
> >don't get any bigger.
> >
> >
> >One can process multiple pixels at a time, but what one gains from
> >working with multiple pixels at a time may be less than what they gain
> >from being able to work a whole pixel at a time.
> >
> >Packing multiple logical pixels in each vector may start to face
> >diminishing returns in terms of needing to deal with more wrangling to
> >get the values where they need to go, ...
> For image manipulation tasks like blurring, sharpening, colour or
> brightness adjustments, you can benefit from very long SIMD registers.
> >> Even for stuff like memcpy and memset, the length is often larger than
> >> 128 bits (16 bytes).
> >>
> >
> >Doesn't necessarily mean these will benefit from a larger vector either.
> >
> >
> >If one has a vector which is twice the size, but the load or store
> >operation takes twice as long as the 128-bit load or store, is it still
> >a win?...
<
> Not for copying, that's why load/store units tend to get wider over
> time, just as SIMD registers get wider. E.g., Haswell and Zen 1 have
> 128-bit wide load/store units, Skylake and Zen 2 have 256 bit wide
> units, and IIRC Ice Lake has 512 bit wide units.
<

The LD/ST units or the data Cache gets wider over time in order to
service these ever increasing width register widths, not the other way
around. If the registers were remain the same size and one wanted
more BW (to match processor capabilities) then the LD/ST units would
get more ports into the cache and the cache would become more banked.

Thomas Koenig

unread,

Aug 16, 2021, 1:42:48 AM8/16/21

to

Interesting. Can you tell what CPU this was for?

> Can't time benchmarks, however.

It would still be possible to count cycles (if the functional simulator
is set up to do that).

Is there a publically available simulator for My 66000?

Ivan Godard

unread,

Aug 16, 2021, 2:10:57 AM8/16/21

to

On 8/15/2021 10:42 PM, Thomas Koenig wrote:
> Brian G. Lucas <bag...@gmail.com> schrieb:
>> On 8/15/21 3:48 AM, Thomas Koenig wrote:
>
>>> I guess the question on when such an implementation might be available
>>> as a soft core or in hardware would remain unanswered, so this
>>> is a rather theoretical discussion...
>>>
>> One doesn't have to wait for silicon (or FPGA) to test compilers. A
>> functional simulator will do.
>
>> Been there, done that. Tested binutils, gcc,
>> the RTOS and the phone user interface for a new ISA, all before silicon.
>> When silicon arrived, the code was ready.
>
> Interesting. Can you tell what CPU this was for?
>

Not an uncommon experience. On the B6500 (new ISA, OS, 5 compilers,
etc.) we were working in sim for three years before we got the first
hardware prototype. Two weeks to the day later we were compiling under
that OS on the new hardware.

Marcus

unread,

Aug 16, 2021, 4:41:05 AM8/16/21

to

On 2021-08-14 17:38, Anton Ertl wrote:
> George Neuner <gneu...@comcast.net> writes:
>> Other
>> than memmove/memcopy and perhaps encryption there's no call for SIMD
>> in OS code.
>
> Moreover, they want to avoid it in the kernel to reduce the number of
> registers that they have to save on system calls.
>
> - anton
>

...moreover, they want the kernel code to work on any CPU generation
while at the same time being as compact as possible. So (because of
"flaw 1") having the kernel dynamically select the SIMD flavor for every
loop is not really an option. I think that it's only done for things
where performance really matters. E.g. IIRC for some routines (RAID,
crypto?), different SIMD versions are benchmarked during boot and the
fastest one is selected.

/Marcus

Marcus

unread,

Aug 16, 2021, 4:45:39 AM8/16/21

to

On 2021-08-13 22:35, Terje Mathisen:
> Marcus wrote:
>> On 2021-08-13 10:54, Terje Mathisen wrote:
>>> MitchAlsup wrote:
>>>> On Thursday, August 12, 2021 at 12:08:41 PM UTC-5, Marcus wrote:
>>>>> I posted this article on my blog:
>>>>>
>>>>> https://www.bitsnbites.eu/three-fundamental-flaws-of-simd
>>>>
>>>> The fundamental flaw in SIMD is
>>>>
>>>> a) one should be able to encode a vectorizable loop once and have it
>>>> run at the maximum performance of each future generation machine.
>>>> One should never have to spit out different instructions just because
>>>> the width of the SIMD registers was widened.
>>>>
>>>> b) one should be able to encode a SIMD instruction such that it
>>>> performs
>>>> as wide as the implementation has resources (or as narrow) using the
>>>> same OpCodes.
>>>
>>> This has been done well in C#, they have vector operations that will
>>> turn into optimal SIMD instructions during the final JIT/AOT stage,
>>> this way they can optimize for the local CPU, and update the compiler
>>> for future platforms.
>>
>> ...which probably means that they will be able to map well to other
>> kinds of vector architectures too.
>>
>> Even if a compiler / language can hide the details of the underlying
>> vector architecture, packed SIMD still suffers from code bloat though:
>> Essentially you need to describe things like pipelining and tail
>> handling in code rather than having the HW take care of it, which hurts
>> code density /and/ increases front end load.
>
> Not really an issue as long as you have a way to do a masked store, i.e.
> you use the tail end count to initialize the write mask, after doing the
> final iteration the normal way. You usually have to make two loop copies
> though, one using regular stores for the full blocks, plus a final one
> which could handle either 0 to N-1 or 1 to N elements.

...which I would consider "code bloat".

>
> Mill is far nicer here, as is Mitch's VVM.

...as is MRISC32 (for certain tasks) ;-)

Michael S

unread,

Aug 16, 2021, 5:02:32 AM8/16/21

to

> Mill is far nicer here, as is Mitch's VVM.

> Terje
>
> --
> - <Terje.Mathisen at tmsw.no>
> "almost all programming can be viewed as an exercise in caching"

Does Mill even try to be competitive in FP MUL/ADD/MADD throughput?
Esp. on single precision front? Today, anything near or below 0.1 peak SP TFLOPS per core is considered low end.
For me, nice and dead slow is not considered nice. I want work done.

Terje Mathisen

unread,

Aug 16, 2021, 7:40:17 AM8/16/21

to

Marcus wrote:
> On 2021-08-13 22:35, Terje Mathisen:

>> Not really an issue as long as you have a way to do a masked store,
>> i.e. you use the tail end count to initialize the write mask, after
>> doing the final iteration the normal way. You usually have to make two
>> loop copies though, one using regular stores for the full blocks, plus
>> a final one which could handle either 0 to N-1 or 1 to N elements.
>
> ...which I would consider "code bloat".
>

Yeah, I agree: If you use the masked store for all blocks then you don't
need to double the code, but it tends to make the main block slightly
slower since you have to use that masked store with an all-enabled mask
for all but the last iteration, and you need to include the code to
select the right mask.

However, if you are moving intermediate-sized blocks, then using a
masked stores will aften allow you to get away with a single iteration.

>>
>> Mill is far nicer here, as is Mitch's VVM.
>
> ...as is MRISC32 (for certain tasks) ;-)

Beauty is in the eye of the beholder. :-)

Terje Mathisen

unread,

Aug 16, 2021, 7:44:26 AM8/16/21

to

Michael S wrote:
> On Friday, August 13, 2021 at 11:35:09 PM UTC+3, Terje Mathisen

>> Not really an issue as long as you have a way to do a masked store,
>> i.e. you use the tail end count to initialize the write mask, after
>> doing the final iteration the normal way. You usually have to make
>> two loop copies though, one using regular stores for the full
>> blocks, plus a final one which could handle either 0 to N-1 or 1 to
>> N elements.
>>
>> Mill is far nicer here, as is Mitch's VVM. Terje
>>

> Does Mill even try to be competitive in FP MUL/ADD/MADD throughput?
> Esp. on single precision front? Today, anything near or below 0.1
> peak SP TFLOPS per core is considered low end. For me, nice and dead
> slow is not considered nice. I want work done.
>

I'll let Ivan post the exact details, but a Mill Gold is very much a
high throughput engine. For simple and even not so simple loops, it will
easily saturate whatever memory subsystem you can afford.

If you only work with data that fits in cache, then the same Mill will
still be quite happy. :-)

Michael S

unread,

Aug 16, 2021, 8:11:39 AM8/16/21

to

According to my understanding of old Mill slides, Gold peaks at 4 SP FMA per clock and frequency is ~1.5 GHz
That's more than 10 times slower than cores I am used to.

MitchAlsup

unread,

Aug 16, 2021, 9:50:21 AM8/16/21

to

How many vector instruction sets are productive on the str* and mem* libraries ?

MitchAlsup

unread,

Aug 16, 2021, 9:50:46 AM8/16/21

to

Brian has a functional simulator. You might ask him.

Michael S

unread,

Aug 16, 2021, 10:23:06 AM8/16/21

to

Being productive on str* and mem* is a bonus, rather than the reason I want SIMD in my CPU.
For me, it's fine if for anything non-related to "dense" numeric processing there is no gain at all

Brian G. Lucas

unread,

Aug 16, 2021, 12:48:08 PM8/16/21

to

On 8/16/21 12:42 AM, Thomas Koenig wrote:
> Brian G. Lucas <bag...@gmail.com> schrieb:
>> On 8/15/21 3:48 AM, Thomas Koenig wrote:
>
>>> I guess the question on when such an implementation might be available
>>> as a soft core or in hardware would remain unanswered, so this
>>> is a rather theoretical discussion...
>>>
>> One doesn't have to wait for silicon (or FPGA) to test compilers. A
>> functional simulator will do.
>
>> Been there, done that. Tested binutils, gcc,
>> the RTOS and the phone user interface for a new ISA, all before silicon.
>> When silicon arrived, the code was ready.
>
> Interesting. Can you tell what CPU this was for?
>

Motorola MCore. Was used in some cell phones, handheld gps units. Now
licensed to the Chinese who call it CSky1. There's even a
official Linux port.

>> Can't time benchmarks, however.
>
> It would still be possible to count cycles (if the functional simulator
> is set up to do that).
>
> Is there a publically available simulator for My 66000?
>

https://github.com/bagel99/progtools
You might not be happy with the source code since its written
in my personal programming language:
https://github.com/bagel99/esl

But it should be easy to transliterate it to C or C++.

BGB

unread,

Aug 16, 2021, 1:01:41 PM8/16/21

to

On 8/16/2021 6:40 AM, Terje Mathisen wrote:
> Marcus wrote:
>> On 2021-08-13 22:35, Terje Mathisen:
>>> Not really an issue as long as you have a way to do a masked store,
>>> i.e. you use the tail end count to initialize the write mask, after
>>> doing the final iteration the normal way. You usually have to make
>>> two loop copies though, one using regular stores for the full blocks,
>>> plus a final one which could handle either 0 to N-1 or 1 to N elements.
>>
>> ...which I would consider "code bloat".
>>
>
> Yeah, I agree: If you use the masked store for all blocks then you don't
> need to double the code, but it tends to make the main block slightly
> slower since you have to use that masked store with an all-enabled mask
> for all but the last iteration, and you need to include the code to
> select the right mask.
>
> However, if you are moving intermediate-sized blocks, then using a
> masked stores will aften allow you to get away with a single iteration.
>

It seems like a "masked splice" could also make sense.

Say, for example:
MCSLICE.B Rm, Ro, Rn

Which replaces the first Ro bytes in Rn with those from Rm.

This could maybe make sense for operations like strcpy and strcat, where
dealing with the last few bytes tends to add a lot of extra cycles, but
these functions aren't really allowed to stomp memory past the end of
the copy.

>>>
>>> Mill is far nicer here, as is Mitch's VVM.
>>
>> ...as is MRISC32 (for certain tasks) ;-)
>
> Beauty is in the eye of the beholder. :-)
>

Probably.

I like my ISA design as well, despite it being a very different design
philosophy; more of a bottom up design focused more on trying to
optimize for performance relative to cost. Traditional metrics of
elegance were given a low priority at this level, and it is assumed that
features would be enabled/disabled/adapted as needed for specific
use-cases (more like what is common in the microcontroller space).

This is in contrast for PC-like use-cases, where the ISA would need to
focus more on program interchange and long-term stability (eg, where one
doesn't want an N-way fracturing of sub-variants most of which are not
binary compatible with eachother).

So, for example, I might not go in my current design direction if the
goal were to try to compete with x86-64 or ARMv8 or similar.

...

Would be nicer if performance were better though.

The main limiting factor in-general still seems to mostly be memory
bandwidth though.

There would appear to be a bottleneck inside the L2 cache but I haven't
figured it out as of yet.

Bidirectional speeds (memcpy):
The DDR controller, by itself, pushes ~ 20MB/s;
The L1 <-> L2 interface pushes ~ 50MB/s;
The L2 <-> DDR interface still only pulls off ~ 8MB/s;
L1 (native): ~ 250 MB/s (~ 62% of theoretical maximum, *1).

The issue does not appear to be with ringbus latency. A special case
shortcut exists to keep the missed requests circling over the L2 until
they hit (without traveling the rest of the ring, unless displaced by
another request); which helps some but does not resolve the issue.

Similarly, there was another trick to speed up the original bus
interface via a sequence numbering trick (allows skipping the "teardown"
stage for each request).

Other per-cycle debug prints did not point out any obvious issues...
Yet, for whatever reason, the effective bandwidth is cut in half.

I did see a speed boost from a hack to increase the internal cache-line
size of the L2 cache from 16 to 32 bytes, but it was unstable (cause not
yet identified).

There were also speedups from using logic to evict older dirty cache
lines when encountered. Though, I guess it is possible that some logic
could be added mostly to sweep the L2 and evict old dirty cache lines
proactively (rather than waiting until they are encountered during an L1
miss).

...

*1: Assuming 100% hit rate and no interlock penalties, the L1 could
theoretically achieve ~ 400MB/s via "MOV.X" ops. In the general case, a
100% hit rate is unlikely with a direct-mapped cache (with some cycles
also being lost to looping overheads, ...).

The L1 <-> L2 bandwidth could (in theory) be improved some by
integrating the TLB into the L1 caches, since it adds ~ 4 cycles of
latency to the L1 ring. This would effectively require redesigning the
L1s and increasing cost though.

And/or use a mini-TLB in the L1 cache, and then have a case to skip over
the main TLB for pretranslated requests. The mini-TLB would likely be
fairly small and limited (only dealing with "simple cases"), ...

...

Ivan Godard

unread,

Aug 16, 2021, 1:30:44 PM8/16/21

to

This is the SIMD version of Mill's "pick" operation, which is the direct
hardware of "a?b:c". In SIMD a vector of bool chooses elements of b or c
vectors to produce a vector result. No need for special mask management,
a vector of bool is just a normal vector like a vector of anything else.

Ivan Godard

unread,

Aug 16, 2021, 1:42:24 PM8/16/21

to

You get one FP operation per cycle per FPU lane in your configuration.
Thus in a high-end config with 4 quad-scalar FPUs doing single you are
getting 4 (single per quad) *4 (FPUs) = 16 single ops per cycle, the
same as a 512-bit SIMD, so I'd call it competitive.

Like Mitch's proposed higher end, we do multiple scalars instead of wide
vector. That lets us use the elements as independent scalar (including
quad arithmetic) instructions as well as SIMD, and avoid the encoding
mess that is AVC/SSE/etc.

Ivan Godard

unread,

Aug 16, 2021, 1:51:42 PM8/16/21

to

IIRC, our slides never talk about precision, just about function units
and ops. You get one instruction per FU per cycle at max scalar width,
split into as many SIMD as fits. As Gold is quad (128-bit), at single
precision SIMD you are getting four-lane SIMD into as many FUs as you
have power for - 4 in Gold.

Mill is a throughput-oriented architecture, so we expect that customers
will use it more in wide-low-clock configs to save on power, instead of
narrow-high-clock configs, but but the architecture can be configured
for either. How many FMAs per second do you want to cool?

BGB

unread,

Aug 16, 2021, 2:24:12 PM8/16/21

to

For what it is worth, I originally wanted to do multiple lanes of 2x
Single, so that a 4x Single operation could be done by running two of
these 2x single operations in parallel. Though could have given better
performance than the current 4x single op as well as being more flexible.

However... This would have had a higher resource cost than just feeding
everything through the existing double-precision unit internally (which
was still a bit faster than using scalar operations).

I might revive this idea at some point, if floating-point throughput
becomes more of an issue.

At present, by current estimates, for most code cycles spent in
floating-point operations are a relatively small part of the total (the
bulk of the clock-cycles is still mostly Load/Store ops followed by ALU
ops and Branch ops, ...).

Michael S

unread,

Aug 16, 2021, 2:31:14 PM8/16/21

to

So, the same # of FLOPS/cycle as I am used to. But MUCH lower clock.
And probably not as good peak-to-sustained ratio because for a sort of operations
that I have in mind Gold appears to have too short belt, which is Mill's equivalent
of too few registers.
For some of important math kernels this defect can be ameliorated when
FMA unit has short latency through adder, but for other kernels it's not enough.

> Mill is a throughput-oriented architecture, so we expect that customers
> will use it more in wide-low-clock configs to save on power, instead of
> narrow-high-clock configs, but but the architecture can be configured
> for either. How many FMAs per second do you want to cool?

As many as needed to specific task.
As said above, I am not going to bother with less than 0.1 SP TFLOPS per core.
But I'd very much like more than that.

Michael S

unread,

Aug 16, 2021, 2:55:23 PM8/16/21

to

Pay attention, that all current FP throughput-oriented cores that feature 512bit SIMD,
i.e. Intel Skylake Server ~= Cascade Lake and Fujitsu A64FX have *dual* 512-bit FMA.
Only Intel's long-discontinued and mostly forgotten Knights Corner (the one, Terje likes)
and Ice/Tiger/Rocket Lake client cores have single FMA.
Rumors are that incoming Ice Lake server and its followup Sapphire Rapids will also
have dual 512-bit FMA. We're going to see soon.

> Like Mitch's proposed higher end, we do multiple scalars instead of wide
> vector. That lets us use the elements as independent scalar (including
> quad arithmetic) instructions as well as SIMD, and avoid the encoding
> mess that is AVC/SSE/etc.

And for a given throughput makes bypass network much more complicated.

Ivan Godard

unread,

Aug 16, 2021, 3:09:58 PM8/16/21

to

Belt length is configurable too, but you probably use most of your
registers for persistent operands rather than transients and are making
the mistake of expecting to put those on the belt. The belt is not
registers-with-temporal-access, and you will get poor results if you try
to use them as such. The belt is for transients, and it's a very rare
code that has more than two transients per instruction; Gold's 32 is
likely overkill.

For longer-lived persistent data you use spatial addressing in the
scratchpad, or rather, the compiler uses that; don't try to asm-program
a Mill. Scratch can be configured as big as you want, so the bounding
limit is the scratch-to-FU bandwidth configured. Sizing that is an
economics exercise, akin to deciding how many ports you put on the RF in
a legacy ISA. The Mill has an advantage there - while a legacy needs
three RF read ports and one write port per FMA FU to get the three
arguments and put the result, a Mill can split the bandwidth demand
between scratch and belt transients. In our (admittedly rough and
preliminary) guesswork a bandwidth load of two belt transients, one
scratch persistent, and a third of a scratch write for three operands
seems to be a reasonable balance for general code.

> For some of important math kernels this defect can be ameliorated when
> FMA unit has short latency through adder, but for other kernels it's not enough.
>
>> Mill is a throughput-oriented architecture, so we expect that customers
>> will use it more in wide-low-clock configs to save on power, instead of
>> narrow-high-clock configs, but but the architecture can be configured
>> for either. How many FMAs per second do you want to cool?
>
> As many as needed to specific task.
> As said above, I am not going to bother with less than 0.1 SP TFLOPS per core.
> But I'd very much like more than that.
>

Mill is a general-purpose machine. For some FP applications any
general-purpose architecture will be out-performed by a special-purpose
one. You may find, for example, that a wave-front barrel architecture
like a GPU suits your app better. Mill is not a GPU, nor a coin-mining
ASIC, nor an 8-bit IOT in a credit card. It may not be suitable for your
needs either.

Terje Mathisen

unread,

Aug 16, 2021, 3:19:37 PM8/16/21

to

I hope (nearly) all of them!

Ivan Godard

unread,

Aug 16, 2021, 3:27:16 PM8/16/21

to

The MUX fan in to the FUs does get deeper, which impacts cycle time. On
the other hand, the need for shuffle ops drops rather dramatically. But
the Mill specification-driven configuration machinery is very flexible -
if it suits your app, you can configure 4096-bit wide SIMD FUs, although
wire area constraints would likely limit you to two of them.

All architectures are limited by what the fabs can build, what power,
cooling, and pinout can be supported when running, and what you can
afford. Mill is not a magic wand to evade these constraints. But those
are not architectural constraints. Mill's claim is that, within the
non-architectural constraints, it offers a better performance X power X
cost product than other architectures on general-purpose loads.

If that is not enough, then you'll have to look elsewhere.

Thomas Koenig

unread,

Aug 16, 2021, 4:06:34 PM8/16/21

to

This seems a little over the top (I didn't learn a definition
language to read the definition of STEP, either. When I need to
define a 3D model in a self-written program, I write out IGES,
which has a human - readable standard. But I digress).

So, in order to simulate running a program My 66000, it would be
necessary to install two additional front ends to LLVM. Hmm...

> But it should be easy to transliterate it to C or C++.

I don't suppose I will be doing that (unless you have a 1:1
esl to C converter).

BGB

unread,

Aug 16, 2021, 5:46:30 PM8/16/21

to

Hmm...

For my ISA, I just sorta wrote an interpreter in C...

It isn't even a particularly good interpreter, but in this case I just
sorta needed to match similar speeds to the FPGA version (there was a
JIT, but it hasn't really been maintained and is currently non-functional).

But, an interpreter running on a desktop PC is plenty able to be faster
than an FPGA core running at 50MHz.

Running the emulator on a RasPi or cellphone is a bit harder, as it
seems a 1.2 or 1.8 GHz Cortex-A53 doesn't have quite enough "go power"
to maintain real-time emulation speeds with an interpreter.

Though, I guess it is an interpreter design which is reasonably tried
and true in my case, namely:
while(tr)
{ tr=tr->Run(ctx, tr); }
Where, tr->Run is an unrolled list of function calls to opcode handlers:
ctx->tr_next=tr->next;
...
ops=tr->ops;
op=*ops++; op->Run(ctx, op);
op=*ops++; op->Run(ctx, op);
...
return(ctx->tr_next);

Where 'tr_next' is set to NULL if an exception occurs.

In this case, the decoder is partially decoupled, and generally only
invoked when new traces are encountered.

Generally it is organized this way because, if one tries to decode and
execute instructions one at a time in an interpreter, the decoder will
tend to become a bottleneck.

...

Anton Ertl

unread,

Aug 16, 2021, 5:48:23 PM8/16/21

to

Michael S <already...@yahoo.com> writes:
>Rumors are that incoming Ice Lake server and its followup Sapphire Rapids will also
>have dual 512-bit FMA. We're going to see soon.

You can see it now:

|# of AVX-512 FMA Units: 2
From <https://www.intel.com/content/www/us/en/products/sku/212287/intel-xeon-platinum-8380-processor-60m-cache-2-30-ghz/specifications.html>