I am a little less strict on this topic, but my general opinion is that
things like strict aliasing should be "opt in" rather than "opt out".
But, yeah, the guy who wrote that in-general has a point, but also tends
to be a bit overly alarmist IMO about these issues.
For example, consider a case of caching struct loads, there are multiple
approaches:
1, Always do the load (my compiler did this originally);
2, Cache, but only within a single basic-block, flush on store (current
rule);
3, Cache, invalidate entries on store if matching types (strict aliasing);
4, Assume that no aliasing occurs whatsoever (playing with fire).
I have noted experimentally that 2 can gain a fair bit of speedup over
1, and is pretty much invisible within a single thread.
The 3 option is closer to what GCC and Clang do by default, but I opt
against this by default given it is unsafe. Doom and Quake and similar
still work with this option. It also seems to be only modestly faster
than 2.
The 4 option is unsafe; it is a little faster than 3 at Dhrystone, but
Doom and Quake are prone to crash if one uses this option.
A variation of this optimization has also been applied to array loads.
My preference is also that things like signed integer operations remain
sign modulo (as well as unsigned operations being unsigned modulo), that
signed right shift behaves in the usually-expected ways (arithmetic
shift), ...
So, ideally, one can try to optimize things in that don't turn into
semantic quicksand, even if this means skipping over some possible types
of optimizations.
Similarly, corner cutting is acceptable in cases where the expected
results can be reasonably well-defined, or are "basically invisible"
within reasonable use-cases.
For example, there can be a fairly significant cost difference between
"FPU that implements IEEE-754 correctly", and one that gives answers
that are "more or less correct".
Expensive FPU:
Separate FPU registers (arguably less register pressure);
FADD, FSUB, FMUL, FDIV, FSQRT, FMAC, ...
Always +/- 0.5 ULP;
Implements Denormals;
...
Cheaper FPU:
Reuse integer registers;
FADD, FSUB, FMUL;
Relaxed rounding;
Denormal as Zero;
...
For most programs, the difference between the cheap and expensive FPU is
not noticeable; and the sorts of workloads which "actually care" are
unlikely to be run on a processor that is too cheap to be able to afford
the fancier FPU (and where the relative slowness of doing everything
with software emulation is undesirable).
One might find that a more useful property though, is not so much that
the FPU is accurate, so much that things like "Single->Double->Single"
conversions can be done while keeping the original value intact.
Compiler optimization is an ongoing challenge though.
Most have been fixing various issues, recent notable bug fixes:
Binary operators needlessly converting arguments to the destination type;
Type conversion operators doing operations via unnecessary temporaries;
Unsigned loads using redundant zero-extension ops;
...
An, some sorts of partially addressed issues:
Temporaries always spilling to stack even when the value is no longer
needed (tried to add 'phi' operators to the IR, but this was a fail; did
add logic though for "don't bother storing temporary if it is never
reloaded" which also helps, but is less fine-grained than a 'phi' would
have been).
Had recently (briefly) gotten Dhrystone up to ~ 65k, but it has since
dropped to 63k, seems very sensitive to code-generation options. This is
on my custom ISA at 50MHz, so ~ 0.74 DMIPS/MHz.
While, conceivably, one can push it a little closer to ~0.78 via strict
aliasing, I don't consider this worthwhile ATM.
This seems to be in the sort of intermediate area between "slow" and
"fast" cores (though, not yet figured out what is the nature of the
dividing line). Though, having looked briefly at the SweRV core, I have
noted that they deal with the pipeline in a very different way (for
example, they seem to handle load/store with a queue and interlocks,
rather than running the L1 D$ in lockstep with the pipeline and stalling
the pipeline for each L1 miss).
It does seem like this could be a possible way to improve the speed of a
core, but would introduce its own complexities. I am not sure if this
accounts for a lot of the differences I am seeing (but, in
testing/models, the core is dumping a lot of its clock-cycles into cache
misses, and in theory this could at least help here, if it could be done
cheaply enough).
Though, as can also be noted in my case, my CPU core lacks an integer
divider, and only provides a 32-bit integer multiplier. A full 64-bit
multiplier would be slow and/or expensive. Generally, faking it as an
instruction-blob has been "fast enough".
Pretending to have instructions, but faking them with traps, is the
worst of both worlds (pointless and slow), at least a runtime call to an
ASM function can be reasonable.
I did recently fiddle around with seeing if I could make the ASM code
for integer divide faster, where options are:
Binary Long Division (current mainline option);
Lookup table of reciprocals (sorta works);
Lookup table + shifted-reciprocal (didn't win).
I had recently tried switching from solely using long division as the
default, to using a lookup table with long-division as a fallback. This
is a little faster for small divisors, but a little slower for divisors
that fall out of range of the table. It also adds the cost of the lookup
table (trades logic for cache misses). It also uses up space, and so
favors a fairly small lookup table (under 1K), where for
divide-intensive code the table will mostly fit in the L1 cache.
Shifted reciprocal can extend the range of such a small lookup table to
cover the entire integer range, but comes with a big issue:
The result is not integer exact;
In effect, the error is proportional to the relative size of the input
values.
So, it is fast for things like 3D rendering calculations where one
doesn't really care if the result is integer exact, but for C's "a/b"
operator, being integer exact is mandatory.
The problem then becomes that in many cases, trying to fix this ends up
costing more than had one just used the long division loop.
Most of the workarounds tend to be "maybe helps, maybe makes it worse".
However, most other code doesn't lean quite so heavily on integer divide
performance, so it is usually isn't quite as relevant.
I am hesitant about adding an instruction for IDIV, as for years even
fairly high end x86 CPU lines were plagued with an IDIV instruction
which was slower than doing it in software, and which performance would
likely have been better had the C compilers just called a runtime
function and done the binary long division in the C library.
Including the divide instruction in an ISA carries a non-trivial risk of
such a thing happening again (if compilers decide to use it by default
because "hey, there is an instruction for it, and dedicated instruction
implies fast, right?...").
But, in some sense, it is easier to use special case mechanisms to make
a runtime call faster, than it is to avoid a potential significant
overhead from emulating instructions which don't exist in hardware.
Granted, a possible intermediate option could be to have an instruction
whose dedicated purpose is to call into a location in a table while also
doing two register moves, ..., which then branches to the appropriate
runtime function (sorta like a special GOT mostly for stuff that was too
expensive to add to the ISA directly, but where one kinda wants to
pretend that a CPU instructions exists).
Then one can just sort of declare that all the scratch registers will be
left in an undefined state following the operation.
...