On 6/16/2022 6:56 AM, Terje Mathisen wrote:
> MitchAlsup wrote:
>> On Tuesday, June 14, 2022 at 12:29:41 PM UTC-5, BGB wrote:
>>> On 6/14/2022 6:20 AM, Anton Ertl wrote:
>>>> Early MIPS CPUs were high-performance for their time.
>>>>
>>> Probably true, it would appear that (prior to the Pentium) x86 and
>>> friends weren't exactly all that high-performance either.
>>>
>> Pentium Pro was the first x86 that anyone would considered fast.
>
> IMHO, high-perf x86 actually started with the Pentium:
>
> The PPro/P6 was simply the first model that was fast when running
> compiled code!
>
Going by what stats I could find, it seemed to be doing pretty well vs
386 and 486.
As noted before, my DMIPS/MHz score only slightly beats out a 486, but
how much is due to architecture vs compiler vs ... is uncertain.
Even as weak as my compiler is though, architecturally I should have an
advantage vs the 486 in nearly every area (well, apart from branch
latency, my core apparently having a somewhat longer pipeline than the 486).
My emulator currently shows a fair number of cycles now going into
branches, but it does not model the branch predictor at present.
Mostly because this was less of an issue, in relation, back when the
core had more significant memory-access bottlenecks (eg, then L1<->L2
being the bottleneck, vs now it being L2<->DRAM).
Though, as-is it appears even a 100% L2 hit rate isn't enough to
consistently break 10fps in GLQuake at 50MHz.
Based on "simulation status LEDs", I am starting to see a bit more
stalls due to non-memory reasons. It would appear the FPU is starting to
make an its presence known.
> By investing a lot of programmer resources, it was in fact possible to
> get really high performance out of the classic Pentium. My favorite
> algorithm to show this is word count where my 60 MHz Pentium could count
> characters/words/lines at an actual (measured) performance of 40 MB/s.
>
> Similarly, as soon as we got the MMX 64-bit SIMD extension, it was
> possible to do full rate DVD decoding with zero frame drops. (Zoran
> SoftDVD was the first program to achieve this, I helped out a little bit
> with the asm code.)
>
I have my doubts as to how much I could pull off here.
Based on my own experiences, I would suspect 640x480p30 MPEG decoding
would be a bit outside of what I could likely pull off on a 50 MHz CPU core.
Was having enough challenge as-is with simpler codec designs at 320x200.
Though, in this case, one faces both the issue of making stuff fast
enough for it to be decoded effectively, as well as low enough bitrate
to not choke on the SDcard (eg: why I couldn't just play MS-CRAM video,
the needed bitrate for CRAM to look "not completely awful" was pretty high).
But, as noted, what I had the best results with was mostly:
Differential color endpoint vectors;
Skipped if no delta;
1/2 bytes for small deltas;
3+ bytes for larger deltas.
6-bit lookup-based patterns (1);
LZ post-compressor.
Experimented with both byte-oriented and Huffman encoded LZ;
Huffman only sometimes won (usually with I-Frames).
Encoded frame data could effectively be seen as a stream of command
bytes, with commands in a few categories:
Draw a block (6b pattern, or a larger block if needed);
Update color endpoints (for a subsequent block);
Skip N blocks;
Combined color endpoint and block (and/or larger blocks and runs).
1:
h_sign, v_sign, h_freq(2b), v_freq(2b)
With freq based on a pattern, eg (0..7):
00: 5555 / 2222 (S=1)
01: 6521 / 1256 (S=1)
10: 6116 / 1661 (S=1)
11: 7070 / 0707 (S=1)
Both the horizontal and vertical patterns would be combined, and then
used to generate the color-selection pattern.
The color endpoints were seen as a pair of RGB555 values, but one
doesn't always want to encode a pair of RGB555 endpoints, so one would
want a scheme to compact this into a smaller representation (such as a
delta applied to the prior endpoints, a joint-coded endpoint, ...).
In the small-case, one can spend say, 5-bits, which indicates to
add/subtract 1 to each component (3^3=27), with 28..31 as a few escape
cases.
This codec didn't really do motion compensation, mostly block skip /
non-skip (translation was defined, but this effectively requires
double-buffering the decoder, which means skipped blocks need to be
copied from one frame to another).
This works well if the background stays static, but is not particular
effective with camera movements (need to effectively re-encode the whole
frame in these cases). Though, CRAM and RPZA also have this limitation.
> Terje
>
> PS. Funnily enough, my 60 MHz Pentium asm code would actually run slower
> on a 200 MHz PPro! The P6 needed an algorithm which avoided all the
> partial register updates my Pentium code took advantage of.
>
Yep.
I made a few codecs before which managed to be particularly fast on
Phenom and Bulldozer/Piledriver, but then took a relative performance
hit on Ryzen.
Partly, this was because on the former, it was often faster to load
values as a packed integer, and then bit-shift and mask out the parts
one was interested in.
On the Ryzen, it was often faster to load discrete byte or word values
from memory, without using any shift-and-mask steps.
This did not fare super well for codecs built heavily around "read 64
bits at a time and then shift-and-mask everything".
There are also some differences for optimizing for 32-bit or 64-bit x86:
x86-32: favors using 32-bit types and "small and tight" loops;
Computed values are kept close to their point of use;
Using lookup tables rather than computing stuff works well;
...
x86-64: more favors 64-bit types and unrolling.
Computed values may be spread out a little more.
In comparison, BJX2 is sorta like x86-64 here, but more so:
It tends to favor fairly aggressive loop unrolling;
Favors distancing computing values from their point of use.
So, one calculates something, and then throws some other independent
expressions between this and the expression that uses the result, with a
fair amount of unrolling, ...
Actually, in some ways, optimizing stuff for my core is kind of
suspiciously similar to the behavior of Bulldozer and friends (though I
don't have a particularly strong explanation for why this would be).
Though, for Bulldozer, there was also sort of a rule of "avoid 'if()'
blocks by any means possible" (use a bunch of extra bit-twiddly and
arithmetic operations, but whatever you do, don't use an 'if()' ...),
which is at least slightly less of an issue with BJX2.
Comparably, my Ryzen seems to not care about branching as much as on
Bulldozer, so the 'if()' branches are slightly less "avoid at any cost".
Though both are unlike the RasPi's, which seemingly don't seem to care
all that much if or where one uses a branch (one can use 'if()'
branches, get clever with function pointers all over the place, ... and
the ARM CPU doesn't seem to care).
But, if I try to run code optimized for BJX2 on a RasPi, in terms of
significant unrolling with a "whole truckload" of variables, ..., it
tends to perform like garbage (favoring smaller/tighter loops and closer
association between related expressions, more like on 32-bit x86).
Though, a lot of this is likely explainable by 32-bit ARM not liking it
when there is more state in-flight variables than the CPU has registers
to be able to hold.
...
One other thing I can note (both my prior and current CPU in my PC):
Don't enable AVX support in MSVC, as this basically ruins performance.
It seems AVX basically performs like crap when compared with plain SSE;
but if enabled, MSVC's auto-vectorization will try to use it, and there
is no good way in MSVC to turn the auto-vectorization off.
Though, in this case I am currently using a CPU that is still using a
128-bit SIMD unit internally (Zen+).
...