On 8/2/2022 6:08 AM, Anton Ertl wrote:
> BGB <
cr8...@gmail.com> writes:
>> To support Reg/Mem "in general", seems like it would need mechanisms for:
>> Perform a Load, then perform the operation (Load-Op);
>
> Sure, that's what is done in every implementation.
>
>> Ability to perform certain ops directly in the L1 cache.
>
> I don't know any implementation that does that, although there have
> been funky memory subsystems that supported fetch-and-add or other
> synchronization primitives; AFAIK they do it in the remote memory
> controller, not in the controlled memory, though.
>
I did an experiment where I put a mini-ALU in the L1 cache.
Interestingly, it sorta works:
No significant change to architecture (vs Load/Store);
Resource cost is modest;
Timing Seems to survive;
...
Drawbacks:
Doesn't work for general operations;
A few 'useful' cases (like direct CMP against memory) are left out;
No good way to route a status flag update out of this.
Compare would require doing the flag update in EX3, after the result
arrives, but this wouldn't save much over the 2-op sequence.
This trick wouldn't work for a full ISA (like x86), but is probably OK
for "well, we'll stick a few ALU ops here".
Would also be fully insufficient for an ISA like M68K.
>> Latter is assuming that only a subset of operations support memory as a
>> destination, which appears to be the case in x86 at least
>> (interestingly, both RISC-V's 'A' extension
>
> That's the atomic extension, i.e., what I called "synchronization
> primitives" above. These operations are unlikely to be fast relative
> to non-atomic operations.
>
Possibly true.
But, it does add a limited set of RMW operations.
The extension I added should be more or less able to emulate the
behavior of the 'A' extension, but is nowhere near as far reaching as x86.
Then again, much past things like basic ALU operations, the compilers
tend to treat x86-64 like it were Load/Store, so it may not be a huge loss.
Well, and also x86 tends to only have LoadOp forms of most instructions,
with StoreOp limited primarily to things like basic ALU instructions and
similar.
Still not entirely sure how I will use them in my compiler (where
everything was written around the assumption of Load/Store), this will
be another thing to resolve for now.
Maybe would add special-cases to a few of the 3AC ops, where if trying
to do a binary op and the arguments are not in registers (and this
extension is enabled), will do a few extra checks and maybe use the
LoadOp or StoreOp encodings if appropriate.
>> However, something like extending EX to 5 stages, or needing "split this
>> operation into micro-ops" logic, would be a little steep. The former
>> would adversely effect branch latency
>
> Use a branch predictor, like the big boys.
>
I do use a branch predictor, but latency would still be latency on
misprediction.
>> the
>> latter would cause these instructions to rather perform poorly
>> (defeating the purpose of adding them).
>
> That's the way the 486 and Pentium went. Yes, load-and-op
> instructions took just as long as a load and an op. I wonder how a
> 486 with an additional EX stage would have performed: one load-and-op
> per cycle would increase performance, but you would have to wait
> another cycle before a conditional branch resolves, you would need
> more bypasses and more area overall.
>
Yeah, dunno. It is a mystery.
I can wonder, how did stuff back then perform as well as it did? By most
of my current metrics, performance should have been "kinda awful" with
386 and 486 PCs, but they still ran things like Doom and Win95 and
similar pretty well.
Well, also mysteries like how things like JPEG and ZIP were "fast",
where I am currently only getting:
~ 0.4 Mpix/sec from JPEG decoding;
~ 2 MB/s from Deflate (decoding);
...
Which doesn't really seem all that fast.
Well, some era appropriate video codecs also sorta work, but I have to
compress them further because, while I have enough CPU power to play
CRAM video, I don't generally have the IO bandwidth (and by the time one
gets the bitrate low enough, it looks like broken crap). Similar codec
designs with an extra LZ stage thrown on work pretty OK though (can do
320x200 at 30fps).
MPEG is still a little bit of a stretch though (unless I do it at
160x120 or 200x144 or something...).
For a moment, I was having (~ childhood) memories of movies on VCD being
playable on PCs, but then remembered that this was on a Pentium, so
probably doesn't really count for whether or not they would have worked
acceptably on a 486.
Well, I also have memories of things like FMV games and similar from
that era. The game basically being a stack of CDs with poor quality
video, most not offering much in terms of either gameplay or replay value.
In my case, Doom runs in the double-digits, but only rarely gets up near
the 32 fps limit.
Granted, I am using RGB55, drawing to an off-screen buffer (followed by
a "blit"), and using RGB alpha-blending to implement things like screen
color flashes, which possibly adds cost in a few areas.
Well, and for example, games like Hexen and Quake2 had faked
alpha-blending via using lookup tables (rather than using the RGB values).
Doom's original "invisibility" effect was also done using using colormap
trickery (whereas my port had switched to doing it via RGB math after
switching stuff over to 16-bit pixels), ...
With my newer TKGDI experiment, it is possible I could revisit the
original idea of doing everything with RGB555, and maybe look at the
possibility of going back to 8-bit indexed color for some things here
(and then convert during "blit").
Mostly this would require me to add indexed-color bitmap support to
TKGDI, say, traditional approach:
BITMAPINFOHEADER:
biBitCount=8, biClrUsed=256, biClrImportant=256, ...
Then one appends the color palette after the end of the BITMAPINFOHEADER
strucure (at 32 bits/color for whatever reason).
Where, the blit operation being responsible for the index-color to
RGB555 conversion.
Could maybe go further, and have a color-cell encoder that can operate
on index-color input (with a table of precomputed Y values and similar),
but this would be kinda moot if window backing buffers and the main
screen buffer were all still RGB555.
But, this still leaves a mystery of how things like Win95 and similar
were so responsive on the fairly limited hardware of the time. Though, I
can guess they didn't use per-window backing buffers (but then how does
one do the window stack redraw without windows drawing on top of each
other, ... ?).
Well, also on this era of hardware, they were using raw bitmapped
framebuffers in a hardware native format, rather than trying to feed
everything through a color-cell encoder (*1).
Well, say, because 640x400x16bpp would need 512K and 32MB/s for the
screen-refresh, and there isn't enough bandwidth to pull off the display
refresh in this case (it would turn into a broken mess).
But, OTOH, 640x400 16-color RGBI looks awful, ...
Which is part of why I originally had used a color-cell display to begin
with (as did a lot of the early game consoles and similar).
*1: This takes blocks of 4x4 pixels, figures out a "dark color" and a
"light color" (converts RGB->Y for this), and then generates a 2-bpp
interpolation value per pixel (interpolates between the A and B
endpoints). This process appears to be a performance bottleneck in the
640x400 mode.
Tested several options, the interpolation bits are generated with a
process like:
rcp=896/((ymax-ymin)+1); //cached (shared between all the pixels)
ix=(((pix_y-avg_y)*rcp)>>8)+2; //per pixel
block=(block<<2)|ix;
Generally, with 2 passes over all the pixels:
First pass, figures out the ranges and color endpoints;
Second pass, maps the pixels to 2-bpp values.
This use of a multiply was the faster approach in this case, where I had
also tested, eg:
ix=(pix_y>=avg_y)?((pix_y>=avg_hi_y)?3:2):((pix_y>=avg_lo_y)?1:0);
But, this was slower than the use of a per-pixel multiply.
The use of a multiply here tends to be faster on x86 as well.
For "higher quality" encoders, there might be multiple gamma functions,
and some fancier math for calculating endpoints (cluster averaging), but
this is a bit too slow for real-time encoding on the BJX2 core
(normally, one would want to select a "gamma curve" which maximizes
contrast between the high and low sets; and then calculate endpoints
roughly representing a weighted average of the extremes and the centroid
regions of each set of pixels).
For speed, one mostly has to live with a single gamma function, and
merely using the minimum and maximum values, ...
Though, one other option would be normalizing the window backing buffers
to my UTX2 format, with UTX2 also being used for the main screen buffer.
This would allow the screen-redraw and conversion into the VRAM format
to be faster.
Mostly, would still prefer Doom to still be able to have double-digit
framerates in a "Full GUI" mode.
Then again, from some of the videos I have seen of people showing off
running Doom on RISC-V soft-cores, they are often getting low
single-digit framerates (so, I am probably not doing too horribly in
this sense).
Like, at least in 320x200 (RGB555 mode), on a 50MHz core, I am getting
double-digit framerates.
> - anton