Why Ryzen 7000 has 32 KB L1 cache and Athlon X2 had 64 KB L1 Cache ?

90 views
Skip to first unread message

Skybuck Flying

unread,
Sep 15, 2022, 5:27:43 AMSep 15
to
Why is the L1 cache half the size today of what it was years ago in 2006 ? or 2001 ?

64 KB then vs 32 KB now ?

What decision was it based on ?

Bye,
Skybuck Flying.

MitchAlsup

unread,
Sep 15, 2022, 1:27:32 PMSep 15
to
On Thursday, September 15, 2022 at 4:27:43 AM UTC-5, Skybuck Flying wrote:
> Why is the L1 cache half the size today of what it was years ago in 2006 ? or 2001 ?
>
> 64 KB then vs 32 KB now ?
>
> What decision was it based on ?
<
Generally it is based on frequency.
<
Wire delay is getting worse as lithography gets smaller, and L1 caches are sized
to just make the target operating frequency. 32KB is smaller than 64KB, and thus
the associated wire delay is less--probably just enough to make the frequency target.
<
If you go back and observe Alpha, it had 8KB->16Kb->8Kb->16Kb as they went
through the changing processes. Here I know the size was chosen to make
frequency.
>
> Bye,
> Skybuck Flying.

Anton Ertl

unread,
Sep 16, 2022, 4:26:22 AMSep 16
to
MitchAlsup <Mitch...@aol.com> writes:
>On Thursday, September 15, 2022 at 4:27:43 AM UTC-5, Skybuck Flying wrote:
>> Why is the L1 cache half the size today of what it was years ago in 2006 ? or 2001 ?
>>
>> 64 KB then vs 32 KB now ?

Already the K7 in 1999 had 64+64KB L1 cache (2-way). Which is
interesting, because the contemporary Pentium III had 16+16KB (4-way),
both with 3 cycles latency.

>Wire delay is getting worse as lithography gets smaller,

Interestingly, the cache sizes are remarkably stable. The K7 and its
successors hat 64+64KB from the K7 1999 (250nm) until the Phenom II
2008 (45nm). Intel has had 32+32KB from at least the Core 2 Duo 2006
(65nm) until the Comet Lake 2019 (14nm); they switched from 3 cycles
latency to 4 cycles latency at some point during this evolution. They
switched to 32KB I-cache + 48KB D-cache with 5 cycles latency with Ice
Lake in 2019 (10nm, but also in the Rocket Lake port of Ice Lake to
14nm).

>If you go back and observe Alpha, it had 8KB->16Kb->8Kb->16Kb as they went
>through the changing processes. Here I know the size was chosen to make
>frequency.

size cyc MHz
EV4 8+8KB 2 200 750nm
EV45 16+16KB 2 300 500nm
EV5 8+8KB 2 333 500nm 96KB L2 cache on-chip
EV56 8+8KB 2 600 350nm 666MHz version by Samsung in 300nm
PCA56 16I+8D 2 533 350nm no on-chip L2
PCA56 32I+16D 2 666 280nm Samsung, no on-chip L2
EV6 64+64KB 3 575 350nm no on-chip L2
EV67 64+64KB 3 833 250nm Samsung
EV68A 64+64KB 3 940 180nm Samsung aluminium interconnects
EV68CB 64+64KB 3 1333 180nm IBM copper interconnects

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

BGB

unread,
Sep 16, 2022, 2:50:49 PMSep 16
to
FWIW, in my own testing there is not as much gain in going from 32K to
64K, as both tend to be already get upwards of 90% hit rate. The gains
from 32->64 being a lot smaller than 16->32 or 8->16.


Ironically, it seems to be a similar issue to register file size, where
going from 32 to 64 GPRs appears to be borderline useless for most code
(contrast to the much more obvious benefits of 8->16 or 16->32).


L2 is different, as better results generally with having a big L2 cache.
So, "as big as you can make it", appears to be the general rule for L2.


I mostly stuck with direct-mapped caches though:
The gains from increasing associativity are small;
They are also hit or miss (helps some workloads but hurts others);
On an FPGA, increasing cache size is generally cheaper than increasing
associativity.

I had also noted that modulo indexing appears to be "generally better"
for the L1 caches than hashed indexing. Less issue in general, though
has a weakness that it isn't ideal in the face of block aligned data.

Hashed indexing generally appears to work better for the L2 cache.


I am not entirely sure of the factors that seem to favor associative
caches over direct-mapped caches in ASIC implementations though.



The TLB is 4-way, but mostly this is because this seems to be roughly
the minimum needed to keep the CPU potentially getting stuck in an
infinite loop.

This does create concern for the 96 bit mode, which at present works by
dropping TLB associativity (thus reintroducing the potential of getting
stick an infinite loop). Possible solutions being either to increase the
TLB to 8-way associativity (in 64-bit mode) or to expand the TLBE's to
256 bits (and mostly only using the low 128 bits in 64-bit mode).

Though, it may not matter if using modulo indexing rather than hashed
indexing, as modulo indexing seems to have less issue with this (with
other pros/cons for modulo addressing the TLB).


Well, and then there is the ACL cache which in my present implementation
is fully associative but also only has 4 entries (though, in most cases
one is only really likely to see ACL misses on system calls or task
switches, so this isn't likely to be a big issue).



MitchAlsup

unread,
Sep 16, 2022, 4:46:04 PMSep 16
to
On Friday, September 16, 2022 at 1:50:49 PM UTC-5, BGB wrote:
> On 9/16/2022 2:44 AM, Anton Ertl wrote:
<snip>
> FWIW, in my own testing there is not as much gain in going from 32K to
> 64K, as both tend to be already get upwards of 90% hit rate. The gains
> from 32->64 being a lot smaller than 16->32 or 8->16.
>
>
> Ironically, it seems to be a similar issue to register file size, where
> going from 32 to 64 GPRs appears to be borderline useless for most code
> (contrast to the much more obvious benefits of 8->16 or 16->32).
>
All of this is completely reasonable. 16 registers should be 20% faster than 8,
32 registers should be 3% faster than 16, 64 registers should be ½% faster
than 32.
>
> L2 is different, as better results generally with having a big L2 cache.
> So, "as big as you can make it", appears to be the general rule for L2.
>
My data indicate that when you add 1 cycle to L2 access you need the L2
to be 3× as big (2×+1 cycle underperforms 1×+0 cycles) from L2=16KB-to-
2048Kb. My layout understanding is that one can achieve 6 cycle access
to L2 at 256 KB and 7 cycle access at 512 KB and 1024 KB, then 8 cycle at
2048 KB. {Many L2 have longer latency so the addition of 1 more cycle
may hurt more or less depending on the design of the accompanying
pipeline.}
<
But sets of associativity have an additional benefit in the L1s--avoidance
of VA<->PA aliasing in the tags, that don't materialize in VA first level
architectures (Mill) but do in more normal (everything Physical) cache
architectures.
>
> I mostly stuck with direct-mapped caches though:
> The gains from increasing associativity are small;
especially after L1s are bigger than 16KB.
> They are also hit or miss (helps some workloads but hurts others);
> On an FPGA, increasing cache size is generally cheaper than increasing
> associativity.
>
> I had also noted that modulo indexing appears to be "generally better"
> for the L1 caches than hashed indexing. Less issue in general, though
> has a weakness that it isn't ideal in the face of block aligned data.
<
This is contrary to the Seznic paper from 30-odd years ago.
>
> Hashed indexing generally appears to work better for the L2 cache.
>
Are you using the same hash per set-of-associativity ?

BGB

unread,
Sep 16, 2022, 11:20:24 PMSep 16
to
On 9/16/2022 3:46 PM, MitchAlsup wrote:
> On Friday, September 16, 2022 at 1:50:49 PM UTC-5, BGB wrote:
>> On 9/16/2022 2:44 AM, Anton Ertl wrote:
> <snip>
>> FWIW, in my own testing there is not as much gain in going from 32K to
>> 64K, as both tend to be already get upwards of 90% hit rate. The gains
>> from 32->64 being a lot smaller than 16->32 or 8->16.
>>
>>
>> Ironically, it seems to be a similar issue to register file size, where
>> going from 32 to 64 GPRs appears to be borderline useless for most code
>> (contrast to the much more obvious benefits of 8->16 or 16->32).
>>
> All of this is completely reasonable. 16 registers should be 20% faster than 8,
> 32 registers should be 3% faster than 16, 64 registers should be ½% faster
> than 32.

I don't have a good model here, but in general this appears to be the case.

For most code, the potential gains are pretty small in any case.
Some niche-case code can benefit, slightly, but most code does not.

I am starting to suspect my XGPR extension is "kinda pointless".


Mostly I have been getting more gains recently from compiler fiddly.


Though, did recently add a few cases which allowed (via a 32-bit encoding):
NOV.L (GBR, Disp10*4), Rn
NOVU.L (GBR, Disp10*4), Rn
NOV.Q (GBR, Disp10*8), Rn
NOV.X (GBR, Disp10*8), Xn
NOV.L Rn, (GBR, Disp10*4)
NOV.Q Rn, (GBR, Disp10*8)
NOV.X Xn, (GBR, Disp10*8)
...
Limited to L, UL, Q, and X.

Which can reduce binary size by around 4% and increase bundling by
around 3%, along with resulting is a slight framerate increase in Doom,
.. (Probably the biggest gains I have seen recently).


This seems to be because the (GBR, Disp9u*1) can only reach 512B,
whereas (GBR, Disp10u*8) can reach 8K, and one can fit a lot more global
variables into 8K than into 512B.

When global variables are sorted according to usage frequency, 512B can
hit ~ 29% of the globals, whereas 8K can hit 95%, at least for Doom.

Sorting the globals by usage frequency seemingly also helping some with
performance (seeming to result in a noticeable reduction in L1 miss rate
vs unsorted globals).


>>
>> L2 is different, as better results generally with having a big L2 cache.
>> So, "as big as you can make it", appears to be the general rule for L2.
>>
> My data indicate that when you add 1 cycle to L2 access you need the L2
> to be 3× as big (2×+1 cycle underperforms 1×+0 cycles) from L2=16KB-to-
> 2048Kb. My layout understanding is that one can achieve 6 cycle access
> to L2 at 256 KB and 7 cycle access at 512 KB and 1024 KB, then 8 cycle at
> 2048 KB. {Many L2 have longer latency so the addition of 1 more cycle
> may hurt more or less depending on the design of the accompanying
> pipeline.}
> <
> But sets of associativity have an additional benefit in the L1s--avoidance
> of VA<->PA aliasing in the tags, that don't materialize in VA first level
> architectures (Mill) but do in more normal (everything Physical) cache
> architectures.

Possibly. In my case, the L1 tags encode basically the full address:
(11:5): Shared between VA and PA;
(47:12): VA
(32:12): PA
Some other bits are metadata/flag bits, additional bits for the 96-bit
VA extension, ...

Though, with some of the other tag bits, this means that a 32K cache
uses roughly 64K of block RAM. Possibly a more efficient way to manage
this exists.


In my case, for the FPGA I am using, I can do a 256K L2 with an internal
access latency of around 2 cycles at 50MHz. At 100MHz would likely need
a higher latency than this.

The L1 caches generally assume a single cycle of latency.
Having 2 cycle latency could be possible, but would require reworking
stuff a bit.

>>
>> I mostly stuck with direct-mapped caches though:
>> The gains from increasing associativity are small;
> especially after L1s are bigger than 16KB.

As noted, I am generally using 32K L1s at present for a single-core
configuration.

A 64K L1 D$ puts a strain on the FPGA budget and doesn't really gain
much of anything in terms of performance.


The 256K L2 cache eats most of the rest of the block-RAM in the FPGA,
and I had to give up the dedicated VRAM to afford this (the VRAM is now
mapped to the L2 cache, with a mechanism set up to basically stream
cache lines from the L2 into a smaller cache in the display output).

The VRAM thing is a bit timing sensitive though and prone to minor glitches.


>> They are also hit or miss (helps some workloads but hurts others);
>> On an FPGA, increasing cache size is generally cheaper than increasing
>> associativity.
>>
>> I had also noted that modulo indexing appears to be "generally better"
>> for the L1 caches than hashed indexing. Less issue in general, though
>> has a weakness that it isn't ideal in the face of block aligned data.
> <
> This is contrary to the Seznic paper from 30-odd years ago.

I have had mixed results here.

In most other scenarios hashing would be an obvious win, but in the L1
caches hashing seems to result in a slightly higher rate of conflict
misses on average (albeit hashing is immune to issues with block-aligned
data). Modulo seems to win so long as one isn't dealing with data that
is aligned in such a way as to cause the data items to knock each other
out of the cache.

One other tradeoff is that modulo indexing allows for double mapping
pages (without creating potential memory consistency issues), whereas
hashed indexing does not. At least, assuming the logical alignment for
mapped regions is larger than the size of the L1 cache (say, forcing 64K
for the alignment of mmap'ed regions).

One could avoid this issue probably by keeping the L1 cache keyed by
physical address rather than virtual address, but this would add other
issues (would require each L1 cache to also have its own copy of the
TLB, etc).


So, consider one has two arrays (that both fit in the L1 cache) and one
is accessing between both arrays:
With hashing, the misses between them happen more or less at random;
With modulo, either there are no conflict misses, or they end up
mutually stomping each other (extra bad) if aligned in a certain way.

The eventual winner seemed to be modulo addressing.
Though, this was more because it allowed for double mapping pages.


For the I$, it isn't really as much of an issue, since alignment isn't
really a thing for the I$.



>>
>> Hashed indexing generally appears to work better for the L2 cache.
>>
> Are you using the same hash per set-of-associativity ?

L1 caches are direct-mapped, currently with modulo addressing.

L2 is currently also direct mapped.
Was previously 2-way, but switched back to direct-mapped as this seemed
to be what Doom preferred.
L2 uses a hashed index scheme.

Had noted before that I get better results (in terms of performance) if
the L1 and L2 mapping scheme is different.


With the L2 cache in 2-way mode, the same index would be used for both ways.

There were some other restrictions though on the design of the L2 cache
that reduced its effectiveness, such as only allowing dirty lines in the
'A' set (with slightly convoluted rules for how stuff moves around).


A more effective 2-way design would likely be:
Loads always happen into A;
Stores always happen from B;
Both A or B may be dirty.

So, in a L2 miss happens:
B is written back to RAM;
A moves to B;
The new cache line is loaded into A.

This scheme could also be easily extended to 4-way:
Loads always happen into A;
Stores always happen from D;
Both A/B/C/D may be dirty.
L2 miss happens:
D is written back to RAM;
A moves to B, B moves to C, C moves to D;
The new cache line is loaded into A.

Would likely need to do modeling though to determine if this approach
would be "consistently" better than a direct-mapped L2 (or, at least,
better enough to justify the LUT cost).


One minor drawback is that this cache design would need to be able to
perform 2 or 4 cache miss events in order to be able to flush a cache
line in the 'A' set.

For L2 at least this is slightly less of an issue, as one can use the
request cycling around the bus to prod it each step along the way (and
cache flushes aren't really meant to be all that fast anyways).

Quadibloc

unread,
Sep 19, 2022, 12:19:20 AMSep 19
to
On Friday, September 16, 2022 at 2:46:04 PM UTC-6, MitchAlsup wrote:

> All of this is completely reasonable. 16 registers should be 20% faster than 8,
> 32 registers should be 3% faster than 16, 64 registers should be ½% faster
> than 32.

In that case, the 360 had it right, and RISC has it wrong. Which makes sense,
I guess, since 32 registers were originally an attempt to remove the need for
OoO, but nowadays they aren't nearly enough for that.

John Savard

MitchAlsup

unread,
Sep 19, 2022, 11:34:24 AMSep 19
to
On Sunday, September 18, 2022 at 11:19:20 PM UTC-5, Quadibloc wrote:
> On Friday, September 16, 2022 at 2:46:04 PM UTC-6, MitchAlsup wrote:
>
> > All of this is completely reasonable. 16 registers should be 20% faster than 8,
> > 32 registers should be 3% faster than 16, 64 registers should be ½% faster
> > than 32.
<
> In that case, the 360 had it right, and RISC has it wrong.
<
It can be argued that excellent silicon design is finding the best "knee" of the
curve--and under that general guideline, 16 registers might be considered better.
<
On the other hand, since reading SRAM is more power expensive than reading
a register file (port) the more registers are not only buying more perf but also
lowering your power consumption.
<
> Which makes sense,
> I guess, since 32 registers were originally an attempt to remove the need for
> OoO, but nowadays they aren't nearly enough for that.
<
That was not their intended purpose. Their intended purpose was to get rid of
excessive stack accesses by having enough registers the compiler does not
run out.
>
> John Savard

BGB

unread,
Sep 19, 2022, 1:19:32 PMSep 19
to
In this case, a 3% gain is noticeable, whereas a fraction of a percent
is less so.


It is not necessarily a universal constant though:
Some functions fit easily in 16, so 32 wouldn't be gain much;
Some functions might not fit well in 32, so can 64 pay off.

A function which ends up turning into a wall of almost entirely spill
and fill will not perform well compared with one where stuff can fit in
registers.



In my case, I originally thought up the XGPR extension in the context of
my WEX-6W idea, where to make effective use of 6-wide bundling one would
need to get more creative with the register mapping, which would then
need more than 32 GPRs.

Some initial prototyping attempts were strongly pointing in the
direction that WEX-6W would be non-viable.

However, XGPR was relatively cheap on an FPGA (its main costs being more
related to hackery needed for the instruction encoding).



Since, as noted, BJX2 is primarily designed around 32 GPRs:
XGPR can only directly be encoded with a subset of the ISA;
With a reduced set of features (no predication for 32b ops);
Much of the rest requires Op64 or Op96 encodings;
...

This means that using it where it is not appropriate is slightly
detrimental to code density and performance.
Though, 128-bit SIMD ops can use them without any penalty, the
restrictions mostly apply to instructions operating on 64-bit data.


It does offer advantages in a few types of cases though, namely code
with a fairly large amount of in-flight data (a large number of
variables and large/complicated expressions).

However, in programs like Doom and friends, this type of code does not
exist. In its basic form, Doom is mostly "small and tight" loops
typically operating with relatively few variables.

It seems to be mostly optimized for machines with a relatively small
number of registers and cheap looping (not particularly the case on
BJX2; spilling and reloading variables for every basic-block being not
ideal for small/tight loops).


Annoyingly, the XGPR extension doesn't significantly effect the rate of
register spills, as in many cases the register spills would have been
"necessary" due to the compiler design regardless of the number of
registers available.

The main hope would be to increase the number of functions that could
"static assign everything".

But, most of the small functions (where these sorts of small/tight loops
exist) already static-assign everything. So, it doesn't change things
all this much. Other cases are mostly those that fall outside the
criteria for which "full static assignment" can be used.


The remainder then are mostly cases where full static assignment can't
be used:
Functions which take the address of a local variable (*1);
Functions which make use of global variables (*2);
...

*1: Taking the address of a variable in a function will cause it to be
treated as volatile for the entire function, so will then be reloaded
and then spilled every time the variable in question is accessed (as
opposed to the default behavior of loading it from the stack the first
time it is accessed, and then spilling it back at the end of the current
basic block).


*2: Another remaining case is global variables, where:
Doom uses global variables inside loops rather frequently;
Global variables are necessarily reload and spill in my compiler.

The optimizations needed to eliminate spill and reload for global
variables is not one my compiler can currently perform (could look into
this for leaf functions, would mostly need to load any globals in the
prolog, and write back any dirty globals in the epilog).



Less sure how much effect there would be (hypothetically) if GCC or
similar were used, since it seems to assign registers in a "block to
block" sense, rather than using a static/dynamic scheme.

The scheme GCC uses requires a smarter compiler though (namely, one
which can track how the variables in one block flow into the other
blocks), whereas the static/dynamic scheme works with a relatively naive
compiler (compiler doesn't need to know/care what happens in the other
blocks), but does seems to need a larger number of registers to work
efficiently.

..


Ironically, my (GBR,Disp10*N) instructions would likely have been less
useful had the compiler not been using reload-and-spill semantics for
globals.

Say:
some_global_a += 3;
some_global_b += 5;
Turns into, say:
MOV.L some_gloabal_a, R4
ADD R4, 3, R4
MOV.L R4, some_global_a
MOV.L some_gloabal_b, R5
ADD R5, 5, R5
MOV.L R5, some_global_b
...

And, then with luck, the WEXifier turns it into, say:
MOV.L some_gloabal_a, R4
ADD R4, 3, R4 | MOV.L some_gloabal_b, R5
ADD R5, 5, R5 | MOV.L R4, some_global_a
MOV.L R5, some_global_b
...

However, without the new instructions, the WEXifier usually couldn't do
anything in this case (it can't reorder or bundle 64-bit encodings).


Though, still kind of annoying when I can enable 64 GPRs, and still see
a bunch of stack spill and fill for functions which could "very
obviously" be able to fit into the available registers.

But, battling otherwise pointless spill-and-fill cases is roughly in the
same category as battling pointless register-register MOV's (where, say,
compilers like GCC can seemingly avoid putting a bunch of otherwise
pointless reg-reg MOV's and similar into the compiler output).

Though, a lot of these MOV's are a partial side-effect of things that
exist in earlier stages, such as:
Cast operations which are effectively no-op in the final output;
The result of an expression going into a stack-temporary in the RIL3 IR
(and then being "stored" into a local variable), without the Op-Store
sequence being merged into a combined "OpAndStore" operation (mostly
only done for things like binary operators and similar);
..

Still could probably be a lot worse.


I guess the question is if it could be possible (in theory) to squeeze
another 40-60% out of this (from what I can tell, this should be enough
to close the gap vs GCC).


Or such...



Thomas Koenig

unread,
Sep 20, 2022, 1:09:09 AMSep 20
to
MitchAlsup <Mitch...@aol.com> schrieb:
> On Sunday, September 18, 2022 at 11:19:20 PM UTC-5, Quadibloc wrote:
>> On Friday, September 16, 2022 at 2:46:04 PM UTC-6, MitchAlsup wrote:
>>
>> > All of this is completely reasonable. 16 registers should be 20% faster than 8,
>> > 32 registers should be 3% faster than 16, 64 registers should be ½% faster
>> > than 32.
><
>> In that case, the 360 had it right, and RISC has it wrong.
><
> It can be argued that excellent silicon design is finding the best "knee" of the
> curve--and under that general guideline, 16 registers might be considered better.

It is also a question of what you need registers for.

Architectures with memory operands need fewer registers than do
load-store architectures. Having immediates also reduces the need
for registers. For software, loop unrolling increases the need
for registers. You also lose some registers if they are dedicated
to certain functions (PC, constant zero, stack pointer, frame
pointer, TOC/GOT, ...

MitchAlsup

unread,
Sep 20, 2022, 12:17:54 PMSep 20
to
On Tuesday, September 20, 2022 at 12:09:09 AM UTC-5, Thomas Koenig wrote:
> MitchAlsup <Mitch...@aol.com> schrieb:
> > On Sunday, September 18, 2022 at 11:19:20 PM UTC-5, Quadibloc wrote:
> >> On Friday, September 16, 2022 at 2:46:04 PM UTC-6, MitchAlsup wrote:
> >>
> >> > All of this is completely reasonable. 16 registers should be 20% faster than 8,
> >> > 32 registers should be 3% faster than 16, 64 registers should be ½% faster
> >> > than 32.
> ><
> >> In that case, the 360 had it right, and RISC has it wrong.
> ><
> > It can be argued that excellent silicon design is finding the best "knee" of the
> > curve--and under that general guideline, 16 registers might be considered better.
> It is also a question of what you need registers for.
>
> Architectures with memory operands need fewer registers than do
> load-store architectures.
<
An x86 with 16 registers performs about as well as a Ld/St machine with 24
<
> Having immediates also reduces the need
> for registers.
<
My 66000 typically uses 3 less registers than RISC-V mainly due to
constants as any operand. Constants are the lowest power of any
operand delivery mechanism.
<
> For software, loop unrolling increases the need
> for registers.
<
VVM automagically unrolls loops without consuming registers.
<
> You also lose some registers if they are dedicated
> to certain functions (PC, constant zero, stack pointer, frame
> pointer, TOC/GOT, ...
<
Which My 66000 needs none of.
a) with immediates in any operand position R0 is just another register
b) while there is a dedicated SP and FP, FP is only used in dynamic
stack sized language subroutines; otherwise--its just another register.
c) IP is not a GPR
d) since one has 32-bit and 64-bit displacements TOC and GOT are
directly addressable.

BGB

unread,
Sep 20, 2022, 1:59:07 PMSep 20
to
On 9/20/2022 11:17 AM, MitchAlsup wrote:
> On Tuesday, September 20, 2022 at 12:09:09 AM UTC-5, Thomas Koenig wrote:
>> MitchAlsup <Mitch...@aol.com> schrieb:
>>> On Sunday, September 18, 2022 at 11:19:20 PM UTC-5, Quadibloc wrote:
>>>> On Friday, September 16, 2022 at 2:46:04 PM UTC-6, MitchAlsup wrote:
>>>>
>>>>> All of this is completely reasonable. 16 registers should be 20% faster than 8,
>>>>> 32 registers should be 3% faster than 16, 64 registers should be ½% faster
>>>>> than 32.
>>> <
>>>> In that case, the 360 had it right, and RISC has it wrong.
>>> <
>>> It can be argued that excellent silicon design is finding the best "knee" of the
>>> curve--and under that general guideline, 16 registers might be considered better.
>> It is also a question of what you need registers for.
>>
>> Architectures with memory operands need fewer registers than do
>> load-store architectures.
> <
> An x86 with 16 registers performs about as well as a Ld/St machine with 24
> <

In my case, stuff is still mostly Load/Store, but did add some LoadOp
instructions. My compiler doesn't use them yet though, and they haven't
really been tested as of yet.


Also added a stat to classify how much of certain types of functions
though, and noted (following some other recent compiler fiddling, *)
that apparently around 95% of the leaf functions are in the "tiny leaf"
case (all variables statically assigned to scratch registers).

Probably part of the reason why XGPR doesn't make much difference here.


*: Now leaf functions (and tiny leaf) can also access global variables,
and the use of constant values no longer hinders a leaf function from
using full static assignment (apparently this was previously an issue).

I also went and added the "load global into a register in prolog, store
back in epilog" optimization for leaf functions.


Have also since fixed the remaining bug that was effecting the
(GBR,Disp10*4|8) ops, turns out it wasn't a Verilog bug but rather some
bugs in BGBCC's WEXifier that my emulator was failing to lint properly
(resulting in these instructions being bundled in inappropriate ways).

Have since fixed up the flags and added stuff to detect a few more of
these cases.


Also fiddling (modeling stuff):
Doom appears to have a higher L2 miss rate with a 4-way set-associative
L2 than with a direct-mapped L2 for some reason:
DM : 4096 x 64B lines, xor hashed index.
4-way: 1024x4 x 64B lines, also hashed by "bucket", LRU.

Messed with both LRU and MRU, both seemingly do worse than DM in the
case of Doom.

Hash used for L2 is generally (for 4096 entries), say:
Index = Addr(17:6) ^ Addr(29:18);

One can try to get more clever, such as by re-ordering the high bits,
but this seems to work reasonably well.

Say:
Index = Addr(17:6) ^ { Addr(21:18), Addr(25:22), Addr(29:26) };
Being arguably slightly better in terms of hit rate (for Doom at least,
though seems to increase the miss rate for the VRAM buffer).


Still not entirely clear why Doom and friends seem to more strongly
prefer a 1-way cache over my various attempts at a set-associative design.

The VRAM also seems to prefer 1-way / direct-mapped as well.


But, alas...


>> Having immediates also reduces the need
>> for registers.
> <
> My 66000 typically uses 3 less registers than RISC-V mainly due to
> constants as any operand. Constants are the lowest power of any
> operand delivery mechanism.
> <

Immediate values mostly limited to Load/Store ops, ALU ops, and a few
other things. Not fully generic.

Constant loading is at least "reasonably cheap" (not necessary to use
either multi-instruction sequences or memory loads).

Minor issue though that encodings larger than 32-bits are considered
immovable by the WEXifier, so larger constants may hinder the "shuffle
and bundle" logic here.


>> For software, loop unrolling increases the need
>> for registers.
> <
> VVM automagically unrolls loops without consuming registers.
> <

In my case it mostly has to be done manually, as my compiler doesn't
really support automatic loop unrolling.

Mostly though this is a case of "compiler isn't really smart enough to
be able to automatically unroll loops without sometimes breaking stuff
and/or making performance worse than with non-unrolled loops".


>> You also lose some registers if they are dedicated
>> to certain functions (PC, constant zero, stack pointer, frame
>> pointer, TOC/GOT, ...
> <
> Which My 66000 needs none of.
> a) with immediates in any operand position R0 is just another register
> b) while there is a dedicated SP and FP, FP is only used in dynamic
> stack sized language subroutines; otherwise--its just another register.
> c) IP is not a GPR
> d) since one has 32-bit and 64-bit displacements TOC and GOT are
> directly addressable.

RISC-V has many of these...

It is said to have 32 GPRs, realistically it is more like 25.

BJX2, in its 32 GPR variant, effectively has 29 GPRs (R0, R1, and R15
being partially excluded).

Some stuff that RISC-V does in GPR space is mapper to CR space in BJX2.


R15 is stack pointer, more or less stuck with this, and is still a
hold-over from SuperH.

R0 and R1, are in a special category of:
Can't be used in some instructions, as they encode spacial cases;
May be stomped by the assembler without warning if it tries to fake an
instruction as a multi-op sequence (though, this is less common now than
it was earlier on).

R16/R17 were considered to also be in a similar category to R0 and R1,
but are for all practical purposes effectively GPRs at this point.


Depending on how one defines it, 32-bit x86 doesn't really have GPRs,
more like 8 SPRs that are "semi-general".



MitchAlsup

unread,
Sep 20, 2022, 3:42:00 PMSep 20
to
Yet, these are the very functions which don't need to save and restore
registers (or return address).
>
>
> Have also since fixed the remaining bug that was effecting the
> (GBR,Disp10*4|8) ops, turns out it wasn't a Verilog bug but rather some
> bugs in BGBCC's WEXifier that my emulator was failing to lint properly
> (resulting in these instructions being bundled in inappropriate ways).
>
> Have since fixed up the flags and added stuff to detect a few more of
> these cases.
>
>
> Also fiddling (modeling stuff):
> Doom appears to have a higher L2 miss rate with a 4-way set-associative
> L2 than with a direct-mapped L2 for some reason:
> DM : 4096 x 64B lines, xor hashed index.
> 4-way: 1024x4 x 64B lines, also hashed by "bucket", LRU.
>
> Messed with both LRU and MRU, both seemingly do worse than DM in the
> case of Doom.
>
> Hash used for L2 is generally (for 4096 entries), say:
> Index = Addr(17:6) ^ Addr(29:18);
<
Index - Addr(17:6) ^ Addr(18:29); // would probably work a bit better.
>
> One can try to get more clever, such as by re-ordering the high bits,
> but this seems to work reasonably well.
>
> Say:
> Index = Addr(17:6) ^ { Addr(21:18), Addr(25:22), Addr(29:26) };
> Being arguably slightly better in terms of hit rate (for Doom at least,
> though seems to increase the miss rate for the VRAM buffer).
<
Rearrange by the bit, not by the nibble.
>
>
> Still not entirely clear why Doom and friends seem to more strongly
> prefer a 1-way cache over my various attempts at a set-associative design.
<
Data structures of modulo-cache-size ?
Where many is ½ of them.
<
But RISC-V is going to have major problems when .text, .bss, .data need
to be positioned above 32-bit address space. My 66000 has no such
problem and in fact the code can be the same.
>
> It is said to have 32 GPRs, realistically it is more like 25.
<
Yes, exactly, and its ABI wastes a register unnecessarily, too.
>
> BJX2, in its 32 GPR variant, effectively has 29 GPRs (R0, R1, and R15
> being partially excluded).
<
Realistically, My 66000 has 30. R0 and R31 (SP) are somewhat special.
<
R0 gets the return address without being explicitly specified.
R0 is a proxy for IP when used as a base register.
R0 is a proxy for zero when used as an index register.
Otherwise, R0 is as real as any other register.
<
SP is implied in the prologue and epilogue instructions.
>
> Some stuff that RISC-V does in GPR space is mapper to CR space in BJX2.
>
My 66000 does not even have control registers (that are program accessible.)
>
> R15 is stack pointer, more or less stuck with this, and is still a
> hold-over from SuperH.
>
> R0 and R1, are in a special category of:
> Can't be used in some instructions, as they encode spacial cases;
> May be stomped by the assembler without warning if it tries to fake an
> instruction as a multi-op sequence (though, this is less common now than
> it was earlier on).
<
My 66000 has NO (absolutely none) registers that get stomped by assembler
or linker or dynamic loader. This is what constants are for and PIC is for.

BGB

unread,
Sep 21, 2022, 3:03:30 AMSep 21
to
Previously, the tiny leaf functions could not use global variables. Now
they can (mostly), though it excludes a few cases (this case can't be
used for leaf functions which are used as function pointers, for ABI
reasons).

It seems leaf functions which:
Access global variables (and are also called as a function pointer);
And/or can't fit everything into ~ 12 scratch registers.

Are actually only a small minority.

I would have thought XGPR would have helped, since it expands the
scratch-register count from 12 to 28, but had failed to run the stats
previously to realize that around 95% of the leaf functions already fit
into 12 scratch registers...


>>
>>
>> Have also since fixed the remaining bug that was effecting the
>> (GBR,Disp10*4|8) ops, turns out it wasn't a Verilog bug but rather some
>> bugs in BGBCC's WEXifier that my emulator was failing to lint properly
>> (resulting in these instructions being bundled in inappropriate ways).
>>
>> Have since fixed up the flags and added stuff to detect a few more of
>> these cases.
>>
>>
>> Also fiddling (modeling stuff):
>> Doom appears to have a higher L2 miss rate with a 4-way set-associative
>> L2 than with a direct-mapped L2 for some reason:
>> DM : 4096 x 64B lines, xor hashed index.
>> 4-way: 1024x4 x 64B lines, also hashed by "bucket", LRU.
>>
>> Messed with both LRU and MRU, both seemingly do worse than DM in the
>> case of Doom.
>>
>> Hash used for L2 is generally (for 4096 entries), say:
>> Index = Addr(17:6) ^ Addr(29:18);
> <
> Index - Addr(17:6) ^ Addr(18:29); // would probably work a bit better.
>>
>> One can try to get more clever, such as by re-ordering the high bits,
>> but this seems to work reasonably well.
>>
>> Say:
>> Index = Addr(17:6) ^ { Addr(21:18), Addr(25:22), Addr(29:26) };
>> Being arguably slightly better in terms of hit rate (for Doom at least,
>> though seems to increase the miss rate for the VRAM buffer).
> <
> Rearrange by the bit, not by the nibble.

Nibble was easier, mostly.

I guess it could be worth looking into in any case whether bit-vs-nibble
is better, and nibble-flipped does at least appear to be slightly better
than non-flipped.

More testing, seems not to have much effect on VRAM either way, but does
appear to help slightly with Doom.


>>
>>
>> Still not entirely clear why Doom and friends seem to more strongly
>> prefer a 1-way cache over my various attempts at a set-associative design.
> <
> Data structures of modulo-cache-size ?


Doom uses a lot of "flats" which are typically 4K (64x64, 8 bits per pixel).

Many of the wall "patches" are either 64 or 128 pixels tall IIRC.

Some amount of the other structures are power-of-2 sizes.

The screen buffer is 128K (320x200, 16bpp RGB555), but ends up padded to
128K (exact) by the memory allocator (it falls over to the page-based
allocator).

The VRAM is also 128K, mapped to a fixed address.
Screen redraw ends up copying and repacking the screen from raster order
to VRAM order (pixel block-oriented).


I had experimented before with modifying Doom to use UTX2 textures and
RGB modulation in place of the color-maps, but it "didn't quite look
right" and didn't really help much with performance either.

In this case, flats would have been reduced to 2K and stored in Morton
order.

Wall patches were still kept in column order, though grouped into
multiples of 4 columns. Masked spans were handled using alpha testing
rather than by drawing multiple spans.

Sprite drawing was also using alpha testing in this case.

IIRC, some amount of the code in this case was derived from parts of
TKRA-GL. Just, say, with a span-drawing function modified into a column
drawing function.


The RGB modulation can't exactly preserve the "look" of the original
colormap tables, as while it can preserve more detail with varying light
levels, it tends to make things slightly more "dull" looking, and also
leads to more obvious banding artifacts in some cases.


I could have gone basically either way from the performance front, but
decided at the time to stick with the colormap based rendering.



I have also since moved Doom to "TKGDI", which turns this process into a
"Draw this buffer to this HDC context according to this BITMAPINFOHEADER
structure" (loosely similar in concept to how it worked in the Win32 API).

Had also experimented with drawing stuff GUI style:
Put screen into 640x400 mode, drawing Doom into a window;
Doom draws into window backing buffer;
GUI draws all the windows into the 640x400 screen buffer;
Then transfer this buffer to VRAM.

But, this bogged everything down to around 8-10 fps. Not entirely sure
how Win3.x or Win9x kept the GUIs as responsive as they were.


In my case, this process did generally involve running the screen though
a hi-color to color-cell repacker pretty much every time the screen was
updated, which was not needed with ye olde VGA (and is a big chunk of
CPU time in this case).

In theory though, could consider helper instructions for color-cell
encoding.

This being because the VRAM remains 128K (color cell), whereas 640x400
RGB555 otherwise requires a 512K framebuffer.


I kinda suspect old style GUIs probably did stuff a little differently
though (well, as opposed to potentially trying to shovel around several
MB of graphical data roughly 20 or 30 times per second).


In the 320x200 mode, it sidesteps this hit mostly by copying Doom's
frame buffer more directly into VRAM (apart from repacking the pixels
into 4x4 blocks).
In my case, these work fine in theory at least.

Virtual memory works in a basic sense, but still isn't really quite as
stable as would be ideal (particularly when mapping stuff to pagefile
backed memory).

Haven't poked at this stuff too much recently, and what I most recently
thought might have been a virtual memory bug was actually more related
to other issues (namely, bugs in my C compiler).



IIRC, the way I last had it set up for program loading was:
.text and the stack are in direct-remapped page ranges;
.data/.bss and the heap were in pagefile backed memory.

There were previously intermittent issues if .text or the stack were put
into pagefile backed memory for a reason that was never really determined.


This was also similar to another issue, namely ROTT's demos tend to
desync in unpredictable ways, where pretty much every rebuild and every
run through a given demo could seemingly change the way in which the
desync plays out.

Granted, this overlaps some with the "feature" where my compiler will
randomly shuffle the relative order of everything in the binary every
time the binary is rebuilt. This is partly intended as a sort of passive
security (buffer overflow exploits should be harder to use if everything
keeps getting shuffled).

For things like global sorting, this does mean there is a certain level
of "looseness" in the sorting to prevent the sorting from entirely
defeating the shuffling.

But, for whatever reason, shuffling the globals in ROTT appears to cause
issues with demo playback.

Then again, it could be a case of more out-of-bounds array access (the
engine had a lot of this).


>>
>> It is said to have 32 GPRs, realistically it is more like 25.
> <
> Yes, exactly, and its ABI wastes a register unnecessarily, too.

Yep.


>>
>> BJX2, in its 32 GPR variant, effectively has 29 GPRs (R0, R1, and R15
>> being partially excluded).
> <
> Realistically, My 66000 has 30. R0 and R31 (SP) are somewhat special.
> <
> R0 gets the return address without being explicitly specified.
> R0 is a proxy for IP when used as a base register.
> R0 is a proxy for zero when used as an index register.
> Otherwise, R0 is as real as any other register.
> <
> SP is implied in the prologue and epilogue instructions.

At this point, SP is special mainly because it is swapped with SSP in
the interrupt mechanism.


>>
>> Some stuff that RISC-V does in GPR space is mapper to CR space in BJX2.
>>
> My 66000 does not even have control registers (that are program accessible.)


This is another "inherited" feature, though SH had both SR and CR spaces
(Status and Control Registers), both of which programs needed to prod at
(along with some architecturally significant registers being stuck off
in MMIO space).

I at least collapsed all of these down into a single CR space.


Arguably, LR and GBR probably could have been in GPR space, in which
case usermode code would not need to access CR's. Well, except maybe
TBR, but this register is still special in that it effectively needs to
be Read-Only for usermode code.

I was conservative and left them as CRs, for better or worse.


>>
>> R15 is stack pointer, more or less stuck with this, and is still a
>> hold-over from SuperH.
>>
>> R0 and R1, are in a special category of:
>> Can't be used in some instructions, as they encode spacial cases;
>> May be stomped by the assembler without warning if it tries to fake an
>> instruction as a multi-op sequence (though, this is less common now than
>> it was earlier on).
> <
> My 66000 has NO (absolutely none) registers that get stomped by assembler
> or linker or dynamic loader. This is what constants are for and PIC is for.

Early on, this seemed like a necessary evil.

I wanted the compiler and assembler to be able to pretend certain
instructions existed even when they didn't.

Say:
ADD R4, 0x1234, R7
No problem, emits:
MOV 0x1234, R0
ADD R4, R0, R7

Granted, most of these cases now exist, so it is less significant than
it once was.

It was also intended to mimic features from the predecessor ISA which no
longer existed (as actual machine instructions, *), so needed a way to
express these. Some of this has decayed though.


MOV.L @R4+, R7 //invoke the voodoo magic
MOV.L @(R0, R4), R11 //invoke more magic
Reply all
Reply to author
Forward
0 new messages