Lewis Cole <
l_c...@juno.com> writes:
>I'm up to my ass in alligators IRL and so I haven't (and likely won't for s=
>ome time) have a lot of time to read and respond to posts here.
>So I'm going to respond to only Mr. Fuld's posts rather than any others.
>And since I am up to my ass in alligators, I'm going to break up my respons=
>e to Mr. Fuld's post into two parts so that I can get SOMETHING out Real So=
>on Now.
>So here is the first part:
>
>On 8/15/2023 11:48 AM, Lewis Cole wrote:
>>> On Tuesday, August 15, 2023 at 12:06:53AM UTC-7, Stephen Fuld wrote:
>>> <snip>
>>>>>> Yeah. Only using half the cache at any one time would seem to decreas=
>e
>>>>>> performance. :-)
>>>>>
>>>>> Of course, the smiley face indicates
>>>>> that you are being facetious.
>>
>>No. I wasn't. See below.
>
>I thought you didn't want to argue about caching? ;-)
>Well, hopefully we both argue on pretty much everything and any disagreemen=
>t is likely due to us not being on the same page WRT our working assumption=
>s, so perhaps this is A Good Thing.
>
>However, I apologize to you for any annoyance I caused you due to my assump=
>tions about your reply.
>
>>>>> But just on the off chance that
>>>>> someone wandering through the group
>>>>> might take you seriously, let me
>>>>> point out that re-purposing half of
>>>>> a cache DOES NOT necessarily reduce
>>>>> performance, and may in fact increase
>>>>> it if the way that the "missing" half
>>>>> is used somehow manages to increase
>>>>> the overall hit rate ...
>>
>> Splitting a size X cache into two sized x/2 caches will almost certainly
>> *reduce* hit rate.=20
>
>There are several factors that influence hit rate, one of them is cache siz=
>e.
>Others, such as number of associativity ways, replacement policy, and line =
>size are also obvious influences.
>In addition, there are other factors that can make up for a slightly reduce=
>d hit rate so that such a cache can still be competitive with a cache with =
>a slightly higher hit rate such as cycle time.
>
>If two caches are literally identical in every way except for size, then yo=
>u are *CORRECT* that the hit rate will almost certainly be lower for the sm=
>aller cache.
>However, given two caches, one of which just happens to be half the size of=
> the other, it does *NOT* follow that the smaller cache must necessarily ha=
>ve a lower hit rate than the other as changes to some of the other factors =
>that affect hit rate might just make up for what was lost due to the smalle=
>r size.
>
>> Think of it this way. The highest hit rate is
>> obtained when the number of most likely to be used blocks are exactly
>> evenly split between the two caches.
>
>Ummm, no. I guess we are going to have an argument over caching after all .=
>...
>
>The highest hit rate is obtained when a cache manages to successfully antic=
>ipate, load up into its local storage, and then provide that which the proc=
>essor needs *BEFORE* the processor actually makes a request to get it from =
>memory. Period.
>This is true regardless of whether or not we're talking about one cache or =
>multiple caches.
>From the point of view of performance, it's what makes caching work.
>(Note that there may be other reasons for a cache such as to reduce bus tra=
>ffic, but let's ignore these reasons for the moment.)
>Whether or not, say, an I-cache happens to have the same number of likely-t=
>o-be-used blocks as the D-cache is irrelevant.
>They may have the same number. They may not have the same number. I suspe=
>ct for reasons that I'll wave my arms at shortly that they usually aren't.
>What matters is whether they have what's needed and can deliver it before t=
>he processor actually requests it.
>
>Now if an I-cache is getting lots and lots of hits, then presumably it is l=
>ikely filled with code loops that are being executed frequently.
>The longer that the processor can continue to execute these loops, the more=
> it will execute them at speeds that approach that which it would if main m=
>emory were as fast as the cache memory.
>And the more that this happens, the more this speed offsets the the much sl=
>ower speed of the processor when it isn't getting constant hits.
>
>However, running in cached loops doesn't imply much about the data that the=
>se loops are accessing.
>They may be marching through long arrays of data or they may be pounding aw=
>ay at a small portion of data structure such as at the front of a ring buff=
>er. It's all very much application dependent.
>About the only thing that we can infer is that because the code is executin=
>g loops, there is at least one instruction (the jump to the top of a loop) =
>which doesn't have/need a corresponding piece of data in the D-cache.
>IOW, there will tend to be one fewer piece of data in the D-cache than in t=
>he I-cache.
>Whether or not this translates into equal numbers of cache lines rather tha=
>n data words just depends.
>(And note I haven't even touched on the effect of writes to the contents of=
> the D-cache.)
>
>So if you think that caches will have their highest hit rate when the "numb=
>er of most likely to be used blocks are exactly evenly split between the tw=
>o caches", you're going to have to provide a better argument/evidence to su=
>pport your reasoning before I will accept this premise, either in the form =
>of a citation or better yet, a "typical" example.
>
>> That would make the contents of
>> the two half sized caches exactly the same as those of the full sized
>> cache.=20
>
>No, it wouldn't. See above.
>
>> Conversely, if one of the caches has a different, (which means
>> lesser used) block, then its hit rate would be lower.=20
>
>No, it wouldn't. See above.
>
>> There is no way
>> that splitting the caches would lead to a higher hit rate.=20
>
>As I waved my arms at before, it is possible if more changes than are made =
>than just to its size.
>
>For example, if a cache happens to be a directed mapped cache, then there's=
> only one spot in the cache for a piece of data with a particular index.
>If another piece of data with the same index is requested, then old piece o=
>f data is lost/replaced with the new one.
>This is basic direct mapped cache behavior 101.
>
>OTOH, if a cache happens to be a set associative cache of any way greater t=
>han one (i.e. not direct mapped), then the new piece of data can end up in =
>a different spot within the same set for the given index from which it can =
>be returned if it is not lost/replaced for some other reason.
>This is basic set associativity cache behavior 101.
>
>The result is that if the processor has a direct mapped cache and just happ=
>ens to make alternating accesses to two pieces of data that have the same i=
>ndex, the directed mapped cache will *ALWAYS* take a miss on every access (=
>i.e. have a hit rate of 0%), while the same processor with a set associativ=
>e cache of any way greater than one will *ALWAYS* take a take a hit (i.e. h=
>ave a hit rate of 100%).
>And note that nowhere in the above description is there any mention of cach=
>e size.
>Cache size DOES implicitly affect the likelihood of a collision, and so "ty=
>pically" you will get more collisions which will cause a direct mapped cach=
>e to perform worse than a set associative cache.
>And you can theoretically (although not practically) go one step further by=
> making a cache fully associative which will eliminate conflict misses enti=
>rely.
>In short, there most certain is "a way" that the hit rate can be higher on =
>a smaller size cache than a larger one, contrary to your claim.
>
>> But hit rate
>> isn't the only thing that determines cache/system performance.
>
>Yes. Finally, we agree on something.
>
>>>>> such as
>>>>> reducing a unified cache that's used
>>>>> to store both code and data with a
>>>>> separate i-cache for holding
>>>>> instructions and a separate d-cache
>>>>> for holding data which is _de rigueur_
>>>>> on processor caches these days.
>>
>> Separating I and D caches has other advantages. Specifically, since
>> they have separate (duplicated) hardware logic both for addressing and
>> the actual data storage, the two caches can be accessed simultaneously,
>> which improves performance, as the instruction fetch part of a modern
>> CPU is totally asynchronous with the operand fetch/store part, and they
>> can be overlapped. This ability, to do an instruction fetch from cache
>> simultaneously with handling a load/store is enough to overcome the
>> lower hit rat
>
>Having a separate I-cache and D-cache may well have other advantages beside=
>s increased hit rate.
>And increased concurrency may well be one of them.
>However, my point by mentioning the existence of separate I-caches and D-ca=
>ches was to point out that given a sufficiently Good Reason, splitting/repl=
>acing a single cache with smaller caches may be A Good Idea.
>Increased concurrency doesn't change that argument in the slightest.
>Simply replace any mention of "increased hit rate" with "increased concurre=
>ncy" and the result is the same.
>
>If you want to claim that increased concurrency was the *MAIN* reason for t=
>he existence of separate I-caches and D-caches, then I await with bated bre=
>ath for you to present evidence and/or a better argument to show this was t=
>he case.
>And if you're wondering why I'm not presenting -- and am not going to prese=
>nt -- any evidence or argument to support my claim that it was due to incre=
>ased hit rate, that's because we both seem to agree on the basic premise I =
>mentioned before, namely, given a sufficiently Good Reason, then splitting/=
>replaceing a single cache with smaller caches may be A Good Idea.
>Any argument that you present strengthens that premise without the need for=
> me to do anything.
>
>I will point out, however, that I think that increased concurrency seems li=
>ke a pretty weak justification.
>Yes, separate caches might well allow for increased concurrency, but you ha=
>ve to come up with finding those things that can be done during instruction=
> execution that can be done in parallel.
>And if you manage to find that parallelism, then you need to not only be ab=
>le to issue separate operations in parallel, you have to make sure that the=
>se parallel operations don't interfere with one another, which is to say th=
>at your caches remain "coherent" despite doing things like modifying the co=
>de stream currently being executed (i.e. self modifying code).
>Given the limited transistor budget In The Early Days, I doubt that dealing=
> with these issues was something that designers were willing to mess with i=
>f they didn't have to.
>(The first caches tended to be direct mapped because they were the simplest=
> and therefore the cheapest to implement while also having the fastest acce=
>ss times.
>Set associative caches performed better, but were more complicated and ther=
>efore more expensive as well as having slower access times and so came late=
>r.)
>ISTM that a more plausible reason other than hit rate would be to further r=
>educe bus traffic which was one of the other big reasons that DEC (IIRC) go=
>t into using them In the Beginning.
>
>> Note that this advantage doesn't apply to a
>> user/supervisor separation, as the CPU is in one mode or the other, not
>> both simultaneously.
>
>Bullshit.
>
>Assuming that you have two separate caches that can be kept fed and otherwi=
>se operate concurrently, then All You Have To Do to make them both somethin=
>g at the same time is to generate "select" signals for each so that they kn=
>ow that they should operate at the same time.
>Obviously, a processor knows whether or not it is in user mode or superviso=
>r mode when it performs an instruction fetch and so it is trivially easy fo=
>r a processor in either mode to generate the correct "select" signal for an=
> instruction fetch from the correct instruction cache.
>It should be equally obvious that a processor in user mode or supervisor mo=
>de knows (or can know) when it is executing an instruction that should oper=
>ate on data that is in the same mode as the instruction its executing.
>And it should be obvious that you don't want a user mode instruction to eve=
>r be able to access supervisor mode data.
>The only case this leaves to address when it comes to the generation of a "=
>select" signal is when a processor running in supervisor mode wants to do s=
>omething with user mode code or data.
>
>But generating a "select" signal that will access instructions or data in e=
>ither a user mode instruction cache or data in a user mode data cache is tr=
>ivially easy as well, at least conceptually, especially if one is willing t=
>o make use of/exploit that which is common practice in OSs these days.
>In particular, since even before the Toy OSs grew up, there has been a fixa=
>tion with dividing the logical address space into two parts, one part for u=
>ser code and data and the other part for supervisor code and data.
>When the logical space is divided exactly in half (as was the case for much=
> of the time for 32-bit machines), the result was that the high order bit o=
>f the address indicates (and therefore could be used as a select line for) =
>user space versus supervisor space cache access.
>While things have changed a bit since 64-bit machines have become dominant,=
> it is still at least conceptually possible to treat some part of the high =
>order part of a logical address as such an indicator.
>
>"But wait ... ," you might be tempted to say, "... something like that does=
>n't work at all on a system like a 2200 ... the Exec has never had the same=
> sort of placement fixation in either absolute or real space that the forme=
>r Toy OSs had/have", which is true.
>But the thing is that the logical address of any accessible word in memory =
>is NOT "U", but rather "(B,U)" (both explicitly in Extended mode and implic=
>itly in Basic Mode) where B is the number of a base register, and each B-re=
>gister contains an access lock field which in turn is made up of a "ring" a=
>nd a "domain".
>Supervisor mode and user mode is all about degrees of trust which is a simp=
>lification of the more general "ring" and "domain" scheme where some collec=
>tion of rings are supposedly for "supervisor" mode and the remaining collec=
>tion are supposedly for "user" mode.
>Whether or not this is actually the way things are used, it is at least con=
>ceptually possible that an address (B,U) can be turned into a supervisor or=
> user mode indicator that can be concatenated with U which can then be sent=
> to the hardware to select a cache and then a particular word within that c=
>ache.
>So once again, we're back to being able to identify supervisor mode code/da=
>ta versus user mode code/data by its address.
>(And yes, I know about the Processor Privilege [PP] flags in the designator=
> register, and reconciling their use with the ring bits might be a problem,=
> but at least conceptually, PP does not -- or at least need not -- matter w=
>hen it comes to selecting a particular cache.)
>
>If you want to say no one in their right mind -- certainly no real live CPU=
> designer -- would think in terms of using some part of an address as a "ri=
>ng" together with an offset, I would point out to you that this is not the =
>case: a real, live CPU designer *DID* choose to merge security modes with =
>addressing and the result was a relatively successful computer.
>It was called the Data General Eclipse and Kidder's book, "Soul of a New Ma=
>chine", mentions this being done.
>
>What I find ... "interesting" ... here, however, is that you would try to m=
>ake an argument at all about the possible lack of concurrency WRT a possibl=
>e supervisor cache.
>As I have indicated before, I assume that any such cache would be basically=
> at the same level as current L3 caches and it is my understanding that for=
> the most part, they're not doing any sort of concurrent operations today.
>It seems, therefore, that you're trying to present a strawman by suggesting=
> a disadvantage that doesn't exist at all when compared to existing L3 cach=
>es.
>
>>>>>
>>>>> I think it should be clear from the
>>>>> multiple layers of cache these days,
>>>>> each layer being slower but larger
>>>>> than the one above it, that the
>>>>> further you go down (towards memory),
>>>>> the more a given cache is supposed to
>>>>> cache instructions/data that is "high
>>>>> use", but not so much as what's in
>>>>> the cache above it.
>>
>> True for an exclusive cache, but not for an inclusive one.
>
>I don't know what you mean by a "exclusive" cache versus an "inclusive one"=
>.
>Please feel free to elaborate on what you mean.
>In every multi-layered cache in a real live processor chip that I'm aware o=
>f, each line in the L1 cache is also represented by a larger line in the L2=
> cache that contains the L1 line as a subset, each line in the L2 cache is =
>also represented by a larger line in the L3 that contains the L2 line as a =
>subset.
>
>At this point, I'm going to end my response to Mr. Fuld's post here and go =
>off and do other things before I get back to making a final reply to the re=
>maining part of his post.