Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

Speaking of the 1110 AKA 1100/40 and its two types of memory ...

206 views

Skip to first unread message

Lewis Cole

unread,

Aug 14, 2023, 1:18:54 AM8/14/23

So just for giggles, I've been thinking more about what (if anything) can be done to improve system performance by tweaks to a CPU's cache architecture.
I've always thought that having a separate cache for supervisor mode references and user mode references _SHOULD_ make things faster, but when I poked around old stuff on the Internet about caches from the beginning of time, I found that while Once Upon A Time, separate supervisor mode and user mode caches were considered something to try, they were apparently abandoned because a unified cache seemed to work better in simulations. Surprise, surprise.

This seems just so odd to me and so I've been wondering how much this result is an artifact of the toy OS that as used in the simulations (Unix) or the (by today's standards) small single layer caches used.
This got me to thinking about the 1110 AKA 1100/40 which had no caches but did have two different types of memory with different access speeds.
(I've always thought of the 1110 AKA 1100/40 as such an ugly machine that I've always ignored it and therefore remained ignorant of it even when I worked for the Company.)
To the extent that the faster (but smaller) memory could be viewed as a "cache" with a 100% hit rate, I've been wondering about how performance differed based on memory placement back then.
Was the Exec (whatever level it might have been ... between 27 and 33?) mostly or wholly loaded into the faster memory?
Was there special code (I think there was) that prioritized placement of certain things in memory and if so how?
What sort of performance gains did use of the faster memory produce or conversely what sort of performance penalties occur when it wasn't?

IOW, anyone care to dig up some old memories about the 1110 AKA 1100/40 you'd like to regale me with? Enquiring minds want to know.

Scott Lurndal

unread,

Aug 14, 2023, 11:38:21 AM8/14/23

Lewis Cole <l_c...@juno.com> writes:
>So just for giggles, I've been thinking more about what (if anything) can b=
>e done to improve system performance by tweaks to a CPU's cache architectur=
>e.
>I've always thought that having a separate cache for supervisor mode refere=
>nces and user mode references _SHOULD_ make things faster, but when I poked=
> around old stuff on the Internet about caches from the beginning of time, =
>I found that while Once Upon A Time, separate supervisor mode and user mode=
> caches were considered something to try, they were apparently abandoned be=
>cause a unified cache seemed to work better in simulations. Surprise, surp=
>rise.
>
>This seems just so odd to me and so I've been wondering how much this resul=
>t is an artifact of the toy OS that as used in the simulations (Unix) or th=

Toy OS?

Stephen Fuld

unread,

Aug 14, 2023, 12:14:50 PM8/14/23

On 8/13/2023 10:18 PM, Lewis Cole wrote:
> So just for giggles, I've been thinking more about what (if anything) can be done to improve system performance by tweaks to a CPU's cache architecture.
> I've always thought that having a separate cache for supervisor mode references and user mode references _SHOULD_ make things faster, but when I poked around old stuff on the Internet about caches from the beginning of time, I found that while Once Upon A Time, separate supervisor mode and user mode caches were considered something to try, they were apparently abandoned because a unified cache seemed to work better in simulations. Surprise, surprise.

Yeah. Only using half the cache at any one time would seem to decrease
performance. :-)

> This seems just so odd to me and so I've been wondering how much this result is an artifact of the toy OS that as used in the simulations (Unix) or the (by today's standards) small single layer caches used.
> This got me to thinking about the 1110 AKA 1100/40 which had no caches but did have two different types of memory with different access speeds.
> (I've always thought of the 1110 AKA 1100/40 as such an ugly machine that I've always ignored it and therefore remained ignorant of it even when I worked for the Company.)
> To the extent that the faster (but smaller) memory could be viewed as a "cache" with a 100% hit rate, I've been wondering about how performance differed based on memory placement back then.

According to the 1110 system description on Bitsavers, the cycle time
for the primary memory (implemented as plated wire) was 325ns for a read
and 520ns for a write, whereas the extended memory (the same core
modules as used for the 1106 main memory) had 1,500 ns cycle time, so a
substantial difference, especially for reads.

But it really wasn't a cache. While there was a way to use the a
channel in a back-to back configuration, to transfer memory blocks from
one type of memory to the other (i.e. not use BT instructions), IIRC,
this was rarely used.

> Was the Exec (whatever level it might have been ... between 27 and 33?) mostly or wholly loaded into the faster memory?

IIRC, 27 was the last 1108/1106 only level. 28 was an internal
Roseville level to start the integration of 1110 support. Level 29
(again IIRC) was the second internal version, perhaps also used for
early beta site 1110 customers; 30 was the first 1110 version, released
on a limited basis primarily to 1110 customers, while 31 was the general
stable release.

> Was there special code (I think there was) that prioritized placement of certain things in memory and if so how?

There were options on the bank collector statements to specify prefer or
require either primary or extended memory. If you didn't specify, the
default was I-banks in primary, D-banks in extended. That made sense,
as all instructions required an I-bank read, but many instructions don't
require a D-bank reference (e.g. register to register, j=U or XU,
control transfer instructions), and the multiple D-bank instructions
(e.g. Search and BT) were rare. Also, since I-banks were almost
entirely reads, you took advantage of the faster read cycle time.

Also, I suspect most programs had a larger D-bank than I-bank, and since
you typically had more extended than primary memory, this allowed more
optimal use of the expensive primary memory.

I don't remember what parts of the Exec were where, but I suspect it was
the same as for user programs. Of course, the interrupt vector
instructions had to be in primary due to their hardware fixed addresses.

> What sort of performance gains did use of the faster memory produce or conversely what sort of performance penalties occur when it wasn't?

As you can see from the different cycle times, the differences were
substantial.

> IOW, anyone care to dig up some old memories about the 1110 AKA 1100/40 you'd like to regale me with? Enquiring minds want to know.

I hope this helps. :-)

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Stephen Fuld

unread,

Aug 14, 2023, 1:59:09 PM8/14/23

Back in the time frame Lewis was talking about (1970s), many mainframe
people regarded Unix as a "toy OS". No one would think that now! 🙂

Lewis Cole

unread,

Aug 14, 2023, 4:47:37 PM8/14/23

On Monday, August 14, 2023 at 9:14:50 AM UTC-7, Stephen Fuld wrote:
>> On 8/13/2023 10:18 PM, Lewis Cole wrote:
>> So just for giggles, I've been thinking
>> more about what (if anything) can be
>> done to improve system performance by
>> tweaks to a CPU's cache architecture.
>> I've always thought that having a
>> separate cache for supervisor mode
>> references and user mode references
>> _SHOULD_ make things faster, but when
>> I poked around old stuff on the
>> Internet about caches from the
>> beginning of time, I found that while
>> Once Upon A Time, separate supervisor
>> mode and user mode caches were
>> considered something to try, they
>> were apparently abandoned because a
>> unified cache seemed to work better
>> in simulations. Surprise, surprise.
>
> Yeah. Only using half the cache at any one time would seem to decrease
> performance. :-)
>

Of course, the smiley face indicates that you are being facetious.
But just on the off chance that someone wandering through the group might take you seriously, let me point out that re-purposing half of a cache DOES NOT necessarily reduce performance, and may in fact increase it if the way that the "missing" half is used somehow manages to increase the overall hit rate ... such as reducing a unified cache that's used to store both code and data with a separate i-cache for holding instructions and a separate d-cache for holding data which is _de rigueur_ on processor caches these days.

I think it should be clear from the multiple layers of cache these days, each layer being slower but larger than the one above it, that the further you go down (towards memory), the more a given cache is supposed to cache instructions/data that is "high use", but not so much as what's in the cache above it.
And even since the beginning of time (well ... since real live multi-tasking OS appeared), it has been obvious that processors tend to spend most of their time in supervisor mode (OS) code rather than in user (program) code.
From what I've read, the reason why separate supervisor and user mode caches performed worse than a unified cache was because of all the bouncing around through out the OS that was done.
Back in The Good Old days where caches were very small essentially single layer, it is easy to imagine that a substantial fraction of any OS code/data (including that of a toy) could not fit in the one and only small cache and would not stay there for very long if it somehow managed to get there.
But these days, caches are huge (especially the lower level ones) and it doesn't seem all that unimaginable to me that you could fit and keep a substantial portion of any OS laying around in one of the L3 caches of today ... or worse yet, in a L4 cache if a case for better performance can be made.

Yes.

> But it really wasn't a cache. While there was a way to use the a
> channel in a back-to back configuration, to transfer memory blocks from
> one type of memory to the other (i.e. not use BT instructions), IIRC,
> this was rarely used.

No, it wasn't a cache, which I thought I made clear in my OP.
Nevertheless, I think one can reasonably view/think of "primary" memory as if it were a slower memory that just happened to be cached where just by some accident, the cache would always return a hit.
Perhaps this seems weird to you, but it seems like a convenient tool to me to see if there might be any advantage to having separate supervisor mode and user mode caches.

>> Was the Exec (whatever level it
>> might have been ... between 27
>> and 33?) mostly or wholly loaded
>> into the faster memory?
>
> IIRC, 27 was the last 1108/1106 only level. 28 was an internal
> Roseville level to start the integration of 1110 support. Level 29
> (again IIRC) was the second internal version, perhaps also used for
> early beta site 1110 customers; 30 was the first 1110 version, released
> on a limited basis primarily to 1110 customers, while 31 was the general
> stable release.
>

Thanks for the history.

>> Was there special code (I think
>> there was) that prioritized
>> placement of certain things in
>> memory and if so how?
>
> There were options on the bank collector statements to specify prefer or
> require either primary or extended memory. If you didn't specify, the
> default was I-banks in primary, D-banks in extended. That made sense,
> as all instructions required an I-bank read, but many instructions don't
> require a D-bank reference (e.g. register to register, j=U or XU,
> control transfer instructions), and the multiple D-bank instructions
>(e.g. Search and BT) were rare. Also, since I-banks were almost
> entirely reads, you took advantage of the faster read cycle time.
>
> Also, I suspect most programs had a larger D-bank than I-bank, and since
> you typically had more extended than primary memory, this allowed more
> optimal use of the expensive primary memory.
>
> I don't remember what parts of the Exec were where, but I suspect it was
> the same as for user programs. Of course, the interrupt vector
> instructions had to be in primary due to their hardware fixed addresses.
>

For me, life started with 36 level by which time *BOOT1, et. al. had given way to *BTBLK, et. al.
Whatever the old bootstrap did, the new one tried to place the Exec I- and D-banks at opposite ends of memory, presumably so that concurrent accesses stood a better chance of not blocking one another due to being in a physically different memory that was often times interleaved.
IIRC, whether or not this was actually useful, it didn't change until M-Series hit the fan with paging.

>> What sort of performance gains
>> did use of the faster memory
>> produce or conversely what sort
>> of performance penalties occur
>> when it wasn't?
>
> As you can see from the different cycle times, the differences were
> substantial.

Yes, but do you know of anything that would suggest things were faster/slower because a lot of the OS was in primary storage most of the time ... IOW something that would support/refute the idea that separate supervisor and user mode caches might now be A Good Idea?

>> IOW, anyone care to dig up some
>> old memories about the 1110 AKA
>> 1100/40 you'd like to regale me
>> with? Enquiring minds want to know.
>
> I hope this helps. :-)

Yes, it does, but feel free to add more.

Stephen Fuld

unread,

Aug 15, 2023, 3:06:53 AM8/15/23

I don't want to get into an argument about caching with you, but I am
sure that the percentage of time spent in supervisor mode is very
workload dependent.

I agree that it sounds weird to me, but if it helps you, have at it.

First of all, when I mentioned the interrupt vectors, I wasn't talking
about boot elements, but the code starting at address 0200 (128 decimal)
through 0377 on at least pre 1100/90 systems which was a set of LMJ
instructions, one per interrupt type, that were the first instructions
executed after an interrupt. e.g. on an 1108, on an ISI External
Interrupt on CPU0 the hardware would transfer control to address 0200,
where the LMJ instruction would capture the address of the next
instruction to be executed in the interrupted program, then transfer
control to the ISI interrupt handler.

But you did jog my memory about Exec placement. On an 1108, the Exec
I-bank was loaded starting at address 0, and extended at far as needed.
The Exec D-bank was loaded at the end of memory i.e. ending at 262K for
a fully configured memory, extending "downward" as far as needed. This
left the largest contiguous space possible for user programs, as well as
insuring that the Exec I and D banks were in different memory banks, to
guarantee overlapped timing for I fetch and data access. I guess that
the 1110 just did the same thing, as it didn't require changing another
thing, and maximized the contiguous space available for user banks in
both primary and extended memory.

>>> What sort of performance gains
>>> did use of the faster memory
>>> produce or conversely what sort
>>> of performance penalties occur
>>> when it wasn't?
>>
>> As you can see from the different cycle times, the differences were
>> substantial.
>
> Yes, but do you know of anything that would suggest things were faster/slower because a lot of the OS was in primary storage most of the time ... IOW something that would support/refute the idea that separate supervisor and user mode caches might now be A Good Idea?

I think the "faster" was just the same as for the default in user
programs - instructions are accessed more often than data, so I don't
think it has any bearing on the separate cache issue. Remember, all of
the instructions (not "a lot") were in primary memory and none of the data.

Scott Lurndal

unread,

Aug 15, 2023, 9:56:40 AM8/15/23

Some people, perhaps.

Burroughs, on the other hand, had unix offerings via Convergent Technologies,
and as Unisys, developed the unix-based OPUS systems (distributed, massively parallel
intel-based systems running a custom microkernel-based distributed version of SVR4).

Scott Lurndal

unread,

Aug 15, 2023, 10:07:23 AM8/15/23

Indeed. On modern toy unix systems, the split is closer to 10% system, 90% user.

For example, a large parallel compilation job[*] (using half of the available 64 cores):

%Cpu(s): 28.6 us, 2.7 sy, 0.1 ni, 66.6 id, 1.8 wa, 0.0 hi, 0.3 si, 0.0 st

That's 28.6% in user mode, 2.7% in system (supervisor) mode.

Most modern server processors (intel, arm64) offer programmable cache partitioning
mechanisms that allow the OS to designate that a schedulable entity belongs to a
specific partition, and provides controls to designate portions of the cache are
reserved to those schedulable entities (threads, processes).

Note also that in modern server grade processors, there are extensions to the
instruction set to allow the application to instruct the cache that data will
be used in the future, in which case the cache controller _may_ pre-load the
data in anticipation of future use.

Most cache subsystems also include logic to anticipate future accesses and
preload the data into the cache before the processor requires it based on
historic patterns of access.

[*] takes close to an hour on a single core system, over 9 million SLOC.

Scott Lurndal

unread,

Aug 15, 2023, 11:33:36 AM8/15/23

I'll note that Mapper was one of the applications that ran on the OPUS systems.

Stephen Fuld

unread,

Aug 15, 2023, 11:44:21 AM8/15/23

On 8/15/2023 6:56 AM, Scott Lurndal wrote:
> Stephen Fuld <sf...@alumni.cmu.edu.invalid> writes:
>> On 8/14/2023 8:38 AM, Scott Lurndal wrote:
>>> Lewis Cole <l_c...@juno.com> writes:
>>>> So just for giggles, I've been thinking more about what (if anything) can b=
>>>> e done to improve system performance by tweaks to a CPU's cache architectur=
>>>> e.
>>>> I've always thought that having a separate cache for supervisor mode refere=
>>>> nces and user mode references _SHOULD_ make things faster, but when I poked=
>>>> around old stuff on the Internet about caches from the beginning of time, =
>>>> I found that while Once Upon A Time, separate supervisor mode and user mode=
>>>> caches were considered something to try, they were apparently abandoned be=
>>>> cause a unified cache seemed to work better in simulations. Surprise, surp=
>>>> rise.
>>>>
>>>> This seems just so odd to me and so I've been wondering how much this resul=
>>>> t is an artifact of the toy OS that as used in the simulations (Unix) or th=
>>>
>>> Toy OS?
>>
>> Back in the time frame Lewis was talking about (1970s), many mainframe
>> people regarded Unix as a "toy OS". No one would think that now!
>
> Some people, perhaps.

I suppose I should have been more specific. At least among the
Univac/Sperry users, which Lewis and I were both part of, that view was
pretty common.

> Burroughs, on the other hand, had unix offerings via Convergent Technologies,

But that was later, at least the 1980s.

> and as Unisys, developed the unix-based OPUS systems (distributed, massively parallel
> intel-based systems running a custom microkernel-based distributed version of SVR4).

Which, of course, was even later.

As I said, as Unix improved, the belief that it was a "toy" system
diminished.

Stephen Fuld

unread,

Aug 15, 2023, 11:52:53 AM8/15/23

That is more in line with my experience and expectations. Of course, if
you are doing OLTP is is probably a higher percentage of system than
what you show; conversely, a highly compute bound scientific job may be
even less.

> For example, a large parallel compilation job[*] (using half of the available 64 cores):
>
> %Cpu(s): 28.6 us, 2.7 sy, 0.1 ni, 66.6 id, 1.8 wa, 0.0 hi, 0.3 si, 0.0 st
>
> That's 28.6% in user mode, 2.7% in system (supervisor) mode.
>
> Most modern server processors (intel, arm64) offer programmable cache partitioning
> mechanisms that allow the OS to designate that a schedulable entity belongs to a
> specific partition, and provides controls to designate portions of the cache are
> reserved to those schedulable entities (threads, processes).

I wasn't aware of this. I will have to do some research. :-)

> Note also that in modern server grade processors, there are extensions to the
> instruction set to allow the application to instruct the cache that data will
> be used in the future, in which case the cache controller _may_ pre-load the
> data in anticipation of future use.

Something beyond prefetch instructions?

> Most cache subsystems also include logic to anticipate future accesses and
> preload the data into the cache before the processor requires it based on
> historic patterns of access.

Sure - especially sequential access patterns are easy to detect.

Scott Lurndal

unread,

Aug 15, 2023, 12:58:52 PM8/15/23

Stephen Fuld <sf...@alumni.cmu.edu.invalid> writes:
>On 8/15/2023 7:07 AM, Scott Lurndal wrote:

>>
>> Most modern server processors (intel, arm64) offer programmable cache partitioning
>> mechanisms that allow the OS to designate that a schedulable entity belongs to a
>> specific partition, and provides controls to designate portions of the cache are
>> reserved to those schedulable entities (threads, processes).
>
>I wasn't aware of this. I will have to do some research. :-)

For the ARM64 version, look for Memory System Resource Partitioning and Monitoring
(MPAM).

https://developer.arm.com/documentation/ddi0598/latest/

Note that this only controls "allocation", not "access" - i.e. any application
can hit a line in any partition, but new lines are only allocated in partitions
associated with the entity that caused the fill to occur.

Resources include cache allocation and memory bandwidth.

>
>
>> Note also that in modern server grade processors, there are extensions to the
>> instruction set to allow the application to instruct the cache that data will
>> be used in the future, in which case the cache controller _may_ pre-load the
>> data in anticipation of future use.
>
>Something beyond prefetch instructions?

Prefetch instructions are what I had in mind.

>
>
>> Most cache subsystems also include logic to anticipate future accesses and
>> preload the data into the cache before the processor requires it based on
>> historic patterns of access.
>
>Sure - especially sequential access patterns are easy to detect.

Yep, stride-based prefetches have been common for a couple of decades
now.

Lewis Cole

unread,

Aug 15, 2023, 2:48:18 PM8/15/23

On Tuesday, August 15, 2023 at 12:06:53 AM UTC-7, Stephen Fuld wrote:
<snip>

> I don't want to get into an argument about caching with you, [...]

I'm not sure what sort of argument you think I'm trying get into WRT caching, but I assume that we both are familiar enough with it so that there's really no argument to be had so your comment makes no sense to me.

> [...] but I am sure that the percentage of time spent in supervisor mode is very
> workload dependent.

Agreed.
But to the extent that the results of the SS keyin were ... useful .. in The Good Old Days at Roseville, I recall seeing something in excess of 80+ percent of the time was spent in the Exec on regular basis.

I'm sure that things have changed now in the Real World as the fixation with server farms that run VMs no doubt means that every attempt is likely being made to stay out of underlying OS and probably even the virtualized OSs as well.
I have no problem with this and think that it's A Good Thing.

But in my OP, I referred to "system performance", not "system throughput".
If all I wanted to do was to increase system throughput, then the obvious solution is to basically get rid of the OS entirely and link in whatever OS-like services are needed with a user program that I could then magically load all by itself into system memory and then run all by itself in a more or less batch processing manner (which I gather was what MirageOS tried to do).

However, I'm assuming that an OS is necessary, not just for throughput but also responsiveness for example and ISTM that the best way (that I can think of at least) is to try to keep as much of the "high use" portions of an OS in a cache.
If everything works, the percentage of time that is spent in the OS should drop relative to what it would otherwise happen to be and ideally system performance (both throughput and responsiveness) should increase, even in the case of micro-kernels where everthing that absolutely doesn't require OS priviledge is pushed out to user land, because (hopefully) crucially needed OS code (say like the message passing code kernel at the heart of many micro-kernels) would be able to run as fast as possible.

<snip>

>> No, it wasn't a cache, which I thought
>> I made clear in my OP.
>> Nevertheless, I think one can reasonably
>> view/think of "primary" memory as if it
>> were a slower memory that just happened
>> to be cached where just by some accident,
>> the cache would always return a hit.
>> Perhaps this seems weird to you, but it
>> seems like a convenient tool to me to
>> see if there might be any advantage to
>> having separate supervisor mode and user
>> mode caches.
>
> I agree that it sounds weird to me, but if it helps you, have at it.

Okay, well, I hope that why I'm thinking this way is obvious.

Caches are devices that make the memory they serve look faster than it really is.
Caches are usually used to try to speed up ALL memory faster without regard to what's in a particular region of memory.
Caches have historically been very, very small, only a few kilo-bytes/36-bit words in size, and so couldn't hold more than a very small portion a user program or OS at any instant in time, and so I suspect that the effect of a larger cache dedicated to OS-code, say, couldn't really be accurately determined in The Good Old Days.
The primary memory of an 1110 AKA 1100/40, while small by today's standards, was huge back then -- much larger than any cache of the time -- and I suspect often times (usually?) contained a significant portion of the code portion of the Exec because of the way things were loaded at boot time.
To the extent that the primary memory behaved like a region of much slower memory that just happened to be front ended by a very fast and good (perfect) hit rate, I'm hoping that the effort to get things (and keep things) into primary memory serves as possible (indirect) measure of how useful having a separate OS cache might be.

<snip>

I understood what you meant by "interrupt vectors". Honestly.

My reference to Bootstrap elements was to point out that I'm familiar with the bootstrap process and bank placement that occurred after the 1100/80 hit the fan.
I don't recall much (if any) procing having to do with EON (1106 and 1108) or TON (1110 and 1100/80 and later 1100/60) in the bootstrap code and so I'm assuming that what occurred for the 1100/80 as far as bank placement also occurred for the 1110 AKA 1100/40.
If that's not the case, then feel free to correct me.

As for the 1100/80 boot process, the hardware loaded the bootblock (*BTBLK) starting at the address indicated by the MSR register, and once loaded, started executing the code there in by taking an I/O interrupt.
The vectors, being part of the bootblock code, directed execution to the initial I/O interrupt handler which IIRC was the start of *BTBLK.
I don't recall what the vectors looked like, but I suspect that they were *NOT* LMJ instructions because on the 1100/80, P started out with an ABSOLUTE address, not a relative address.
IIRC the Processor and Storage manual specifically mentioned that the vector need not be an LMJ because any relative return address captured was likely to be wrong.

> But you did jog my memory about Exec placement. On an 1108, the Exec
> I-bank was loaded starting at address 0, and extended at far as needed.
> The Exec D-bank was loaded at the end of memory i.e. ending at 262K for
> a fully configured memory, extending "downward" as far as needed. This
> left the largest contiguous space possible for user programs, as well as
> insuring that the Exec I and D banks were in different memory banks, to
> guarantee overlapped timing for I fetch and data access. I guess that
> the 1110 just did the same thing, as it didn't require changing another
> thing, and maximized the contiguous space available for user banks in
> both primary and extended memory.

FWIW, the I-bank couldn't start at 0 on an 1100/80 or 1100/80A.
The caches (Storage Interface Units AKA SIUs) made memory look like it was centered the address 8-million (decimal) and expanded upwards and downwards from there.
The maximum amount of memory possible was (IIRC) 8-million and so I suppose that one could theoretically get memory to go from 0 to 8-million, but AFAIK, that never happened and so always started at 4-million at its lowest.

You could up and down memory, of course, but thanks to the interleave of memory and the SIUs above them, at least on the 1100/80A all you could do was to up and down one half of memory at a time.
While it was likely that the I-bank (both visible and invisible) where based off of MSR and the D-bank was at the high end of memory, it's possible that it was sitting next to the mid-memory boundary if you happened to boot the system with half the memory off down and then upped it later on the fly.

Where things started being loaded was based off of MSR, but I'm too lazy to go back and try to find when MSR was introduced.

Lewis Cole

unread,

Aug 15, 2023, 10:56:52 PM8/15/23

On Tuesday, August 15, 2023 at 7:07:23 AM UTC-7, Scott Lurndal wrote:

> Stephen Fuld writes:
>> I don't want to get into an argument about caching with you, but I am
>> sure that the percentage of time spent in supervisor mode is very
>> workload dependent.
>>
>
> Indeed. On modern toy unix systems, the split is closer to 10% system, 90% user.
>
> For example, a large parallel compilation job[*] (using half of the available 64 cores):
>
> %Cpu(s): 28.6 us, 2.7 sy, 0.1 ni, 66.6 id, 1.8 wa, 0.0 hi, 0.3 si, 0.0 st
>
> That's 28.6% in user mode, 2.7% in system (supervisor) mode.

That's nice.
But it doesn't speak to what would happen to system performance if the amount of time spent in the OS went down, does it?
Nor does it say anything about whether or not having a dedicated supervisor cache would help or hurt things.
If you have some papers that you can point to, I'd love to hear about them.

> Most modern server processors (intel, arm64) offer programmable cache partitioning
> mechanisms that allow the OS to designate that a schedulable entity belongs to a
> specific partition, and provides controls to designate portions of the cache are
> reserved to those schedulable entities (threads, processes).

I think that "programmable cache partitioning" is what the ARM folks call some of their processors' ability to partition one of their caches.
I think that the equivalent thing for x86-64 processors by Intel (which is the dominant x86-64 server processor maker) is called "Cache Allocation Technology" (CAT) and what it does is to basically set limits on how much cache thread/process/something-or-other can use so that said thread/process/something-or-other's cache usage doesn't impact other threads/processes/something-or-others.
While this is ... amusing ... it doesn't really say much with regard to the possible impact of supervisor mode cache ... except perhaps in the case of a micro-kernel where various user-land programs perform OS functions in which case I would argue that, in effect, the partitions created amount to supervisor mode caches for supervisor code that doesn't happen to run with the supervisor bit set.

Just as an aside, if someone wanted to write a 2200 system simulator, say, it would be A Good Idea to dedicate one processor *CORE* to each simulated 2200 IP and basically ignore any "hyperthreaded" processors as all they can do if they are allowed to execute is to disrupt the cache needed by the actual core to simulate the IP.

If you are referring to the various PREFETCHxxx instructions, yes, they exist, but they are usually "hints" the last time I looked and only load *DATA* into the L3 DATA cache in advance of its possible use.
So unless something's changed (and feel free to let me know if it has), you can't pre-fetch supervisor mode code for some service that user mode code might need Real Soon Now.

> Most cache subsystems also include logic to anticipate future accesses and
> preload the data into the cache before the processor requires it based on
> historic patterns of access.

Again, data, not code.
So if you have a request for some OS-like service that would be useful to have done as quickly as possible, the actual code might not still be in an instruction cache if a user program has been running for a long time and so has filled the cache with its working set.

Scott Lurndal

unread,

Aug 17, 2023, 11:06:19 AM8/17/23

Lewis Cole <l_c...@juno.com> writes:
>On Tuesday, August 15, 2023 at 7:07:23=E2=80=AFAM UTC-7, Scott Lurndal wrot=
>e:
>> Stephen Fuld writes:=20

>>> I don't want to get into an argument about caching with you, but I am
>>> sure that the percentage of time spent in supervisor mode is very
>>> workload dependent.
>>>
>>

>> Indeed. On modern toy unix systems, the split is closer to 10% system, 90=
>% user.
>>
>> For example, a large parallel compilation job[*] (using half of the avail=

>able 64 cores):
>>
>> %Cpu(s): 28.6 us, 2.7 sy, 0.1 ni, 66.6 id, 1.8 wa, 0.0 hi, 0.3 si, 0.0 st
>>
>> That's 28.6% in user mode, 2.7% in system (supervisor) mode.
>
>That's nice.

>But it doesn't speak to what would happen to system performance if the amou=

>nt of time spent in the OS went down, does it?

I don't understand what you you are suggesting. The time spent in the
OS is 2.7% If the amount spent in the OS is goes down, that's just
more time available for user mode code to execute.

On intel processors, the caches are physical tagged and physical indexed,
which means that any access to a particular address, regardless of
access mode (user, supervisor) or execution ring will hit on the cache
for any given physical address. Security access controls are in the TLBs
(at both L1 and L2).

On ARM64 processors, the caches are additionally tagged with the exception
level (User, Kernel, Hypervisor, Secure Monitor) which additionally qualifies
accesses to each cache line.

ARM does provide a mechanism to partition the caches for -allocation- only,
otherwise the normal aforementioned access constraints apply. The ARM64
mechanism (MPAM) assigns a partition identifier to entities (e.g. processes)
and any cache allocation for that entity will be allocated from the specified
partition; any accesses to the physical address corresponding to the cache
line will be satisfied from any partition so long as any security constraints
are met.

>Nor does it say anything about whether or not having a dedicated supervisor=

> cache would help or hurt things.

It certainly implies such. If the supervisor is only running 3% of the
time, having a dedicated supervisor cache would hurt performance.

>I think that "programmable cache partitioning" is what the ARM folks call s=

>ome of their processors' ability to partition one of their caches.

It's called MPAM (and FWIW, I spent a few years on the ARM Technical Advisory
Board while the 64-bit architecture was being developed).

>I think that the equivalent thing for x86-64 processors by Intel (which is =
>the dominant x86-64 server processor maker) is called "Cache Allocation Tec=
>hnology" (CAT) and what it does is to basically set limits on how much cach=
>e thread/process/something-or-other can use so that said thread/process/som=
>ething-or-other's cache usage doesn't impact other threads/processes/someth=
>ing-or-others.

Again, this controls "allocation", not access.

as an aside, if someone wanted to write a 2200 system simulator, say, =
>it would be A Good Idea to dedicate one processor *CORE* to each simulated =
>2200 IP and basically ignore any "hyperthreaded" processors as all they can=
> do if they are allowed to execute is to disrupt the cache needed by the ac=

>tual core to simulate the IP.

Well given my former employer (Unisys)has developed and sells a 2200 system
simulator, it may be useful to ask them for details on the implementation.

>If you are referring to the various PREFETCHxxx instructions, yes, they exi=
>st, but they are usually "hints" the last time I looked and only load *DATA=

>* into the L3 DATA cache in advance of its possible use.

Yes, they are hints. That allows the processor vendors to best determine
the behavior for their particular microarchitecture. I can tell you that
they are honored on all the ARM64 processors we build.

>So unless something's changed (and feel free to let me know if it has), you=
> can't pre-fetch supervisor mode code for some service that user mode code =

>might need Real Soon Now.

Right. Given that the supervisor code isn't running, it would be difficult
for it to anticipate subsequent user mode behavior. In any case, looking
at current measurements for context switches between user and kernel modes
on modern intel and arm64 processors, it wouldn't be likely to help
performance even if such a mechanism was available, and indeed, if supervisor
calls are that common, it's entirely likely that the line would already
be present in at least the LLC, if not closer to the processor, particurly
with SMT (aka hyperthreading).

>
>> Most cache subsystems also include logic to anticipate future accesses an=

>d
>> preload the data into the cache before the processor requires it based on
>> historic patterns of access.
>
>Again, data, not code.

Actually, instruction prefetching is far easier than data prefetching, and
with dedicated Icache at L1, the icache will prefetch.

>So if you have a request for some OS-like service that would be useful to h=
>ave done as quickly as possible, the actual code might not still be in an i=
>nstruction cache if a user program has been running for a long time and so =

>has filled the cache with its working set.

All the major processors have performance monitoring tools to count
icache misses. The vast majority of such misses are related to
branches and function calls, fetching supervisor mode code is likely
in the noise.

Stephen Fuld

unread,

Aug 18, 2023, 11:39:20 AM8/18/23

On 8/15/2023 11:48 AM, Lewis Cole wrote:
> On Tuesday, August 15, 2023 at 12:06:53 AM UTC-7, Stephen Fuld wrote:
> <snip>
>>>> Yeah. Only using half the cache at any one time would seem to decrease
>>>> performance. :-)
>>>
>>> Of course, the smiley face indicates
>>> that you are being facetious.

No. I wasn't. See below.

>>> But just on the off chance that
>>> someone wandering through the group
>>> might take you seriously, let me
>>> point out that re-purposing half of
>>> a cache DOES NOT necessarily reduce
>>> performance, and may in fact increase
>>> it if the way that the "missing" half
>>> is used somehow manages to increase
>>> the overall hit rate ...

Splitting a size X cache into two sized x/2 caches will almost certainly
*reduce* hit rate. Think of it this way. The highest hit rate is
obtained when the number of most likely to be used blocks are exactly
evenly split between the two caches. That would make the contents of
the two half sized caches exactly the same as those of the full sized
cache. Conversely, if one of the caches has a different, (which means
lesser used) block, then its hit rate would be lower. There is no way
that splitting the caches would lead to a higher hit rate. But hit rate
isn't the only thing that determines cache/system performance.

>>> such as
>>> reducing a unified cache that's used
>>> to store both code and data with a
>>> separate i-cache for holding
>>> instructions and a separate d-cache
>>> for holding data which is _de rigueur_
>>> on processor caches these days.

Separating I and D caches has other advantages. Specifically, since
they have separate (duplicated) hardware logic both for addressing and
the actual data storage, the two caches can be accessed simultaneously,
which improves performance, as the instruction fetch part of a modern
CPU is totally asynchronous with the operand fetch/store part, and they
can be overlapped. This ability, to do an instruction fetch from cache
simultaneously with handling a load/store is enough to overcome the
lower hit rate. Note that this advantage doesn't apply to a
user/supervisor separation, as the CPU is in one mode or the other, not
both simultaneously.

>>>
>>> I think it should be clear from the
>>> multiple layers of cache these days,
>>> each layer being slower but larger
>>> than the one above it, that the
>>> further you go down (towards memory),
>>> the more a given cache is supposed to
>>> cache instructions/data that is "high
>>> use", but not so much as what's in
>>> the cache above it.

True for an exclusive cache, but not for an inclusive one.

>>> And even since the beginning of time
>>> (well ... since real live multi-tasking
>>> OS appeared), it has been obvious that
>>> processors tend to spend most of their
>>> time in supervisor mode (OS) code
>>> rather than in user (program) code.
>>
>> I don't want to get into an argument about caching with you, [...]
>
> I'm not sure what sort of argument you think I'm trying get into WRT caching, but I assume that we both are familiar enough with it so that there's really no argument to be had so your comment makes no sense to me.
>
>> [...] but I am sure that the percentage of time spent in supervisor mode is very
>> workload dependent.
>
> Agreed.
> But to the extent that the results of the SS keyin were ... useful .. in The Good Old Days at Roseville, I recall seeing something in excess of 80+ percent of the time was spent in the Exec on regular basis.

It took me a while to respond to this, as I had a memory, but had to
find the manual to check. You might have had some non-released code
running in Roseville, but the standard SS keyin doesn't show what
percentage of time is spent in Exec. To me, and as supported by the
evidence Scott gave, 80% seems way too high.

Scott Lurndal

unread,

Aug 18, 2023, 1:21:32 PM8/18/23

To be fair, one must consider functional differences in operating systems.

Back in the day, for example, the operating system was responsible for
record management, while in *nix code that is all delegated to user mode
code. So for applications that heavily used files (and in the olden days
the lack of memory was compensated for by using temporary files on mass
storage devices) there would likely be far more time spent in supervisor
code than in *nix/windows today. In the Burroughs MCP, for example,
a subtantial portion of DMSII is part of the OS rather than purely usermode
code is it would be with e.g. Oracle on *nix.

Stephen Fuld

unread,

Aug 18, 2023, 1:42:05 PM8/18/23

On 8/18/2023 10:21 AM, Scott Lurndal wrote:
> Stephen Fuld <sf...@alumni.cmu.edu.invalid> writes:
>> On 8/15/2023 11:48 AM, Lewis Cole wrote:
>>> On Tuesday, August 15, 2023 at 12:06:53 AM UTC-7, Stephen Fuld wrote:
>
>>> Agreed.
>>> But to the extent that the results of the SS keyin were ... useful .. in The Good Old Days at Roseville, I recall seeing something in excess of 80+ percent of the time was spent in the Exec on regular basis.
>>
>> It took me a while to respond to this, as I had a memory, but had to
>> find the manual to check. You might have had some non-released code
>> running in Roseville, but the standard SS keyin doesn't show what
>> percentage of time is spent in Exec. To me, and as supported by the
>> evidence Scott gave, 80% seems way too high.
>
> To be fair, one must consider functional differences in operating systems.

Sure, but Lewis was referring to the 1100 Exec.

> Back in the day, for example, the operating system was responsible for
> record management, while in *nix code that is all delegated to user mode
> code.

For the 1100 OS, it was never responsible for "record management". The
OS only knew about "blocks" of data. All blocking/deblocking/record
management was done in user libraries, or for the database system, by
the database system code, which ran in user mode.

> So for applications that heavily used files (and in the olden days
> the lack of memory was compensated for by using temporary files on mass
> storage devices) there would likely be far more time spent in supervisor
> code than in *nix/windows today.

While I agree about using more temporary files, etc., for the 1100 OS,
only the block I/O was done in supervisor mode. So the effect of more
files was less than for an OS that did record management in supervisor more.

> In the Burroughs MCP, for example,
> a subtantial portion of DMSII is part of the OS rather than purely usermode
> code is it would be with e.g. Oracle on *nix.

I certainly defer to your knowledge of MCP systems, for the 1100/OS, its
database systems, in those days, primarily DMS 1100, ran in user mode.
It only used supervisor mode (called Exec mode), for block/database page
I/O.

Stephen Fuld

unread,

Aug 22, 2023, 3:06:18 PM8/22/23

On 8/15/2023 9:58 AM, Scott Lurndal wrote:
> Stephen Fuld <sf...@alumni.cmu.edu.invalid> writes:
>> On 8/15/2023 7:07 AM, Scott Lurndal wrote:
>
>>>
>>> Most modern server processors (intel, arm64) offer programmable cache partitioning
>>> mechanisms that allow the OS to designate that a schedulable entity belongs to a
>>> specific partition, and provides controls to designate portions of the cache are
>>> reserved to those schedulable entities (threads, processes).
>>
>> I wasn't aware of this. I will have to do some research. :-)
>
> For the ARM64 version, look for Memory System Resource Partitioning and Monitoring
> (MPAM).
>
> https://developer.arm.com/documentation/ddi0598/latest/
>
> Note that this only controls "allocation", not "access" - i.e. any application
> can hit a line in any partition, but new lines are only allocated in partitions
> associated with the entity that caused the fill to occur.
>
> Resources include cache allocation and memory bandwidth.

First, thanks for the link. I looked at it a little (I am not an ARM
programmer). I can appreciate its utility, particularly in something
like a cloud server environment where it is useful to prevent one
application either inadvertently or on purpose, from "overpowering"
(i.e. monopolizing resources) another, so you can meet SLAs etc. This
is well explained in the "Overview" section of the manual.

However, I don't see its value in the situation that Lewis is talking
about, supervisor vs users. A user can't "overpower" the OS, as the OS
could simply not give it much CPU time. And if can't rely on the OS not
to overpower or steal otherwise needed resources from the user programs,
then I think you have worse problems. :-(

Scott Lurndal

unread,

Aug 22, 2023, 4:36:45 PM8/22/23

It doesn't have any value in Lewis' situation. Modern operating systems
don't spend significant time in kernel mode, by design. Cache partioning
by execution state (user, kernel/supervisor, hypervisor, firmware (e.g SMM or EL3))
will reduce overall performance in that type of environment.

Whether it would matter for e.g. the 2200 emulators currently shipping
(or even a CMOS 2200 were they still being designed), might be a different
matter.

In modern processors, where the caches are physically tagged, data shared
by the kernel and user mode code (e.g. when communicating between them)
will occupy a single cache line (or set thereof) - if the cache were
partitioned between them, you'll either have the same line cached in
both places (if unmodified) or the supervisor partition would need to invalidate
the usermode partition line if the supervisor modifies the line.

I can't see that type of partitioning ever having a positive effect
in the current generation of architectures.

Adding additional ways to each set helps as well.

For high-performance networking, much of the kernel functionality has
been moved to user mode anyway. See:

https://www.dpdk.org/
https://opendataplane.org/

Don Vito Martinelli

unread,

Aug 23, 2023, 10:02:08 AM8/23/23

I ran across a problem in the late 90's where terminating DDP-PPC or
DDN/TAS (can't remember which) on a single processor machine could cause
the processor to go to 100% Realtime. The only way to stop it was to
stop the partition. Unsurprisingly, there was a PCR for this problem
and I applied it smartish.
In 2017/18 we had a somewhat similar problem where a machine with a
performance key limiting it to 5% (a guess) would be permitted to run at
100% for a while (some TIP problem) before the performance key struck
back and restricted it to well under 1% until the "mips loan" had been
"repaid". This took maybe 30 minutes, maybe longer.

Lewis Cole

unread,

Oct 8, 2023, 10:27:38 PM10/8/23

I'm up to my ass in alligators IRL and so I haven't (and likely won't for some time) have a lot of time to read and respond to posts here.
So I'm going to respond to only Mr. Fuld's posts rather than any others.
And since I am up to my ass in alligators, I'm going to break up my response to Mr. Fuld's post into two parts so that I can get SOMETHING out Real Soon Now.
So here is the first part:

On 8/15/2023 11:48 AM, Lewis Cole wrote:

>> On Tuesday, August 15, 2023 at 12:06:53AM UTC-7, Stephen Fuld wrote:
>> <snip>
>>>>> Yeah. Only using half the cache at any one time would seem to decrease
>>>>> performance. :-)
>>>>
>>>> Of course, the smiley face indicates
>>>> that you are being facetious.
>
>No. I wasn't. See below.

I thought you didn't want to argue about caching? ;-)
Well, hopefully we both argue on pretty much everything and any disagreement is likely due to us not being on the same page WRT our working assumptions, so perhaps this is A Good Thing.

However, I apologize to you for any annoyance I caused you due to my assumptions about your reply.

>>>> But just on the off chance that
>>>> someone wandering through the group
>>>> might take you seriously, let me
>>>> point out that re-purposing half of
>>>> a cache DOES NOT necessarily reduce
>>>> performance, and may in fact increase
>>>> it if the way that the "missing" half
>>>> is used somehow manages to increase
>>>> the overall hit rate ...
>
> Splitting a size X cache into two sized x/2 caches will almost certainly
> *reduce* hit rate.

There are several factors that influence hit rate, one of them is cache size.
Others, such as number of associativity ways, replacement policy, and line size are also obvious influences.
In addition, there are other factors that can make up for a slightly reduced hit rate so that such a cache can still be competitive with a cache with a slightly higher hit rate such as cycle time.

If two caches are literally identical in every way except for size, then you are *CORRECT* that the hit rate will almost certainly be lower for the smaller cache.
However, given two caches, one of which just happens to be half the size of the other, it does *NOT* follow that the smaller cache must necessarily have a lower hit rate than the other as changes to some of the other factors that affect hit rate might just make up for what was lost due to the smaller size.

> Think of it this way. The highest hit rate is
> obtained when the number of most likely to be used blocks are exactly
> evenly split between the two caches.

Ummm, no. I guess we are going to have an argument over caching after all ....

The highest hit rate is obtained when a cache manages to successfully anticipate, load up into its local storage, and then provide that which the processor needs *BEFORE* the processor actually makes a request to get it from memory. Period.
This is true regardless of whether or not we're talking about one cache or multiple caches.
From the point of view of performance, it's what makes caching work.
(Note that there may be other reasons for a cache such as to reduce bus traffic, but let's ignore these reasons for the moment.)
Whether or not, say, an I-cache happens to have the same number of likely-to-be-used blocks as the D-cache is irrelevant.
They may have the same number. They may not have the same number. I suspect for reasons that I'll wave my arms at shortly that they usually aren't.
What matters is whether they have what's needed and can deliver it before the processor actually requests it.

Now if an I-cache is getting lots and lots of hits, then presumably it is likely filled with code loops that are being executed frequently.
The longer that the processor can continue to execute these loops, the more it will execute them at speeds that approach that which it would if main memory were as fast as the cache memory.
And the more that this happens, the more this speed offsets the the much slower speed of the processor when it isn't getting constant hits.

However, running in cached loops doesn't imply much about the data that these loops are accessing.
They may be marching through long arrays of data or they may be pounding away at a small portion of data structure such as at the front of a ring buffer. It's all very much application dependent.
About the only thing that we can infer is that because the code is executing loops, there is at least one instruction (the jump to the top of a loop) which doesn't have/need a corresponding piece of data in the D-cache.
IOW, there will tend to be one fewer piece of data in the D-cache than in the I-cache.
Whether or not this translates into equal numbers of cache lines rather than data words just depends.
(And note I haven't even touched on the effect of writes to the contents of the D-cache.)

So if you think that caches will have their highest hit rate when the "number of most likely to be used blocks are exactly evenly split between the two caches", you're going to have to provide a better argument/evidence to support your reasoning before I will accept this premise, either in the form of a citation or better yet, a "typical" example.

> That would make the contents of
> the two half sized caches exactly the same as those of the full sized
> cache.

No, it wouldn't. See above.

> Conversely, if one of the caches has a different, (which means
> lesser used) block, then its hit rate would be lower.

No, it wouldn't. See above.

> There is no way
> that splitting the caches would lead to a higher hit rate.

As I waved my arms at before, it is possible if more changes than are made than just to its size.

For example, if a cache happens to be a directed mapped cache, then there's only one spot in the cache for a piece of data with a particular index.
If another piece of data with the same index is requested, then old piece of data is lost/replaced with the new one.
This is basic direct mapped cache behavior 101.

OTOH, if a cache happens to be a set associative cache of any way greater than one (i.e. not direct mapped), then the new piece of data can end up in a different spot within the same set for the given index from which it can be returned if it is not lost/replaced for some other reason.
This is basic set associativity cache behavior 101.

The result is that if the processor has a direct mapped cache and just happens to make alternating accesses to two pieces of data that have the same index, the directed mapped cache will *ALWAYS* take a miss on every access (i.e. have a hit rate of 0%), while the same processor with a set associative cache of any way greater than one will *ALWAYS* take a take a hit (i.e. have a hit rate of 100%).
And note that nowhere in the above description is there any mention of cache size.
Cache size DOES implicitly affect the likelihood of a collision, and so "typically" you will get more collisions which will cause a direct mapped cache to perform worse than a set associative cache.
And you can theoretically (although not practically) go one step further by making a cache fully associative which will eliminate conflict misses entirely.
In short, there most certain is "a way" that the hit rate can be higher on a smaller size cache than a larger one, contrary to your claim.

> But hit rate
> isn't the only thing that determines cache/system performance.

Yes. Finally, we agree on something.

>>>> such as
>>>> reducing a unified cache that's used
>>>> to store both code and data with a
>>>> separate i-cache for holding
>>>> instructions and a separate d-cache
>>>> for holding data which is _de rigueur_
>>>> on processor caches these days.
>
> Separating I and D caches has other advantages. Specifically, since
> they have separate (duplicated) hardware logic both for addressing and
> the actual data storage, the two caches can be accessed simultaneously,
> which improves performance, as the instruction fetch part of a modern
> CPU is totally asynchronous with the operand fetch/store part, and they
> can be overlapped. This ability, to do an instruction fetch from cache
> simultaneously with handling a load/store is enough to overcome the
> lower hit rat

Having a separate I-cache and D-cache may well have other advantages besides increased hit rate.
And increased concurrency may well be one of them.
However, my point by mentioning the existence of separate I-caches and D-caches was to point out that given a sufficiently Good Reason, splitting/replacing a single cache with smaller caches may be A Good Idea.
Increased concurrency doesn't change that argument in the slightest.
Simply replace any mention of "increased hit rate" with "increased concurrency" and the result is the same.

If you want to claim that increased concurrency was the *MAIN* reason for the existence of separate I-caches and D-caches, then I await with bated breath for you to present evidence and/or a better argument to show this was the case.
And if you're wondering why I'm not presenting -- and am not going to present -- any evidence or argument to support my claim that it was due to increased hit rate, that's because we both seem to agree on the basic premise I mentioned before, namely, given a sufficiently Good Reason, then splitting/replaceing a single cache with smaller caches may be A Good Idea.
Any argument that you present strengthens that premise without the need for me to do anything.

I will point out, however, that I think that increased concurrency seems like a pretty weak justification.
Yes, separate caches might well allow for increased concurrency, but you have to come up with finding those things that can be done during instruction execution that can be done in parallel.
And if you manage to find that parallelism, then you need to not only be able to issue separate operations in parallel, you have to make sure that these parallel operations don't interfere with one another, which is to say that your caches remain "coherent" despite doing things like modifying the code stream currently being executed (i.e. self modifying code).
Given the limited transistor budget In The Early Days, I doubt that dealing with these issues was something that designers were willing to mess with if they didn't have to.
(The first caches tended to be direct mapped because they were the simplest and therefore the cheapest to implement while also having the fastest access times.
Set associative caches performed better, but were more complicated and therefore more expensive as well as having slower access times and so came later.)
ISTM that a more plausible reason other than hit rate would be to further reduce bus traffic which was one of the other big reasons that DEC (IIRC) got into using them In the Beginning.

> Note that this advantage doesn't apply to a
> user/supervisor separation, as the CPU is in one mode or the other, not
> both simultaneously.

Bullshit.

Assuming that you have two separate caches that can be kept fed and otherwise operate concurrently, then All You Have To Do to make them both something at the same time is to generate "select" signals for each so that they know that they should operate at the same time.
Obviously, a processor knows whether or not it is in user mode or supervisor mode when it performs an instruction fetch and so it is trivially easy for a processor in either mode to generate the correct "select" signal for an instruction fetch from the correct instruction cache.
It should be equally obvious that a processor in user mode or supervisor mode knows (or can know) when it is executing an instruction that should operate on data that is in the same mode as the instruction its executing.
And it should be obvious that you don't want a user mode instruction to ever be able to access supervisor mode data.
The only case this leaves to address when it comes to the generation of a "select" signal is when a processor running in supervisor mode wants to do something with user mode code or data.

But generating a "select" signal that will access instructions or data in either a user mode instruction cache or data in a user mode data cache is trivially easy as well, at least conceptually, especially if one is willing to make use of/exploit that which is common practice in OSs these days.
In particular, since even before the Toy OSs grew up, there has been a fixation with dividing the logical address space into two parts, one part for user code and data and the other part for supervisor code and data.
When the logical space is divided exactly in half (as was the case for much of the time for 32-bit machines), the result was that the high order bit of the address indicates (and therefore could be used as a select line for) user space versus supervisor space cache access.
While things have changed a bit since 64-bit machines have become dominant, it is still at least conceptually possible to treat some part of the high order part of a logical address as such an indicator.

"But wait ... ," you might be tempted to say, "... something like that doesn't work at all on a system like a 2200 ... the Exec has never had the same sort of placement fixation in either absolute or real space that the former Toy OSs had/have", which is true.
But the thing is that the logical address of any accessible word in memory is NOT "U", but rather "(B,U)" (both explicitly in Extended mode and implicitly in Basic Mode) where B is the number of a base register, and each B-register contains an access lock field which in turn is made up of a "ring" and a "domain".
Supervisor mode and user mode is all about degrees of trust which is a simplification of the more general "ring" and "domain" scheme where some collection of rings are supposedly for "supervisor" mode and the remaining collection are supposedly for "user" mode.
Whether or not this is actually the way things are used, it is at least conceptually possible that an address (B,U) can be turned into a supervisor or user mode indicator that can be concatenated with U which can then be sent to the hardware to select a cache and then a particular word within that cache.
So once again, we're back to being able to identify supervisor mode code/data versus user mode code/data by its address.
(And yes, I know about the Processor Privilege [PP] flags in the designator register, and reconciling their use with the ring bits might be a problem, but at least conceptually, PP does not -- or at least need not -- matter when it comes to selecting a particular cache.)

If you want to say no one in their right mind -- certainly no real live CPU designer -- would think in terms of using some part of an address as a "ring" together with an offset, I would point out to you that this is not the case: a real, live CPU designer *DID* choose to merge security modes with addressing and the result was a relatively successful computer.
It was called the Data General Eclipse and Kidder's book, "Soul of a New Machine", mentions this being done.

What I find ... "interesting" ... here, however, is that you would try to make an argument at all about the possible lack of concurrency WRT a possible supervisor cache.
As I have indicated before, I assume that any such cache would be basically at the same level as current L3 caches and it is my understanding that for the most part, they're not doing any sort of concurrent operations today.
It seems, therefore, that you're trying to present a strawman by suggesting a disadvantage that doesn't exist at all when compared to existing L3 caches.

>>>>
>>>> I think it should be clear from the
>>>> multiple layers of cache these days,
>>>> each layer being slower but larger
>>>> than the one above it, that the
>>>> further you go down (towards memory),
>>>> the more a given cache is supposed to
>>>> cache instructions/data that is "high
>>>> use", but not so much as what's in
>>>> the cache above it.
>
> True for an exclusive cache, but not for an inclusive one.

I don't know what you mean by a "exclusive" cache versus an "inclusive one".
Please feel free to elaborate on what you mean.
In every multi-layered cache in a real live processor chip that I'm aware of, each line in the L1 cache is also represented by a larger line in the L2 cache that contains the L1 line as a subset, each line in the L2 cache is also represented by a larger line in the L3 that contains the L2 line as a subset.

At this point, I'm going to end my response to Mr. Fuld's post here and go off and do other things before I get back to making a final reply to the remaining part of his post.

Lewis Cole

unread,

Oct 8, 2023, 10:30:15 PM10/8/23

So here's the second part of my reply to Mr. Fuld's last response to me.
Considering how quickly this reply has grown, I may end up breaking it up into a third part as well.

On 8/15/2023 11:48 AM, Lewis Cole wrote:

>> On Tuesday, August 15, 2023 at 12:06:53AM UTC-7, Stephen Fuld wrote:
>> <snip>

So let me get this straight: You don't believe the 80% figure I cite because it seems too high to you and it didn't come a "standard" SS keyin of the time.
Meanwhile, you believe the figure cited by Mr. Lurndal because it seems more believable even though it comes from a system that's almost certainly running a different workload than the one I'm referring to which was from decades ago.
Did I get this right?
Seriously?

What happened to the bit where *YOU* were saying about the amount of time spent in an OS was probably workload dependent?
And since when does the credibility of local code written in Roseville (by people who are likely responsible for the care and feeding of the Exec that the local code is being written for) some how become suspect just because said code didn't make it into a release ... whether it's output is consistent with what you believe or not?

FWIW, I stand by the statement about seeing CPU utilization in excess of 80+% on a regular basis because that is what I recall seeing.
You can choose to believe me or not.
(And I would like to point out that I don't appreciate being called a liar no matter how politely it is done.)

I cannot provide direct evidence to support my statement.
I don't have any console listings or demand terminal session listings where I entered an "@@cons ss", for example.
However, I can point to an old (~1981) video that clearly suggests that the 20% figure cited by Mr. Lurndal almost certainly doesn't apply to the Exec at least in some environments from way back when.
And I can wave my arms at why it is most certainly possible for a much higher figure to show up, at least theoretically, even today.

So the video I want to draw your attention to is entitled, "19th Annual Sperry Univac Spring Technical Symposium - 'Proposed Memory Management Techniques for Sperry Univac 1100 Series Systems'", and can be found here:

< https://digital.hagley.org/VID_1985261_B110_ID05?solr_nav%5Bid%5D=88d187d912cfce1a5ad1&solr_nav%5Bpage%5D=0&solr_nav%5Boffset%5D=2 >

Around timestamp [4:43] in the video, S.J. Trivedi, one of the gurus of the DA at the time, notes that *DA element of the Dynamic Allocator (DA) had been observed to take up 33% of the total CPU time for a 3x (pronounced "three by" for the unfamiliar Univac/Sperry Univac/Sperry notation who are still reading along) 1100/80A system with 3MW of memory.
He notes that "33%" usage effectively means that a customer who buys a 3x 1100/80A is buying a 2x 1100/80A to do useful work with the third CPU being used to take care of memory management.
He adds (presumably jokingly) that because the DA is non-reentrant, another CPU can't be added to help out the one doing the DA work, but if it had been re-entrant, then maybe one and a half CPUs (meaning 1.5 CPUs out of 3 CPUs or 50% of the available computing power in the system) to up to two CPUs (meaning 2 CPUs out 3 CPUs or 66% of the available computing power in the system) could be "gobbled up by the DA".
(FWIW, I *THINK* that the changes he was referring to became known as the "Performance Enhancement Package" (PEP) package which was integrated some time after I arrived at Roseville and I *HOPE* significantly reduced the average amount of time chewed up for memory management, but by how much, I have no idea.)

Now 33% isn't 80+% (nor is a hypothetical 50% or 66% for that matter), but it also isn't something that's around or under 20%.
And if you make the entirely reasonable assumption that there was/is more to the time spent in the Exec than just memory allocation, then may I not so humbly suggest that something north of 20% of total available CPU time might well be spent in the Exec.

Just for giggles, though, let's say that that was then and this is now, meaning that the amount of time spent in the Exec is (and has been for some time) roughly the same as the figure that Mr. Lurndal cited ... so what?
Mr. Lurndal apparently wants to argue that since the *AVERAGE* amount of time that some systems (presumably those whose OSs' names end in the letters "ix") spend in the supervisor is "only" 20%, that means that it isn't worth having a separate supervisor cache.
After all, his reasoning goes, if the entire time spent in the supervisor were eliminated, that would mean an increase of only 20% more time to user programs.

Just on principle, this is a silly thing to say.

It obviously incorrectly equates time with useful work and then compounds that by treating time spent in user code as important while time spent in supervisor mode as waste.
It shouldn't take much to realize that this is nonsense.
Imagine a user program making a request to the OS to send a message somewhere that can't be delivered for some reason (e.g. an error or some programmatic limits being exceeded), the OS should return a bit more quickly than if it could send the message.
So the user program should get a bit more time and the OS should get a bit less time.
But no one in their right mind should automatically presume the user program should be able to do something more "useful" with the extra time it has.
And even if the user program does come up with something "useful" to do, whatever it does, it isn't able to do what it originally wanted to do which was apparently worth doing or else it wouldn't have bothered at all.
Meanwhile, if the OS could send the message and could in fact could do so more quickly than usual, the time in the OS might well go down at least initially.
But that may be temporary as the time could go up because the user program tries to send even more messages resulting in more time being spent in the OS.
And note that time spent in the OS going up could mean that more "useful" work is getting done.

And if you need some evidence that small amounts of time spent in an OS can have a significant impact on what a user is able to accomplish that is disproportionate, I will point out that as a percentage, the amount of time spent in an interrupt handler is generally very small compared to the time spent in the rest of an OS.

But enough of principles, let's talk about the 20% figure Mr. Lurndal cites.

Unless the system is running only a single user program on a single CPU, this figure is almost certainly the *AVERAGE* amount of time that the supervisor flag is set for the user programs that just happen to be running on the system during some particular time (i.e. it's load dependent).
And being an *AVERAGE*, that means that even on the system that Mr. Lurndal was referencing, the amount of time sent in the supervisor versus the amount of time spent in user program(s) could well be much lower or higher for shorter periods of time which end up being averaged out in the wash.
So the amount of supervisor time that is in some sense "available" to improve (or decrease) system performance can well be more than the 20% average figure, even in this most simplistic analysis, especially if something out of the ordinary happens to happen.

So let's talk about the 20% figure and work loads just for giggles to see if something out of the ordinary can just happen to happen.

Once Upon a Time, electronic digital computers used to do a lot of number crunching.
Back in these Good Old Days, user program requests were presumably few and far between (since the programs called upon the supervisor for services pretty much without regard to what was happening in the outside world).
And when a user program was in control, it had "useful work" to do.
Over time, such programs started running parts in parallel on different CPUs and Amdahl wrote his famous paper about how speed up of the small, non-parallel part of the programs didn't do much to increase overall performance of the parallel portion.
Back then, though, performance was measured in terms of CPU utilization and throughput and maybe turn around time.
This sort of computing environment is still in existence today, and as best as I can tell, it is the sort of environment that Mr. Lurndal works in and so is familiar with.

But that stopped being the only (and I suspect the dominant) computing environment more than 50 years ago as programs started to talk to users interactively.
Once that started to happen, performance stopped being measured simply by CPU utilization and throughput, but instead by response time and what Tannenbaum calls "proportionality" (which is to say, how much the actual response seen by a user matches the user's expectation for what performance should be) ... which is something I tried to point out to you in an earlier post.
Lo and behold, contention for access to resources became more of a problem, and with it, the use of queues to manage these resources.
And with queues, queuing theory became something to take note of.
Two things in particular to take note of are (1) the average time between the arrival of requests for something and (2) the average amount of time that it takes to process a request.
If the average time to process a request is less than (i.e. shorter than) the average time between the arrival of requests, then queue generally either don't form at all or don't get very long.

But the average time between arrivals is the *AVERAGE* time, not a guaranteed maximum time between arrives.
And so if it just so happens that a burst of arrivals occurs in less than the average, queues can form and lengthen, with the result being that until the arrival times and queues return to normal, all bets are off when it comes to response times AKA the latency of the requests.
CPU utilization can grow up to 100% even today if you just happen to be unlucky.
(OBTW, remember when I pointed out what S.J. said about the DA being queue driven?)
Other than just waiting for things to work themselves out on their own, about the best you can do to get things back to normal is to throw more resources at the problem.
For example, if additional processor time can help (by reducing the average response time), then throwing more processors at the problem can potentially be useful ... like adding more cashiers to the check out lines of a store with the lines start getting long.

Now as it turns out, this tendency for response time (AKA the latency) going to hell every so often is a *VERY BIG DEAL* .
In fact, it's a *HUGE DEAL* when it happens because the effects can propagate all over the place when we're talking about things that are hooked up in a network as is common these days.
The effects of this so-called "tail latency" can cost someone massive amounts of money that is not being made.
Now often times, network components are the focus of the show, but since response time problems can propagate, real live computers and their OSs are also in the picture.
You can verify this for yourself by a little Googling with "tail latency" along with whatever other search terms you'd like to toss in.

Now at this point, you may be tempted to say something like, "yes, well, that's all well and good, but it's a lot of arm waving compared to the nice solid 20% figure that Mr. Lurndal cited.".
If something like that does cross your mind, I would like to point out that I have no burden of proof to show anything here.
I SUSPECT that a separate supervisor cache could help boost system performance, but I don't know if that's actually the case.
That's why I asked questions to try to find out if my suspicions were correct.
It was Mr. Lurndal who came out and made the statement that he didn't think that a separate supervisor cache would help because in his experience, only 20% of the available time is spent in supervisor code.
So it's Mr. Lurndal's logical responsibility to provide arguments and evidence to support his statements about there being no useful benefit to having a separate supervisor cache -- not mine -- and he simply hasn't done so.
He has waved his arms at 20% (or even 10%) time spent in the supervisor for some systems he is familiar with, which as my arm waving hopefully showed is just so much bullshit as it doesn't tell you anything about benefit.
And the only "evidence" he's provided is smoke intended to show that he's an "expert" on cache hardware (in particular ARM cache hardware).
I'm more than willing to concede he's a cache "expert", but unfortunately for him that doesn't say a damn thing about the correctness/truthfulness of his statement.
20% is still just 20% which (in this case) is just useless figure.
Even though I didn't need to provide any response, what I've presented here is evidence that his so-called evidence isn't as compelling as he makes it out to be.

So If you want to believe that Mr. Lurndal's 20% figure is somehow "meaningful" about anything (at least enough so that you're willing to call me a liar), then that's your call.
But please don't pretend that the 20% figure is anything but at best, a single data point with all that entails in terms of being a slam dunk "proof" of what he's claiming.

So I'm going to end this part and go on to a new Part 3.

Lewis Cole

unread,

Oct 8, 2023, 10:33:02 PM10/8/23

So here's the third part of my reply to Mr. Fuld's last response to me.

What more can I possibly say?
Well, it seems to me that Mr. Lurndal was trying to imply that a separate supervisor cache was a stupid thing to think about *TODAY* , and in doing so, he's also implying that it's also a stupid thing to think about for the future.
I mean 20% is 20% no matter what.
I think this completely ignores where OS design appears to be heading which could well make a separate supervisor cache more desirable in the future that it would be today.
So I'd like to wave my arms at what lead me to thinking about a separate supervisor cache so you can see where I think things are heading and judge for yourself whether or not such a thing might make more sense in the future than now.

So let's start with a little basic history (highly simplified of course) that I hope everyone will agree with.

Once upon a time, digital electronic computers used vacuum tubes which were bulky and power hungry.
Then along came transistors which caused a major reduction in size and power consumption along with an increase in performance (AKA switching speed).
And once transistors were packaged into ICs, performance really started taking off.
CPU performance grew at a rate that basically mirrored the number of transistors on a chip and Moore's Law said this doubled every couple of years.

But then, something happened around the start of the 1990s.
Moore's Law still seemed to apply WRT transistor count, but CPU performance no long tracked the number of transistors.
CPU performance continued to increase, but linearly rather than exponentially.
Cranking up the clock speed and increasing the number of pipeline stages no longer helped the way it did before.

So CPU designers went to multiple CPUs per chip which all accessed a common memory and the shared memory paradigm that we've all come to know and love became king.
There's only one small problem ... the caches that are needed to make shared memory work are having problems keeping up with dealing with the growth in CPU cores due to the need to keep the multiple caches "coherent" (i.e. look the same) within the system.
There have apparently been for some time now toy tests where coherence has caused caches to run slower than the memory they are in front of.

Now Once Upon a Time, message passing and shared memory were considered logically equivalent approaches where the latter just happened to be faster.
But over time, it now appears that while shared memory is faster for awhile, it doesn't scale as well as message passing and so if you want to build really large systems with lots of CPUs, message passing is the way to go.
And before some *ix fanboy points out that there are systems that have thousands of CPUs, I'm talking about systems with tens of thousands of CPUs and beyond, which is territory that no *ix system has yet been able to reach.

So about 10 years ago, the boys and girls at ETH Zurich along with the boys and girls at Microsoft decided to try to come up with an OS based on a new model which became known as a "multi-kernel".
The new OS they created, called Barrelfish, treated all CPUs as if they were networked even if they were on the same chip, sharing the same common memory.
Barrelfish used explicit message passing to do everything including keeping replicated copies of OS data up-to-date (coherent) rather than relying on the underlying cache hardware to always implicitly get there from here.
A good description of Barrelfish can be found in these videos from Microsoft Research:

< https://www.youtube.com/watch?v=gnd9LPWv1U8 >
< https://www.youtube.com/watch?v=iw1SwGNBWMU >

The first video goes over the scalability problems with current OSs starting around [19:10] and then waves its arms at how message passing is faster/more scalable for systems with more than 8 cores and messing with more than 8 cache lines around [50:43].
The second video goes how/why Barrelfish is the way it is, waving its arms at things like message passing costs.

If you want, you can find an MS thesis paper that showed Barrelfish was slower than a monolithic kernel as well as follow-up papers that suggest that the slowness was due to some sort of locking that has since been fixed.
Generally, though, it appears that Barrelfish demonstrated what was hoped for and so even though it stopped being an actively worked on project some years ago, the ideas behind it have trickled into newer OS design thinking.
And of course this included folks who were looking into how to run OSs whose names end in "ix" on top of Barrelfish.

Of course, the Barrelfish folks worked a lot on getting message passing to be relatively fast.
Nevertheless, I thought it might be amusing to look into whether or not there were any user programs that relied on message passing and what they did to make things work as fast as it could.
(Note that I am *NOT* talking about such software communicates with the host OS ... I'm interested only in how the user software itself worked.)
That's when I stumbled into what's (misleadingly) called "High Frequency Trading" (HFT) systems or (more accurately called) "Low Latency Trading" systems used to trade stocks and things on stock markets.
Such systems need to be able to send messages (to effect trades) that have to be done Right Away (10 MICRO-seconds or less) and done Right Away consistently (within 10 microseconds virtually all the time) and do so with high reliability since we're talking about billions of dollars being at stake.

There were several things that popped out at me right away.
One was that in one of the videos about such systems, the presenter said that everything is working against such systems including the networks and computers and the OSs they use because of their fixation with "fairness" and "throughput".
Another (perhaps from the same video) was that the amount of time spent in the critical "hot paths" was very small, and because most of the time was spent in other code that basically performed "administrative" work, the caches were basically always flushed of anything "hot path" related by the time they needed to be executed.

For some reason, the words "interrupt handler" came to mind when this latter point came up.
I recalled a story about how a system -- I believe an 1100/80A -- ran faster when it's "bouncing ball" broke.
For those unfamiliar with 1100 Series interrupt handling, all external interrupts are broadcast to all online processors in a partition.
The processor that takes the interrupt is the one that has the "bouncing ball" (AKA the Broadcast Interrupt Eligibility flag) set in it (if it also happens to have deferrable interrupts enabled flag set).
Once a processor takes an external interrupt, the "bouncing ball" is (hopefully) passed to another processor although if there isn't any other processor that can accept the "bouncing ball", it may well stay with the processor that last took an external interrupt.
According to the story I referred to before, the "bouncing ball" mechanism broke somehow and so the bouncing ball remained with one processor even though there were other eligible processors laying around.
The result, according to the story, is that system performance increased, although the reason why wasn't clear.
Rumor had it that things speed up because the interrupt handler code remained in cache or because the user or OS code and data didn't get flushed because of the interrupt handler hitting the fan ... or perhaps none of the above.

What I do know is that from my machinations with looking into simulation of an 1100/2200 system, the "bouncing ball" logic seemed to unnecessarily complicate the simulator code.
I was tempted to skip it until I heard that the Company's emulator supposedly emulated the "bouncing ball" behavior.
But more recently, I've noticed in the model dependencies section of the processor programmer manual, the "bouncing ball" is now model dependent for some reason.
It could be for performance, or it could be because it just isn't worth it in terms of complexity now that things are emulated.

This is what got me to wondering if a separate supervisor cache could be useful to boost system performance by tweaking such a cache to retain interrupt handler code for a "long" time.
But then, it occurred to me that because the L3 cache was so large, it might be useful to dedicate some or all of it to holding supervisor code.
One would assume that in the case of something like SeL4, this would be A Good Idea as the "supervisor" (AKA OS kernel) consists only of the message passing engine and interrupt handlers with everything else moved to user space.
In the case of Barrelfish, although OS data is replicated, the L3 cache is apparently being used to speed up passing messages between other cores on the same chip and so it seemed like a separate supervisor D-cache might be A Good Idea.
And since OS kernel code tends not to change (meaning maintaining coherence isn't much of a concern), then having a separate supervisor (I-cache) might be A Good Idea as well.

Obviously, the best way to investigate this is by way of controlled experiments, say by instrumenting a copy of Boches and seeing where/how time is spent.
But this was too high of a learning curve for me and so I thought I'd look for other evidence that might show that separate supervisor cache (or two) might be useful ... or not.
That lead me to posting my question ... which has turned out to be a waste of time.
At the moment, I have better things to do now than to look into this matter any further, especially while the weather is good so I think I'm done with this thread.
Feel free to comment as you like without concern for me responding as I probably won't even be reading here for awhile.

I may be back to ask questions about 1100/2200 Basic Mode addressing, but I'm pretty sure there will be a specific answer to any question(s) I ask.

Scott Lurndal

unread,

Oct 9, 2023, 12:46:05 PM10/9/23

Lewis Cole <l_c...@juno.com> writes:
>I'm up to my ass in alligators IRL and so I haven't (and likely won't for s=

>ome time) have a lot of time to read and respond to posts here.
>So I'm going to respond to only Mr. Fuld's posts rather than any others.

>And since I am up to my ass in alligators, I'm going to break up my respons=
>e to Mr. Fuld's post into two parts so that I can get SOMETHING out Real So=

>on Now.
>So here is the first part:
>
>On 8/15/2023 11:48 AM, Lewis Cole wrote:
>>> On Tuesday, August 15, 2023 at 12:06:53AM UTC-7, Stephen Fuld wrote:
>>> <snip>

>>>>>> Yeah. Only using half the cache at any one time would seem to decreas=

>e
>>>>>> performance. :-)
>>>>>
>>>>> Of course, the smiley face indicates
>>>>> that you are being facetious.
>>
>>No. I wasn't. See below.
>
>I thought you didn't want to argue about caching? ;-)

>Well, hopefully we both argue on pretty much everything and any disagreemen=
>t is likely due to us not being on the same page WRT our working assumption=

>s, so perhaps this is A Good Thing.
>

>However, I apologize to you for any annoyance I caused you due to my assump=

>tions about your reply.
>
>>>>> But just on the off chance that
>>>>> someone wandering through the group
>>>>> might take you seriously, let me
>>>>> point out that re-purposing half of
>>>>> a cache DOES NOT necessarily reduce
>>>>> performance, and may in fact increase
>>>>> it if the way that the "missing" half
>>>>> is used somehow manages to increase
>>>>> the overall hit rate ...
>>
>> Splitting a size X cache into two sized x/2 caches will almost certainly

>> *reduce* hit rate.=20
>
>There are several factors that influence hit rate, one of them is cache siz=
>e.
>Others, such as number of associativity ways, replacement policy, and line =

>size are also obvious influences.

>In addition, there are other factors that can make up for a slightly reduce=
>d hit rate so that such a cache can still be competitive with a cache with =

>a slightly higher hit rate such as cycle time.
>

>If two caches are literally identical in every way except for size, then yo=
>u are *CORRECT* that the hit rate will almost certainly be lower for the sm=
>aller cache.
>However, given two caches, one of which just happens to be half the size of=
> the other, it does *NOT* follow that the smaller cache must necessarily ha=
>ve a lower hit rate than the other as changes to some of the other factors =
>that affect hit rate might just make up for what was lost due to the smalle=

>r size.
>
>> Think of it this way. The highest hit rate is
>> obtained when the number of most likely to be used blocks are exactly
>> evenly split between the two caches.
>

>multiple caches.
>From the point of view of performance, it's what makes caching work.

>(Note that there may be other reasons for a cache such as to reduce bus tra=

>ffic, but let's ignore these reasons for the moment.)

>Whether or not, say, an I-cache happens to have the same number of likely-t=

>o-be-used blocks as the D-cache is irrelevant.

>They may have the same number. They may not have the same number. I suspe=

>ct for reasons that I'll wave my arms at shortly that they usually aren't.

>What matters is whether they have what's needed and can deliver it before t=

>he processor actually requests it.
>

>Now if an I-cache is getting lots and lots of hits, then presumably it is l=

>ikely filled with code loops that are being executed frequently.

>The longer that the processor can continue to execute these loops, the more=
> it will execute them at speeds that approach that which it would if main m=

>emory were as fast as the cache memory.

>And the more that this happens, the more this speed offsets the the much sl=

>ower speed of the processor when it isn't getting constant hits.
>

>However, running in cached loops doesn't imply much about the data that the=
>se loops are accessing.
>They may be marching through long arrays of data or they may be pounding aw=
>ay at a small portion of data structure such as at the front of a ring buff=

>er. It's all very much application dependent.

>About the only thing that we can infer is that because the code is executin=
>g loops, there is at least one instruction (the jump to the top of a loop) =

>which doesn't have/need a corresponding piece of data in the D-cache.

>IOW, there will tend to be one fewer piece of data in the D-cache than in t=
>he I-cache.
>Whether or not this translates into equal numbers of cache lines rather tha=

>n data words just depends.

>(And note I haven't even touched on the effect of writes to the contents of=
> the D-cache.)
>
>So if you think that caches will have their highest hit rate when the "numb=
>er of most likely to be used blocks are exactly evenly split between the tw=
>o caches", you're going to have to provide a better argument/evidence to su=
>pport your reasoning before I will accept this premise, either in the form =

>of a citation or better yet, a "typical" example.
>
>> That would make the contents of
>> the two half sized caches exactly the same as those of the full sized

>> cache.=20

>
>No, it wouldn't. See above.
>
>> Conversely, if one of the caches has a different, (which means

>> lesser used) block, then its hit rate would be lower.=20

>
>No, it wouldn't. See above.
>
>> There is no way

>> that splitting the caches would lead to a higher hit rate.=20
>
>As I waved my arms at before, it is possible if more changes than are made =

>than just to its size.
>

>For example, if a cache happens to be a directed mapped cache, then there's=

> only one spot in the cache for a piece of data with a particular index.

>If another piece of data with the same index is requested, then old piece o=

>f data is lost/replaced with the new one.
>This is basic direct mapped cache behavior 101.
>

>OTOH, if a cache happens to be a set associative cache of any way greater t=
>han one (i.e. not direct mapped), then the new piece of data can end up in =
>a different spot within the same set for the given index from which it can =

>be returned if it is not lost/replaced for some other reason.
>This is basic set associativity cache behavior 101.
>

>The result is that if the processor has a direct mapped cache and just happ=
>ens to make alternating accesses to two pieces of data that have the same i=
>ndex, the directed mapped cache will *ALWAYS* take a miss on every access (=
>i.e. have a hit rate of 0%), while the same processor with a set associativ=
>e cache of any way greater than one will *ALWAYS* take a take a hit (i.e. h=

>ave a hit rate of 100%).

>And note that nowhere in the above description is there any mention of cach=
>e size.
>Cache size DOES implicitly affect the likelihood of a collision, and so "ty=
>pically" you will get more collisions which will cause a direct mapped cach=

>e to perform worse than a set associative cache.

>And you can theoretically (although not practically) go one step further by=
> making a cache fully associative which will eliminate conflict misses enti=
>rely.
>In short, there most certain is "a way" that the hit rate can be higher on =

>a smaller size cache than a larger one, contrary to your claim.
>
>> But hit rate
>> isn't the only thing that determines cache/system performance.
>
>Yes. Finally, we agree on something.
>
>>>>> such as
>>>>> reducing a unified cache that's used
>>>>> to store both code and data with a
>>>>> separate i-cache for holding
>>>>> instructions and a separate d-cache
>>>>> for holding data which is _de rigueur_
>>>>> on processor caches these days.
>>
>> Separating I and D caches has other advantages. Specifically, since
>> they have separate (duplicated) hardware logic both for addressing and
>> the actual data storage, the two caches can be accessed simultaneously,
>> which improves performance, as the instruction fetch part of a modern
>> CPU is totally asynchronous with the operand fetch/store part, and they
>> can be overlapped. This ability, to do an instruction fetch from cache
>> simultaneously with handling a load/store is enough to overcome the
>> lower hit rat
>

>Having a separate I-cache and D-cache may well have other advantages beside=

>s increased hit rate.
>And increased concurrency may well be one of them.

>However, my point by mentioning the existence of separate I-caches and D-ca=
>ches was to point out that given a sufficiently Good Reason, splitting/repl=

>acing a single cache with smaller caches may be A Good Idea.
>Increased concurrency doesn't change that argument in the slightest.

>Simply replace any mention of "increased hit rate" with "increased concurre=

>ncy" and the result is the same.
>

>If you want to claim that increased concurrency was the *MAIN* reason for t=
>he existence of separate I-caches and D-caches, then I await with bated bre=
>ath for you to present evidence and/or a better argument to show this was t=
>he case.
>And if you're wondering why I'm not presenting -- and am not going to prese=
>nt -- any evidence or argument to support my claim that it was due to incre=
>ased hit rate, that's because we both seem to agree on the basic premise I =
>mentioned before, namely, given a sufficiently Good Reason, then splitting/=

>replaceing a single cache with smaller caches may be A Good Idea.

>Any argument that you present strengthens that premise without the need for=
> me to do anything.
>
>I will point out, however, that I think that increased concurrency seems li=

>ke a pretty weak justification.

>Yes, separate caches might well allow for increased concurrency, but you ha=
>ve to come up with finding those things that can be done during instruction=

> execution that can be done in parallel.

>And if you manage to find that parallelism, then you need to not only be ab=
>le to issue separate operations in parallel, you have to make sure that the=
>se parallel operations don't interfere with one another, which is to say th=
>at your caches remain "coherent" despite doing things like modifying the co=

>de stream currently being executed (i.e. self modifying code).

>Given the limited transistor budget In The Early Days, I doubt that dealing=
> with these issues was something that designers were willing to mess with i=

>f they didn't have to.

>(The first caches tended to be direct mapped because they were the simplest=
> and therefore the cheapest to implement while also having the fastest acce=
>ss times.
>Set associative caches performed better, but were more complicated and ther=
>efore more expensive as well as having slower access times and so came late=
>r.)
>ISTM that a more plausible reason other than hit rate would be to further r=
>educe bus traffic which was one of the other big reasons that DEC (IIRC) go=

>t into using them In the Beginning.
>
>> Note that this advantage doesn't apply to a
>> user/supervisor separation, as the CPU is in one mode or the other, not
>> both simultaneously.
>
>Bullshit.
>

>Assuming that you have two separate caches that can be kept fed and otherwi=
>se operate concurrently, then All You Have To Do to make them both somethin=
>g at the same time is to generate "select" signals for each so that they kn=

>ow that they should operate at the same time.

>Obviously, a processor knows whether or not it is in user mode or superviso=
>r mode when it performs an instruction fetch and so it is trivially easy fo=
>r a processor in either mode to generate the correct "select" signal for an=

> instruction fetch from the correct instruction cache.

>It should be equally obvious that a processor in user mode or supervisor mo=
>de knows (or can know) when it is executing an instruction that should oper=

>ate on data that is in the same mode as the instruction its executing.

>And it should be obvious that you don't want a user mode instruction to eve=

>r be able to access supervisor mode data.

>The only case this leaves to address when it comes to the generation of a "=
>select" signal is when a processor running in supervisor mode wants to do s=

>omething with user mode code or data.
>

>But generating a "select" signal that will access instructions or data in e=
>ither a user mode instruction cache or data in a user mode data cache is tr=
>ivially easy as well, at least conceptually, especially if one is willing t=

>o make use of/exploit that which is common practice in OSs these days.

>In particular, since even before the Toy OSs grew up, there has been a fixa=
>tion with dividing the logical address space into two parts, one part for u=

>ser code and data and the other part for supervisor code and data.

>When the logical space is divided exactly in half (as was the case for much=
> of the time for 32-bit machines), the result was that the high order bit o=
>f the address indicates (and therefore could be used as a select line for) =

>user space versus supervisor space cache access.

>While things have changed a bit since 64-bit machines have become dominant,=
> it is still at least conceptually possible to treat some part of the high =

>order part of a logical address as such an indicator.
>

>"But wait ... ," you might be tempted to say, "... something like that does=
>n't work at all on a system like a 2200 ... the Exec has never had the same=
> sort of placement fixation in either absolute or real space that the forme=

>r Toy OSs had/have", which is true.

>But the thing is that the logical address of any accessible word in memory =
>is NOT "U", but rather "(B,U)" (both explicitly in Extended mode and implic=
>itly in Basic Mode) where B is the number of a base register, and each B-re=
>gister contains an access lock field which in turn is made up of a "ring" a=
>nd a "domain".
>Supervisor mode and user mode is all about degrees of trust which is a simp=
>lification of the more general "ring" and "domain" scheme where some collec=
>tion of rings are supposedly for "supervisor" mode and the remaining collec=

>tion are supposedly for "user" mode.

>Whether or not this is actually the way things are used, it is at least con=
>ceptually possible that an address (B,U) can be turned into a supervisor or=
> user mode indicator that can be concatenated with U which can then be sent=
> to the hardware to select a cache and then a particular word within that c=
>ache.
>So once again, we're back to being able to identify supervisor mode code/da=

>ta versus user mode code/data by its address.

>(And yes, I know about the Processor Privilege [PP] flags in the designator=
> register, and reconciling their use with the ring bits might be a problem,=
> but at least conceptually, PP does not -- or at least need not -- matter w=

>hen it comes to selecting a particular cache.)
>

>If you want to say no one in their right mind -- certainly no real live CPU=
> designer -- would think in terms of using some part of an address as a "ri=
>ng" together with an offset, I would point out to you that this is not the =
>case: a real, live CPU designer *DID* choose to merge security modes with =

>addressing and the result was a relatively successful computer.

>It was called the Data General Eclipse and Kidder's book, "Soul of a New Ma=

>chine", mentions this being done.
>

>What I find ... "interesting" ... here, however, is that you would try to m=
>ake an argument at all about the possible lack of concurrency WRT a possibl=
>e supervisor cache.
>As I have indicated before, I assume that any such cache would be basically=
> at the same level as current L3 caches and it is my understanding that for=

> the most part, they're not doing any sort of concurrent operations today.

>It seems, therefore, that you're trying to present a strawman by suggesting=
> a disadvantage that doesn't exist at all when compared to existing L3 cach=

>es.
>
>>>>>
>>>>> I think it should be clear from the
>>>>> multiple layers of cache these days,
>>>>> each layer being slower but larger
>>>>> than the one above it, that the
>>>>> further you go down (towards memory),
>>>>> the more a given cache is supposed to
>>>>> cache instructions/data that is "high
>>>>> use", but not so much as what's in
>>>>> the cache above it.
>>
>> True for an exclusive cache, but not for an inclusive one.
>

>I don't know what you mean by a "exclusive" cache versus an "inclusive one"=

>.
>Please feel free to elaborate on what you mean.

>In every multi-layered cache in a real live processor chip that I'm aware o=
>f, each line in the L1 cache is also represented by a larger line in the L2=
> cache that contains the L1 line as a subset, each line in the L2 cache is =
>also represented by a larger line in the L3 that contains the L2 line as a =
>subset.
>
>At this point, I'm going to end my response to Mr. Fuld's post here and go =
>off and do other things before I get back to making a final reply to the re=
>maining part of his post.

Scott Lurndal

unread,

Oct 9, 2023, 1:22:58 PM10/9/23

Lewis Cole <l_c...@juno.com> writes:

>> Think of it this way. The highest hit rate is
>> obtained when the number of most likely to be used blocks are exactly
>> evenly split between the two caches.
>

>Ummm, no. I guess we are going to have an argument over caching after all .=
>...
>
>The highest hit rate is obtained when a cache manages to successfully antic=
>ipate, load up into its local storage, and then provide that which the proc=
>essor needs *BEFORE* the processor actually makes a request to get it from =
>memory. Period.

I think that's a pretty simplistic characterization. Certainly
the hit-rate of a cache is an important performance indicator. That's
distinct, however, from prefetching cache lines in anticipation of
future need. Hit rate can be affected by cache organization (number
of sets, number of ways, associated metadata (virtual machine identifiers,
address space identifiers, et alia).

Prefetching falls into two buckets - explicit and implicit. Explicit
prefetching (i.e. via specialized load instructions) initiated by the
software can improve (or if done incorrectly, degrade) the cache hit
rate.

Implicit anticipatory prefetching by the cache subsystem can also have
a positive effect _on specific workloads_, but if not done carefully,
can have a negative affect on other workloads. A stride-based
prefetcher helps in the case of sequential memory accesses, for example,
but will degrade the performance of a random access (e.g. walking a
linked list or traversing a tree structured object).

<snip>

>Whether or not, say, an I-cache happens to have the same number of likely-t=

>o-be-used blocks as the D-cache is irrelevant.

Is it?

It's also useful to characterize the cache heirarchy. Modern systems have
three or more layers of cache, each slower to access than the next, and
each larger, but yet all are faster than accessing DRAM directly.

>They may have the same number. They may not have the same number. I suspe=

>ct for reasons that I'll wave my arms at shortly that they usually aren't.

>What matters is whether they have what's needed and can deliver it before t=

>he processor actually requests it.
>

>Now if an I-cache is getting lots and lots of hits, then presumably it is l=

>ikely filled with code loops that are being executed frequently.

Actually, Icache hit rates are dominated by how the compiler and linker
generate the executable code. To the extent that the compiler
(or linker - see LTO) can optimize the code layout for the most common
cases (taken/non taken branches, etc), the Icache hit rate is dependent
upon how well the code is laid out in memory. An anticipatory prefetcher
can be trained to be very effective in these cases (or the compiler can
explictly generate prefetech instructions.

>The longer that the processor can continue to execute these loops, the more=
> it will execute them at speeds that approach that which it would if main m=

>emory were as fast as the cache memory.

The loops will almost surely be domininated by accesses to the data cache
rather than the instruction case, to be sure.

<snip>

>> That would make the contents of
>> the two half sized caches exactly the same as those of the full sized

>> cache.=20

>
>No, it wouldn't. See above.
>
>> Conversely, if one of the caches has a different, (which means

>> lesser used) block, then its hit rate would be lower.=20

>
>No, it wouldn't. See above.
>
>> There is no way

>> that splitting the caches would lead to a higher hit rate.=20
>
>As I waved my arms at before, it is possible if more changes than are made =

>than just to its size.
>

>For example, if a cache happens to be a directed mapped cache, then there's=

> only one spot in the cache for a piece of data with a particular index.

>If another piece of data with the same index is requested, then old piece o=

>f data is lost/replaced with the new one.
>This is basic direct mapped cache behavior 101.

And the last modern system built with a direct mapped cache was over two
decades ago.

>
>OTOH, if a cache happens to be a set associative cache of any way greater t=
>han one (i.e. not direct mapped), then the new piece of data can end up in =
>a different spot within the same set for the given index from which it can =

>be returned if it is not lost/replaced for some other reason.
>This is basic set associativity cache behavior 101.
>

>ave a hit rate of 100%).

Which is _why_ direct cache implementations are considered obsolete.

<snip>

>Having a separate I-cache and D-cache may well have other advantages beside=

>s increased hit rate.
>And increased concurrency may well be one of them.

>However, my point by mentioning the existence of separate I-caches and D-ca=
>ches was to point out that given a sufficiently Good Reason, splitting/repl=

>acing a single cache with smaller caches may be A Good Idea.

A singleton example which may not apply generally. Splitting a Dcache
(at the same latency level) doesn't seem to have arguments for and
a plentitude of arguments against.

>I will point out, however, that I think that increased concurrency seems li=

>ke a pretty weak justification.

>Yes, separate caches might well allow for increased concurrency, but you ha=
>ve to come up with finding those things that can be done during instruction=

> execution that can be done in parallel.

So, what characteristic of an access define which of the separate caches
will be used by that access? Is there a potential for the same cache
line to appear in both caches? Does software need to manage the cache
fills (e.g. the MIPS software table walker)?

<snip>

>But generating a "select" signal that will access instructions or data in e=
>ither a user mode instruction cache or data in a user mode data cache is tr=
>ivially easy as well, at least conceptually, especially if one is willing t=

>o make use of/exploit that which is common practice in OSs these days.

>In particular, since even before the Toy OSs grew up, there has been a fixa=
>tion with dividing the logical address space into two parts, one part for u=

>ser code and data and the other part for supervisor code and data.

Fixation? WTF?

Leaving aside your "Toy OS" comment, whether you're referring to MSDOS or
any of the various single-job Mainframe operating systems that grew into
multiprogramming operating systems, there's always been a need to distinguish
between privileged and non-privileged code in some fashion.

The mechanisms varied, but the need is the same. Burroughs did it with
segmentation, IBM did it with partitions (physical and eventually logical),
Univac, well, that one was just wierd.

>When the logical space is divided exactly in half (as was the case for much=

> of the time for 32-bit machines),

That's not actually the case. Many of the early unix systems when ported
to 386, chose a 3-1 split (3G user, 1G system), others used a 2-2 split.

There was no other choice that would provide acceptable performance.

the result was that the high order bit o=
>f the address indicates (and therefore could be used as a select line for) =

>user space versus supervisor space cache access.

It wasn't necessarily the high-order bit(s).

>While things have changed a bit since 64-bit machines have become dominant,=
> it is still at least conceptually possible to treat some part of the high =

>order part of a logical address as such an indicator.

ARM64, which is about a decade old now, uses bit 55 of the virtual address
to determine whether the address should be translated using supervisor
translation tables or user translation tables. On Intel and AMD systems,
it's the highest supported virtual address bit (40, 48 or 52) and the
higher bits are required to be 'canonical' (i.e. match the supported high-order
bit in value).

<snip>

>What I find ... "interesting" ... here, however, is that you would try to m=
>ake an argument at all about the possible lack of concurrency WRT a possibl=
>e supervisor cache.
>As I have indicated before, I assume that any such cache would be basically=
> at the same level as current L3 caches and it is my understanding that for=

> the most part, they're not doing any sort of concurrent operations today.

In the system I'm currently working on, L1I and L1D are part of the
processor core. L2 is part of the processor "cluster" (one or more
cores sharing that L2) and L3 is shared (and accessed concurrently on
L2 misses).

On the processors we build, there's no L3, and L2 is distributed across
each processor "cluster" and is accessed concurrently by all cores.

>I don't know what you mean by a "exclusive" cache versus an "inclusive one"=

Does L3 include all lines that are also cached in L1 or L2, or does it only
include lines that are not cached in L1 or L2.

Scott Lurndal

unread,

Oct 9, 2023, 1:33:07 PM10/9/23

Lewis Cole <l_c...@juno.com> writes:
>
>So here's the second part of my reply to Mr. Fuld's last response to me.

>Considering how quickly this reply has grown, I may end up breaking it up i=

>nto a third part as well.

>Just for giggles, though, let's say that that was then and this is now, mea=
>ning that the amount of time spent in the Exec is (and has been for some ti=

>me) roughly the same as the figure that Mr. Lurndal cited ... so what?

>Mr. Lurndal apparently wants to argue that since the *AVERAGE* amount of ti=
>me that some systems (presumably those whose OSs' names end in the letters =
>"ix") spend in the supervisor is "only" 20%, that means that it isn't worth=

> having a separate supervisor cache.

Ok, another 'toy os' command from a dinosaur. Note that I spent a decade
working on the Burroughs MCP for the medium systems line. Modern unix
is 1000 times better than the MCP in almost every respect. Likewise,
they're better than OS2200.

>After all, his reasoning goes, if the entire time spent in the supervisor w=
>ere eliminated, that would mean an increase of only 20% more time to user p=
>rograms.

I simply pointed out that modern day systems spend very little time in
"supervisor" mode. Please don't try to guess my "reasoning".

Even the Dorado systems are running on standard intel hardware these days.

>
>Just on principle, this is a silly thing to say.

I never said that.

>
>It obviously incorrectly equates time with useful work and then compounds t=
>hat by treating time spent in user code as important while time spent in su=
>pervisor mode as waste.

No. Time spent in supervisor mode is time spent not doing application
processing. Take a look at DPDK or ODP, for example, where much of the
work is moved from the OS to the application specifically to eliminate
the overhead of moving between user and supervisor/kernel modes.

>It shouldn't take much to realize that this is nonsense.

>Imagine a user program making a request to the OS to send a message somewhe=
>re that can't be delivered for some reason (e.g. an error or some programma=
>tic limits being exceeded), the OS should return a bit more quickly than if=

> it could send the message.

>So the user program should get a bit more time and the OS should get a bit =
>less time.
>But no one in their right mind should automatically presume the user progra=

>m should be able to do something more "useful" with the extra time it has.

There have been more than one program running under an operating system for
fifty years. There is always something productive that the processor can
be doing while one process is waiting, like scheduling another user process.

>
>So I'm going to end this part and go on to a new Part 3.

Don't bother on my account.

Scott Lurndal

unread,

Oct 9, 2023, 1:47:41 PM10/9/23

Lewis Cole <l_c...@juno.com> writes:
>So here's the third part of my reply to Mr. Fuld's last response to me.
>

>So about 10 years ago, the boys and girls at ETH Zurich along with the boys=
> and girls at Microsoft decided to try to come up with an OS based on a new=

> model which became known as a "multi-kernel".

>The new OS they created, called Barrelfish, treated all CPUs as if they wer=
>e networked even if they were on the same chip, sharing the same common mem=
>ory.

Sorry, you're way behind. Unisys did did this 1989-1997. The
system was called OPUS. Chorus microkernel and a Unisys developed
Unix subsystem distributed across 64 nodes (pentium pro dual processor,
each node with ethernet and scsi controllers) using the Intel Paragon
supercomputer wormhole routing backplane.

It even ran Mapper.

A decade later, some ex unisys folks and I started a company called 3Leaf
systems which built an ASIC to extend the cache coherency domain
for AMD/Intel processors across a fabric (Infiniband QDR at the time)
creating large, shared memory NUMA systems. We were just a few
years too early.

Today, CXL has become an industry standard.

The processors I work on today, have huge bandwidth
requirements (more than 80 gigabytes/second).

Stephen Fuld

unread,

Oct 11, 2023, 11:58:54 AM10/11/23

Basically right. Add to that the belief that I have that if OS
utilization were frequently 80%, then no customer would buy such a
system, as they would be losing 80% of it to the OS. And the fact that
I saw lots of customer systems when I was active in the community and
never saw anything like it.

But see below for a possible resolution to this issue.

> What happened to the bit where *YOU* were saying about the amount of time spent in an OS was probably workload dependent?

I believe that. But 80% is way above any experience that I have had.

> And since when does the credibility of local code written in Roseville (by people who are likely responsible for the care and feeding of the Exec that the local code is being written for) some how become suspect just because said code didn't make it into a release ... whether it's output is consistent with what you believe or not?

Wow! I never doubted the credibility of the Roseville Exec programmers.
They were exceptionally good. But you presented no evidence that such
code ever existed. I posited it as a possibility to explain what you
recalled seeing. I actually doubt such code existed.

>
> FWIW, I stand by the statement about seeing CPU utilization in excess of 80+% on a regular basis because that is what I recall seeing.

Ahhh! Here is the key. In this sentence, you say *CPU utilization*,
not *OS utilization*. The CPU utilization includes everything but idle
time, specifically including Exec plus all user (Batch, Demand, TIP,
RT,) time. BTW, this is readily calculated from the numbers in a
standard SS keyin. I certainly agree that this could, and frequently was
at 80% or higher. If you claim that you frequently saw CPU utilization
at 80%, I will readily believe you, and I suspect that Scott will too.

> You can choose to believe me or not.
> (And I would like to point out that I don't appreciate being called a liar no matter how politely it is done.)

Again, Wow! I never called you a liar. To be pedantic, a lie is
something that the originator knows is incorrect. I never said you were
lying. At worst, I accused you of having bad recollection, not
intention, which, as I get older, I suffer from more and more. :-(

>
> I cannot provide direct evidence to support my statement.
> I don't have any console listings or demand terminal session listings where I entered an "@@cons ss", for example.
> However, I can point to an old (~1981) video that clearly suggests that the 20% figure cited by Mr. Lurndal almost certainly doesn't apply to the Exec at least in some environments from way back when.
> And I can wave my arms at why it is most certainly possible for a much higher figure to show up, at least theoretically, even today.
>
> So the video I want to draw your attention to is entitled, "19th Annual Sperry Univac Spring Technical Symposium - 'Proposed Memory Management Techniques for Sperry Univac 1100 Series Systems'", and can be found here:
>
> < https://digital.hagley.org/VID_1985261_B110_ID05?solr_nav%5Bid%5D=88d187d912cfce1a5ad1&solr_nav%5Bpage%5D=0&solr_nav%5Boffset%5D=2 >

Interesting video, thank you. BTW, the excessive time spent in memory
allocation searching for the best fit, figuring out what to swap and
minimizing fragmentation were probably motivating factors for going to a
paging system.

But note that the allocation times getting up to 33% (as he said due to
larger memories being available and changing workload patterns) was such
a problem that they convened a task force to fix it, and it seems put in
patches pretty quickly. Assuming their changes were successful, it
should have substantially reduced memory allocation time.

But all of this about utilization and workload changing is not relevant
to the original question of whether having two caches, one of size X
dedicated to Exec (supervisor) and one of size Y, dedicated to user use
is better than a single cache of size X+Y available to both.

Since when in Exec mode, the effective cache size is smaller (does not
include Y), and similarly for user, i.e. not including X, performance
will be worse for both workloads. This is different from a separate I
cache vs. D cache, as all programs use both simultaneously.

Feel free to respond, but ISTM that this thread has wandered so far from
its original topic that I am losing interest, and probably won't respond.

David W Schroth

unread,

Oct 12, 2023, 12:36:53 AM10/12/23

On Wed, 11 Oct 2023 08:58:51 -0700, Stephen Fuld
<sf...@alumni.cmu.edu.invalid> wrote:

I know I'm going to regret responding to all of this...

I have to believe your memory is conflating two different things. Not
surprising, given the timespan involved.

FWIW, the output from the SS keyin does not tell anyone how much time
was spent in the Exec. It tells the operator what percentage of the
possible Standard Units of Processing were consumed by Batch programs,
Demand programs, and TIP transactions. SUPs are *not* accumulated by
the Exec. Note that the amount of possible SUPs in a measuring
interval is not particularly well-defined.

I have a vague memory from when I first worked at Univac facilities in
Minnesota of seeing a sign describing how the system in the "Fishbowl"
had been instrumented to display performance numbers in real time (as
opposed to Real Time performance numbers). I don't recall ever seeing
the system/display, so it's possible that thr forty-some odd years
misspent in the employment of Uivac and Unisys has left me with a
false memory.

Otherwisem the only way to see how much time is spent in the Exec
involves the use of SIP/OSAM, which you were almost certainly not
using from the operator's console.

>>> It took me a while to respond to this, as I had a memory, but had to
>>> find the manual to check. You might have had some non-released code
>>> running in Roseville, but the standard SS keyin doesn't show what
>>> percentage of time is spent in Exec. To me, and as supported by the
>>> evidence Scott gave, 80% seems way too high.
>>

Well, 80% at a customer site *is* way too high. However, Mr. Cole was
at the development center, where torturing the hardware and software
was de rigueur. If he says he saw 80%, I tend to believe him, while
not agreeing that this was something typically seen at customer sites.

>> So let me get this straight: You don't believe the 80% figure I cite because it seems too high to you and it didn't come a "standard" SS keyin of the time.
>> Meanwhile, you believe the figure cited by Mr. Lurndal because it seems more believable even though it comes from a system that's almost certainly running a different workload than the one I'm referring to which was from decades ago.
>> Did I get this right?
>> Seriously?
>
>Basically right. Add to that the belief that I have that if OS
>utilization were frequently 80%, then no customer would buy such a
>system, as they would be losing 80% of it to the OS. And the fact that
>I saw lots of customer systems when I was active in the community and
>never saw anything like it.
>
>But see below for a possible resolution to this issue.
>
>
>> What happened to the bit where *YOU* were saying about the amount of time spent in an OS was probably workload dependent?
>
>I believe that. But 80% is way above any experience that I have had.
>

My personal recollection is there were benchmarks where the Exec
portion of TIP utilized 50% of the processor cycles. Given that
recollection dates back (starts counting on fingers, ends up counting
on toes) over twenty years ago, I wouldn't put a lot of stock in this
recollection.

>
>> And since when does the credibility of local code written in Roseville (by people who are likely responsible for the care and feeding of the Exec that the local code is being written for) some how become suspect just because said code didn't make it into a release ... whether it's output is consistent with what you believe or not?
>
>Wow! I never doubted the credibility of the Roseville Exec programmers.
> They were exceptionally good. But you presented no evidence that such
>code ever existed. I posited it as a possibility to explain what you
>recalled seeing. I actually doubt such code existed.
>

If I have not created the aforementioned "fishbowl" system out of whle
cloth, there was certainly modifications to both hardware and software
involved. Any such probably got folded into the Internal Performance
Monitors and External Performance Monitors on later systems.

>
>
>>
>> FWIW, I stand by the statement about seeing CPU utilization in excess of 80+% on a regular basis because that is what I recall seeing.
>
>Ahhh! Here is the key. In this sentence, you say *CPU utilization*,
>not *OS utilization*. The CPU utilization includes everything but idle
>time, specifically including Exec plus all user (Batch, Demand, TIP,
>RT,) time. BTW, this is readily calculated from the numbers in a
>standard SS keyin. I certainly agree that this could, and frequently was
>at 80% or higher. If you claim that you frequently saw CPU utilization
>at 80%, I will readily believe you, and I suspect that Scott will too.
>

I am exrememly skeptical of the assertion that one can calculate CPU
utilization from the output of the SS keyin. Possibly because I've
spent too much time digging aroung in the guts of the Continuous
Display/SS keyin output. I will acknowledge the possibility that
calculating the CPU utilization from the output of the SS keyin
*might* have been possible for EON (1108) and TON (1110) systems, but
I'm pretty sure it wouldn't work for any systems from the 1100/80
onward.

The take from the Performance Analysis folk in the Development Center
is that while one can calculate MIPS from SUPs, the result of such
calculations is not particularly accurate.

>
>
>> You can choose to believe me or not.
>> (And I would like to point out that I don't appreciate being called a liar no matter how politely it is done.)
>
>Again, Wow! I never called you a liar. To be pedantic, a lie is
>something that the originator knows is incorrect. I never said you were
>lying. At worst, I accused you of having bad recollection, not
>intention, which, as I get older, I suffer from more and more. :-(
>
>
>
>
>>
>> I cannot provide direct evidence to support my statement.
>> I don't have any console listings or demand terminal session listings where I entered an "@@cons ss", for example.
>> However, I can point to an old (~1981) video that clearly suggests that the 20% figure cited by Mr. Lurndal almost certainly doesn't apply to the Exec at least in some environments from way back when.
>> And I can wave my arms at why it is most certainly possible for a much higher figure to show up, at least theoretically, even today.
>>
>> So the video I want to draw your attention to is entitled, "19th Annual Sperry Univac Spring Technical Symposium - 'Proposed Memory Management Techniques for Sperry Univac 1100 Series Systems'", and can be found here:
>>
>> < https://digital.hagley.org/VID_1985261_B110_ID05?solr_nav%5Bid%5D=88d187d912cfce1a5ad1&solr_nav%5Bpage%5D=0&solr_nav%5Boffset%5D=2 >
>
>Interesting video, thank you. BTW, the excessive time spent in memory
>allocation searching for the best fit, figuring out what to swap and
>minimizing fragmentation were probably motivating factors for going to a
>paging system.

Probably not so much. While I wasn't there when paging was
architected, I was there to design and implement it. The motivating
factor is almost certainly called out in the following quote - "There
is only one mistake in computer design that is difficult to recover
from - not having enough address bits for memory addressing and memory
management."

Lewis Cole

unread,

Oct 12, 2023, 2:25:36 AM10/12/23

If Mr. Schroth says that I'm full of shit WRT being able to determine the amount of time spent in the Exec via an SS keyin, then I accept that I am full of shit.
I am/was wrong.
Thank you for the correction, Mr. Schroth.

David W Schroth

unread,

Oct 12, 2023, 11:25:15 PM10/12/23

On Wed, 11 Oct 2023 23:25:32 -0700 (PDT), Lewis Cole <l_c...@juno.com>
wrote:

I'm pretty sure I didn't say you were full of shit.

I thought some more about this, and I suspect I'm giving too much
weight to my experience with more recent flavors of the architecture.
I think the key is "how closely does the 2200's accounting measure
(SUPs) match up with the amount of wall clock time?"
For those systems that use Quantum Timer ticks to generate SUPs, the
answer is "Not very closely at all." A load instruction will cost one
Quantum Timer tick, regardless of whether the load instruction gets
the data from the Level 1 cache or from remote memory. The wall clock
time of the instruction will be greatly affected by where the data is
retrieved from.
At the other end of the spectrum, I suspect that accounting numbers
were generated by subtracting start times from end times. Since all
memory references cost roughly the same amount of wall clock time, I
would expect that the output from the SS keyin could actually provide
a reasonable estimate of how mush time was spent in the Exec. Since
neither you nor I is old enough to remember how Exec 8 did accounting,
this will remain somewhat speculative.

Stephen Fuld

unread,

Nov 21, 2023, 1:47:04 PM11/21/23

On 10/11/2023 9:38 PM, David W Schroth wrote:
> On Wed, 11 Oct 2023 08:58:51 -0700, Stephen Fuld
> <sf...@alumni.cmu.edu.invalid> wrote:
>
> I know I'm going to regret responding to all of this...

I hope not. I am sure I am not alone in valuing your contributions here.

big snip

>>> So the video I want to draw your attention to is entitled, "19th Annual Sperry Univac Spring Technical Symposium - 'Proposed Memory Management Techniques for Sperry Univac 1100 Series Systems'", and can be found here:
>>>
>>> < https://digital.hagley.org/VID_1985261_B110_ID05?solr_nav%5Bid%5D=88d187d912cfce1a5ad1&solr_nav%5Bpage%5D=0&solr_nav%5Boffset%5D=2 >
>>
>> Interesting video, thank you. BTW, the excessive time spent in memory
>> allocation searching for the best fit, figuring out what to swap and
>> minimizing fragmentation were probably motivating factors for going to a
>> paging system.
>
> Probably not so much. While I wasn't there when paging was
> architected, I was there to design and implement it. The motivating
> factor is almost certainly called out in the following quote - "There
> is only one mistake in computer design that is difficult to recover
> from - not having enough address bits for memory addressing and memory
> management."

While I absolutely agree with the quotation, with all due respect, I
disagree that it was the motivation for implementing paging. A caveat, I
was not involved at all in either the architecture nor the
implementation. My argument is based primarily on logical analysis.

The reason that the ability for a program to address lots of memory
(i.e. more address bits) wasn't a factor in the decision is that Univac
already that problem solved!

I remember a conversation I had with Ron Smith at a Use conference
sometime probably in the late 1970s or early 1980s, when IBM had
implemented virtual memory/paging in the S/370 line. I can't remember
the exact quotation, but it was essentially that paging was sort of like
multibanking, but turned "inside out".

That is, with virtual memory, multiple different, potentially large,
user program addresses get mapped to the same physical memory at
different times, whereas with multibanking, multiple smaller user
program addresses (i.e. bank relative addresses), get mapped at
different times (i.e. when the bank was pointed), to the same physical
memory. In other words, both virtual memory/paging and multibanking
break the identity of program relative addresses with physical memory
addresses.

Since you can have a large number (hundreds or thousands) of banks
defined within a program, by pointing different banks at different
times, you can address a huge amount of memory (far larger than any
contemplated physical memory), and the limitation expressed in that
quotation doesn't apply.

Each solution (paging and multi banking) has advantages and
disadvantages, and one can argue the relative merits of the two
solutions (we can discuss that further if anyone cares), they both solve
the problem, so solving that problem shouldn't/couldn't be the
motivation for Unisys implementing paging in 2200s.

Obviously, I invite comments/questions/arguments, etc.

David W Schroth

unread,

Nov 21, 2023, 8:33:58 PM11/21/23

I believe there are a couple of problems with that view.

The Exec depended very much on absolute addressing when managing
mamory, which limited the systems to 2 ** 24 words of physical memory,
which was Not Enough.

And the amount of virtual space available to the system was limited by
the amount of swapfile space which, if I recall correctly, was limited
to 03400000000 words (less tha half a GiW, although I am too lazy to
figure out the exact amount).

I think both of those problems could have been addressed in a swapping
system by applying some of the paging design (2200 paging supplied one
or more Working Set file(s) for each subsystem), but swapping Large
Banks (2**24 words max) or Very Large Banks (2**30 words max) would
take too long and consume too much I/O bandwidth. I grant it would be
interesting to follow up on Nick McLaren's idea of using base and
bounds with swapping on systems with a lot of memory, but my
experiences fixing 2200 paging bugs suggests (to me) that the end
result would not be as satisfactory as Nick thought (even though he's
probably much smarter than me).

>
>Each solution (paging and multi banking) has advantages and
>disadvantages, and one can argue the relative merits of the two
>solutions (we can discuss that further if anyone cares), they both solve
>the problem, so solving that problem shouldn't/couldn't be the
>motivation for Unisys implementing paging in 2200s.
>
>Obviously, I invite comments/questions/arguments, etc.

My views are colored by my suspicion that I am one of the very few
people still working who has worked down in the bowels of memory
management by swapping and memory management by paging.

Regards,

David W. Schroth

Scott Lurndal

unread,

Nov 23, 2023, 3:25:06 PM11/23/23

My experiences with MCP/VS on the Burroughs side, which
supported swapping (rollout/rollin) contiguous regions
of 1000 digit "pages" showed that checkerboarding of
memory was inevitable, leading to OS defragmentation
overhead and/or excessive swapping in any kind of
multiprogramming environment.

Swapping to solid state disk ameliorated the performance
overhead somewhat, but at a price.

>My views are colored by my suspicion that I am one of the very few
>people still working who has worked down in the bowels of memory
>management by swapping and memory management by paging.

I'll take paging over any segmentation scheme anyday.

David W Schroth

unread,

Nov 24, 2023, 2:51:52 PM11/24/23

On Thu, 23 Nov 2023 20:25:03 GMT, sc...@slp53.sl.home (Scott Lurndal)
wrote:

And my experience with OS22000 n=memory managent leaves preferring
both paging and segmentation, each doing what they do best.

Paging for mapping virtual to physical and getting chunks of virtual
space into physical memory and out to backing store.

And segmentation for process/thread isolation and access control.

Wich is how they have been used in OS2200 su=ince the early '90s...

Stephen Fuld

unread,

Nov 27, 2023, 8:26:25 PM11/27/23

Agreed. If those were the only problems, fixing them would have been a
much easier task than implementing paging.

but swapping Large
> Banks (2**24 words max) or Very Large Banks (2**30 words max) would
> take too long and consume too much I/O bandwidth.

Absolutely agree! In fact, this was one of the issues mentioned in the
video, compounded by people not making banks dynamic when they probably
should have, thus increasing executable and thus swap sizes. However,
this problem, while important, has nothing to do with running out of
memory addressing bits.

> I grant it would be
> interesting to follow up on Nick McLaren's idea of using base and
> bounds with swapping on systems with a lot of memory, but my
> experiences fixing 2200 paging bugs suggests (to me) that the end
> result would not be as satisfactory as Nick thought (even though he's
> probably much smarter than me).

I think Nick's thinking was overly influenced by the environment he was
used to, that is scientific computing where there is much less swapping
than in say a general purpose system such as time shared program
development system. If you don't do much swapping, it matters less how
long it takes.

>>
>> Each solution (paging and multi banking) has advantages and
>> disadvantages, and one can argue the relative merits of the two
>> solutions (we can discuss that further if anyone cares), they both solve
>> the problem, so solving that problem shouldn't/couldn't be the
>> motivation for Unisys implementing paging in 2200s.
>>
>> Obviously, I invite comments/questions/arguments, etc.
>
> My views are colored by my suspicion that I am one of the very few
> people still working who has worked down in the bowels of memory
> management by swapping and memory management by paging.

Probably true. A unicorn :-)

Stephen Fuld

unread,

Nov 27, 2023, 8:40:14 PM11/27/23

On 11/24/2023 11:54 AM, David W Schroth wrote:
> On Thu, 23 Nov 2023 20:25:03 GMT, sc...@slp53.sl.home (Scott Lurndal)
> wrote:

big snip

>> My experiences with MCP/VS on the Burroughs side, which
>> supported swapping (rollout/rollin) contiguous regions
>> of 1000 digit "pages" showed that checkerboarding of
>> memory was inevitable, leading to OS defragmentation
>> overhead and/or excessive swapping in any kind of
>> multiprogramming environment.

That agrees with what the video said, and my experience, with the
possible exception of Nick McLarren's comments on a scientific workload
(obviously not V-series).

>> Swapping to solid state disk ameliorated the performance
>> overhead somewhat, but at a price.

Sure. In the early 1970s, we used UCS (Unified Channel Storage), which
was essentially 1106 core memory used as a peripheral, for swapping. By
the late 70s, replaced by Amperif SSDs.

>>
>>
>>> My views are colored by my suspicion that I am one of the very few
>>> people still working who has worked down in the bowels of memory
>>> management by swapping and memory management by paging.
>>
>> I'll take paging over any segmentation scheme anyday.
>
> And my experience with OS22000 n=memory managent leaves preferring
> both paging and segmentation, each doing what they do best.
>
> Paging for mapping virtual to physical and getting chunks of virtual
> space into physical memory and out to backing store.
>
> And segmentation for process/thread isolation and access control.

Yup. I dislike "overloading" the paging mechanism with the protection
mechanism, which has nothing to do with 4K boundaries. I sort of like
what the Mill is proposing - a separate data structure with base and
limits for protection, cached in the CPU by a "PLB" (Protection Look
Aside Buffer) working along side a traditional page table/TLB mechanism.
It accomplishes the separation without the issues of multi-banking e.g.
the user having to point different banks.

> Wich is how they have been used in OS2200 su=ince the early '90s...

Yup.

0 new messages