Versatile Cache from IBM

Quadibloc

unread,

Sep 6, 2021, 1:22:46 AM9/6/21

to

Just read this:

https://www.anandtech.com/show/16924/did-ibm-just-preview-the-future-of-caches

IBM's new Telum chip, for its next generation of Z-series mainframes,
has 8 processors per die, each one with 32 megabytes of L2 cache.

There is no L3 or L4 cache... in the usual sense.

However, processors on the same die may make some of their L2
cache available as L3 cache for other processors on the die, and
processors on other dies on the same multi-chip module may even
provide some of their L2 cache as L4 cache for processors on other
dies.

John Savard

Terje Mathisen

unread,

Sep 6, 2021, 2:34:15 AM9/6/21

to

Isn't this how many/most multi-CPU slot machines work?

All the CPUs look at memory traffic from other chips, and update their
own cache state correspondingly, flushing dirty lines to RAM or a common
L3/L4. Sending an updated line directly to the requesting core is a
common optimization.

If the IBM idea is an OS-level feature where one CPU can reserve some of
its own L2 cache space for another CPU, presumably to serve as a victim
cache, then that is new I guess?

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

EricP

unread,

Sep 6, 2021, 8:25:48 AM9/6/21

to

The article didn't say anything about reserve - it sounded dynamic.
They eliminated L3 & L4, and allow a remote nodes' L2 to be used
as another nodes victim cache.

Where a traditional coherence protocol only allows a core to
pull a line from a remote nodes' cache, this one additionally
allows it to push a line into a remote nodes' cache.

If you have idle cores and the remote L2 would otherwise be unused
this is a win.

But it makes matching application working set to cache size a bit fuzzy.

I can also see the potential for system wide thrashing if the total
working set size exceeds capacity, similar to old Unix systems with
their global working set management.

Anton Ertl

unread,

Sep 6, 2021, 9:43:08 AM9/6/21

to

Unfortunately, both the Hot Chips slides and the article (that is
supposed to be written after additional Q&A with the IBM guys;
apparently the IBM guys were not allowed to reveal anything more) miss
pretty much all the interesting details. Basically, the only thing we
learn ist that other CPUs L2 caches can be used as victim caches.

Questions that come to my mind:

How does a line become "unused" and available for other CPUs? EricP
suggests that cache lines on idle cores are unused, but a currently
idle core could continue on the same task the very next microsecond,
so that does not sound like a good strategy. Cache lines belonging to
no-longer-allocated ASIDs might be a better approach; it probably
requires some scrubbing (probably in hardware), though. Or maybe a
long-unaccessed cache line that is still accessible could also be
categorized as "unused".

Does this system employ a directory or snooping. With hundreds of L2
caches, snooping seems inefficient, but what do I know. A directory
might be based on the memory controller for the physical RAM that
holds the memory if not in cache; i.e., send the request to the memory
controller, and it will send it on to RAM or the appropriate L2 cache
based on the directory.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Stefan Monnier

unread,

Sep 6, 2021, 10:27:52 AM9/6/21

to

> How does a line become "unused" and available for other CPUs?

Is there any reason to think this isn't just decided by the OS?

Stefan

Quadibloc

unread,

Sep 6, 2021, 11:04:25 AM9/6/21

to

On Monday, September 6, 2021 at 12:34:15 AM UTC-6, Terje Mathisen wrote:

> Isn't this how many/most multi-CPU slot machines work?
>
> All the CPUs look at memory traffic from other chips, and update their
> own cache state correspondingly, flushing dirty lines to RAM or a common
> L3/L4.

That's just maintaining cache coherency. Actually using the caches of other
processors as a higher-level cache is something different.

John Savard

John Levine

unread,

Sep 6, 2021, 2:49:07 PM9/6/21

to

According to Stefan Monnier <mon...@iro.umontreal.ca>:

>> How does a line become "unused" and available for other CPUs?
>
>Is there any reason to think this isn't just decided by the OS?

Seems unlikely. IBM systems run a lot of legacy code including old systems in
virtual machines and if this is going to be useful it won't depend on OS upgrades.

I suppose the physical OS knows when it's reclaimed a page frame but that seems
rather coarse granularity for a cache flush.

--
Regards,
John Levine, jo...@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

Quadibloc

unread,

Sep 6, 2021, 4:20:48 PM9/6/21

to

On Monday, September 6, 2021 at 12:49:07 PM UTC-6, John Levine wrote:

> Seems unlikely. IBM systems run a lot of legacy code including old systems in
> virtual machines and if this is going to be useful it won't depend on OS upgrades.

It's not clear to me why an upgraded OS conflicts with legacy applications.

I mean, one could even have the cache allocated at the hypervisor level if you wanted
to run each legacy application under its own legacy operating system.

John Savard

Terje Mathisen

unread,

Sep 6, 2021, 4:27:59 PM9/6/21

to

Which was in my next (snipped) paragraph. Why didn't you respond to that
part instead?

John Dallman

unread,

Sep 6, 2021, 4:46:03 PM9/6/21

to

In article <42fd1c31-424a-4ec8...@googlegroups.com>,

jsa...@ecn.ab.ca (Quadibloc) wrote:

> It's not clear to me why an upgraded OS conflicts with legacy
> applications.

Some aspects of those operating systems are surprisingly primitive. As in
data structures expected to be accessed at absolute addresses, for
example. There are some very baroque sets of registers for remapping
memory in weird ways.

The "Principles of Operation" manual is free, available here:
<https://www.ibm.com/support/pages/zarchitecture-principles-operation>

I read some sections of it recently, wondering what the programming
environment was like. It's a weird mixture of primitive and sophisticated.
My conclusion was "Stick to Linux, if I ever have to deal with this
platform."

John

Anne & Lynn Wheeler

unread,

Sep 6, 2021, 5:21:37 PM9/6/21

to

j...@cix.co.uk (John Dallman) writes:
> The "Principles of Operation" manual is free, available here:
> <https://www.ibm.com/support/pages/zarchitecture-principles-operation>
>
> I read some sections of it recently, wondering what the programming
> environment was like. It's a weird mixture of primitive and sophisticated.
> My conclusion was "Stick to Linux, if I ever have to deal with this
> platform."

access registers (multiple concurrent, active address spaces) & program
call started out as part of 811 (for 370/xa arch. documents dated
nov1978) for MVS. OS/360 was heavily pointer passing API. MVT (real
storage) was initially mapped to a single 16mbyte virtual address space
for VS2/R1 ... then for MVS, VS2/R2 each application and subsystem was
giving its own 16mbyte virtual address space. However, for the pointer
passing APIs, they mapped an 8mbyte image of the kernel into every
application virtual adddress space (leaving 8mbytes for application).
Then for subsystem calls (each in its own private 16mbyte virtual
address space), they created the common segment (1mbyte) mapped into
every 16mbyte virtual address space, parameter list/returns storage can
be allocated in the common segment area and the pointer passed to the
called subsystem.

Common area size requirement is somewhat proportional to number of
concurrent applications and number of subsystems ... by 3033 (before
370/xa & 31bit addressing, still 24bit/16mbytes)) installations were
requiring 5-6 mbytes for the common area (renamed from common segment
area, CSA, to common system area, CSA) leaving 2-3mbytes for
applications ... but threatening to increase to 8mbytes ... leaving zero
bytes for applications.

Come 3081, there is both 370 mode and 370/xa mode ... however customers
weren't migrating to MVS/XA as expected which was increasingly putting
enormous pressure for MVS operation as environments scaled up with more
concurrent applications executing and more running subsystems.

With 370/xa, program call and access registers ... there is privilege
("MVS") kernel table of subsystems and address space pointers.
Subsystem calls reference a specific entry in that table and the
hardware moves the caller's address space pointer into secondary (access
register) and loads the subsystem address space pointer as primarry and
transfers to the subsystem. The subsystem can now directly access the
caller's parameter list in its secondary address space (eliminating the
enormous pressure on the CSA as well as kernel call software overhead
for the switch). The return instruction restores the applications
address space to primary and returns to the caller

--
virtualization experience starting Jan1968, online at home since Mar1970

Stephen Fuld

unread,

Sep 6, 2021, 6:11:10 PM9/6/21

to

You are not being exactly fair. First of all, what you see in the POO
is a lot of the "cruft" remaining from 60 years of development history,
always maintaining backward compatibility. Second, you are comparing
hardware as shown in the manual versus software (Linux).

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Al Grant

unread,

Sep 7, 2021, 4:10:33 AM9/7/21

to

On Monday, September 6, 2021 at 9:46:03 PM UTC+1, John Dallman wrote:
> I read some sections of it recently, wondering what the programming
> environment was like. It's a weird mixture of primitive and sophisticated.
> My conclusion was "Stick to Linux, if I ever have to deal with this
> platform."

The programming environment is very usable. It's true that
some data is located in the 4K block located at 0x0, but that
page is more like an extended set of system registers
(the sort that on other architectures are accessed using
special instructions) - it's mapped to a different physical
address per CPU. Almost everything else is accessed via
pointer chaining - it's like Linux, but with documentation.

Linux runs on z/Series, and if new features need OS support,
they might show up in the upstream kernel.

I wrote a lot of 370 assembler in the 90s and read a lot
more. Code looked pretty much like it did in the 1960s.
Looking at the output of GCC for z/Series recently, I barely
recognized it. There are a lot of new instructions - and this
is the regular userspace ISA (or "problem state"). So the
userspace ISA has changed far more in its last 30 years
than in its first 30 years. Something must be using those
new instructions! The compilers and the major applications
(CICS and DB2 etc.) will be, at least. So, they may have
added support for the new cache features. My guess is
it would involve the "Next Instruction Access Intent"
instruction, which gives hints on cache placement for
whatever data is touched by the next instruction.

There may be a lot of legacy assembler code, but it was
written to run in small footprints and slow CPUs, so
performance on modern machines is less of an issue.

John Dallman

unread,

Sep 7, 2021, 4:49:39 AM9/7/21

to

In article <sh63lq$v79$1...@dont-email.me>, sf...@alumni.cmu.edu.invalid

(Stephen Fuld) wrote:

> You are not being exactly fair. First of all, what you see in the
> POO is a lot of the "cruft" remaining from 60 years of development
> history, always maintaining backward compatibility. Second, you
> are comparing hardware as shown in the manual versus software
> (Linux).

Indeed. Part of an operating system's job is to hide hardware complexity
from the application programmer. However, since the traditional OSes for
Z architecture have grown with the hardware design, they seem to conceal
rather less than Linux.

John

Thomas Koenig

unread,

Sep 7, 2021, 6:08:38 AM9/7/21

to

Al Grant <algra...@gmail.com> schrieb:

> I wrote a lot of 370 assembler in the 90s and read a lot
> more. Code looked pretty much like it did in the 1960s.
> Looking at the output of GCC for z/Series recently, I barely
> recognized it. There are a lot of new instructions - and this
> is the regular userspace ISA (or "problem state"). So the
> userspace ISA has changed far more in its last 30 years
> than in its first 30 years. Something must be using those
> new instructions!

The main change was the switch from 31 to 64 bits, with
new registers, new instructions to access them, new whatnot.

It is not surprising that you recognize few instructions.

Stephen Fuld

unread,

Sep 7, 2021, 12:00:58 PM9/7/21

to

That's true, but I think the big difference is that Linux, indeed Unix,
was from the start written in a higher level language, i.e. not
assembler. Thus, while it was not the initial intent, it was fairly
easy to port to new architectures. The IBM OSs, in contrast, were
written entirely in assembler, so were, pretty much by definition,
hardware specific. By the time most of them had been rewritten in a
higher level language (presumably to aid development and maintenance) it
was too late to consider them to be portable.

Thus the Unix specific model of an OS, and Linux in particular, became
the lingua franca of a programming environment. You may not realize how
much the decisions of the original Unix developers pervades what we
think of as an OS, but there were many other decisions that they could
have made that would make things different. That those seem odd to many
people is an indicator of my point.

John Dallman

unread,

Sep 7, 2021, 12:35:52 PM9/7/21

to

In article <sh82bo$86n$1...@dont-email.me>, sf...@alumni.cmu.edu.invalid

(Stephen Fuld) wrote:

> Thus the Unix specific model of an OS, and Linux in particular,
> became the lingua franca of a programming environment. You may not
> realize how much the decisions of the original Unix developers
> pervades what we think of as an OS, but there were many other
> decisions that they could have made that would make things
> different.

I've done enough VMS, MacOS Classic and CP/M to have some idea. I've also
seen the Unix ideas pushed too far, in HeliOS, which really wasn't
practical. The Unix abstractions are not perfect, but they are pretty
good.

John

Stephen Fuld

unread,

Sep 7, 2021, 12:58:02 PM9/7/21

to

On 9/7/2021 9:34 AM, John Dallman wrote:
> In article <sh82bo$86n$1...@dont-email.me>, sf...@alumni.cmu.edu.invalid
> (Stephen Fuld) wrote:
>
>> Thus the Unix specific model of an OS, and Linux in particular,
>> became the lingua franca of a programming environment. You may not
>> realize how much the decisions of the original Unix developers
>> pervades what we think of as an OS, but there were many other
>> decisions that they could have made that would make things
>> different.
>
> I've done enough VMS, MacOS Classic and CP/M to have some idea.

OK, but those are relatively recent compared to mainframe OSs such as
OS/360. The differences are even bigger.

> I've also
> seen the Unix ideas pushed too far, in HeliOS, which really wasn't
> practical.

While I have no knowledge of that, I certainly can believe you.

> The Unix abstractions are not perfect, but they are pretty
> good.

Agreed. I think the biggest omission in original Unix (since addressed)
was native threads. Other people probably have others things.

Stephen Fuld

unread,

Sep 7, 2021, 1:32:44 PM9/7/21

to

On 9/6/2021 6:28 AM, Anton Ertl wrote:

snip

> Unfortunately, both the Hot Chips slides and the article (that is
> supposed to be written after additional Q&A with the IBM guys;
> apparently the IBM guys were not allowed to reveal anything more) miss
> pretty much all the interesting details. Basically, the only thing we
> learn ist that other CPUs L2 caches can be used as victim caches.

IBM is masterful at giving you lots of information, giving the
impression that they are very open, but not actually giving you the
details you want. They have been doing that for decades! :-(

> Questions that come to my mind:
>
> How does a line become "unused" and available for other CPUs? EricP
> suggests that cache lines on idle cores are unused, but a currently
> idle core could continue on the same task the very next microsecond,
> so that does not sound like a good strategy. Cache lines belonging to
> no-longer-allocated ASIDs might be a better approach; it probably
> requires some scrubbing (probably in hardware), though. Or maybe a
> long-unaccessed cache line that is still accessible could also be
> categorized as "unused".
>
> Does this system employ a directory or snooping. With hundreds of L2
> caches, snooping seems inefficient, but what do I know. A directory
> might be based on the memory controller for the physical RAM that
> holds the memory if not in cache; i.e., send the request to the memory
> controller, and it will send it on to RAM or the appropriate L2 cache
> based on the directory.

You and I agree that we don't know. However, perhaps we can take some
hints from the following quotation. From the article

> In the Q&A following the session, Dr. Christian Jacobi (Chief Architect of Z) said that the system is designed to keep track of data on a cache miss, uses broadcasts, and memory state bits are tracked for broadcasts to external chips. These go across the whole system, and when data arrives it makes sure it can be used and confirms that all other copies are invalidated before working on the data. In the slack channel as part of the event, he also stated that lots of cycle counting goes on!

So perhaps they are using cycle counts since last reference as an
indicator of "oldness"? They broadcast the cycle count of the to be
evicted line to see if any other cache has an older one that is
preferable to evict? Of course, this is pure speculation. :-(