Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Fantasy-Land Hierarchal NUMA Memory-Model on Vertical Multi-Core/Memory Cube-Processing Unit.

8 views
Skip to first unread message

Chris Thomasson

unread,
Feb 14, 2008, 5:16:12 AM2/14/08
to
The physical size of a single liquid-cooled processing-unit would be
4"x4"x4" in the form of pluggable processing entities, that connect directly
into the motherboard of a super-computer which can indeed fit "comfortably"
under a given company's "average" desk size per-office. These units would
consist of many cores intertwined with multiple layers of
_highly_-integrated on-chip CPU-Vendor created memory-regions distributed;
they reside within a per-layer level of processing-units. The motherboard
has dedicated on-board logic that manages the traffic sent through the
master per-super-computing board interconnects:

http://groups.google.com/group/comp.arch/msg/7e0d9f39335f7677

http://groups.google.com/group/comp.arch/msg/c9d99ae1251f2462

http://groups.google.com/group/comp.programming.threads/msg/47ec14c94eb80701


Hierarchal in the sense of strong SC memory-barriers associated with
interlocked operations (think LOCK prefix in x86) __compared__ to weaker
model (e.g., DEC Alpha, or even a SPARC under RMO; which seems to exclude
Solaris because it has an affinity for TSO). The weaker and more distributed
the memory-model is, the greater the scalability factor. Indeed the Alpha
has a really weak memory-model such that it does not even support implied
data-dependencies on "raw" pointer loads. X86 implies a '#LoadStore |
#LoadLoad' membar per-load; think in RMO SPARC terms wrt barrier instruction
operand. Weak mem-model per-possessing unit indeed. Inter-Processing
Communication is not recommended (programming wise) unless you have a darn
good reason to do so.

The chip would be large enough to contain many layers of CPU's and memory.
The memory model is distributed into a "processing-layer", and over the
height of the unit. You stack layers on top of each other. There is series
of pins that go directly through the layers; this contact will be the arrays
of communication elevators that are responsible for supporting inter-layer
messages. The physical distance between layers is VERY small, however these
super-computing units are 4" tall. Level 1 better have a damn good reason to
contact Level 666 or something. This would be documented as NUMA on
steroids. Inter layer communication is not reccomeded.

Code awareness would be provided by a basic thread-affinity method. Each
thread would have its execution bound to a specific super-computing
processing-unit; and a core contained therein. A function call similar to
POSIX pthread_getspecific() would be enough to let the calling thread know
what processing-unit it was running on.

Programming would be in the form of distributed PDR algorithm which is
compatible with weak cache-friendly memory-model (think vanilla RCU; as it
was in early Linux Kernel implementations); very-weak distributed memory
access is the game; search comp.programming.threads for more info. On chip
PDR can be used for hot-pluggable CPU units. Think cracking open a
live-server and sticking another high-end processing-unit into a free
interconnect on the motherboard within a given box.

Before I go on and on and complete my naive science-fiction fiction book,
what do you think?

Apparently, Intel could not get their 80-core chip to access "normal"
existing memory units. They are working directly with memory-vendors to
create highly-processor specific distributed memory-slabs that can fit
directly on top of a physical chip.


--
Chris M. Thomasson
http://appcore.home.comcast.net

John Dallman

unread,
Feb 14, 2008, 7:27:00 AM2/14/08
to
In article <_eWdnVUW-4Nwjyna...@comcast.com>,
cri...@comcast.net (Chris Thomasson) wrote:

> Before I go on and on and complete my naive science-fiction fiction
> book, what do you think?

It seems to have the weakness of all the "desktop supercomputing" ideas
that float around: a sharp discontinuity of hardware, software
compatibility, and maintenance from the existing systems doing jobs that
don't require "supercomputing" performance. This makes it hard for it to
leverage their hardware volumes, which pushes up costs fast. That makes
it hard for you to interest existing software vendors in it: they want a
reasonably large market, and are often cynical about hardware designers'
understanding of the software business.

Various companies have tried bringing things somewhat like this to
market; Clearspeed is an example. The situation is somewhat similar to
that of the first 8-bit home computers: nobody achieved any serious
commercial success, until Apple got the Apple ][ something close enough
to right. They demonstrated a hardware design pattern, which the IBM-PC
followed, and that started to be a large and successful business. But
it's really hard to get venture capital to start a business which will
almost certainly fail but may show someone else how to do it right. The
fact that Apple came back with the Mac is essentially coincidental;
there was very little continuity of hardware or software between the
different generations of Apple systems, and a different company, or
none, could have come up with the Mac.

To make this kind of thing succeed, you need to present a compelling
turnkey solution to at least one problem that lots of companies need to
solve regularly; a few companies who have lots of money riding on it
will do instead, but the competition will be fiercer.

> Apparently, Intel could not get their 80-core chip to access "normal"
> existing memory units. They are working directly with memory-vendors
> to create highly-processor specific distributed memory-slabs that can
> fit directly on top of a physical chip.

"A sharp discontinuity of hardware ... pushes up costs fast."

--
John Dallman, j...@cix.co.uk, HTML mail is treated as probable spam.

Chris Thomasson

unread,
Feb 14, 2008, 8:36:57 AM2/14/08
to

"Chris Thomasson" <cri...@comcast.net> wrote in message
news:_eWdnVUW-4Nwjyna...@comcast.com...

> The physical size of a single liquid-cooled processing-unit would be
> 4"x4"x4" in the form of pluggable processing entities, that connect
> directly into the motherboard of a super-computer which can indeed fit
> "comfortably" under a given company's "average" desk size per-office.

[...]

The cost would be huge... However, your spending OR$&D on hard-core
expandable distributed super-computer. A possible selling point could
something like:

Of course its per-unit-cost is expensive, its a super computing framework
after all...

Of course a unit would be physically "huge" (e.g., 4"x4"x4"); the
interconnects which accept it and the programming rules contained therein
coalesces into a massive expandable super-computing plug-in bay. Spending
$$$ on super computer "receiver" which can exist under a desk in many of
your offices, well, if you don't need it; don't buy it...

:^(

MitchAlsup

unread,
Feb 14, 2008, 1:04:01 PM2/14/08
to
On Feb 14, 4:16 am, "Chris Thomasson" <cris...@comcast.net> wrote:
> The physical size of a single liquid-cooled processing-unit would be
> 4"x4"x4" in the form of pluggable processing entities, that connect directly
> into the motherboard of a super-computer which can indeed fit "comfortably"
> under a given company's "average" desk size per-office.

Lets say, for example, that we make this monster from 12mm*12mm*1mm
processor chips with all the usual stuff on die (CPU, cahces,
crossbars, controllers) and a 3D interconnect scheme. Thus, this 4*4*4
cube would house some 8*8*100 of these dies all packed together
nicely. Thus, we now have an energy density (at 100W/die) of 64KW in a
cube the size of a light bulb. But let us ignore that minor heat flow
problem for now.

Let us assume you intend to mount this chip via a socket on the
motherboard. So, we now have 640,000 processors (single CPU/die)
consuming memory bandwidth through 9000-20,000 pins. As they say in
the hills: "Ain't gonna verk". You are off by a factor of 100 in the
pin interconnect (minimum); even if the pins are all wiggling at
3+GHz. But lets ignore this (again) for now.

The motherboard will have to support some 640,000 DRAM DIMMs (for
adequate memory footprint and adequate BW) and have these all within
about 10" of the cup cube (acceptable latency). I think there is a
fundamental limit to supercomputing that will stymie this effort far
before we get to the 4*4*4 cube. And this limit is the number of pins
(or amount of information) that can pass through the surface area of
that cube (per unit time).

Now, imagine the problem if each die contains several CPUs (4-64),
strengthening the problem of the surface BW and amount of main memory
used to support the processing resources.

So the fundamental problem to be solved is: "How much information can
pass through a surface (of this cube) per unit time"? (Attribute:
somebody at IBM about a decade ago). Then build the internal resources
of that cube to consume that BW and not much more.

Chris Thomasson

unread,
Feb 22, 2008, 8:46:37 PM2/22/08
to
"MitchAlsup" <Mitch...@aol.com> wrote in message
news:ce6f2f29-b878-42e7...@m78g2000hsh.googlegroups.com...

On Feb 14, 4:16 am, "Chris Thomasson" <cris...@comcast.net> wrote:
> The physical size of a single liquid-cooled processing-unit would be
> 4"x4"x4" in the form of pluggable processing entities, that connect
> directly
> into the motherboard of a super-computer which can indeed fit
> "comfortably"
> under a given company's "average" desk size per-office.

> Lets say, for example, that we make this monster from 12mm*12mm*1mm
> processor chips with all the usual stuff on die (CPU, cahces,
> crossbars, controllers) and a 3D interconnect scheme. Thus, this 4*4*4
> cube would house some 8*8*100 of these dies all packed together
> nicely. Thus, we now have an energy density (at 100W/die) of 64KW in a
> cube the size of a light bulb. But let us ignore that minor heat flow
> problem for now.

> Let us assume you intend to mount this chip via a socket on the
> motherboard. So, we now have 640,000 processors (single CPU/die)
> consuming memory bandwidth through 9000-20,000 pins. As they say in
> the hills: "Ain't gonna verk". You are off by a factor of 100 in the

> vpin interconnect (minimum); even if the pins are all wiggling at


> 3+GHz. But lets ignore this (again) for now.

> The motherboard will have to support some 640,000 DRAM DIMMs (for
> adequate memory footprint and adequate BW) and have these all within
> about 10" of the cup cube (acceptable latency). I think there is a
> fundamental limit to supercomputing that will stymie this effort far
> before we get to the 4*4*4 cube. And this limit is the number of pins
> (or amount of information) that can pass through the surface area of
> that cube (per unit time).

[...]

Points well taken. Okay... Well, I guess I could reduce the idea down to an
integration of a fairly large amount of memory directly into the cores of a
single processing-cube. Something like 8 Cores with hundreds of kilobytes
(6-7 hundred) of per-core memory, perhaps even a 1MB per-core... This would
be NUMA setup, with DMA channel interface for inter-core communication. Like
the Cell processor. The memory connects can site directly below and on top
of the "cpu-layer". Like a sandwich with memory layers for bread, and cpu
layer for meat. Is that feasible?

Dmitriy V'jukov

unread,
Feb 25, 2008, 4:03:25 PM2/25/08
to
On 23 фев, 04:46, "Chris Thomasson" <cris...@comcast.net> wrote:

> Points well taken. Okay... Well, I guess I could reduce the idea down to an
> integration of a fairly large amount of memory directly into the cores of a
> single processing-cube. Something like 8 Cores with hundreds of kilobytes
> (6-7 hundred) of per-core memory, perhaps even a 1MB per-core... This would
> be NUMA setup, with DMA channel interface for inter-core communication. Like
> the Cell processor. The memory connects can site directly below and on top
> of the "cpu-layer". Like a sandwich with memory layers for bread, and cpu
> layer for meat. Is that feasible?

I don't know whether it's feasible or not. But I can propose another
design for 'Fantasy-Land Unit'. Theoretical.

8 cores. 8 MB of memory. All memory is divided into, for example, 64
kb blocks. Every memory block is connected to *one* core through
programmable commutator. This design directly supports producer-
consumer and pipeline patterns. And eliminates the need for DMA
channel.
Usage pattern. 'Producer' core fills memory block with data. Then
'producer' core remaps memory block to 'consumer' core, and notifies
'consumer' core that there is some work to do.

There can be some NUMA artifacts anyway. I.e. some memory blocks
'closer' to core and some 'farther' from core. But potentially every
memory block can be connected to every core.

Dmitriy V'jukov

Chris Thomasson

unread,
Feb 26, 2008, 9:59:50 PM2/26/08
to
"John Dallman" <j...@cix.co.uk> wrote in message
news:memo.2008021...@jgd.compulink.co.uk...

> In article <_eWdnVUW-4Nwjyna...@comcast.com>,
> cri...@comcast.net (Chris Thomasson) wrote:
>
>> Before I go on and on and complete my naive science-fiction fiction
>> book, what do you think?
>
> It seems to have the weakness of all the "desktop supercomputing" ideas
> that float around: a sharp discontinuity of hardware, software
> compatibility, and maintenance from the existing systems doing jobs that
> don't require "supercomputing" performance.
[...]

>> Apparently, Intel could not get their 80-core chip to access "normal"
>> existing memory units. They are working directly with memory-vendors
>> to create highly-processor specific distributed memory-slabs that can
>> fit directly on top of a physical chip.
>
> "A sharp discontinuity of hardware ... pushes up costs fast."

Yup. I have drastically scaled this back to a 2GB memory-card with sockets
for 1/2 multi-core SPU's to plug into. I just want to know if the real-world
issues wrt hardware implementation/programming would render the idea
useless.

John Dallman

unread,
Feb 27, 2008, 7:16:00 PM2/27/08
to
In article <cYSdnU8VS5WeTVna...@comcast.com>,
cri...@comcast.net (Chris Thomasson) wrote:

> Yup. I have drastically scaled this back to a 2GB memory-card with
> sockets for 1/2 multi-core SPU's to plug into. I just want to know if
> the real-world issues wrt hardware implementation/programming would
> render the idea useless.

Presumably you have a bunch of these cards making up a system. What is
the communication between them? If there is none, there's no point in
them being in the same system.

If it's message-passing, you have the kind of system that is programmed
with MPI. One notes that usage MPI is confined to certain quite specific
kinds of scientific processing: it is not impacting the broader world of
computing, and it's less than clear that a cheap MPI system would see
much take-up.

If there is an address space spanning the whole system, but a processor
has faster access to its local memory than to memory attached to other
processors, then you have a NUMA system. There are quite a few ways to
do this, and reading up on them is advisable. They are fairly complex to
build if they're to be cache-coherent, but if they aren't, nobody will
buy them.

Chris Thomasson

unread,
Feb 28, 2008, 12:16:29 AM2/28/08
to
"John Dallman" <j...@cix.co.uk> wrote in message
news:memo.2008022...@jgd.compulink.co.uk...

Well, I would use the NUMA setup... IMVHO, I think that ccNUMA is an
oxymoron. There are algorithms that can deal with very weak cache-coherency,
e.g., RCU. I need to think a lot more on this. Thanks for your patience.

:^)

Rob Warnock

unread,
Feb 28, 2008, 6:25:20 AM2/28/08
to
Chris Thomasson <cri...@comcast.net> wrote:
+---------------

| IMVHO, I think that ccNUMA is an oxymoron.
+---------------

Why do you say that?!? ccNUMA systems such as SGI Origin & Altix
deliver sequential consistency, just like well-behaved ccSMP systems,
as does AMD's Opteron [though Opteron's *implementation* of ccNUMA
doesn't scale well at all].

The "cc" and "NUMA" attributes are, to a very large extent, orthogonal.
You can have one, or the other, or both, or neither.


-Rob

p.s. Oh, and note the footnote in the chapter on ccNUMA in
Greg Phister's "On Clusters" where he admits that an IBM 3081(?)
with >4 CPUs is actually ccNUMA, not SMP, though with only ~1.4:1
remote-to-local latency ratio compared to 2-3:1 for SGI's Origin.
[I could have the IBM model number wrong, since I don't have the
book at hand at the moment.]

-----
Rob Warnock <rp...@rpw3.org>
627 26th Avenue <URL:http://rpw3.org/>
San Mateo, CA 94403 (650)572-2607

Anne & Lynn Wheeler

unread,
Feb 28, 2008, 9:21:33 AM2/28/08
to

rp...@rpw3.org (Rob Warnock) writes:
> p.s. Oh, and note the footnote in the chapter on ccNUMA in
> Greg Phister's "On Clusters" where he admits that an IBM 3081(?)
> with >4 CPUs is actually ccNUMA, not SMP, though with only ~1.4:1
> remote-to-local latency ratio compared to 2-3:1 for SGI's Origin.
> [I could have the IBM model number wrong, since I don't have the
> book at hand at the moment.]

prior to 3081 ... 360 & 370 two-processor SMP were actually two
independent processor complexes ... with provisions to coordinate memory
as one set ... but could also be split and run as two independent
processor complexes.

3081 was referred to as "dyadic" ... since the two processors were built
into the same box and no longer possible to split into two separate,
independent processor complexes. 3084 was two independent 3081s tied
together to operate as 4-way smp (but could be split apart and operate
as two independent 3081s).

for a little x-over from this thread:
http://www.garlic.com/~lynn/2008e.html#38 Any benefit to programming a RISC processor by hand?

801/risc provided for no cache consistency ... in much the same way that
risc simplification was extreme counter reaction to the complexity of
future system ... no cache consistency was reaction to heavy performance
penalty paid by both 370 and future system for cache consistency.

standard two-way 370 smp cache machine ran machine cycle at .9 times
uniprocessor machines ... to allow for x-cache chatter (any actual
x-cache invalidates would slow processing down futher). so standard 370
two-way smp basic hardware was rated at 1.8 times a uniprocessor (actual
thruput was frequently quoted at 1.3-1.5 times a uniprocessor when
x-cache invalidates and smp software overhead was included). cleaving a
(two-way) 370 smp to operate as two independent processors ... would
automatically bump machine cycle to full speed.

there was never initially any intention to provide single processor
version of 308x machine ... however some of the operating systems were
slow in upgrading kernel for multiprocessor support ... primarily acp
(airline control program, renamed tpf, transaction processing facility,
to reflect its heavy use by financial transaction networks). because of
acp/tpf issue ... they finally released 3083 ... basically 3081 with 2nd
processor removed from the box and machine cycle bumped up to full
speed.

3084 4-way operation was resulting in some amount of cache thrashing
performance impact (i.e. each cache getting signals from three other
caches, rather than one other cache. in the early time-frame of 3084
deployments, there were major operating system efforts to restructure
kernel storage allocation for cache line sensitivity ... make sure that
kernel data structures were cache-line aligned ... and in units of
cache-line. these kernel storage cache-line sensitivity efforts resulted
in five+ percent overall system thruput improvement.

going into 3090 with larger number of processors ... they had to resort
to driving cache at significantly higher machine cycle than the rest of
the machine.

3090 had a different kind of NUMA problem with physical packaging of
memory ... i.e. the physical distance of some of the memory was
beginning to exceed the design point for latency. to address this issue
... effectively the same memory technology was packaged in two different
ways. there was the standard memory storage bus that handled standard
instruction operations (cache misses, etc). The additional memory
storage as packaged as "extended storage" ... on a new kind of wider
bus. Synchronous processor instructions were used by the kernel to copy
pages back&forth between "normal" memory and "extended storage" memory
(as sort of psuedo paging device).

later machine generations got around the physical packaging issue and
eliminated physical extended storage ... however, there retained
microcode configuration operations to simulate "extended storage" using
regular memory ... finding that splitting the memory into two different
categories alleviated problems in kernel page replacement algorithm
implementations (with performance guidelines about "extended storage"
configurations continuing up into this decade).

a different issue regarding partitioning memory is discussed
in this post regarding recent z10 announcement
http://www.garlic.com/~lynn/2008e.html#39 z10 presentation on 26 Feb

involving "HSA" and its used as system-wide disk cache ... with some
discussion of global LRU replacement vis-a-vis (partitioned) local LRU
replacement.

misc. past posts mentioning SMP operation
http://www.garlic.com/~lynn/subtopic.html#smp

above also mentions working with charlie at the science center when he
invented compare&swap instruction.

for other cluster topic drift
http://www.garlic.com/~lynn/2000c.html#21 Cache coherence [was Re: TF-1]
http://www.garlic.com/~lynn/2001n.html#83 CM-5 Thinking Machines, Supercomputers
http://www.garlic.com/~lynn/2006w.html#40 Why so little parallelism?
http://www.garlic.com/~lynn/2006w.html#41 Why so little parallelism?

on this topic of work were were doing in ha/cmp scaleup
http://www.garlic.com/~lynn/subtopic.html#hacmp
and old MEDUSA, cluster-in-a-rack email
http://www.garlic.com/~lynn/lhwemail.html#medusa

there is possible some justification for transferring the project and
being told to not work on anything with more processors ... related to
this comment about fall-out of future system failure:
http://www.garlic.com/~lynn/2001f.html#33

since we weren't just restricting cluster-in-a-rack scaleup to
numerical intensive ... but were also working on commercial
applications ... old reference
http://www.garlic.com/~lynn/95.html#13
http://www.garlic.com/~lynn/96.html#15

more recent reference discussing scaleup for cluster distributed
lock manager
http://www.garlic.com/~lynn/2008b.html#69 How does ATTACH pass address of ECB to child?
http://www.garlic.com/~lynn/2008c.html#81 Random thoughts

another recent mention of SCI for numa implementation:
http://www.garlic.com/~lynn/2008e.html#24 Berkeley researcher describes parallel path

John Dallman

unread,
Feb 28, 2008, 3:52:00 PM2/28/08
to
In article <afKdnbvs5dQa3Fva...@comcast.com>,
cri...@comcast.net (Chris Thomasson) wrote:

> Well, I would use the NUMA setup...

The devil of these is very definitely in the details. Find out about how
the 128-way Sun systems work, before you start using like terms "the
NUMA setup". There are quite a lot of ways to do it.

Stephen Fuld

unread,
Mar 6, 2008, 11:24:58 PM3/6/08
to
Anne & Lynn Wheeler wrote:

snip

> 3090 had a different kind of NUMA problem with physical packaging of
> memory ... i.e. the physical distance of some of the memory was
> beginning to exceed the design point for latency. to address this issue
> ... effectively the same memory technology was packaged in two different
> ways. there was the standard memory storage bus that handled standard
> instruction operations (cache misses, etc). The additional memory
> storage as packaged as "extended storage" ... on a new kind of wider
> bus. Synchronous processor instructions were used by the kernel to copy
> pages back&forth between "normal" memory and "extended storage" memory
> (as sort of psuedo paging device).
>
> later machine generations got around the physical packaging issue and
> eliminated physical extended storage ... however, there retained
> microcode configuration operations to simulate "extended storage" using
> regular memory ... finding that splitting the memory into two different
> categories alleviated problems in kernel page replacement algorithm
> implementations (with performance guidelines about "extended storage"
> configurations continuing up into this decade).

I remember this, was puzzled then and still remain puzzled. How could
it be better to use some memory as a paging device, actually moving data
around in memory than simply using it as additional memory to hold those
same pages? It certainly is counter-intuitive! Can you describe the
problems with the page replacement algorithm implementations that caused
this anomalous behavior?


--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Anne & Lynn Wheeler

unread,
Mar 6, 2008, 11:40:38 PM3/6/08
to
Stephen Fuld <S.F...@PleaseRemove.att.net> writes:
> I remember this, was puzzled then and still remain puzzled. How could
> it be better to use some memory as a paging device, actually moving
> data around in memory than simply using it as additional memory to
> hold those same pages? It certainly is counter-intuitive! Can you
> describe the problems with the page replacement algorithm
> implementations that caused this anomalous behavior?

re:
http://www.garlic.com/~lynn/2008e.html#40 Fantasy-Land_Hierarchal_NUMA_Memory-Model_on_Vertical

smaller real storage would result in cycling the reference bits more
often ... leading to improved differentiation of what was being used
within periods of time ... larger real storage extended the interval so
that there was less differentiation about actual page use (i.e. all
pages with any use at all during the extended period appeared to be
equally used).

there is also some possibility that the effective operation may
degenerate to FIFO ... aka default LRU can degenerate to FIFO ...
reference to coding slight of hand where LRU degenerates to random
instead of FIFO:
http://www.garlic.com/~lynn/2008e.html#16 Kernels

lots of posts about doing replacement algorithms dating
back to undergraduate in 60s
http://www.garlic.com/~lynn/subtopic.html#wsclock

which 15 or so yrs later got me dragged into battle somebody was having
getting their Phd from stanford ... recent post:
http://www.garlic.com/~lynn/2008c.html#65 No Glory for the PDP-15

old communication on the subject
http://www.garlic.com/~lynn/2006w.html#email821019

Nick Maclaren

unread,
Mar 7, 2008, 5:20:15 AM3/7/08
to

In article <us3Aj.286050$MJ6.1...@bgtnsc05-news.ops.worldnet.att.net>,

Stephen Fuld <S.F...@PleaseRemove.att.net> writes:
|>
|> I remember this, was puzzled then and still remain puzzled. How could
|> it be better to use some memory as a paging device, actually moving data
|> around in memory than simply using it as additional memory to hold those
|> same pages? It certainly is counter-intuitive! Can you describe the
|> problems with the page replacement algorithm implementations that caused
|> this anomalous behavior?

You're looking in the wrong place!

Think I/O. You could often read and write several pages for the cost
of one. Now, simply using a larger page size has other problems. But
aggregating pages that seem to be used together can have benefits.


Regards,
Nick Maclaren.

Stephen Fuld

unread,
Mar 7, 2008, 11:09:46 AM3/7/08
to
Anne & Lynn Wheeler wrote:
> Stephen Fuld <S.F...@PleaseRemove.att.net> writes:
>> I remember this, was puzzled then and still remain puzzled. How could
>> it be better to use some memory as a paging device, actually moving
>> data around in memory than simply using it as additional memory to
>> hold those same pages? It certainly is counter-intuitive! Can you
>> describe the problems with the page replacement algorithm
>> implementations that caused this anomalous behavior?
>
> re:
> http://www.garlic.com/~lynn/2008e.html#40 Fantasy-Land_Hierarchal_NUMA_Memory-Model_on_Vertical
>
> smaller real storage would result in cycling the reference bits more
> often ... leading to improved differentiation of what was being used
> within periods of time ... larger real storage extended the interval so
> that there was less differentiation about actual page use (i.e. all
> pages with any use at all during the extended period appeared to be
> equally used).

Ahh! The result makes sense. There was a trade off between "accuracy"
of the knowledge of what was LRU and the amount of overhead needed to
obtain that knowledge. As the memory size grew, the amount of overhead
for the same degree of accuracy grew. I guess it was determined that
the overhead of the "extra" moving of pages around was less than that
for the increased accuracy. An ugly choice to have to make. :-(

Thanks

Stephen Fuld

unread,
Mar 7, 2008, 11:18:46 AM3/7/08
to
Nick Maclaren wrote:
> In article <us3Aj.286050$MJ6.1...@bgtnsc05-news.ops.worldnet.att.net>,
> Stephen Fuld <S.F...@PleaseRemove.att.net> writes:
> |>
> |> I remember this, was puzzled then and still remain puzzled. How could
> |> it be better to use some memory as a paging device, actually moving data
> |> around in memory than simply using it as additional memory to hold those
> |> same pages? It certainly is counter-intuitive! Can you describe the
> |> problems with the page replacement algorithm implementations that caused
> |> this anomalous behavior?
>
> You're looking in the wrong place!
>
> Think I/O.

As a former disk system architect, I usually do! :-)

You could often read and write several pages for the cost
> of one.

On a real I/O, sure. But these weren't real I/Os. They were
essentially MVCL instructions to move a page from one location in memory
(that was part of the memory marked as normal memory), to and from
another part of the same physical memory, just marked as extended and
therefore non-directly referencable by normal instructions. So overhead
or latency is almost zero and the cost is almost proportional to the
amount of data moved. Besides, if the extended memory was used as
primary, then the amount of page movement of any kind would be reduced.

Now, simply using a larger page size has other problems.

Yes, but no one was suggesting that As far as I know.

But
> aggregating pages that seem to be used together can have benefits.

Yes, but it is hard to see the benefits in this situation. Remember, it
was the same physical memory, so if you used it all for primary storage,
you could do whatever aggregation was useful. The arbitrary
"downgrading" of some of that memory to be used only for a "high speed
paging device" seemed to me to be a bad idea.

Nick Maclaren

unread,
Mar 7, 2008, 11:22:49 AM3/7/08
to

In article <GVdAj.720913$kj1.6...@bgtnsc04-news.ops.worldnet.att.net>,

We are at cross-purposes. I didn't mean that there was any gain in
the actual aggregation phase, but that the gain comes when you want
to write it to real disk and back again.

I.e. the part of memory that is the high speed paging device is
used as much as an aggregating buffer as anything else.


Regards,
Nick Maclaren.

Stephen Fuld

unread,
Mar 7, 2008, 12:14:53 PM3/7/08
to

I think I understand what you are saying, but given the availability of
scatter/gather I/O (data chaining in IBM speak), such aggregation is
just as easy from real memory as it is from extended. Thus there is no
advantage to moving the pages to some special location to aggregate
them, and you have to pay the cost of the move.

Nick Maclaren

unread,
Mar 7, 2008, 1:12:00 PM3/7/08
to

In article <hKeAj.721128$kj1.4...@bgtnsc04-news.ops.worldnet.att.net>,

Yes and no. That became decreasingly true as time went on, and there
always were secondary costs. But I can't remember the details of the
argument after so long.

I take your point that it is vastly less relevant on a machine of that
class than one one where 4 KB blocks are seriously bad news.


Regards,
Nick Maclaren.

Anne & Lynn Wheeler

unread,
Mar 7, 2008, 2:36:12 PM3/7/08
to
Stephen Fuld <S.F...@PleaseRemove.att.net> writes:
> I think I understand what you are saying, but given the availability
> of scatter/gather I/O (data chaining in IBM speak), such aggregation
> is just as easy from real memory as it is from extended. Thus there
> is no advantage to moving the pages to some special location to
> aggregate them, and you have to pay the cost of the move.

re:
http://www.garlic.com/~lynn/2008f.html#3 Fantasy-Land_Hierarchal_NUMA_Memory-Model_on_Vertical ...

3090 extended store had a syncronous fast move instruction custom for
the extended store ... sort of like a customized 4k move ... and
possibly also didn't affect processor cache (other than invalidates
... but didn't actually drag thru cache). i/o was not possible to
"extended store" ... i.e. any pages in extended stored that "aged out"
would have to be copied back to regular memory before doing any i/o.
one might consider this analogous to "page migration" ... something i
had implemented for fixed-head (and electronic store) paging disks in
the 70s (required moving pages back into storage before writting to
lower thruput disks) ... actually the page migration was more
generalized ... but fixed-head to non-fixed-head was most apparent
effect.

a little before 3090 (early 80s starting with 308x) ... both mvs and vm
got "big pages" ... basically custom page replacement that attempted to
aggregate a full track of a process pages (10 for 3380) for a single
transfer out. a subsequent fault on any member of a "big page" ... would
fetch all pages in a big page. the issue was effectively trading off
real storage (fetch of 40k) against 3380 access bottleneck (i.e. random
access for 3380 vis-a-vis 3330 only went up moderately ... but transfer
rate went up by a factor of ten times). the idea was somewhat
approximate the thruput of fixed-head disks ... with the much less
expensive 3380s (attempting to always perform full track operation per
access, possibly even doing multiple full track operations at the same
cylinder position ... before any requirement to move the arm).

big pages implemented a log-structured filesystem kind of logic (i.e.
disk arm access optimization) allocated space supporting the operation
was possibly ten times expected use ... moving cursor progressed across
surface ... with new write of a big page always going to next available
track on the leading edge of the moving cursor. advantage of paging
strategy (vis-a-vis log-structured filesystem) was there was no
"clean-up" requirement to periodically consolidate scattered written
file records into consecutive locations.

discussion around recent releases imply that support for "big pages" has
been dropped because associated processor overhead was exceeding the
benefit.

the original purpose of 3090 extended store was because of physical
packaging and requiring something that differentiated storage with
longer latency. this is somewhat akin to the old 360 LCS ... where there
was variety of both kinds of implementation ... 1) execution directly
out of LCS (with latency on every load/store) and 2) copying from LCS
down to faster storage. 3090 extended store only provided hardware
support for the "copying" strategy.

later machine generations addressed the latency associated with physical
packaging. however, it was still possible to configurate microcode to
partition standard memory as (hardware) emulated "extended store".
http://www.garlic.com/~lynn/2008e.html#40 Fantasy-Land_Hierarchal_NUMA_Memory-Model_on_Vertical ...

the issue was that there was some experience with customers getting
better thruput with some configured extended store (trading off lower
amount of standard addressable storage for some amount of extended
storage). one could claim that a better algorithm implementation could
approach the accuracy of working with smaller amounts of real memory ...
while obtaining the benefits of having one global addressable memory
(w/o requiring moving overhead).

for total other topic drift ... there was project to craft HiPPI I/O
support onto 3090 ... but standard channel interface wouldn't support
the 100mbyte/sec transfer ... so there was a hack to craft HiPPI support
into the side of the extended store bus. since the processor didn't
actually have channel program capability on the extended store bus ...
I/O commands were implemented using a peek/poke kind of architecture
(using the extended store, 4k "copy" instructions to peek & poke to
reserved extended store addresses).

for other topic drift ... past posts discussing big page (i.e. full
track transfers of multiple regular pages) implementation from early
80s:
http://www.garlic.com/~lynn/2001k.html#60 Defrag in linux? - Newbie question
http://www.garlic.com/~lynn/2002b.html#20 index searching
http://www.garlic.com/~lynn/2002c.html#29 Page size (was: VAX, M68K complex instructions)
http://www.garlic.com/~lynn/2002c.html#48 Swapper was Re: History of Login Names
http://www.garlic.com/~lynn/2002e.html#8 What are some impressive page rates?
http://www.garlic.com/~lynn/2002e.html#11 What are some impressive page rates?
http://www.garlic.com/~lynn/2002f.html#20 Blade architectures
http://www.garlic.com/~lynn/2002l.html#36 Do any architectures use instruction count instead of timer
http://www.garlic.com/~lynn/2002m.html#4 Handling variable page sizes?
http://www.garlic.com/~lynn/2002m.html#7 Handling variable page sizes?
http://www.garlic.com/~lynn/2003b.html#69 Disk drives as commodities. Was Re: Yamhill
http://www.garlic.com/~lynn/2003d.html#21 PDP10 and RISC
http://www.garlic.com/~lynn/2003f.html#5 Alpha performance, why?
http://www.garlic.com/~lynn/2003f.html#9 Alpha performance, why?
http://www.garlic.com/~lynn/2003f.html#16 Alpha performance, why?
http://www.garlic.com/~lynn/2003f.html#48 Alpha performance, why?
http://www.garlic.com/~lynn/2003g.html#12 Page Table - per OS/Process
http://www.garlic.com/~lynn/2003o.html#61 1teraflops cell processor possible?
http://www.garlic.com/~lynn/2003o.html#62 1teraflops cell processor possible?
http://www.garlic.com/~lynn/2004.html#13 Holee shit! 30 years ago!
http://www.garlic.com/~lynn/2004e.html#16 Paging query - progress
http://www.garlic.com/~lynn/2004n.html#22 Shipwrecks
http://www.garlic.com/~lynn/2004p.html#39 100% CPU is not always bad
http://www.garlic.com/~lynn/2005h.html#15 Exceptions at basic block boundaries
http://www.garlic.com/~lynn/2005j.html#51 Q ALLOC PAGE vs. CP Q ALLOC vs ESAMAP
http://www.garlic.com/~lynn/2005l.html#41 25% Pageds utilization on 3390-09?
http://www.garlic.com/~lynn/2005n.html#18 Code density and performance?
http://www.garlic.com/~lynn/2005n.html#19 Code density and performance?
http://www.garlic.com/~lynn/2005n.html#21 Code density and performance?
http://www.garlic.com/~lynn/2005n.html#22 Code density and performance?
http://www.garlic.com/~lynn/2006j.html#2 virtual memory
http://www.garlic.com/~lynn/2006j.html#3 virtual memory
http://www.garlic.com/~lynn/2006j.html#4 virtual memory
http://www.garlic.com/~lynn/2006j.html#11 The Pankian Metaphor
http://www.garlic.com/~lynn/2006l.html#13 virtual memory
http://www.garlic.com/~lynn/2006r.html#35 REAL memory column in SDSF
http://www.garlic.com/~lynn/2006r.html#37 REAL memory column in SDSF
http://www.garlic.com/~lynn/2006r.html#39 REAL memory column in SDSF
http://www.garlic.com/~lynn/2006t.html#18 Why magnetic drums was/are worse than disks ?
http://www.garlic.com/~lynn/2006v.html#43 The Future of CPUs: What's After Multi-Core?
http://www.garlic.com/~lynn/2006y.html#9 The Future of CPUs: What's After Multi-Core?
http://www.garlic.com/~lynn/2007o.html#32 reading erased bits


Nick Maclaren

unread,
Mar 7, 2008, 3:40:32 PM3/7/08
to

In article <m363vyz...@garlic.com>,
Anne & Lynn Wheeler <ly...@garlic.com> writes:

|> Stephen Fuld <S.F...@PleaseRemove.att.net> writes:
|> > I think I understand what you are saying, but given the availability
|> > of scatter/gather I/O (data chaining in IBM speak), such aggregation
|> > is just as easy from real memory as it is from extended. Thus there
|> > is no advantage to moving the pages to some special location to
|> > aggregate them, and you have to pay the cost of the move.
|>
|> re:
|> http://www.garlic.com/~lynn/2008f.html#3 Fantasy-Land_Hierarchal_NUMA_Memory-Model_on_Vertical ...
|>
|> 3090 extended store had a syncronous fast move instruction custom for
|> the extended store ...

Yes, but THAT extended store was designed to be cheaper by providing
ONLY such operations - i.e. the hardware couldn't be used as general
memory. I thought that this thread was about using part of general
memory as an in-store paging device.


Regards,
Nick Maclaren.

Anne & Lynn Wheeler

unread,
Mar 7, 2008, 4:57:57 PM3/7/08
to

nm...@cus.cam.ac.uk (Nick Maclaren) writes:
> Yes, but THAT extended store was designed to be cheaper by providing
> ONLY such operations - i.e. the hardware couldn't be used as general
> memory. I thought that this thread was about using part of general
> memory as an in-store paging device.

just mentioning runup to the situation ... and it actually wasn't
designed to be cheaper ... they would have preferred to not have
extended store at all. 3090 system thruput could benefit from the
additional storage.

3090 extended store was same physical memory technology but physical
packaging resulted in different latency ... so they somewhat backed into
putting it on a different bus ... that was wider and only operated with
special move instruction (it wasn't as good as single global addressable
memory ... but it was much better than having to do physical i/o).

later machines eliminated the physical packaging differentiation ...
but provided microcode configuration option to use regular memory
technology as (microcode/hardware) emulated extended storage ... and
customers could select to treat storage as single global address space
... or partitioned between standard addressable storage and emulated
extended storage.

posts in this thread:
http://www.garlic.com/~lynn/2008e.html#40 Fantasy-Land_Hierarchal_NUMA_Memory-Model_on_Vertical
http://www.garlic.com/~lynn/2008f.html#3 Fantasy-Land_Hierarchal_NUMA_Memory-Model_on_Vertical
http://www.garlic.com/~lynn/2008f.html#6 Fantasy-Land_Hierarchal_NUMA_Memory-Model_on_Vertical

for other thread drift ... current generations of machines have separate
dedicated HSA (hardware storage area) ... uses include "system" disk
record caching:
http://www.garlic.com/~lynn/2008d.html#91 Z10 presentation on 26 Feb
http://www.garlic.com/~lynn/2008e.html#31 IBM announce Z10 ..why so fast...any problem on z 9
http://www.garlic.com/~lynn/2008e.html#39 IBM announce Z10 ..why so fast...any problem on z 9

as previously mentioned, in the late 70s, we did a special
implementation that would capture all disk record number access ... and
it was installed on several different systems in the san jose area
... capturing live activity for various kinds of application work
(engineering timesharing, administrative business, development,
etc). The information was used in I/O cache simulation ... which looked
at variety of different caching strategies. One of the findings was
that for a given amount of electronic storage (for cache), optimal
system performance was with using all the storage as single global
system cache ... as opposed to various kinds of partitioning strategies
(where total aggregate electronic cache storage remained the
same)... channel level caches, control unit level caches, device level
caches.

This cache simulation work tended to validate my earlier findings in the
60s (as undergraduate) that "global LRU" implementation strategies
outperformed "local LRU" implementations (where "local LRU" is
equivalent to partitioning storage into various kinds of subsets).

These findings would also tend to support that single global addressable
area would outperform subsetting it into two different areas ... one of
them a simulated electronic paging device. For subsetting with some sort
of emulated electronic paging device, to provide improved throughput
... would indicate some kind of deficiency/idiosyncrasy in the
implementation (that the configuration variations, compensate for)

Morten Reistad

unread,
Mar 8, 2008, 11:03:57 AM3/8/08
to
In article <us3Aj.286050$MJ6.1...@bgtnsc05-news.ops.worldnet.att.net>,

The "large memory" patches to BSD and Linux, to support >4G (3.75G, actually)
on a 32-bit ISA work in this fashion, plus some remapping of page tables.

Now there are 4 layers of memory; the 640k for booting, the 16M for
old-style DMA and 24-bit processes, 4G for 32-bit access, and the
"large memory" beyond 4G.

-- mrr

Anne & Lynn Wheeler

unread,
Mar 8, 2008, 1:32:08 PM3/8/08
to
Morten Reistad <fi...@last.name> writes:
> The "large memory" patches to BSD and Linux, to support >4G (3.75G, actually)
> on a 32-bit ISA work in this fashion, plus some remapping of page tables.
>
> Now there are 4 layers of memory; the 640k for booting, the 16M for
> old-style DMA and 24-bit processes, 4G for 32-bit access, and the
> "large memory" beyond 4G.

re:


http://www.garlic.com/~lynn/2008e.html#40 Fantasy-Land_Hierarchal_NUMA_Memory-Model_on_Vertical
http://www.garlic.com/~lynn/2008f.html#3 Fantasy-Land_Hierarchal_NUMA_Memory-Model_on_Vertical
http://www.garlic.com/~lynn/2008f.html#6 Fantasy-Land_Hierarchal_NUMA_Memory-Model_on_Vertical

http://www.garlic.com/~lynn/2008f.html#8 Fantasy-Land_Hierarchal_NUMA_Memory-Model_on_Vertical

the "original" of this was 3033 >16mbyte support.

360/67 had 24bit and 32bit addressing modes ... 370 retrenched to just
24bit addressing (real & virtual) modes.

come up to 3033 ... disks had relative slower system thruput, fixed
storage requirements were increasing , 3033 was running into real
thruput crunch with 16mbyte real memory constraint.

370 architecture had 16bit page table entry ... 12 bit real (4k byte)
page number, 2 defined bits and 2 undefined bits. 3033 did a hack to
allow addressing 64mbytes ... by allowing the 2 undefined bits to be
concatenated to the 12bit page number (for 14bit number). real & virtual
addressing was still limited to 24bit ... but pagetable/tlb could
translate that into 26bit address.

normal channel programs were 24bit ... but "IDALs" (indirect data access
lists) were introduced with for 370 channel programs ... which had 31bit
address field for data transfers. the original objective of IDALs was
that CCWs in channel programs have to be processed syncronously and
prefetching wasn't allowed. the "problem" was scatter/gather i/o
transfers crossing page boundaries and non-contiguous ... which in 360
required data-chaining. however, there were some situation where there
would be overrun because the (data-chaining) chained to CCW couldn't be
fetched within the time constraints. IDAL addresses were allowed to be
prefetched ... which facilitated converting virtually contiguous data
transfers ... to non-contiguous real data transfers (because of
non-contiguous virtual pages).

in any case, IDAL allowed for channel programs on 3033 to do i/o into
and out of real storage above the 16mbyte line (although all the actual
channel programs had to exist below the 16mbyte line).

the remaining problem was that their were certain operations requiring
virtual pages to be below the 16mbyte line ... if the page had
originally been above the 16mbyte line ... it then would have to be
moved below the line. an early design called for writing such a page to
disk and then bringing it back in below the 16mbyte line. i did a hack
for them that fiddled some page table entries and used a MVCL in special
virtual address space to move virtual pages back and forth between below
the 16mbyte line and above the 16mbyte line.

old email reference trying to explain that the page table, MVCL hack was
much better than the I/O strategy.
http://www.garlic.com/~lynn/2006t.html#email800121

old email discussing perterbations to global LRU with two distinct
areas for virtual pages (above and below the 16mbyte line)
http://www.garlic.com/~lynn/2007b.html#email860124

other old email discussing global LRU
http://www.garlic.com/~lynn/lhwemail.html#globallru

posts mentioning page replacement algorithms
http://www.garlic.com/~lynn/subtopic.html#wsclock

past posts discussing the 3033 >16mbyte hack
http://www.garlic.com/~lynn/2000d.html#82 "all-out" vs less aggressive designs (was: Re: 36 to 32 bit transition)
http://www.garlic.com/~lynn/2003f.html#4 Alpha performance, why?
http://www.garlic.com/~lynn/2003f.html#24 New RFC 3514 addresses malicious network traffic
http://www.garlic.com/~lynn/2004c.html#34 Playing games in mainframe
http://www.garlic.com/~lynn/2004e.html#8 were dumb terminals actually so dumb???
http://www.garlic.com/~lynn/2004e.html#41 Infiniband - practicalities for small clusters
http://www.garlic.com/~lynn/2004f.html#38 Infiniband - practicalities for small clusters
http://www.garlic.com/~lynn/2004k.html#44 Wars against bad things
http://www.garlic.com/~lynn/2005.html#34 increasing addressable memory via paged memory?
http://www.garlic.com/~lynn/2005p.html#19 address space
http://www.garlic.com/~lynn/2005q.html#30 HASP/ASP JES/JES2/JES3
http://www.garlic.com/~lynn/2005u.html#44 POWER6 on zSeries?
http://www.garlic.com/~lynn/2006.html#13 VM maclib reference
http://www.garlic.com/~lynn/2006l.html#2 virtual memory
http://www.garlic.com/~lynn/2006m.html#27 Old Hashing Routine
http://www.garlic.com/~lynn/2006w.html#23 Multiple mappings
http://www.garlic.com/~lynn/2007b.html#34 Just another example of mainframe costs
http://www.garlic.com/~lynn/2007g.html#59 IBM to the PCM market(the sky is falling!!!the sky is falling!!)
http://www.garlic.com/~lynn/2007r.html#56 CSA 'above the bar'


Brian Boutel

unread,
Mar 9, 2008, 12:10:47 AM3/9/08
to

ISTR that VMS (at least in the early part of its life) kept some pages
as a staging area, i.e they were marked for paging out, but it hadn't
yet happened, so if they came back into use they could be retrieved
cheaply. I don't know whether copying of contents was used, or simply
updating tables.

This seemed to work, although presumably performance was affected by
tuning the relative sizes of the active and staging regions.

In theory, some page replacement strategies have the property that
adding page frames cannot degrade performance. FIFO is provably[1] not
one of these. I think that VMS actually used FIFO for the decision to
mark pages, but the chance of cheap recovery if not too much time had
elapsed since marking it for paging compensated for its bad side. On the
plus side, FIFO is cheap to implement.


[1] Coffman and Denning give the sequence 1 2 3 4 1 2 5 1 2 3 4 5 first
with three page frames and then with 4. The number of page faults
increases from 9 to 10.


Note, however that I haven't thought much about this for 25 years.


--brian

--
Wellington, New Zealand

"What's life? Life's easy. A quirk of matter. Nature's way of keeping
meat fresh."

Anne & Lynn Wheeler

unread,
Mar 9, 2008, 3:24:11 AM3/9/08
to

Brian Boutel <fa...@fake.org> writes:
> ISTR that VMS (at least in the early part of its life) kept some pages
> as a staging area, i.e they were marked for paging out, but it hadn't
> yet happened, so if they came back into use they could be retrieved
> cheaply. I don't know whether copying of contents was used, or simply
> updating tables.

standard LRU (whether global LRU or local LRU) tended to degrade to FIFO
... as previously mentioned, I had done a slight of hand coding trick
where LRU would degrade to RANDOM instead of FIFO ... i.e.
http://www.garlic.com/~lynn/2008f.html#3 Fantasy-Land_Hierarchal_NUMA_Memory-Model_on_Vertical ...

and showed much better thruput characteristics.

synchronous page replacement ... when there was page fault ... invoke
the page replacement algorithm ... find a page for replacement ... and
allocate that page for replacement by the faulting page ... could run
into additional latency delays ... if the page selected for replacement
had been changed and first requires writing to backing store ... before
it can be "replaced".

asynchronous page replacement ... would attempt to keep a small pool of
pages immediately available for replacement ... i.e. the production of
pages to be replaced running slightly ahead of the consumption of the
pages ... attempting to mitigate the replace latency (when the page
needed to be written). It was possible to "reclaim" pages in such a pool
... if an application page faulted on one ... prior to it being
acctually allocated for replacement. Circa 1980, somebody had thought up
this brilliant strategy for the favorite son operating system ... and
also called me about implementing such a "reclaim" strategy for VM. I
commented that since my days as an undergraduate in the 60s, it had
never occured to me not to have implemented such a "reclaim" strategy
(in the implementation i had done for cp67 and later vm370 case, the
page table entry invalid flag was marked invalid ... but the page number
not actually zeroed ... until the real page number was assigned for some
other virtual page, reclaim then just involved updating table info.

then there is "duplicated" vis-a-vis "no-duplicated" stratigy with
regard to managing secondary backing store location.

In the duplicate strategy ... the backing store location would remain
allocated, even if a page had been fetched into main memory for use. If
the page was subsequently selected for replacement, the page hadn't been
changed and the backing store location was still valid ... the write
could be avoided and the real storage location could be immediately
available for use.

In the no-duplicate strategy ... the backing store location is
deallocated when the page is brought into main memory. Subsequent
selection of the page for replacement would always require it to be
written out.

The duplicate strategy could reduce page i/o traffic ... and also reduce
latency ... especially in synchronous page replacement strategy. The
no-duplicate strategy increases the amount of page i/o traffic, but
eliminates the amount of secondary storage required ... especially in
situations with large real storage configurations and somewhat
constrained disk space (i.e. eliminates duplicates of a page both on
disk and in real memory).

Another early disagreement with the people working on moving the
favorite son batch operating system from purely real storage into
virtual memory operation (initially os/v2 svs and then os/vs2
mvs). their modeling had "found" that a page replacement strategy could
reduce page i/o activity and latency if it were to first explicitly
search for a non-changed page ... before resorting to selecting a
changed page. My contention was that it would horribly distort any LRU
operational characteristics ... but they went ahead and did it
anyway. It wasn't until a number of years into MVS releases that they
realized that this "micro optimization" was selecting high-use, shared
program execution virtual pages before lower use private application
(changed) data pages. Of course this was only applicable to the
"duplicate" strategy where it was possible to avoid the write of a a
non-changed page selected for replacement.

other posts in this thread:
http://www.garlic.com/~lynn/2008e.html#40 Fantasy-Land_Hierarchal_NUMA_Memory-Model_on_Vertical


http://www.garlic.com/~lynn/2008f.html#6 Fantasy-Land_Hierarchal_NUMA_Memory-Model_on_Vertical
http://www.garlic.com/~lynn/2008f.html#8 Fantasy-Land_Hierarchal_NUMA_Memory-Model_on_Vertical

http://www.garlic.com/~lynn/2008f.html#12 Fantasy-Land_Hierarchal_NUMA_Memory-Model_on_Vertical

misc. old email discussing things like global LRU vis-a-vis local LRU
http://www.garlic.com/~lynn/lhwemail.html#globallru

and previous posts mentioning page replacement algorithms
http://www.garlic.com/~lynn/subtopic.html#wsclock

previous posts specifically mentioning dup/no-dup strategies:
http://www.garlic.com/~lynn/93.html#12 managing large amounts of vm
http://www.garlic.com/~lynn/93.html#13 managing large amounts of vm
http://www.garlic.com/~lynn/94.html#9 talk to your I/O cache
http://www.garlic.com/~lynn/2000d.html#13 4341 was "Is a VAX a mainframe?"
http://www.garlic.com/~lynn/2001i.html#42 Question re: Size of Swap File
http://www.garlic.com/~lynn/2001l.html#55 mainframe question
http://www.garlic.com/~lynn/2001n.html#78 Swap partition no bigger than 128MB?????
http://www.garlic.com/~lynn/2002b.html#10 hollow files in unix filesystems?
http://www.garlic.com/~lynn/2002b.html#16 hollow files in unix filesystems?
http://www.garlic.com/~lynn/2002b.html#19 hollow files in unix filesystems?
http://www.garlic.com/~lynn/2002b.html#20 index searching


http://www.garlic.com/~lynn/2002e.html#11 What are some impressive page rates?
http://www.garlic.com/~lynn/2002f.html#20 Blade architectures

http://www.garlic.com/~lynn/2002f.html#26 Blade architectures

http://www.garlic.com/~lynn/2003o.html#62 1teraflops cell processor possible?

http://www.garlic.com/~lynn/2004g.html#17 Infiniband - practicalities for small clusters
http://www.garlic.com/~lynn/2004g.html#18 Infiniband - practicalities for small clusters
http://www.garlic.com/~lynn/2004g.html#20 Infiniband - practicalities for small clusters
http://www.garlic.com/~lynn/2004h.html#19 fast check for binary zeroes in memory
http://www.garlic.com/~lynn/2004i.html#1 Hard disk architecture: are outer cylinders still faster than inner cylinders?
http://www.garlic.com/~lynn/2005c.html#27 [Lit.] Buffer overruns
http://www.garlic.com/~lynn/2005m.html#28 IBM's mini computers--lack thereof
http://www.garlic.com/~lynn/2006c.html#8 IBM 610 workstation computer
http://www.garlic.com/~lynn/2006e.html#45 using 3390 mod-9s
http://www.garlic.com/~lynn/2006f.html#18 how much swap size did you take?
http://www.garlic.com/~lynn/2006i.html#41 virtual memory

http://www.garlic.com/~lynn/2007c.html#0 old discussion of disk controller chache
http://www.garlic.com/~lynn/2007e.html#60 FBA rant
http://www.garlic.com/~lynn/2007l.html#61 John W. Backus, 82, Fortran developer, dies

Nick Maclaren

unread,
Mar 9, 2008, 6:08:27 AM3/9/08
to

In article <fqvrgn$cmn$1...@registered.motzarella.org>,

Brian Boutel <fa...@fake.org> writes:
|>
|> ISTR that VMS (at least in the early part of its life) kept some pages
|> as a staging area, i.e they were marked for paging out, but it hadn't
|> yet happened, so if they came back into use they could be retrieved
|> cheaply. I don't know whether copying of contents was used, or simply
|> updating tables.

So did MVS and, I believe, many other mainframes.


Regards,
Nick Maclaren.

Morten Reistad

unread,
Mar 9, 2008, 6:32:20 AM3/9/08
to
In article <fr0cur$jn0$1...@gemini.csx.cam.ac.uk>,

This is a standard part of most paging algorithms.

See <http://en.wikipedia.org/wiki/Page_replacement_algorithm>

-- mrr

Peter Flass

unread,
Mar 9, 2008, 10:14:08 AM3/9/08
to
Brian Boutel wrote:
>
> ISTR that VMS (at least in the early part of its life) kept some pages
> as a staging area, i.e they were marked for paging out, but it hadn't
> yet happened, so if they came back into use they could be retrieved
> cheaply. I don't know whether copying of contents was used, or simply
> updating tables.
>

I think most, maybe all, systems do this these days.

Anne & Lynn Wheeler

unread,
Mar 9, 2008, 12:07:39 PM3/9/08
to
nm...@cus.cam.ac.uk (Nick Maclaren) writes:
> So did MVS and, I believe, many other mainframes.

re:
http://www.garlic.com/~lynn/2008f.html#19 Fantasy-Land_Hierarchal_NUMA_Memory-Model_on_Vertical ...

as mentioned ... not in early releases. somebody got an award for coming
up with the idea for mvs (very late 70s) ... and then i got a call if
they could also apply it to vm (and get another award?). i replied i had
never not done it that way since undergraduate days in the 60s.

i somewhat facetiously suggested that instead of giving awards for
making things the way they should have been originally ... the original
people responsible should be given negative awards.

Jeff Jonas

unread,
Mar 24, 2008, 7:24:18 PM3/24/08
to
>> The "large memory" patches to BSD and Linux, to support >4G (3.75G, actually)
>> on a 32-bit ISA work in this fashion, plus some remapping of page tables.
>>
>> Now there are 4 layers of memory; the 640k for booting, the 16M for
>> old-style DMA and 24-bit processes, 4G for 32-bit access, and the
>> "large memory" beyond 4G.

Non IBM systems suffered similar problems.

"trampoline buffers" were placed in low RAM
for peripherals who could not address all of RAM.
This inverted the concept of what made "good" controllers:
- PIO (programmed I/O) were the slowest
since the CPU had to move every word to/from the controller
but that ment anything the CPU can address can be a buffer.
- DMA were considered a little better
but still depended on the motherboard/system's DMA controller to access RAM
- "bus mastering" controllers perform the DMA themselves,
but only if they have enough address lines!
That's one BIG strike against the ISA bus :-(


Even z80 systems had similar situations.
The WaveMate Bullet system had 128k of DRAM
but with only 16 address lines, the Z80 can only directly address 64k.
Only the upper 48k of RAM was switchable, leaving 16k for system & common.
What of the other 16k? That was used for buffering since the DMA chip
had additional logic to add a 17th address line for read/write.
I don't think the CPU could access it directly,
analagous to the mainframe's extended memory.

-- Jeffrey Jonas
jeffj@panix(dot)com
The original Dr. JCL and Mr .hide

0 new messages