AMD 64bit Hammer CPU and VM

gat1024

unread,

Dec 13, 2000, 1:48:30 PM12/13/00

to

AMD's approach to 64bit computing is to extend all of the x86 reg's and
tables out to 64bits. So now the paging hardware uses 64bit entries for
it's page tables. Wouldn't this pose a problem for VMM's designed around
32bit CPU's?

On 32bit x86, the paging hardware uses forward referencing tables. You
could theoretically maintain a global map of every page in the address
space in about 4MB. You could keep the entire table in RAM without
swapping. Using this technique on the Hammer would yield a page table
2^32 larger. You'd have to resort some sort of sparse or multi-level
data structure (more than the 3 levels already present). But since there
is not hardware support for this sort of design, performance will
probably suffer.

Question is, knowing that you'd have to change your VMM design anyway,
why didn't AMD make it easier on OS programmers and go with some sort of
backward referencing design like on the PowerPC?

Does anyone think the design will negatively affect the performance of
an OS's VM subsystem? Can we expect so-so performance from AMD's new
chip on real world tests?

Thanks.

Sent via Deja.com
http://www.deja.com/

Aaron R. Kulkis

unread,

Dec 13, 2000, 5:49:57 PM12/13/00

to

gat1024 wrote:
>
> AMD's approach to 64bit computing is to extend all of the x86 reg's and
> tables out to 64bits. So now the paging hardware uses 64bit entries for
> it's page tables. Wouldn't this pose a problem for VMM's designed around
> 32bit CPU's?

Any programmer who depends on a specific architecture's wordsize
should be flogged

....daily....

>
> On 32bit x86, the paging hardware uses forward referencing tables. You
> could theoretically maintain a global map of every page in the address
> space in about 4MB. You could keep the entire table in RAM without
> swapping. Using this technique on the Hammer would yield a page table
> 2^32 larger. You'd have to resort some sort of sparse or multi-level
> data structure (more than the 3 levels already present). But since there
> is not hardware support for this sort of design, performance will
> probably suffer.
>
> Question is, knowing that you'd have to change your VMM design anyway,
> why didn't AMD make it easier on OS programmers and go with some sort of
> backward referencing design like on the PowerPC?
>
> Does anyone think the design will negatively affect the performance of
> an OS's VM subsystem? Can we expect so-so performance from AMD's new
> chip on real world tests?
>
> Thanks.
>
> Sent via Deja.com
> http://www.deja.com/

--
Aaron R. Kulkis
Unix Systems Engineer
DNRC Minister of all I survey
ICQ # 3056642

Terje Mathisen

unread,

Dec 13, 2000, 5:38:30 PM12/13/00

to

gat1024 wrote:
> On 32bit x86, the paging hardware uses forward referencing tables. You
> could theoretically maintain a global map of every page in the address
> space in about 4MB. You could keep the entire table in RAM without
> swapping. Using this technique on the Hammer would yield a page table
> 2^32 larger. You'd have to resort some sort of sparse or multi-level
> data structure (more than the 3 levels already present). But since there
> is not hardware support for this sort of design, performance will
> probably suffer.
>
> Question is, knowing that you'd have to change your VMM design anyway,
> why didn't AMD make it easier on OS programmers and go with some sort of
> backward referencing design like on the PowerPC?

AFAIK, IBM have gone away from the reverse tables, since they made it
harder to share memory pages using mapping tricks, i.e. it is hard to
make the same page appear at multiple virtual addresses, right?

Terje

--
- <Terje.M...@hda.hydro.com>
Using self-discipline, see http://www.eiffel.com/discipline
"almost all programming can be viewed as an exercise in caching"

Daniel Wilcox

unread,

Dec 13, 2000, 7:12:30 PM12/13/00

to

Aaron R. Kulkis <aku...@yahoo.com> wrote:
> gat1024 wrote:
>>
>> AMD's approach to 64bit computing is to extend all of the x86 reg's and
>> tables out to 64bits. So now the paging hardware uses 64bit entries for
>> it's page tables. Wouldn't this pose a problem for VMM's designed around
>> 32bit CPU's?

> Any programmer who depends on a specific architecture's wordsize
> should be flogged

Crucifiction is probably more suitable, the last thing we want is another
martyr!

Dont trust your pointers!!

> ....daily....

>>
>> On 32bit x86, the paging hardware uses forward referencing tables. You
>> could theoretically maintain a global map of every page in the address
>> space in about 4MB. You could keep the entire table in RAM without
>> swapping. Using this technique on the Hammer would yield a page table
>> 2^32 larger. You'd have to resort some sort of sparse or multi-level
>> data structure (more than the 3 levels already present). But since there
>> is not hardware support for this sort of design, performance will
>> probably suffer.
>>
>> Question is, knowing that you'd have to change your VMM design anyway,
>> why didn't AMD make it easier on OS programmers and go with some sort of
>> backward referencing design like on the PowerPC?
>>
>> Does anyone think the design will negatively affect the performance of
>> an OS's VM subsystem? Can we expect so-so performance from AMD's new
>> chip on real world tests?
>>
>> Thanks.
>>
>> Sent via Deja.com
>> http://www.deja.com/

> --
> Aaron R. Kulkis
> Unix Systems Engineer
> DNRC Minister of all I survey
> ICQ # 3056642

--
Research Associate
High Performance Systems Group Office: +44 (0)2476 522485
Department of Computer Science Mobile: +44 (0)7974 912552
University of Warwick Fax: +44 (0)2476 573024
CV4 7AL (UK)

gat1024

unread,

Dec 13, 2000, 9:18:47 PM12/13/00

to

In article <3A37FA66...@hda.hydro.com>,
Terje Mathisen <terje.m...@hda.hydro.com> wrote:
> gat1024 wrote:
> >** snip **

> > Question is, knowing that you'd have to change your VMM design
anyway,
> > why didn't AMD make it easier on OS programmers and go with some
sort of
> > backward referencing design like on the PowerPC?
>
> AFAIK, IBM have gone away from the reverse tables, since they made it
> harder to share memory pages using mapping tricks, i.e. it is hard to
> make the same page appear at multiple virtual addresses, right?
>
> Terje
>
> --
> - <Terje.M...@hda.hydro.com>
> Using self-discipline, see http://www.eiffel.com/discipline
> "almost all programming can be viewed as an exercise in caching"
>

I didn't know IBM had moved away from that. It introduces a slight
incompatibility with their PowerPC architecture spec that I thought
POWER4 adhered to? Maybe not.

I hear that backward referencing is harder to work with for the reason
you point out, but when the address space reaches a certain size, it
becomes a necessary evil. You'd spend more time traversing structures
and searching for pages with the forward way vs. quick hashing (via
dedicated circuitry) and locating pages in the reverse way.

I'm just thinking that AMD's design, while giving you a huge address
space, wouldn't necessarily allow you to work any faster than an
equivalently clocked 32bit Athlon. (Disregarding superlarge datasets of
course.) It might even be slightly slower in the area where the OS
spends a good deal of its time -- handling page faults and swapping
data.

Evelyne Michaud

unread,

Dec 14, 2000, 1:09:30 AM12/14/00

to

"Aaron R. Kulkis" wrote:
> gat1024 wrote:
> >
> > AMD's approach to 64bit computing is to extend all of the x86 reg's and
> > tables out to 64bits. So now the paging hardware uses 64bit entries for
> > it's page tables. Wouldn't this pose a problem for VMM's designed around
> > 32bit CPU's?
>
> Any programmer who depends on a specific architecture's wordsize
> should be flogged
>
> ....daily....

An x86 OS VM programmer should not depend on the architecture word size
?

Right. I'd like to see you do this.

If you stick to high-level applications then I would more readily agree
with
you. It is already difficult in low-level ones, like debuggers or
compilers.
In a VM system, it is plain nonsense.

Eric

ttk_ciar

unread,

Dec 14, 2000, 3:44:53 AM12/14/00

to

> From: Evelyne Michaud <evelyne...@pobox.com>
> Date: Thu, 14 Dec 2000 06:09:30 GMT

>
>> Any programmer who depends on a specific architecture's wordsize
>> should be flogged
>>
>> ....daily....
>
> An x86 OS VM programmer should not depend on the architecture word size?

Not unless they wanted to limit their OS to only running on x86
hardware, or resigned themselves to needlessly having to rewrite a
lot of code every time they ported their OS to a different arch.

> Right. I'd like to see you do this.
>
> If you stick to high-level applications then I would more readily agree
>with you. It is already difficult in low-level ones, like debuggers or
>compilers. In a VM system, it is plain nonsense.

Please qv _Understanding the Linux Kernel_, by Bovet and Cesati,
published by O'Reilly, pp 53-55. Linux's VM system is (wisely) not
dependent on any particular architecture's address word size.

-- TTK

Peter Boyle

unread,

Dec 14, 2000, 5:45:44 AM12/14/00

to

On 14 Dec 2000, it was written:

Hi,

I think his point is that even e.g. the linux kernel has to
have significant swathes of system dependent code code in
<arch>/asm and include/asm-<arch>.

Of course the interfaces are well thought out
making porting easy, however at some level the code *does* know...

Peter Boyle pbo...@physics.gla.ac.uk

Jan Vorbrueggen

unread,

Dec 14, 2000, 9:09:06 AM12/14/00

to

gat1024 <gat...@my-deja.com> writes:

> You'd have to resort some sort of sparse or multi-level data structure
> (more than the 3 levels already present). But since there is not hardware
> support for this sort of design, performance will probably suffer.

Tell that to Alpha, which has exactly that structure and does software page
table walking (in PAL code). The current implementation uses three-level page
tables, IIRC, for up to 53-56 bits (again IIRC) of physical memory, but large
pages and four levels are architected for the full 64 bits.

Jan

Anton Ertl

unread,

Dec 14, 2000, 11:55:09 AM12/14/00

to

In article <919am2$olr$1...@nnrp1.deja.com>,

gat1024 <gat...@my-deja.com> writes:
>I hear that backward referencing is harder to work with for the reason
>you point out, but when the address space reaches a certain size, it
>becomes a necessary evil. You'd spend more time traversing structures
>and searching for pages with the forward way vs. quick hashing (via
>dedicated circuitry) and locating pages in the reverse way.

Well, some people did actually evaluate such things and IIRC the
findings contradict your claims:

@InProceedings{jacob&mudge98,
author = {Bruce L. Jacob and Trevor N. Mudge},
title = {A Look at Several Memory Management Units,
TLB-Refill Mechanisms, and Page Table Organizations},
crossref = {asplos98},
pages = {295--306},
annote = {Describes several MMU/TLB/Page Table Organizations
of various machines and their operating systems, and
evaluates and analyses the performance impact of the
various choices.}
}

@Proceedings{asplos98,
title = "Architectural Support for Programming Languages and
Operating Systems (ASPLOS-VIII)",
booktitle = "Architectural Support for Programming Languages and
Operating Systems (ASPLOS-VIII)",
year = "1998",
key = "ASPLOS-VIII"
}

@InProceedings{dougan+99,
author = {Cort Dougan and Paul Mackeras and Victor Yodaiken},
title = {Optimizing the Idle Task and Other {MMU} Tricks},
crossref = {osdi99},
pages = {229--237},
url = {http://hq.fsmlabs.com/~cort/papers/linuxppc-mm/linuxppc-mm.ps},
annote = {Discusses various optimizations of memory management
stuff in Linux/PPC. The optimizations are: mapping
the kernel with BAT (block address translation)
registers instead of the TLB; better choice of
segment IDs (VSIDs) to get a higher hash table hit
ratio; hand optimizing the TLB miss code; on the
603, don't use the hash tables upon TLB miss, use
the page tables directly; instead of flushing stale
entries from the TLB and hash table, just change the
involved VSID; turn off the cache on TLB miss to
avoid polluting the cache with page table entries;
clear free pages in the idle task, with caches
turned off. Not all of these optimizations are
supported with convincing data in the paper, but the
effect of their combination is quite good. One
interesting result was that apparently the kernel
compile benchmark was originally suffering quite a
lot from TLB misses (just mapping the kernel with
BATs reduced wall-clock time by a factor 1.25).}
}

@Proceedings{osdi99,
title = {Operating Systems Design and Implementation (OSDI '99)},
booktitle = {Operating Systems Design and Implementation (OSDI '99)},
year = {1999},
key = {OSDI '99}
}

>I'm just thinking that AMD's design, while giving you a huge address
>space, wouldn't necessarily allow you to work any faster than an
>equivalently clocked 32bit Athlon. (Disregarding superlarge datasets of
>course.)

I would not expect a 64-bit CPU to be faster unless the application
can use SIMD-type parallelism to a significant degree (and much of
that is probably already exploited through MMX etc. on the AMD
processors).

> It might even be slightly slower in the area where the OS
>spends a good deal of its time -- handling page faults and swapping
>data.

If the OS spends a lot of time in page faults and swapping, another
level of page tables is your least worry. I guess you actually mean
TLB misses; some applications suffer a lot from them. I guess AMD can
make the L2 TLBs larger if the longer traversal time is a problem.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

gat1024

unread,

Dec 14, 2000, 10:00:34 PM12/14/00

to

In article <91au1d$rn5$4...@news.tuwien.ac.at>,

an...@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
> Well, some people did actually evaluate such things and IIRC the
> findings contradict your claims:
>
> @InProceedings{jacob&mudge98,
> author = {Bruce L. Jacob and Trevor N. Mudge},
> title = {A Look at Several Memory Management Units,
> TLB-Refill Mechanisms, and Page Table
Organizations},
> crossref = {asplos98},
> pages = {295--306},
> annote = {Describes several MMU/TLB/Page Table Organizations
> of various machines and their operating systems, and
> evaluates and analyses the performance impact of the
> various choices.}

> ** snip **

Thanks for the info! Just looking over the first doc and it answers many
questions. Although they conclude that the differences between
hierarchical vs. inverted is neglible overall, they believe that a
hardware TLB with an hardware assisted inverted table (a la PowerPC)
would probably be the optimal solution. Interesting part was that
the Intel way worked best even though it had many L2 cache misses.
Mainly because it had no interrupt processing overhead.

> If the OS spends a lot of time in page faults and swapping, another
> level of page tables is your least worry. I guess you actually mean
> TLB misses; some applications suffer a lot from them. I guess AMD can
> make the L2 TLBs larger if the longer traversal time is a problem.
>
> - anton
> --

You're right, I meant TLB misses. And I guess using larger pages can
help as well -- at the expense of losing finer grain protection on
blocks of memory.

Greg Pfister

unread,

Dec 14, 2000, 4:18:21 PM12/14/00

to

Terje Mathisen wrote:
>
> gat1024 wrote:
> > On 32bit x86, the paging hardware uses forward referencing tables. You
> > could theoretically maintain a global map of every page in the address
> > space in about 4MB. You could keep the entire table in RAM without
> > swapping. Using this technique on the Hammer would yield a page table
> > 2^32 larger. You'd have to resort some sort of sparse or multi-level
> > data structure (more than the 3 levels already present). But since there
> > is not hardware support for this sort of design, performance will
> > probably suffer.
> >
> > Question is, knowing that you'd have to change your VMM design anyway,
> > why didn't AMD make it easier on OS programmers and go with some sort of
> > backward referencing design like on the PowerPC?
>
> AFAIK, IBM have gone away from the reverse tables, since they made it
> harder to share memory pages using mapping tricks, i.e. it is hard to
> make the same page appear at multiple virtual addresses, right?

Right. SMPs kill having a literally inverted page table, with one
slot for each physical page, since (for example) it's impossible
to support shared segments that way.

Ever since SMPs, the page tables have been hashed, not truly
inverted. But that's a sparse-ish data structure, of the kind
being discussed, too.

Greg Pfister

Jan Vorbrueggen

unread,

Dec 15, 2000, 3:15:09 AM12/15/00

to

gat1024 <gat...@my-deja.com> writes:

> And I guess using larger pages can help as well -- at the expense of
> losing finer grain protection on blocks of memory.

That's what granularity hints regions in the OS and their TLB support are
for. You can have your cake _and_ eat it - well, at the cost of additional
complexity in the TLB, but what are a few thousanf transistors among friends
today?

Jan

Joe Keane

unread,

Dec 16, 2000, 2:28:43 AM12/16/00

to

In article <918g9r$19q$1...@nnrp1.deja.com>

gat1024 <gat...@my-deja.com> writes:
>Question is, knowing that you'd have to change your VMM design anyway,
>why didn't AMD make it easier on OS programmers and go with some sort of
>backward referencing design like on the PowerPC?

No no, you're thinking of POWER.

So far as i know, all PowerPCs have proper virtual memory, thus you get
working mapped files, shared memory, and so on.

--
Joe Keane, amateur mathematician

Alberto Moreira

unread,

Dec 16, 2000, 9:01:28 AM12/16/00

to

Joe Keane wrote:

> No no, you're thinking of POWER.
>
> So far as i know, all PowerPCs have proper virtual memory, thus you get
> working mapped files, shared memory, and so on.

I'm curious to know what's your concept of "proper" virtual
memory. The way I see it, the obvious thing to do is to make
pages big enough, 4k pages is a mickeymouse legacy of days gone
by. And believe it or not, people do get mapped files and shared
memory in Win2K.

Alberto.

Sander Vesik

unread,

Dec 18, 2000, 4:14:40 PM12/18/00

to

Alberto Moreira <junk...@moreira.mv.com> wrote:
> Joe Keane wrote:
>
>> No no, you're thinking of POWER.
>>
>> So far as i know, all PowerPCs have proper virtual memory, thus you get
>> working mapped files, shared memory, and so on.
>
> I'm curious to know what's your concept of "proper" virtual
> memory. The way I see it, the obvious thing to do is to make

One where the above listed features work sufficently well when
used by real world applications depending on them.

> pages big enough, 4k pages is a mickeymouse legacy of days gone

Can you substaniate your claim?

> by. And believe it or not, people do get mapped files and shared
> memory in Win2K.

I can't see a single point in bringing win2k into this thread.

> Alberto.

--
Sander

FLW: "I can banish that demon"

Alberto Moreira

unread,

Dec 19, 2000, 9:05:30 AM12/19/00

to

Sander Vesik wrote:

> One where the above listed features work sufficently well when
> used by real world applications depending on them.

Real world applications couldn't care less about paging.

> Can you substaniate your claim?

Sure. Take a high end PC, say, 4Gb of physical memory. Now,
divide it in 4Mb pages: you have 1024 pages to worry about. A 4K
byte TLB is enough to keep all pages, so page tables are, uh,
unnecessary. More, an application will have to need more than
4Gb of memory before paging, TLB included, is even necessary,
how many applications can claim such a large working set ?

When paging was invented, large machines had 512K bytes of
memory. Those days, you know, are gone, we no longer need to be
relocating segments up and down, or to link applications in
complex tree-like overlay structures, to fit them in those 131K
36-bit memory I had in my Univac 1108s. The problem that paging
was invented to solve is now a thing of the past, yet the scheme
is still there even though it doesn't buy us much except
headaches and expenditure in additional hardware and control
software.

> I can't see a single point in bringing win2k into this thread.

It's an operating system, you know, one of those real life
programs that uses paging. No worse than any other, no less
legitimate, and very, very pervasive throughout the computing
world.

Alberto.

Jeremy Harris [RU-UK]

unread,

Dec 19, 2000, 11:45:47 AM12/19/00

to

In article <3A3F6B2A...@moreira.mv.com>,

Alberto Moreira <junk...@moreira.mv.com> writes:
> Sander Vesik wrote:
>
>> One where the above listed features work sufficently well when
>> used by real world applications depending on them.
>
> Real world applications couldn't care less about paging.

But their users can, when the application slows down so much as
to be unusable.

>> Can you substaniate your claim?
>
> Sure. Take a high end PC, say, 4Gb of physical memory. Now,
> divide it in 4Mb pages: you have 1024 pages to worry about. A 4K
> byte TLB is enough to keep all pages, so page tables are, uh,
> unnecessary. More, an application will have to need more than
> 4Gb of memory before paging, TLB included, is even necessary,
> how many applications can claim such a large working set ?

Now try and run two thousand processes.

Oh dear. Internal fragmentation.

- Jeremy

Tim Bradshaw

unread,

Dec 19, 2000, 12:04:08 PM12/19/00

to

Alberto Moreira <junk...@moreira.mv.com> writes:

>
> Sure. Take a high end PC, say, 4Gb of physical memory. Now,
> divide it in 4Mb pages: you have 1024 pages to worry about. A 4K
> byte TLB is enough to keep all pages, so page tables are, uh,
> unnecessary. More, an application will have to need more than
> 4Gb of memory before paging, TLB included, is even necessary,
> how many applications can claim such a large working set ?
>

What happens if this machine needs to run more than 1024 processes?
Looks to me like bad trouble.

--tim

Fred Kleinsorge

unread,

Dec 20, 2000, 2:32:14 PM12/20/00

to

Tim Bradshaw wrote in message ...

This is along the lines of what I would ask. There is a compromise between
TLB misses, paging, and number of addressable units available. A 2K page
size is great for a small, cheap system with not a lot of memory (say 16mb),
when you start to reach memory sizes of 1TB, then 2K pages are not-so-great,
maybe a 64kb page size would be more reasonable. A 4mb page size seems
excessive for a general purpose system - today. And let's hope you never
have to take a hard page fault ;-)

Pages tend to be the funamental unit for sharing, and protection. So making
them too large means that you have lots of wasted memory space. You can
reduce TLB misses without changing the page size - for instance Global Hints
on Alpha.

4GB, BTW - is a piddly amount of memory for most large non-PC servers today.
The ES40 sitting next to me has 16GB.

ttk_ciar

unread,

Dec 20, 2000, 9:02:22 PM12/20/00

to

>This is along the lines of what I would ask. There is a compromise between
>TLB misses, paging, and number of addressable units available. A 2K page
>size is great for a small, cheap system with not a lot of memory (say 16mb),
>when you start to reach memory sizes of 1TB, then 2K pages are not-so-great,
>maybe a 64kb page size would be more reasonable. A 4mb page size seems
>excessive for a general purpose system - today. And let's hope you never
>have to take a hard page fault ;-)

Isn't this why some systems have variable-sized pages?
Is there a significant downside to using variable-sized pages?

-- TTK

Jan Vorbrueggen

unread,

Dec 21, 2000, 5:38:44 AM12/21/00

to

TTK Ciar writes:

> Is there a significant downside to using variable-sized pages?

Allocation is a mess. Consider how long it took to get variable-length TLB
entries into hard- and software, and that's a much easier problem, with the
hard part being pushed to explicit specification by the user in many cases -
at least, that is how I understand VMS' granularity hints word (Fred?).

Jan

John R. Mashey

unread,

Dec 21, 2000, 4:03:33 AM12/21/00

to

In article <91rob...@news2.newsguy.com>, TTK Ciar writes:
|> Organization: Subtle, but there

|>
|> >This is along the lines of what I would ask. There is a compromise between
|> >TLB misses, paging, and number of addressable units available. A 2K page
|> >size is great for a small, cheap system with not a lot of memory (say 16mb),
|> >when you start to reach memory sizes of 1TB, then 2K pages are not-so-great,
|> >maybe a 64kb page size would be more reasonable. A 4mb page size seems
|> >excessive for a general purpose system - today. And let's hope you never
|> >have to take a hard page fault ;-)

|> Isn't this why some systems have variable-sized pages?

Yes; in the MIPS case we decide ~1989 that the R4000 would have all-out
variable page-size support.

|> Is there a significant downside to using variable-sized pages?

(a) Hardware TLB design: consider a TLB design with fixed-size pages.
Some designs extend reasonably gracefully to supporting relatively
arbitrary mixtures of page sizes. Others don't, which is why there have been
plenty of designs with, for example, 2 parallel TLBs, one with
a lot of small page entries, and the other a few huge page entries for
special uses. That works by doing 2 lookups in parallel, having
extracted different numbers of bits.

In a MIPS-style TLB (and others do similar things), each actual TLB entry
could be filled with any of the choice of page-sizes, and the TLB lookup
works as follows:
(a) The largest Virtual Page Number (i.e., for the smallest page
size, 4K in MIPS) is sent in parallel to each TLB entry.
(b) Each entry logically records a VPN and a mask that says
how many of the high-order bits are valid. The associative
comparison is made to only use the valid bits. (The R4000
used some magic cirtcuit design to do this.)

It's up to the OS to make sure that no more than one TLB entry actually
matches any given VPN, although there was a "TLB shutdown" trap in case
the OS screwed up, to avoid burning up the chip.

All of this sort of thing is relatively straightforward in a full-associative
TLB with software-controlled refill. It is less straightforward to to
allow arbitrary mixtures in (for example) a 4-set-associative design,
where one wants to use low-order-of-VPN bits to index into the arrays.

The fundamental issue for any general variable-page-size design is that,
given a Virtual Address, you don't really know the page size of the page
that contains the VA, until you get to the TLB. Individual TLB entries have
to record the sizes of virtual memory that they map, so they can determine
if they actual map the incoming VA or not.

(b) Software: the OS, of course, has to deal with variable page sizes
in a useful way, and this is nontrivial, and usually takes years to mature
*after* you have the hardware around. A lot of 1980s OSs certainly thought
that for a given system, there was exactly one normal page-size.

However, even given (a) and (b), Moore's Law keeps increasing DRAMs,
and there are more and more machines out there with GBs of memory.
It is often more effective to use variable-page TLBs to increase TLB
coverage, than to try to build small-fixed-sized TLBs with enough
entries. For example, with an 8MB cache, and 4K pages, you need
2048 entries just to cover the cache. Even going to 64K pages gets this
down to 128 entries, and of course a 16MB page gets this to 1 entry.
Since some people actually run single apps with multiple GBs of memory,
such apps can badly thrash TLBs with moderate numbers of small pages.
Such apps occur in big technical codes and also in some kinds of DBMS
apps, where you want to have big shared regions.

--
-John Mashey EMAIL: ma...@sgi.com DDD: 650-933-3090 FAX: 650-851-4620
USPS: SGI 1600 Amphitheatre Pkwy., ms. 562, Mountain View, CA 94043-1351
SGI employee 25% time; cell phone = 650-575-6347.
PERMANENT EMAIL ADDRESS: ma...@heymash.com

Peter Boyle

unread,

Dec 21, 2000, 6:22:25 AM12/21/00

to

On Wed, 20 Dec 2000, Fred Kleinsorge wrote:

>
> Pages tend to be the funamental unit for sharing, and protection. So making
> them too large means that you have lots of wasted memory space. You can
> reduce TLB misses without changing the page size - for instance Global Hints
> on Alpha.
>

The Block Address Translation tables on the PowerPC would ideal for
avoiding TLB missing on large memory work. When active these override
the TLB system.
If only there were user hinted OS support...

Peter Boyle pbo...@physics.gla.ac.uk

Jeff Epler

unread,

Dec 19, 2000, 8:09:08 PM12/19/00

to

On Tue, 19 Dec 2000 09:05:30 -0500, Alberto Moreira
<junk...@moreira.mv.com> wrote:
>Sander Vesik wrote:
>
>> One where the above listed features work sufficently well when
>> used by real world applications depending on them.
>
>Real world applications couldn't care less about paging.
>
>> Can you substaniate your claim?
>
>Sure. Take a high end PC, say, 4Gb of physical memory. Now,
>divide it in 4Mb pages: you have 1024 pages to worry about. A 4K
>byte TLB is enough to keep all pages, so page tables are, uh,
>unnecessary. More, an application will have to need more than
>4Gb of memory before paging, TLB included, is even necessary,
>how many applications can claim such a large working set ?

Take a common medium-end PC with 64 or 128 megs of RAM.

With 4M pages, you get 16 or 32 pages worth of memory. Imagine how Linux
would run on a system such as this.

Each process uses at least 3 pages of memory (code, heap, stack)My
machine currently has 43 processes running. The kernel uses at least
4 pages of memory (code, heap, stacks, and the lowest 4M page which is
made unusable on the PC architecture)

Each memory-mapped object, such as a file or shared library, takes at least 1
page of memory. My machine currently has 88 mapped objects, including
executables.

This means that my system needs a minimum of 178 pages. But since my system
has 36 megs of RAM, it only has 9(!) 4M pages. That's not even enough to
run the kernel (4 pages), and /sbin/init, /usr/X11R6/bin/XF86_SVGA,
and /usr/X11R6/bin/xterm. Let alone a window manager or a shell in that
xterm! (As it is, with 4K pages, I can run all these programs without swap,
though due to having very little disk cache, performance is degraded)

178 * 4M pages would equate to 712 megs of RAM. People don't have this,
not even on their big desktop machines!

(My "big desktop machine", sitting at the Gnome desktop without any
applications running, has 64 processes and 113 mapped objects, for at
least 245 pages required --- still not quite a gig of RAM at 4M/page,
but getting close. With 4K pages, there are only about 28 megs of RAM
used, excluding buffers/cache)

And let's talk about swapping! How fast is your disk? How long is it
going to take to seek, write 4 megs, seek, read 4 megs, just to swap
one page to disk (or seek + read 4 megs, in the case of reading a mapped
file)? Disks are fast, but they're not that fast! UW-SCSI is 40MB/s,
so that means you can swap 5 pages per second. Great! That means it will
take at least .6 seconds just to fork+exec on your system, unless you just
happen to be running with 12 more megs free on top of the 980 megs in use
in my "Desktop" example above!

Maybe 4M pages would be great, but not if we continue to use pages as
we do today. (roughly defined, the granularity of memory as allocated
by the OS, and the granularity of memory protection)

Hey, how about if we do away with the TLB by mapping everything as though it
were one big page, starting at 0? Then we can do away with the TLB entirely,
no matter how much memory you have .. all you need is enough RAM and enough
address bits.

Jeff

Rob Young

unread,

Dec 21, 2000, 10:32:15 AM12/21/00

to

In article <slrn9401lj...@potty.housenet>, jep...@inetnebr.com (Jeff Epler) writes:

> And let's talk about swapping! How fast is your disk? How long is it
> going to take to seek, write 4 megs, seek, read 4 megs, just to swap
> one page to disk (or seek + read 4 megs, in the case of reading a mapped
> file)? Disks are fast, but they're not that fast! UW-SCSI is 40MB/s,
> so that means you can swap 5 pages per second. Great! That means it will
> take at least .6 seconds just to fork+exec on your system, unless you just
> happen to be running with 12 more megs free on top of the 980 megs in use
> in my "Desktop" example above!
>

You live in the past. Seagate is shipping 15K RPM UltraSCSI III
drives. Ultra III does 160 MByte/sec with Ultra IV coming out
in Y2001 at 320 MByte/sec. As an aside, when your co-workers start
chattering about FibreChannel drives ask them about bandwidth.

> Maybe 4M pages would be great, but not if we continue to use pages as
> we do today. (roughly defined, the granularity of memory as allocated
> by the OS, and the granularity of memory protection)
>
> Hey, how about if we do away with the TLB by mapping everything as though it
> were one big page, starting at 0? Then we can do away with the TLB entirely,
> no matter how much memory you have .. all you need is enough RAM and enough
> address bits.
>

4 Meg pages are not for the desktop. However, as memory sizes
get larger and larger in the Enterprise (Compaq is shipping a box
that can handle 256 GBytes and plans to support 1 Terabyte in
the future) memory structures and the time to traverse them get
unwieldy or prohibitive and so you will see larger page sizes
as a consequence.

Rob

Jeremy Harris [RU-UK]

unread,

Dec 21, 2000, 11:30:24 AM12/21/00

to

In article <AAbs$q53...@eisner.decus.org>,

you...@eisner.decus.org (Rob Young) writes:
> In article <slrn9401lj...@potty.housenet>, jep...@inetnebr.com (Jeff Epler) writes:
>
>> And let's talk about swapping! How fast is your disk? How long is it
>> going to take to seek, write 4 megs, seek, read 4 megs, just to swap
>> one page to disk (or seek + read 4 megs, in the case of reading a mapped
>> file)? Disks are fast, but they're not that fast! UW-SCSI is 40MB/s,
>> so that means you can swap 5 pages per second. Great! That means it will
>> take at least .6 seconds just to fork+exec on your system, unless you just
>> happen to be running with 12 more megs free on top of the 980 megs in use
>> in my "Desktop" example above!
>>
>
> You live in the past. Seagate is shipping 15K RPM UltraSCSI III
> drives. Ultra III does 160 MByte/sec with Ultra IV coming out
> in Y2001 at 320 MByte/sec. As an aside, when your co-workers start
> chattering about FibreChannel drives ask them about bandwidth.

The channel may well do that, but what can the platters sustain?

- Jeremy

Andy Freeman

unread,

Dec 21, 2000, 12:24:50 PM12/21/00

to

In article <slrn9401lj...@potty.housenet>,

jep...@inetnebr.com (Jeff Epler) wrote:
> On Tue, 19 Dec 2000 09:05:30 -0500, Alberto Moreira
<junk...@moreira.mv.com> wrote:
> >Sure. Take a high end PC, say, 4Gb of physical memory. Now,
> >divide it in 4Mb pages: you have 1024 pages to worry about. A 4K
> >byte TLB is enough to keep all pages, so page tables are, uh,
> >unnecessary. More, an application will have to need more than
> >4Gb of memory before paging, TLB included, is even necessary,
> >how many applications can claim such a large working set ?
>
> Take a common medium-end PC with 64 or 128 megs of RAM.
>
> With 4M pages, you get 16 or 32 pages worth of memory. Imagine how
Linux
> would run on a system such as this.

As Epler points out, it would thrash because real memory/page size is
so small that the system can't hold enough mappable objects. (A small
TLB isn't as big an issue if the system can keep unmapped objects in
RAM.) The page-in time would be horrible because 4MByte takes a while.

The minimum number of mappable objects required is fairly usage
dependent. A single-user system doesn't need as many as a server
shared by lots of users. A dos box doesn't need as many as a
single-user Linux system running X.

I'd guess that 1024 mappable objects would suffice for 6-80% of
multi-programmed systems, that 5% would be happy with 16, 3-50%
could live with 128-256, and that 5-10% require at least 16k.

The size of TLB entries must balance the need for enough mappable
objects with TLB speed, miss rates, and refill mechanisms. TLB speed
(for a given technology) is largely determined by architecture and
the number of entries. (A cache-like multi-level hardware TLB might
be useful for systems that need more than 1024 entries.)

Miss rates are affected by how applications use mappable objects, by
how the program's entities are allocated and by their usage patterns.
For example, some code and stack segments might be able to share a 4M
page; the internal fragmentation might be acceptable if it keeps the
miss-rate and the number of required mappable objects down.

Since caches are to RAM what RAM is to swap-disk, I suspect that we'll
seem the same sorts of mechanisms in both. (We don't see unmapped
but in cache data yet.) What's interesting to me is that the fractal
analogy suggests that TLBs should have more "lines" than caches, but
that doesn't seem to be happening.

-andy

Rob Young

unread,

Dec 21, 2000, 12:51:17 PM12/21/00

to

Likewise, what bandwidth can a fibrechannel disk sustain?

http://www.seagate.com/support/kb/disc/ultra3faq.html#2

What transfer rates can I expect from my Ultra160 SCSI drive?

Ultra160 SCSI provides SCSI bus maximum burst data rates of 160
Mbytes/sec. That is double the Ultra2 LVD drives (80 Mbytes/sec),
quadruple the fastest SCSI-2 standard (40 Mbytes/sec), and light years
ahead of the SCSI-1 standard used prior to 1992 in which SCSI bus rates
were as slow as 3 Mbytes/sec. More realistically, sustained data transfer
rates of 30-50 Mbytes/sec can be expected.

---

So maybe the UltraSCSI IV drives will sustain only 110 MByte/sec.

Now suppose your needs are 200 MByte/sec sustained? How to solve
that? Today I would create a 10 member 0+1 set on controllers
that support write-backed mirrored cache, etc.

We can spiral off to various bandwidth wars I suppose, but I will
acknowledge you can overrun a single drive a million different
ways.

Rob

Andi Kleen

unread,

Dec 21, 2000, 1:04:26 PM12/21/00

to

ma...@mash.engr.sgi.com (John R. Mashey) writes:
>
> It's up to the OS to make sure that no more than one TLB entry actually
> matches any given VPN, although there was a "TLB shutdown" trap in case
> the OS screwed up, to avoid burning up the chip.

Would the chip really "burn" if this trap wasn't implemented ? Just curious.

-Andi

Brig Campbell

unread,

Dec 21, 2000, 2:38:12 PM12/21/00

to

"Rob Young" <you...@eisner.decus.org> wrote in message
news:WozySI...@eisner.decus.org...

While the drives have great performance and support various interfaces such
as the new 200MB FC and Ultra160, I'd prefer to hide all this behind large
caches on the storage subsystem. Throw in a couple of FC controllers
connect to a FC SAN and 200MB/sec is no problem.

Don't try this at home, unless you have lots of money then please email me.
:-)

-brig

Some disk performance date from Seagate:
http://www.seagate.com/cda/products/discsales/enterprise/family/0,1130,246,0
0.html

Performance Ultra
FC
Internal Transfer Rate, ZBR (Mbits/sec) 385-512 395-508
Internal Formatted Transfer Rate (Mbytes/sec) 37.4-48.9 38-48.9
External Transfer Rate (mbytes/sec)

Ultra 8,16 bit/Ultra2/Fibre Channel (per loop) 200 160 200
Track-to-track Seek Read/Write (msec) 0.5/0.7 0.5/0.7
Average Seek Read/Write (msec) 3.9/4.5 3.9/4.5
Average Latency (msec) 2 2
Spindle Speed (RPM) 15K 15K

Rob Young

unread,

Dec 21, 2000, 3:29:55 PM12/21/00

to

In article <91tm77$ktf$1...@mail.pl.unisys.com>, "Brig Campbell" <brig.c...@unisys.com> writes:

>
> While the drives have great performance and support various interfaces such
> as the new 200MB FC and Ultra160, I'd prefer to hide all this behind large
> caches on the storage subsystem. Throw in a couple of FC controllers
> connect to a FC SAN and 200MB/sec is no problem.
>

Depends... if you only had 3 drives hanging off your controllers
you would have an impossible job sustaining 200 MB/sec as the writes
have to hit the platters. With a great filesystem you don't have
to mask filesystem inefficiencies with expensive controllers.

You can pump as much and much more through AIX with
enough SSA storage and the controllers are quite cheap/reasonable. You
can stripe and mirror at the OS level and get quite smashing
performance:

http://www.tpc.org/results/individual_results/IBM/ibm.s80.99110101.es.pdf

One thing to note in that config is they get 135K tpmC and there
is only 96 MByte of fast write cache.

Perhaps it looks like I want it both ways. But a fast and smart
filesystem is really the key to high performance and reasonable
costs (exceptions of course to using the "cluster cheat" to rocket
tpmC scores), that RS/6000 at 135K is "only" $7 million.
Here is 220K tpmC for "only" $9.5 million , again using SSA:

http://www.tpc.org/results/individual_results/Bull/bull.epc2450.00110701.es.pdf

Contrast this to the excellent GS320 number of 155K:

http://www.tpc.org/results/individual_results/Compaq/compaq.gs320.00111001.es.pdf

for $8.6 million. For $900K more you get 70K more tpmC with AIX.

"Dumb" controllers would put Compaq in a better price performance
category. As a for instance, there is $500,000 in controller
SOFTWARE costs in that config and an additional $390,000 in
controller hardware/software maintenance costs, not counting
the costs of the controllers themselves!

As CPUs get faster and faster, the overhead of software based
RAID is less of an issue allowing vendors to ship dumb controllers
as the smarts are already in the OS. Would make for more competitive
tpmC bidding if the expensive controllers would get ditched.

Rob

Thomas Womack

unread,

Dec 21, 2000, 3:31:57 PM12/21/00

to

"Jeff Epler" <jep...@inetnebr.com> wrote

> <junk...@moreira.mv.com> wrote:
> >Sander Vesik wrote:
> >

> >> One where the above listed features work sufficiently well when

> >> used by real world applications depending on them.
> >
> >Real world applications couldn't care less about paging.
> >

> >> Can you substantiate your claim?

> >
> >Sure. Take a high end PC, say, 4Gb of physical memory. Now,
> >divide it in 4Mb pages: you have 1024 pages to worry about. A 4K
> >byte TLB is enough to keep all pages, so page tables are, uh,
> >unnecessary.

Aren't current TLBs a lot smaller than that -- I had a vague feeling the P3
one was no more than 64 entries [it couldn't map all of L2]. Or am I missing
some subtle factor to do with number of sets?

> >More, an application will have to need more than
> >4Gb of memory before paging, TLB included, is even necessary,
> >how many applications can claim such a large working set ?
>
> Take a common medium-end PC with 64 or 128 megs of RAM.
>
> With 4M pages, you get 16 or 32 pages worth of memory. Imagine how Linux
> would run on a system such as this.

OK, 4M pages seem silly.

How about, say, 64K or 128K pages? Not such stupid granularity, and you
still have the advantage of doing everything in TLB and losing page tables;
even if you choose not to do that, at least you can map the whole L2 cache
into the TLB.

I used for several years a machine (original Acorn Archimedes, 1987 vintage)
which asserted that you had 1024 pages per memory controller. For <=4M of
memory you had one memory controller; if you had 2M of memory, they were 2K
pages. You could get 512k machines, which I think ignored one address line
and claimed to be 1M, but the top half duplicated the bottom half.

For more than 4M you used one memory controller per 4M block of memory, and
could get up to 4096 pages, if you were lunatic enough to spend £2000 or so
on sixteen whole megabytes of RAM.

Of course, this precise approach won't work nowadays because you really
don't want chip boundaries between your CPU and your memory controllers, and
you really really don't want 8N chip boundaries for N gigabytes of memory.
Also, physical chips are expensive and board area is expensive and ...

Tom

Steve Crockett

unread,

Dec 21, 2000, 3:50:30 PM12/21/00

to

"Thomas Womack" <t...@womack.net> writes:

> "Jeff Epler" <jep...@inetnebr.com> wrote
> > <junk...@moreira.mv.com> wrote:
> > >Sander Vesik wrote:
> > >
> > >> One where the above listed features work sufficiently well when
> > >> used by real world applications depending on them.
> > >
> > >Real world applications couldn't care less about paging.
> > >
> > >> Can you substantiate your claim?
> > >
> > >Sure. Take a high end PC, say, 4Gb of physical memory. Now,
> > >divide it in 4Mb pages: you have 1024 pages to worry about. A 4K
> > >byte TLB is enough to keep all pages, so page tables are, uh,
> > >unnecessary.
>
> Aren't current TLBs a lot smaller than that -- I had a vague feeling the P3
> one was no more than 64 entries [it couldn't map all of L2]. Or am I missing
> some subtle factor to do with number of sets?
>
> > >More, an application will have to need more than
> > >4Gb of memory before paging, TLB included, is even necessary,
> > >how many applications can claim such a large working set ?
> >
> > Take a common medium-end PC with 64 or 128 megs of RAM.
> >
> > With 4M pages, you get 16 or 32 pages worth of memory. Imagine how Linux
> > would run on a system such as this.
>
> OK, 4M pages seem silly.

No, they don't. Imagine instead how running a proprietary Unix on
a system with 128 processors or more using shared memory
with 2 GB/processor (e.g., 256 GB memory) would do with 4K pages.
Got any interest in trying to manage 67 million pages,
and still do useful computing?

Note that this is comp.arch, not comp.pc.arch or
comp.i'm-too-cheap.arch or even comp.os.linux.advocacy.
Some of the people who post and lurk here have multi-million
dollar budgets for hardware, and have actual needs for systems
similar to the one mentioned above.

Linux has its place, but it isn't the be-all and end-all of
computing. More to the point, I'm pretty sure Linus T.
and the cabal prefer it that way, never mind the
usual zealotry surrounding the issue.

Most people don't expect a Lamborghini to be run by
a 2-cycle Briggs & Stratton. Why limit computer
design decisions/discussions only to what works on
a desktop Linux box?

(For the record, I find a Briggs & Stratton to be
much better fit than a V10 for my lawn mower, and
a Linux desktop suits my e-mail and web browsing
needs much better than the 256-processor, 256 GB
Origin 2000 I use regularly for my day job.)

Use/design the right tool for your job, and let others
use/design the right tool for theirs.

--
Steve Crockett Supercomputing Apps-Energy
SGI Server & Supercomputing Bus. Unit
11490 Westheimer Rd, Ste. 100 e-mail: s...@sgi.com
Houston, TX 77077 phone: (281) 493-8349

Aaron R. Kulkis

unread,

Dec 21, 2000, 4:17:27 PM12/21/00

to

Rob Young wrote:
>
> In article <slrn9401lj...@potty.housenet>, jep...@inetnebr.com (Jeff Epler) writes:
>
> > And let's talk about swapping! How fast is your disk? How long is it
> > going to take to seek, write 4 megs, seek, read 4 megs, just to swap
> > one page to disk (or seek + read 4 megs, in the case of reading a mapped
> > file)? Disks are fast, but they're not that fast! UW-SCSI is 40MB/s,
> > so that means you can swap 5 pages per second. Great! That means it will
> > take at least .6 seconds just to fork+exec on your system, unless you just
> > happen to be running with 12 more megs free on top of the 980 megs in use
> > in my "Desktop" example above!
> >
>
> You live in the past. Seagate is shipping 15K RPM UltraSCSI III
> drives. Ultra III does 160 MByte/sec with Ultra IV coming out
> in Y2001 at 320 MByte/sec. As an aside, when your co-workers start
> chattering about FibreChannel drives ask them about bandwidth.

Which doesn't change the MECHANICAL head-seek times.
Face it... 4M is too big. The costs FAR outweigh the meager benefits.

>
> > Maybe 4M pages would be great, but not if we continue to use pages as
> > we do today. (roughly defined, the granularity of memory as allocated
> > by the OS, and the granularity of memory protection)
> >
> > Hey, how about if we do away with the TLB by mapping everything as though it
> > were one big page, starting at 0? Then we can do away with the TLB entirely,
> > no matter how much memory you have .. all you need is enough RAM and enough
> > address bits.
> >
>
> 4 Meg pages are not for the desktop. However, as memory sizes
> get larger and larger in the Enterprise (Compaq is shipping a box
> that can handle 256 GBytes and plans to support 1 Terabyte in
> the future) memory structures and the time to traverse them get
> unwieldy or prohibitive and so you will see larger page sizes
> as a consequence.

If you have 1 TB of memory, then you have enough mem for smaller page sizes.
Duh.

>
> Rob

--
Aaron R. Kulkis
Unix Systems Engineer
DNRC Minister of all I survey
ICQ # 3056642

Greg Lindahl

unread,

Dec 21, 2000, 4:02:27 PM12/21/00

to

> > With 4M pages, you get 16 or 32 pages worth of memory. Imagine how Linux
> > would run on a system such as this.
>
> OK, 4M pages seem silly.

At that memory size! But...

This argument has been a bit silly, since folks seem to be assuming
that it's about "should 4M pages be the default?" instead of "do 4M
pages make sense for a 16 Gbyte system that never pages?" The first
question is kind of dull; the second is much more interesting.

-- g

Bill Todd

unread,

Dec 21, 2000, 5:09:40 PM12/21/00

to

Steve Crockett <s...@sgi.com> wrote in message
news:rozvgsd...@thulcandra.houst.sgi.com...

Yes, they do, for *most* purposes. The fact that you can find the
occasional exception does nothing to change this.

Imagine instead how running a proprietary Unix on
> a system with 128 processors or more using shared memory
> with 2 GB/processor (e.g., 256 GB memory) would do with 4K pages.
> Got any interest in trying to manage 67 million pages,
> and still do useful computing?

It really depends upon the access (and possibly also sharing) granularity of
the application and its data. And how the memory system interacts with
secondary storage (e.g., with 4 MB pages, many kinds of data won't be stored
in memory at all efficiently if storage is accessed at memory-page
granularity).

As was suggested earlier, 4 MB disk access granularity has major performance
implications whenever this equates to the disk-access granularity (e.g., for
paging). Any randomly-accessed data that does not require larger transfers
should ideally be small enough such that the disk seek and rotational
latency (at least on average, and when queues are expected that 'average'
value decreases due to request reordering optimizations) significantly
exceeds the request transfer time, thus minimally impacting the attainable
I/O rate.

Contemporary high-volume disks have average access times on the order of 12
ms. (8 ms. for a 1/3-stroke seek, 4 ms. for half a rotation) and average
transfer rates of around 30 MB/sec, which would suggest something like a 64
KB page size (adding about 2 ms. to the base 12 ms. figure) as an upper
limit from that viewpoint (and 64 KB also happens to fall within the range
of competence of IDE disks, though their limitations in this area seem to be
on the verge of disappearing). 32 KB should be fine even with significant
queue optimization.

As noted elsewhere in this thread, contemporary high-end disks have average
access times on the order of 6 ms. (4 for the 1/3-stroke seek, 2 for half a
rotation) and average transfer rates of around 45 MB/sec., so the desirable
max random transfer size doesn't differ that much from the above (though is
biased downward a bit).

Since access times generally improve at slower rates than transfer rates,
reasonable random-access sizes (in this context) tend to improve over time.
Given the continual drop in memory costs per bit, so does the amount of
internal fragmentation that is reasonable to allow due to partially-unused
space in a memory page. Thus over time memory page sizes tend to increase -
from 512 bytes on a VAX to 4 KB or 8 KB in the '90s to (at least in this
disk-access-oriented context) perhaps 64 KB today. But it will be a long
time before 4 MB pages are a good choice for general-purpose use.

- bill

Stephen Fuld

unread,

Dec 21, 2000, 5:12:14 PM12/21/00

to

"Rob Young" <you...@eisner.decus.org> wrote in message
news:WozySI...@eisner.decus.org...

No! The sustained data rate is a function of the sustained data rate off
the disk heads. currently this is limited by the electronics in the disk
and is about 5-600 Mbits per second. Note that this is at the outer
diameter and is less as you seek inward on the drive. Thus just changing
the interface won't affect the sustained transfer rate at all (unless the
bus interface it is slower than the disk rate). This is true wether you are
going from SCSI 160 to SCSI 320 or to 100 MB Fibre Channel, or the new 200
MB fire channel. The quotation you gave above supports the 40 MB / Sec rate
given by the previous poster.

>
> Now suppose your needs are 200 MByte/sec sustained? How to solve
> that? Today I would create a 10 member 0+1 set on controllers
> that support write-backed mirrored cache, etc.

You didn't say how many controlers you would use, nor what the referencing
pattern was like (that is, the expected time the disk spends seeking and in
rotational latency versus transferring, which can radically affect the
achievable sustained transfer rate.).

--
- Stephen Fuld

Stephen Fuld

unread,

Dec 21, 2000, 5:12:13 PM12/21/00

to

"Rob Young" <you...@eisner.decus.org> wrote in message

news:AAbs$q53...@eisner.decus.org...

> In article <slrn9401lj...@potty.housenet>, jep...@inetnebr.com
(Jeff Epler) writes:
>
> > And let's talk about swapping! How fast is your disk? How long is it
> > going to take to seek, write 4 megs, seek, read 4 megs, just to swap
> > one page to disk (or seek + read 4 megs, in the case of reading a mapped
> > file)? Disks are fast, but they're not that fast! UW-SCSI is 40MB/s,
> > so that means you can swap 5 pages per second. Great! That means it
will
> > take at least .6 seconds just to fork+exec on your system, unless you
just
> > happen to be running with 12 more megs free on top of the 980 megs in
use
> > in my "Desktop" example above!
> >
>
> You live in the past. Seagate is shipping 15K RPM UltraSCSI III
> drives. Ultra III does 160 MByte/sec with Ultra IV coming out
> in Y2001 at 320 MByte/sec. As an aside, when your co-workers start
> chattering about FibreChannel drives ask them about bandwidth.

The speed of the SCSI bus is not the limiting factor here. For a single I/O
of 4 MB, the limit is the actual sustained transfer rate off the disk. The
higher bus speed is usefull to handle multiple buffered disks on one
controller, but doesn't much afect the speed of a single I/O. For current
disks, depending on what zone on the disk we are talking about 40 MB / sec
isn't a bad number.

As for parallel SCSI versus Fibre Channel, bandwidth is a relativly minor
part of the equation. The advantages of FC are more in the areas of better
connectivity (up to 127 drives per controller), better availability (true
dual port on the drive to stay up in the event of cable failure, etc.) and
easier cabling for systems with a large number of drives.

For most applications (except things like streaming video), the seek and
latency time of the disk swamps the data transfer time and since the SCSI
bus or FC loop isn't busy during the seek, the bus transfer rate rarely
limits performance.

--
- Stephen Fuld

Snip

> 4 Meg pages are not for the desktop. However, as memory sizes
> get larger and larger in the Enterprise (Compaq is shipping a box
> that can handle 256 GBytes and plans to support 1 Terabyte in
> the future) memory structures and the time to traverse them get
> unwieldy or prohibitive and so you will see larger page sizes
> as a consequence.
>

Agreed. What's more, due to the continuing trends in these areas, the
optimum size will continue to grow in the future. If 64K is "the right
size" i.e. the best compormise, now, then sometime in the future 4M will be.

> Rob
>

Steve Crockett

unread,

Dec 21, 2000, 5:40:56 PM12/21/00

to

"Bill Todd" <bill...@foo.mv.com> writes:

> > > OK, 4M pages seem silly.
> >
> >
> > No, they don't.
>
> Yes, they do, for *most* purposes. The fact that you can find the
> occasional exception does nothing to change this.

No one advocating the use of large pages has tried to insist
that they be used indiscriminately. In fact, no advocate for
them has even insisted that they be used exclusively of any
other page size.

I'm looking at a system right now which shows the following
breakdown of pages:

Node[0]
Totalmem758.2M
freemem 707.3M
64k pages 121
256k pages 30
1MB pages 7
4MB pages 117
16MB pages 0
Node[1]
Totalmem760.7M
freemem 678.2M
64k pages 121
256k pages 30
1MB pages 7
4MB pages 128
16MB pages 0
Node[2]
Totalmem760.7M
freemem 737.1M
64k pages 121
256k pages 30
1MB pages 7
4MB pages 114
16MB pages 0
Node[3]
Totalmem760.7M
freemem 682.6M
64k pages 121
256k pages 30
1MB pages 7
4MB pages 76
16MB pages 0

The unlisted pages for each node are 16K.

>
> Imagine instead how running a proprietary Unix on
> > a system with 128 processors or more using shared memory
> > with 2 GB/processor (e.g., 256 GB memory) would do with 4K pages.
> > Got any interest in trying to manage 67 million pages,
> > and still do useful computing?
>
> It really depends upon the access (and possibly also sharing) granularity of
> the application and its data. And how the memory system interacts with
> secondary storage (e.g., with 4 MB pages, many kinds of data won't be stored
> in memory at all efficiently if storage is accessed at memory-page
> granularity).

[ snipping lots of stuff about disks ]

When you're dealing with large memory systems like this,
paging to disk isn't the issue--unless you can sustain
multiple GB/sec to disk, swapping is just out of the question
from a performance perspective. Anyone using such a large
system would double the memory before they would take
the performance hit swapping would cause.

The issue for pages in a system with memory of this size
is the rate at which you take TLB misses. Most people who
care about performance are aware of the performance degradation
caused by inefficient use of caches---for every cache miss whose
latency cannot be hidden, you can lose the equivalent of 20-70
operations. If you are so unlucky (or unwise) to thrash your
cache unmercifully, you can give up 50+% of the capability
of your system.

The TLB is effectively a cache which preserves the
mapping of virtual memory to physical memory, and it is typically
quite small (64 entries on a MIPS R10000). If your memory access pattern
is non-sequential, or if you have very large strides between
successive accesses, you can easily thrash the TLB. TLB
misses are also rather costly. Larger pages tend to eliminate
TLB misses, since more memory is mapped by a single TLB entry.

I've seen situations where using 1 MB pages instead of 16 KB
pages for an application which ran in-core led to a performance
improvement of 25% (i.e., what ran in 50 minutes now runs in 40).
I suspect others have seen greater improvements.

John R. Mashey

unread,

Dec 21, 2000, 5:58:13 PM12/21/00

to

Well, burn-out is probably more accurate than burn-up :-)
It was determined, that if you filled up the TLB with identical entries,
that resulting effects of multiple matches & the extra current would eventually cause physical damage, hence the "TLB-shutdown" feature for absolute safety.
Yes, I know this is weird and paranoid, but people ahd visions of
lawsuits if hackers got into a machine, took it over, and physically
damaged it this way.

Brig Campbell

unread,

Dec 21, 2000, 6:13:07 PM12/21/00

to

"Rob Young" <you...@eisner.decus.org> wrote in message

news:56CFHv...@eisner.decus.org...

> In article <91tm77$ktf$1...@mail.pl.unisys.com>, "Brig Campbell"
<brig.c...@unisys.com> writes:
>
> >
> > While the drives have great performance and support various interfaces
such
> > as the new 200MB FC and Ultra160, I'd prefer to hide all this behind
large
> > caches on the storage subsystem. Throw in a couple of FC controllers
> > connect to a FC SAN and 200MB/sec is no problem.
> >
>
> Depends... if you only had 3 drives hanging off your controllers
> you would have an impossible job sustaining 200 MB/sec as the writes
> have to hit the platters. With a great filesystem you don't have
> to mask filesystem inefficiencies with expensive controllers.

First, happy holidays to everyone. The writes don't always have to hit the
platters, like in the case of EMC Symmetix, once the write hits the cache it
says o.k. The phyical write occurs later.

But this only works as long as you don't dirty all the cache because then
the symmetrix must start doing syncronized write. And at 200MB/sec, you
could substain writes to a 16GB cache for about 80 seconds.

You probably want to use more than 3 drives just for bandwidth purposes,
nothing to do with capacity.

> "Dumb" controllers would put Compaq in a better price performance
> category. As a for instance, there is $500,000 in controller
> SOFTWARE costs in that config and an additional $390,000 in
> controller hardware/software maintenance costs, not counting
> the costs of the controllers themselves!
>
> As CPUs get faster and faster, the overhead of software based
> RAID is less of an issue allowing vendors to ship dumb controllers
> as the smarts are already in the OS. Would make for more competitive
> tpmC bidding if the expensive controllers would get ditched.
>

How does Infiniband change that effort. Vendors will have no problem
connecting lots of 2.1GB/sec Host Channel Adapters so bandwidth into the
system will not be a problem. We still need to address the latency of the
darn slow disk. I like big, intelligent caches on the storage side that
analyze access patterns and attempt to preread the data I'm might need next
independant of the host operating system.

Aaron R. Kulkis

unread,

Dec 21, 2000, 6:41:12 PM12/21/00

to

Rob Young wrote:
>
> In article <slrn9401lj...@potty.housenet>, jep...@inetnebr.com (Jeff Epler) writes:
>
> > And let's talk about swapping! How fast is your disk? How long is it
> > going to take to seek, write 4 megs, seek, read 4 megs, just to swap
> > one page to disk (or seek + read 4 megs, in the case of reading a mapped
> > file)? Disks are fast, but they're not that fast! UW-SCSI is 40MB/s,
> > so that means you can swap 5 pages per second. Great! That means it will
> > take at least .6 seconds just to fork+exec on your system, unless you just
> > happen to be running with 12 more megs free on top of the 980 megs in use
> > in my "Desktop" example above!
> >
>
> You live in the past. Seagate is shipping 15K RPM UltraSCSI III
> drives. Ultra III does 160 MByte/sec with Ultra IV coming out
> in Y2001 at 320 MByte/sec. As an aside, when your co-workers start
> chattering about FibreChannel drives ask them about bandwidth.
>

However, the point of virtual memory is NOT to explore the limits
of INefficiency...

> > Maybe 4M pages would be great, but not if we continue to use pages as
> > we do today. (roughly defined, the granularity of memory as allocated
> > by the OS, and the granularity of memory protection)
> >
> > Hey, how about if we do away with the TLB by mapping everything as though it
> > were one big page, starting at 0? Then we can do away with the TLB entirely,
> > no matter how much memory you have .. all you need is enough RAM and enough
> > address bits.
> >
>
> 4 Meg pages are not for the desktop. However, as memory sizes
> get larger and larger in the Enterprise (Compaq is shipping a box
> that can handle 256 GBytes and plans to support 1 Terabyte in
> the future) memory structures and the time to traverse them get
> unwieldy or prohibitive and so you will see larger page sizes
> as a consequence.
>
> Rob

--
Aaron R. Kulkis
Unix Systems Engineer
DNRC Minister of all I survey
ICQ # 3056642

H: "Having found not one single carbon monoxide leak on the entire
premises, it is my belief, and Willard concurs, that the reason
you folks feel listless and disoriented is simply because
you are lazy, stupid people"

I: Loren Petrich's 2-week stubborn refusal to respond to the
challenge to describe even one philosophical difference
between himself and the communists demonstrates that, in fact,
Loren Petrich is a COMMUNIST ***hole

J: Other knee_jerk reactionaries: billh, david casey, redc1c4,
The retarded sisters: Raunchy (rauni) and Anencephielle (Enielle),
also known as old hags who've hit the wall....

A: The wise man is mocked by fools.

B: Jet Silverman plays the fool and spews out nonsense as a
method of sidetracking discussions which are headed in a
direction that she doesn't like.

C: Jet Silverman claims to have killfiled me.

D: Jet Silverman now follows me from newgroup to newsgroup
...despite (C) above.

E: Jet is not worthy of the time to compose a response until
her behavior improves.

F: Unit_4's "Kook hunt" reminds me of "Jimmy Baker's" harangues against
adultery while concurrently committing adultery with Tammy Hahn.

G: Knackos...you're a retard.

Dik T. Winter

unread,

Dec 21, 2000, 7:42:43 PM12/21/00

to

In article <91tm77$ktf$1...@mail.pl.unisys.com> "Brig Campbell" <brig.c...@unisys.com> writes:
> While the drives have great performance and support various interfaces such
> as the new 200MB FC and Ultra160, I'd prefer to hide all this behind large
> caches on the storage subsystem. Throw in a couple of FC controllers
> connect to a FC SAN and 200MB/sec is no problem.

Or go with disk-striping.
--
dik t. winter, cwi, kruislaan 413, 1098 sj amsterdam, nederland, +31205924131
home: bovenover 215, 1025 jn amsterdam, nederland; http://www.cwi.nl/~dik/

Dik T. Winter

unread,

Dec 21, 2000, 7:53:24 PM12/21/00

to

In article <rozr931...@thulcandra.houst.sgi.com> Steve Crockett <s...@sgi.com> writes:
> I've seen situations where using 1 MB pages instead of 16 KB
> pages for an application which ran in-core led to a performance
> improvement of 25% (i.e., what ran in 50 minutes now runs in 40).
> I suspect others have seen greater improvements.

Older, but similar. On an application we were running on a CDC Cyber 205
the speed was doubled when the small page size was increased from 512 bytes
to 2048 bytes. Only due to the decrease of in-processor TLB table misses.
When we had originally thought about it and used large pages (1 MB) we would
immediately have detected the speed increase. And that is relevant for a
program that has been running for about 2 CPU-months.

Pete Zaitcev

unread,

Dec 21, 2000, 8:48:29 PM12/21/00

to

On Thu, 21 Dec 2000 15:13:07 -0800, Brig Campbell <brig.c...@unisys.com> wrote:

> [...] We still need to address the latency of the

> darn slow disk. I like big, intelligent caches on the storage side that
> analyze access patterns and attempt to preread the data I'm might need next
> independant of the host operating system.

EMC went out of its way to market insanely huge caches
in its controllers, but I am not convinced. Same intelligence
that "analyze access patterns" can reside in the host just
as easily.

I see some difference between storage cache and CPU cache.
First, storage cache can be effectively promoted to be CPU RAM,
where it is more useful. Second, access patterns to the storage
exibit less locatily (unless we talk sequential access, where
you need a small cache or buffer for (de-)coalescing).
Because of the above, cache in the storage controller is
nearly useless or outright harmful.

Storage cache helps systems that cannot be easily extended,
so you cannot use $$$ that you spend on storage for RAM instead
and my first point is negated (MVS mainframes, for instance).

Some buffers are required in storage as long as it implements
RAID-5 and coalescing. But they are nowhere near the amount
that EMC advertises.

-- Pete

Stephen Fuld

unread,

Dec 21, 2000, 9:17:46 PM12/21/00

to

"Pete Zaitcev" <zai...@yahoo.com> wrote in message
news:slrn9453ek....@js006.zaitcev.lan...

A couple of points.

Remember, historically the EMC systems were designed for MVS (which doesn't
have a "file cache" a la Unix or even NT) so large caches were usefull in
its original intended environment. They also needed a large cache because
they keep the data lengths of records in the cache for MVS which allows them
to have 100% write hits. They added the open system interfaces (SCSI and
then Fibre Channel) later to expand their market, but the architecture
remained basically unchanged. Yes it is overkill for many (most?) open
system applications, but it is hard to argue with success.

One advantage of storage cache that you didn't mention is that the cache in
the storage system is non-volatile and isolated from the direct effects of
the operating system, so writes survive both power outages and OS crashes
and are more likely to survive an errant store by the OS, wheras RAM in the
CPU may not be non-volatile and usually gets lost upon OS failure.

--
- Stephen Fuld

>
> -- Pete

Bernd Paysan

unread,

Dec 22, 2000, 11:48:40 AM12/22/00

to

Thomas Womack wrote:
> How about, say, 64K or 128K pages? Not such stupid granularity, and you
> still have the advantage of doing everything in TLB and losing page tables;
> even if you choose not to do that, at least you can map the whole L2 cache
> into the TLB.

My rule of thumb is: make blocked IO operations so that seek and
transfer time get close together. For modern HDs, 256k would be around
the sweet spot (on the logarithmic scale a bit closer to 4M than to 4K).
Looking at the RSS from my Linux workstation, it looks like most
processes would take several of those pages, anyway. This would increase
memory usage only by about 10%, and that's not that bad.

If the MMU did support several page sizes - not just 4K and 4M - one
could even split down the last pages in a section (like
buddy-allocation).

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/

Rob Young

unread,

Dec 22, 2000, 12:27:41 PM12/22/00

to

In article <91u2q7$3s7$1...@mail.pl.unisys.com>, "Brig Campbell" <brig.c...@unisys.com> writes:
>
> "Rob Young" <you...@eisner.decus.org> wrote in message
> news:56CFHv...@eisner.decus.org...
>> In article <91tm77$ktf$1...@mail.pl.unisys.com>, "Brig Campbell"
> <brig.c...@unisys.com> writes:
>>
>> >
>> > While the drives have great performance and support various interfaces
> such
>> > as the new 200MB FC and Ultra160, I'd prefer to hide all this behind
> large
>> > caches on the storage subsystem. Throw in a couple of FC controllers
>> > connect to a FC SAN and 200MB/sec is no problem.
>> >
>>
>> Depends... if you only had 3 drives hanging off your controllers
>> you would have an impossible job sustaining 200 MB/sec as the writes
>> have to hit the platters. With a great filesystem you don't have
>> to mask filesystem inefficiencies with expensive controllers.
>
> First, happy holidays to everyone. The writes don't always have to hit the
> platters, like in the case of EMC Symmetix, once the write hits the cache it
> says o.k. The phyical write occurs later.
>
> But this only works as long as you don't dirty all the cache because then
> the symmetrix must start doing syncronized write. And at 200MB/sec, you
> could substain writes to a 16GB cache for about 80 seconds.
>
> You probably want to use more than 3 drives just for bandwidth purposes,
> nothing to do with capacity.
>

But to get pathological and break your assumption above... the
3 drives are being used for temp data, it always changes and
streams 200 MB/sec 7x24. You can't write that much to those drives
and the hope of combining writes in cache is lost as all the data
is different and yes the data is active for three minutes, i.e.
the writes have to hit the platters and finally the Symmetrix
is supporting 5 OSes so the cache is split across 5 bands, no
OS gets to see the entire 16 Gigs, we wouldn't get 80 seconds
worth of time. Am I doing this? No. But to just point out
and concur , striping is a cheaper alternative as another poster
alluded to in another thread.

>
> <snip some tpc stuff>
>
>> "Dumb" controllers would put Compaq in a better price performance
>> category. As a for instance, there is $500,000 in controller
>> SOFTWARE costs in that config and an additional $390,000 in
>> controller hardware/software maintenance costs, not counting
>> the costs of the controllers themselves!
>>
>> As CPUs get faster and faster, the overhead of software based
>> RAID is less of an issue allowing vendors to ship dumb controllers
>> as the smarts are already in the OS. Would make for more competitive
>> tpmC bidding if the expensive controllers would get ditched.
>>
>
> How does Infiniband change that effort. Vendors will have no problem
> connecting lots of 2.1GB/sec Host Channel Adapters so bandwidth into the
> system will not be a problem. We still need to address the latency of the
> darn slow disk. I like big, intelligent caches on the storage side that
> analyze access patterns and attempt to preread the data I'm might need next
> independant of the host operating system.
>

And I suppose the OS can't recognize a pattern or OS research
isn't sufficiently advanced as EMC to compete with EMC?

Perhaps you missed where I mentioned the RS/6000 in their
135K tpmC mark show 96 MByte of fast write cache (write back
cache). If the filesystem is sufficiently striped, you don't
need a ton of cache to stream writes. I will give you some
quarter on the "pre-read" side IF the reads are sequential.
But there isn't a whole lot of help for random I/O. Now where
I have a leg up on you is I could get 10-20 times as much filesystem
cache. So if you could carve out 1-2 Gigs of Symmetrix cache
in my band for me, I would take 32 Gigs of filesystem cache and
spend as much. That is another reason the RS/6000 number is a very
good number as the SGA (or whatever it is called in DB2) is most
likely very substantial and they wouldn't get a whole lot of mileage
out of very expensive controller cache. Finally, we will see monster
memories where large systems have 1 Terabyte (today systems are
shipping with 256 Gigs of memory) and a Terabyte of memory would make
a pretty decent filesystem cache one would think.

Oh, one other thing... if EMC Symmetrix would help and really
make a difference with its "pattern analysis" you would think someone
somewhere would be using it to perform their tpmC benchmarks and
get a leg up on the competition. Haven't found one yet, but I haven't
read them all either.

Rob

Bill Todd

unread,

Dec 22, 2000, 12:48:13 PM12/22/00

to

Bernd Paysan <bpa...@mikron.de> wrote in message
news:3A4385E8...@mikron.de...

> Thomas Womack wrote:
> > How about, say, 64K or 128K pages? Not such stupid granularity, and you
> > still have the advantage of doing everything in TLB and losing page
tables;
> > even if you choose not to do that, at least you can map the whole L2
cache
> > into the TLB.
>
> My rule of thumb is: make blocked IO operations so that seek and
> transfer time get close together. For modern HDs, 256k would be around
> the sweet spot (on the logarithmic scale a bit closer to 4M than to 4K).
> Looking at the RSS from my Linux workstation, it looks like most
> processes would take several of those pages, anyway. This would increase
> memory usage only by about 10%, and that's not that bad.

Unfortunately, while that approach helps otherwise unintelligent large
sequential I/O patterns, it doesn't optimize them (since the data rate
obtained is only about half what it could be) and also approximately halves
the random I/O rate for operations that could easily be satisfied with far
smaller data transfers. So using smaller transfers (e.g., 64 KB or smaller
as I suggested elsewhere) and clustering them such that sequential patterns
can bundle multiple such units into a single transfer (or
concurrently-queued multiple transfers that get to the platters without
wasted motion) wins on both fronts, since the bundle may be larger than the
compromise value you'd choose. Clustered paging operations use this
principle, at least on the write side (and so could certain kinds of file
systems for some write activity, though I'm not sure if any do, save for
log-structured ones).

- bill

Anne & Lynn Wheeler

unread,

Dec 22, 2000, 1:12:02 PM12/22/00

to

"Bill Todd" <bill...@foo.mv.com> writes:
> Unfortunately, while that approach helps otherwise unintelligent large
> sequential I/O patterns, it doesn't optimize them (since the data rate
> obtained is only about half what it could be) and also approximately halves
> the random I/O rate for operations that could easily be satisfied with far
> smaller data transfers. So using smaller transfers (e.g., 64 KB or smaller
> as I suggested elsewhere) and clustering them such that sequential patterns
> can bundle multiple such units into a single transfer (or
> concurrently-queued multiple transfers that get to the platters without
> wasted motion) wins on both fronts, since the bundle may be larger than the
> compromise value you'd choose. Clustered paging operations use this
> principle, at least on the write side (and so could certain kinds of file
> systems for some write activity, though I'm not sure if any do, save for
> log-structured ones).

some of the ibm mainframe systems going back to the early '80s
supported clustered page i/o for both read and write. basically a
cluster was rebuilt on outgoing and the whole cluster was brought back
in on incoming. basically something akin to working set was
partitioned into cluster sizes on outgoing ... which somewhat improved
the probability that pages that tended to be used at the same time
were in the same cluster. at the time it was 4k pages and 10page
clusters.

--
Anne & Lynn Wheeler | ly...@garlic.com - http://www.garlic.com/~lynn/

Stephen Fuld

unread,

Dec 22, 2000, 5:06:37 PM12/22/00

to

"Rob Young" <you...@eisner.decus.org> wrote in message

news:fg+Nwr...@eisner.decus.org...

I absolutely agree. But to give this a slightly different perspective,
cache (any cache) is primarily a latency (or access time) improver. It can
help a lot when there is reference locality, either spatial or temporal. It
can improve bandwidth for modest sized sequential runs, but for runs longer
than the cache size, the cache is of marginal, at best, benefit. If you
really need 200 MB/sec for a long time (runs >> than the cache size) the
Symmetrix is not a particularly good solution. This is not the type of
application that will benefit from a cache. For such long runs, the access
times are not a big factor and striping without a cache, but with prefetch
buffers, is probably the best solution.

--
- Stephen Fuld

snip

> Rob
>

Terje Mathisen

unread,

Dec 22, 2000, 6:13:56 PM12/22/00

to

Bill Todd wrote:
> Contemporary high-volume disks have average access times on the order of 12
> ms. (8 ms. for a 1/3-stroke seek, 4 ms. for half a rotation) and average
> transfer rates of around 30 MB/sec, which would suggest something like a 64
> KB page size (adding about 2 ms. to the base 12 ms. figure) as an upper
> limit from that viewpoint (and 64 KB also happens to fall within the range
> of competence of IDE disks, though their limitations in this area seem to be
> on the verge of disappearing). 32 KB should be fine even with significant
> queue optimization.

Doesn't this kind of thinking indicate that the page size should be
something like 256K or 512K, simply because that make the transfer time
comparable to the seek time?

Terje

--
- <Terje.M...@hda.hydro.com>
Using self-discipline, see http://www.eiffel.com/discipline
"almost all programming can be viewed as an exercise in caching"

Bill Todd

unread,

Dec 23, 2000, 12:08:22 AM12/23/00

to

Terje Mathisen <terje.m...@hda.hydro.com> wrote in message
news:3A43E034...@hda.hydro.com...

> Bill Todd wrote:
> > Contemporary high-volume disks have average access times on the order of
12
> > ms. (8 ms. for a 1/3-stroke seek, 4 ms. for half a rotation) and average
> > transfer rates of around 30 MB/sec, which would suggest something like a
64
> > KB page size (adding about 2 ms. to the base 12 ms. figure) as an upper
> > limit from that viewpoint (and 64 KB also happens to fall within the
range
> > of competence of IDE disks, though their limitations in this area seem
to be
> > on the verge of disappearing). 32 KB should be fine even with
significant
> > queue optimization.
>
> Doesn't this kind of thinking indicate that the page size should be
> something like 256K or 512K, simply because that make the transfer time
> comparable to the seek time?

My point was that for random accesses targeting data of modest size (i.e.,
where the likelihood of any additional adjacent data that is brought in on
an access being useful is small) you want the transfer time to be small
compared to the access time so as not to noticeably compromise the number of
accesses per unit time that the disk can sustain. If instead the transfer
time is comparable to the positioning time, then the disk access rate will
drop to about half this value, and the extra data you bring in won't do you
any measurable good (that being the assumption above; if it *would* be
useful, then by all means you want to use larger transfers, in some cases
megabytes - but you can always group small units into larger ones under such
circumstances, whereas if your base transfer size is large you're stuck with
its overhead whether the data is useful or not).

- bill

Anton Ertl

unread,

Dec 23, 2000, 8:06:56 AM12/23/00

to

In article <slrn9401lj...@potty.housenet>,
jep...@inetnebr.com (Jeff Epler) writes:

>Each process uses at least 3 pages of memory (code, heap, stack)My
>machine currently has 43 processes running. The kernel uses at least
>4 pages of memory (code, heap, stacks, and the lowest 4M page which is
>made unusable on the PC architecture)
>
>Each memory-mapped object, such as a file or shared library, takes at least 1
>page of memory. My machine currently has 88 mapped objects, including
>executables.

How do you count that? I currently have 42 processes running; with

cat /proc/[123456789]*/maps|wc

I get 780 mappings.

Now I try to find the unique mapped objects with

cat /proc/[123456789]*/maps|sort -k 3|uniq -f 2 -c|wc

and get 134. However, that includes 182 mappings of "00000000 00:00 0",
which are probably copy-on-write, and probably have been changed,
i.e., require different pages in reality; some other mapped objects
probably also are copy-on-write and are modified (e.g., data segments
of executables).

>This means that my system needs a minimum of 178 pages.

Why do you think a VM object needs several pages?

- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

Terje Mathisen

unread,

Dec 23, 2000, 1:13:11 PM12/23/00

to

Bill Todd wrote:
>
> Terje Mathisen <terje.m...@hda.hydro.com> wrote in message

> > Doesn't this kind of thinking indicate that the page size should be
> > something like 256K or 512K, simply because that make the transfer time
> > comparable to the seek time?
>
> My point was that for random accesses targeting data of modest size (i.e.,
> where the likelihood of any additional adjacent data that is brought in on
> an access being useful is small) you want the transfer time to be small
> compared to the access time so as not to noticeably compromise the number of
> accesses per unit time that the disk can sustain.

Exactly my point!

If the transfer time is comparable or less than the access/seek time,
then the maximum speedup you could get would be less than 50%, while the
effective bandwidth dropped to zero.

I'll accept that the optimal point might be less than 256/512, but the
difference between 64 and 256 is not really significant. I guess we both
agree that 4K is way to small when even laptops come with 128MB
standard.

Bruce Hoult

unread,

Dec 23, 2000, 9:08:05 PM12/23/00

to

In article <3A44EB37...@hda.hydro.com>, Terje Mathisen
<terje.m...@hda.hydro.com> wrote:

> I'll accept that the optimal point might be less than 256/512, but the
> difference between 64 and 256 is not really significant. I guess we both
> agree that 4K is way to small when even laptops come with 128MB
> standard.

Come back FAT, all is forgiven!

The problem with big allocation blocks is of course all the programs
(and users) that insist on making zillions of tiny little files. Such
as ... ahem ... most usenet software.

You want some way to allow small files to be small while at the same
time not fragmenting big files too much -- as you say, 64 KB - 256 KB
chunks are probably pretty good as a target.

-- Bruce

Terje Mathisen

unread,

Dec 24, 2000, 5:32:33 AM12/24/00

to

Hello Bruce, and Merry Christmas (yes, it is Christmas her in Norway
now)!

This discussion was actually regarding the need for memory allocation
blocks (pages) somewhere between the classic 4K and the 'new' (in x86)
4MB.

As long as you also support 'small' pages, then it should not be a
problem to setup a block of any size, by filling it first with as many
large blocks as would fit, filling up the remainder with small blocks.

This is more or less analogous to the case you brought up, with file
system space allocation, where most modern systems seems to use some
combination of large blocks plus suballocation to handle the tail end.

Having a full set of powers-of-two block sizes will give an optimal
solution to the problem of minimizing the number of blocks to cover any
given request, but even having just two sizes help a lot.

On x86, the main benefits of the 4MB pages would seem to be that it is
easy to cover a large frame buffer, plus any large/static OS blocks,
without using up all the 4K TLB entries.

Anyway, TLB entries are just another form of cache, with a constant
struggle between keeping it fast and large enough to avoid trashing.

Anton Ertl

unread,

Dec 24, 2000, 6:04:19 AM12/24/00

to

In article <3A43E034...@hda.hydro.com>,

Terje Mathisen <terje.m...@hda.hydro.com> writes:
>Bill Todd wrote:
>> Contemporary high-volume disks have average access times on the order of 12
>> ms. (8 ms. for a 1/3-stroke seek, 4 ms. for half a rotation) and average
>> transfer rates of around 30 MB/sec, which would suggest something like a 64
>> KB page size (adding about 2 ms. to the base 12 ms. figure) as an upper
>> limit from that viewpoint (and 64 KB also happens to fall within the range
>> of competence of IDE disks, though their limitations in this area seem to be
>> on the verge of disappearing). 32 KB should be fine even with significant
>> queue optimization.
>
>Doesn't this kind of thinking indicate that the page size should be
>something like 256K or 512K, simply because that make the transfer time
>comparable to the seek time?

If you do a lot of paging, this will probably help; of course there
may be access patterns where a smaller page size does not increase the
number of page faults (or may even decrease it because more pages can
be kept in RAM).

However, nowadays paging is hopefully an unusual condition (except for
paging in binaries during first startup, where larger pages would
help; but doing eager loading instead of demand-paging is probably
even faster); we should also look at how the page size affects the
more usual usage. The effects I see:

1) TLB misses: larger pages will reduce them.

2) copy-on-write and page clearing: larger, partially used pages will
make that more expensive. The interaction with the cache may make
these activities more expensive even for fully-used pages.

In different applications these two factors have different weights, so
there probably is no generally optimal page size for these
considerations. An adaptive OS might help.

Dan Foster

unread,

Dec 24, 2000, 6:20:10 AM12/24/00

to

In article <bruce-51DAE4....@news.nzl.ihugultra.co.nz>,

Bruce Hoult <br...@hoult.org> wrote:
>Come back FAT, all is forgiven!

:)

>The problem with big allocation blocks is of course all the programs
>(and users) that insist on making zillions of tiny little files. Such
>as ... ahem ... most usenet software.

Then that will come back to haunt the USENET administrator at some point :)

Applications can change its storage methods/back-ends. USENET news software
is no exception. This is one of the reasons why it now supports large
cyclical files - say, 2 gigabytes, all articles gets allocated in its data
blocks, and there's a database to map from article ID to actual spot in
the cyclical file. It's not a perfect approach, but solves a lot of the
original nagging problems.

Or the USENET news administrator, if he/she is really so bound on retaining
the traditional storage method, will simply have to *really* plan ahead and
make a filesystem with unheard-of available inode counts :) And also eat any
penalty incurred from this scheme, including wasted disk space, extra I/Os,
busting inode cache in certain situations, etc.

It's all about how people wants to use their applications, which ideally
should be configured to meet their needs, including that of efficient
performance and building in room for some long-term growth.

If an application remains static, and its needs grows pretty large over
time... I don't quite think changing filesystems is the magic solution per se.
Only puts off resolution for a bit longer.

>You want some way to allow small files to be small while at the same
>time not fragmenting big files too much -- as you say, 64 KB - 256 KB
>chunks are probably pretty good as a target.

There's at least one interesting filesystem that apparently tries to find
ways to optimize both ends of the spectrum reasonably well -- reiserfs.

Docs might be at http://www.namesys.com but it's down right now, so I have
no way of verifying the current reiserfs home page URL. It's been about
4 years since I last read the seminal white paper by Hans Reiser, and do
(vaguely) recall it was interesting reading.

-Dan

Casper H.S. Dik - Network Security Engineer

unread,

Dec 24, 2000, 6:51:47 AM12/24/00

to

[[ PLEASE DON'T SEND ME EMAIL COPIES OF POSTINGS ]]

an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:

>2) copy-on-write and page clearing: larger, partially used pages will
>make that more expensive. The interaction with the cache may make
>these activities more expensive even for fully-used pages.

Copy-on-write operations themselves become more expensive; however,
marking pages for copy-on-write becomes less expensive as you have
far fewer pages.

Casper
--
Expressed in this posting are my opinions. They are in no way related
to opinions held by my employer, Sun Microsystems.
Statements on Sun products included here are not gospel and may
be fiction rather than truth.

Miha Peternel

unread,

Dec 24, 2000, 8:37:48 AM12/24/00

to

In article <91tecn$een$1...@nnrp1.deja.com>, ana...@earthlink.net says...
> I'd guess that 1024 mappable objects would suffice for 6-80% of
> multi-programmed systems, that 5% would be happy with 16, 3-50%
> could live with 128-256, and that 5-10% require at least 16k.

Let's not forget the history...

If a problem is better suited for a small number of variably sized
large pages, the appropriate architectural solution is actually:

SEGMENTATION.

Anyways, now that we know that small pages are very useful and not a
performance hit, there's no reason to suddenly switch to really big
pages. The OS can cluster the pages to optimize swapping and use bigger
pages only where it's better suitable.

Miha

McCalpin

unread,

Dec 21, 2000, 11:46:19 AM12/21/00

to

In article <91rob...@news2.newsguy.com>, <TTK Ciar> wrote:
>>[...] There is a compromise between
>>TLB misses, paging, and number of addressable units available.
>
> Isn't this why some systems have variable-sized pages?
> Is there a significant downside to using variable-sized pages?

In the interests of clarity, it would be better to call this
"simultaneous support of multiple page sizes".

As I understand it, some of the Compaq OS's for Alpha support
both the default (small) page size and a large page size (4 MB).

SGI's IRIX allows any process to access almost all of the hardware-
supported page sizes simultaneously: 16kB, 64kB, 256kB, 1M, 4M, 16M.
It does not support the older 4kB page size, but since this machine
is a server with minimum configs having lots of real memory, this
does not really make much difference.

Intel's IA64 family supports multiple page sizes, up to even larger
sizes than supported on the SGI MIPS R1x000-based machines. I don't
know much about O/S support for this feature.

There is certainly a great deal of complexity added to the software
and the hardware in order to support this. There are inevitably
compromises in performance as well, but it is relatively easy to
to identify application/OS areas where the decreased TLB miss rate
obtained by using the large pages is larger than the additional
penalty involved in instantiating the pages and managing higher-
level TLB operations.
--
John D. McCalpin, Ph.D. mcca...@austin.ibm.com
Senior Scientist IBM POWER Microprocessor Development
"I am willing to make mistakes as long as
someone else is willing to learn from them."

amoli...@visi-dot-com.com

unread,

Dec 24, 2000, 11:02:27 AM12/24/00

to

In article <924m5a$8l0$1...@node17.cwnet.frontiernet.net>,

Dan Foster <d...@frontiernet.net> wrote:
>In article <bruce-51DAE4....@news.nzl.ihugultra.co.nz>,
>Bruce Hoult <br...@hoult.org> wrote:
>>Come back FAT, all is forgiven!
>
>:)
>
>>The problem with big allocation blocks is of course all the programs
>>(and users) that insist on making zillions of tiny little files. Such
>>as ... ahem ... most usenet software.
>
>Then that will come back to haunt the USENET administrator at some point :)
>
>Applications can change its storage methods/back-ends. USENET news software
>is no exception. This is one of the reasons why it now supports large
>cyclical files - say, 2 gigabytes, all articles gets allocated in its data
>blocks, and there's a database to map from article ID to actual spot in
>the cyclical file. It's not a perfect approach, but solves a lot of the
>original nagging problems.
>
>Or the USENET news administrator, if he/she is really so bound on retaining
>the traditional storage method, will simply have to *really* plan ahead and
>make a filesystem with unheard-of available inode counts :) And also eat any
>penalty incurred from this scheme, including wasted disk space, extra I/Os,
>busting inode cache in certain situations, etc.

Absolutely. The RIGHT way to implement a filesystem is to make
it work very badly for files, and force applications to implement a
filesystem suitable for, well, files, inside the OS filesystem. Does
this log-structured filesystem-in-a-file scheme sound as insane to
anyone else as it does to me?

Jeff Epler

unread,

Dec 23, 2000, 11:40:39 PM12/23/00

to

On 23 Dec 2000 13:06:56 GMT, Anton Ertl

<an...@mips.complang.tuwien.ac.at> wrote:
>In article <slrn9401lj...@potty.housenet>,
> jep...@inetnebr.com (Jeff Epler) writes:
>>Each process uses at least 3 pages of memory (code, heap, stack)My
>>machine currently has 43 processes running. The kernel uses at least
>>4 pages of memory (code, heap, stacks, and the lowest 4M page which is
>>made unusable on the PC architecture)
>>
>>Each memory-mapped object, such as a file or shared library, takes at least 1
>>page of memory. My machine currently has 88 mapped objects, including
>>executables.
>
>How do you count that? I currently have 42 processes running; with
>
>cat /proc/[123456789]*/maps|wc
>
>I get 780 mappings.

Sounds about like my system.

>
>Now I try to find the unique mapped objects with
>
>cat /proc/[123456789]*/maps|sort -k 3|uniq -f 2 -c|wc

I did something like
cat /proc/[1-9]*/maps | awk '{print $6}' | sort -u
mine seems to count things such as
1 08048000-0804b000 r-xp 00000000 03:03 163165 /usr/sbin/atd
1 0804b000-0804d000 rw-p 00002000 03:03 163165 /usr/sbin/atd
as a single instance, though it's actually mapped in two sections. (code vs
data?)

>>This means that my system needs a minimum of 178 pages.
>
>Why do you think a VM object needs several pages?

178 pages was the sum of unique mmapped files, 3 pages (?) for the kernel, and
two read-write pages for each task (stack and heap). While most of these
objects are more than one page, I was trying to arrive at a minimum figure,
under the unixy assumption that each mapped object is at least a page, and
that stack, heap, and code all reside in separate mapped objects.

By the way, only three objects seem to be mapped larger than 4M on my system..
The most popular map size seems to be 16k.

Jeff

Stefan Monnier <foo@acm.com>

unread,

Dec 24, 2000, 12:36:47 PM12/24/00

to

>>>>> "amolitor-at" == amolitor-at <amoli...@visi-dot-com.com> writes:
> Absolutely. The RIGHT way to implement a filesystem is to make
> it work very badly for files, and force applications to implement a
> filesystem suitable for, well, files, inside the OS filesystem. Does
> this log-structured filesystem-in-a-file scheme sound as insane to
> anyone else as it does to me?

On the one hand, I can only agree that it should not be necessary, but on the
other it seems difficult to implement a filesystem that can provide POSIX like
semantics while at the same time also providing the best performance for other
files that do not require POSIX semantics (in the case of a news server: no
need for per-file access rights and time-stamps, no need to modify a file, no
need to remove files individually, ...).

Stefan

Bill Todd

unread,

Dec 24, 2000, 2:52:53 PM12/24/00

to

Stefan Monnier <f...@acm.com> <monnier+comp.arch/news/@flint.cs.yale.edu>
wrote in message news:5lvgs94...@rum.cs.yale.edu...

> >>>>> "amolitor-at" == amolitor-at <amoli...@visi-dot-com.com> writes:
> > Absolutely. The RIGHT way to implement a filesystem is to make
> > it work very badly for files, and force applications to implement a
> > filesystem suitable for, well, files, inside the OS filesystem.

Thanks for stating my own view more trenchantly that I might have chosen to.

Does
> > this log-structured filesystem-in-a-file scheme sound as insane to
> > anyone else as it does to me?

Well, given a requirement to use an existing file system that doesn't meet
one's needs, it's understandable. It's the leap to suggesting that because
some, possibly even most, file systems may not handle this application well
that it's outside the realm of applications that a file system *should* be
able to handle well that, for me, leaves the realm of sanity.

>
> On the one hand, I can only agree that it should not be necessary, but on
the
> other it seems difficult to implement a filesystem that can provide POSIX
like
> semantics while at the same time also providing the best performance for
other
> files that do not require POSIX semantics (in the case of a news server:
no
> need for per-file access rights

Compact access-control information should be able to reside in the inode (or
other per-file root structure) - right beside the data for files up to some
reasonable size (I'd pick just under 32 KB or 64 KB for the reasons I've
noted elsewhere). Applications (news servers, databases) that perform their
own external access control do not require lengthy ACLs, and the CPU
overhead of checking a single ACE should be negligible.

and time-stamps,

A file system concerned with performance should offer options for batching
timestamp updates (as opposed to making them persistent on every read and
write operation, though that option may also be necessary, at least for
writes) and for turning them off altogether where they're
counter-productive.

no need to modify a file,

Such facilities usually add more to a file system's complexity than to its
overhead.

no
> need to remove files individually, ...).

Maintaining a transaction log such that directories can be bulk-updated
without sacrificing individual operation persistence removes most of the
pain of individual removal as part of a bulk operation. My impression is
that SGI's XFS provides at least some support for batch-updates (in all
contexts, not just Removes) backed by a log.

I don't know why file systems constitute such a back-water of technology as
they usually do: *I* at least find them interesting, and their (remediable)
performance failings seem to contribute disproportionately to overall system
sluggishness in a great many contexts. For example, an earlier posting
moaned about the need to preallocate ridiculously large inode counts and
waste disk space as if the need to preallocate inodes, and indeed the need
for inodes to have some fixed size, were some kind of natural law rather
than a remnant of a decades-old on-disk structure that was nothing to brag
about (save perhaps for its simplicity) even at the time it was designed.

- bill

>
>
> Stefan

Bill Todd

unread,

Dec 24, 2000, 2:56:34 PM12/24/00

to

Terje Mathisen <terje.m...@hda.hydro.com> wrote in message

news:3A44EB37...@hda.hydro.com...

> Bill Todd wrote:
> >
> > Terje Mathisen <terje.m...@hda.hydro.com> wrote in message
> > > Doesn't this kind of thinking indicate that the page size should be
> > > something like 256K or 512K, simply because that make the transfer
time
> > > comparable to the seek time?
> >
> > My point was that for random accesses targeting data of modest size
(i.e.,
> > where the likelihood of any additional adjacent data that is brought in
on
> > an access being useful is small) you want the transfer time to be small
> > compared to the access time so as not to noticeably compromise the
number of
> > accesses per unit time that the disk can sustain.
>
> Exactly my point!
>
> If the transfer time is comparable or less than the access/seek time,
> then the maximum speedup you could get would be less than 50%, while the
> effective bandwidth dropped to zero.

If my need is for small random accesses, a near-zero bandwidth is exactly
what I want: why should I give up close to 50% of my access performance
(and consume bus and memory cycles) fetching data of no use to me?

>
> I'll accept that the optimal point might be less than 256/512, but the
> difference between 64 and 256 is not really significant.

Only if you believe (I don't) that a performance improvement of about 75%
(from about 40 accesses/sec/disk to about 70, for an average 7200 rpm drive)
is not really significant. Especially given that the improvement comes at
zero cost (in fact, you also reduce unnecessary load on other system
components like memory and busses).

I guess we both
> agree that 4K is way to small when even laptops come with 128MB
> standard.

512 bytes is not too small for a disk transfer if that's all the data you
need. The only question is whether supporting such small transfers costs
you something elsewhere. The propensity in recent systems to tie disk
transfer sizes to system memory page sizes (to support 'unified buffer
caching') need not compromise small disk transfers if the system is willing,
e.g., to fetch randomly-accessed data and track dirty data on a
per-disk-sector (rather than per-page) basis in its cache; if not, then
above a page size of (at present) something like 32 KB or 64 KB small random
disk access performance suffers, even if memory were free and address bits
infinite (and since they aren't, nor are memory and bus bandwidth, there are
some small negative effects even at smaller page sizes if they force you to
fetch and store more data than you need to).

- bill

Terje Mathisen

unread,

Dec 24, 2000, 5:42:20 PM12/24/00

to

Bill Todd wrote:
>
> Terje Mathisen <terje.m...@hda.hydro.com> wrote in message

> > I'll accept that the optimal point might be less than 256/512, but the
> > difference between 64 and 256 is not really significant.
>
> Only if you believe (I don't) that a performance improvement of about 75%
> (from about 40 accesses/sec/disk to about 70, for an average 7200 rpm drive)
> is not really significant. Especially given that the improvement comes at
> zero cost (in fact, you also reduce unnecessary load on other system
> components like memory and busses).

This would only work if all disk accesses were for paging, I believe I
also stated that a reasonable file system should support suballocation.

>
> I guess we both
> > agree that 4K is way to small when even laptops come with 128MB
> > standard.
>
> 512 bytes is not too small for a disk transfer if that's all the data you
> need.

The 4K was related to memory page size, where you want a reasonable
number of simultaneously active pages, relative to your TLB size, to
avoid trashing.

I also prefer sector-based (i.e. usually 512 bytes) sub-allocation in
the file system, but this doesn't mean that the CPU and OS must support
512-byte memory pages.

> The only question is whether supporting such small transfers costs
> you something elsewhere. The propensity in recent systems to tie disk
> transfer sizes to system memory page sizes (to support 'unified buffer
> caching') need not compromise small disk transfers if the system is willing,
> e.g., to fetch randomly-accessed data and track dirty data on a
> per-disk-sector (rather than per-page) basis in its cache; if not, then
> above a page size of (at present) something like 32 KB or 64 KB small random
> disk access performance suffers, even if memory were free and address bits
> infinite (and since they aren't, nor are memory and bus bandwidth, there are
> some small negative effects even at smaller page sizes if they force you to
> fetch and store more data than you need to).

OK, I agree with you here.

Greg Pfister

unread,

Dec 23, 2000, 12:31:35 AM12/23/00

to

Bill Todd wrote:
>
[snip]

> My point was that for random accesses targeting data of modest size (i.e.,
> where the likelihood of any additional adjacent data that is brought in on
> an access being useful is small) you want the transfer time to be small
> compared to the access time so as not to noticeably compromise the number of

> accesses per unit time that the disk can sustain. ...
[snip some more]

Synchronization check:

Bill, I think you are talking about support of multiple differnet
applications running simultaneously, presenting what amounts to a
"random" load to the system. This is often reasonable, and the
starting point of many people doing OSs and file systems,
particularly thinking in terms of commercial applications.

On the other hand, I suspect that Terje Malthisen, Steve
Crockett, and others are mostly talking about running one
whopping big application that sucks down the whole machine and
grinds for a long time. This is also reasonable, and is typically
the starting point of people doing big technical and scientific
applications.

I suspect that this means both sides are right in their
respective domains.

Greg Pfister
<not necessarily my employer's opinion>

Erik Corry

unread,

Dec 25, 2000, 5:09:45 AM12/25/00

to

"Terje Mathisen" <3A43E034...@hda.hydro.com> wrote:

> Doesn't this kind of thinking indicate that the page size should be
> something like 256K or 512K, simply because that make the transfer time
> comparable to the seek time?

I get 25 Mbyte/s sustained from new disks according to hdparm, so
that makes for 10-20ms, which seems a little high when the seek times
are now 5-10ms.

But yes, it probably indicates that the OS should try to cluster
pages in groups of 128-256kbytes when scheduling for swap in and
swap out. This is what Linux 2.4 tries to do, I think (Stephen
Tweedie has been working on it).

There are various reasons why you might want to make individual pages
smaller than that. Copy on write in connection with fork and linker
fixups, support for lots of small processes and other stuff mentioned
in this thread.

--
There's really no way to fix this, and still keep Perl pathologically eclectic
--
Erik Corry er...@arbat.com Ceterum censeo, Microsoftem esse delendam!

Terje Mathisen

unread,

Dec 25, 2000, 6:08:00 AM12/25/00

to

Thanks, Greg, and Merry Christmas!

I did actually include small and big apps, but with particular concern
that the single GB-sized app wouldn't make everything else grind to a
halt.

I'd like to compliment MIPS (and I guess IA64) on having a series of
page sizes, this is 'The Right Thing', as long as you can get both the
HW and OS to do it correctly, and the added complexity doesn't eat up
the gain. :-)

BTW, it really bothers me to see obviously bad results from otherwise
reasonable assumptions, case in point: Task migration on SMP WinNT
systems:

When I use Panavue's Image Assembler to stitch together huge bitmaps, I
often (as I mentioned previously) need 300-400 MB of RAM for that single
process.

According to the OS, this single-threaded task is using exactly 50% of
the CPU, i.e. 100% of one CPU, with the other CPU idle.

Several times per second, the OS will notice this, and promptly migrate
the process to the other CPU, incidentally blowing away any useful data
from the CPU caches.

If I then manually tweak the process, by setting its cpu affinity to
just one of the CPUs, both IA and all other tasks runs better.

Any good ideas for MS on how to avoid this behaviour? Would SMP Linux
make equally bad decisions?

tto...@bio.vu.nl

unread,

Dec 25, 2000, 6:53:37 AM12/25/00

to

Terje Mathisen wrote:

> Any good ideas for MS on how to avoid this behaviour? Would SMP Linux
> make equally bad decisions?

Move processes that take little CPU. These are essentially free to be
moved as they fo not get contiguous timeslices. Sleeping tasks are the
first ones to be migrated.

Set affinity to make a process stick to the CPU it had been running on.
The higher the CPU utilization was, the higher the stickyness. Maybe
that can be done with a very simple rule: do not migrate if you have had
the two previous timeslices exclusive use of the CPU.

The second rule gives you some chance to pick out low CPU processes to
move away. Mmm, I wonder if this will work.

Thomas

Terje Mathisen

unread,

Dec 25, 2000, 8:49:17 AM12/25/00

to

tto...@bio.vu.nl wrote:
>
> Terje Mathisen wrote:
>
> > Any good ideas for MS on how to avoid this behaviour? Would SMP Linux
> > make equally bad decisions?
>
> Move processes that take little CPU. These are essentially free to be
> moved as they fo not get contiguous timeslices. Sleeping tasks are the
> first ones to be migrated.

I believe this is what NT does first, but when all processes are more or
less idle on one CPU and the cpu-hog on the other cpu is at 100%, it
still migrates the hog.

I just wonder what kind of scheduling heuristic can come up with a
result like this. :-(

> Set affinity to make a process stick to the CPU it had been running on.
> The higher the CPU utilization was, the higher the stickyness. Maybe
> that can be done with a very simple rule: do not migrate if you have had
> the two previous timeslices exclusive use of the CPU.

Sounds reasonable, esp. if you add some consideration of the current
working size: Multi-MB apps are _probably_ not a good choice to migrate.

Terje

PS. Merry Christmas everyone!

Paul Repacholi

unread,

Dec 25, 2000, 10:22:03 AM12/25/00

to

mXr...@email.com (Miha Peternel) writes:

Agree, plus another factor.

If we take a page miss, and have no pages on the clean list, we
then have 4MB of memory latency in the kernel. Possibly in
interupt mode for good measure.

--
Paul Repacholi 1 Crescent Rd.,
+61 (08) 9257-1001 Kalamunda.
West Australia 6076
Raw, Cooked or Well-done, it's all half baked.

Stefan Monnier <foo@acm.com>

unread,

Dec 25, 2000, 12:52:29 PM12/25/00

to

>>>>> "Terje" == Terje Mathisen <terje.m...@hda.hydro.com> writes:
> If I then manually tweak the process, by setting its cpu affinity to
> just one of the CPUs, both IA and all other tasks runs better.
> Any good ideas for MS on how to avoid this behaviour? Would SMP Linux
> make equally bad decisions?

I don't know if Linux would be better, but I remember similar problems
being discussed on the linux-kernel mailing-list. The basic reason for
such behavior was something like:
- cpu-hog process 1 on CPU 1.
- process 2 wakes up, run on CPU 2.
- process 3 wakes up, run on CPU 1 (temporarily putting the lower-priority
cpu-hog process 1 to sleep).
- process 2 blocks on something, goes to sleep.
- process 1 can now be woken up and CPU 1 is busy but CPU 2 is idle, so it
gets scheduled on CPU 2.

The affinity heuristic used in the scheduler should make sure that the
scheduler doesn't too eagerly move process 1 to CPU 2 but instead either
move process 3 to CPU 2 or wait a while (until process 3 blocks or runs
out of its time-slice).

Tuning the scheduler to do the right thing seemed non-trivial because
process 3 might also be big (like the Xserver responding to a request
from process 2) and thus costly to move as well.
Also I don't think that Linux's scheduler would know to stop process 3 and move
it to CPU 2, so it chooses to wait instead.

Stefan

Bill Todd

unread,

Dec 25, 2000, 2:03:28 PM12/25/00

to

Greg Pfister <pfi...@us.ibm.com> wrote in message
news:3A4438B7...@us.ibm.com...

> Bill Todd wrote:
> >
> [snip]
> > My point was that for random accesses targeting data of modest size
(i.e.,
> > where the likelihood of any additional adjacent data that is brought in
on
> > an access being useful is small) you want the transfer time to be small
> > compared to the access time so as not to noticeably compromise the
number of
> > accesses per unit time that the disk can sustain. ...
> [snip some more]
>
> Synchronization check:
>
> Bill, I think you are talking about support of multiple differnet
> applications running simultaneously, presenting what amounts to a
> "random" load to the system. This is often reasonable, and the
> starting point of many people doing OSs and file systems,
> particularly thinking in terms of commercial applications.

Well, non-sequential access patterns are not the exclusive province of
multiple applications: a single application (a typical database being
perhaps the most obvious example, even when dedicated to a single
application) can exhibit them too, and accesses secondary storage at 2 KB or
32 KB granularity for most I/O (IIRC for Oracle, anyway, though those sizes
may have evolved over time). While some database objects may be made up of
multiple 32 KB pages that the database may be able to make adjacent, and
while instances such objects may increase in the future as more stored data
moves in multi-media directions, at present the overwhelming majority of
database accesses are still, I suspect, very small compared with the 4 MB
page size under discussion here.

My original post on the subject qualified my comments as being solely
related to the interaction of page size with minimum disk access (not disk
allocation, which is a different issue) granularity, but that may have been
overlooked in some of the subsequent discussion. As long as system memory
page size does not place an effective minimum on disk-access (and caching,
and file update) granularity, large memory pages are fine with me if that's
what other system page use finds appropriate. But my impression is that
common system cache implementations do in fact tie cache and update
granularity (save at end of file) to system memory page size, and when
that's the case as a file system or application designer who may want
high-performance fine-granularity random access I want at least relatively
small memory pages - certainly no larger than 64 KB for
disk-access-performance reasons, and smaller yet if physical memory
available for caching is at all limited. If a system can support multiple
page sizes, I'll try to use them effectively: some file objects *are* big.
However, the utility of caching objects tends to vary inversely with their
size when memory available for caching is limited, and the ability to use
memory-mapping mechanisms to facilitate efficient caching of varying object
sizes is valuable, so reasonably fine-grained page sizes remain desirable
(for file-related activity) if one must choose only one page size.

- bill

Douglas Siebert

unread,

Dec 25, 2000, 11:32:26 PM12/25/00

to

Steve Crockett <s...@sgi.com> writes:

>> > Take a common medium-end PC with 64 or 128 megs of RAM.
>> >
>> > With 4M pages, you get 16 or 32 pages worth of memory. Imagine how Linux
>> > would run on a system such as this.
>>
>> OK, 4M pages seem silly.

>No, they don't. Imagine instead how running a proprietary Unix on
>a system with 128 processors or more using shared memory
>with 2 GB/processor (e.g., 256 GB memory) would do with 4K pages.
>Got any interest in trying to manage 67 million pages,
>and still do useful computing?

>Note that this is comp.arch, not comp.pc.arch or
>comp.i'm-too-cheap.arch or even comp.os.linux.advocacy.
>Some of the people who post and lurk here have multi-million
>dollar budgets for hardware, and have actual needs for systems
>similar to the one mentioned above.

This is all quite silly. There is no reason systems can be designed so
that they can work as a desktop or a supercomputer, if a bit of thought
is put into the design to make it flexible. HP-UX and the PA-RISC CPU
fully support page sizes of 4K, 16K, 64K, 256K, 1M, 4M, 16M, 64M, and 256M.
The OS will select what it thinks is the "right" page size for a process
at startup time, or you can set attributes on the executable to tell it
what page sizes you want (independantly for instruction and data) So it
supports both lightweight processes (so your shell scripts won't crawl
or use up all the available memory) as well as programs with gigabytes of
data, on the same same system, no worries.

From other posts in this thread, it sounds like IRIX does the same. I
don't believe that HP and SGI's engineers are substantially more brilliant
than those who work for Sun, IBM and Compaq, or the guys who hack the
Linux kernel. And while they certainly are more brilliant than those who
work for Microsoft, by definition, even the Microserfs could probably
code this support into Win2K, given a CPU that supports it, and a bean
counter who puts it on their to do list.

So if the Hammer supports 4K and 4M, any reasonable OS ought to allow
both to coexist on the same system, to support everyone's needs. Too
bad there isn't a better range, but that's better than trying to force
everyone to 4K or everyone to 4M.

--
Doug Siebert
dsie...@excisethis.khamsin.net

If at first you don't succeed, skydiving is not for you.

Douglas Siebert

unread,

Dec 25, 2000, 11:51:21 PM12/25/00

to

"Stephen Fuld" <s.f...@worldnet.att.net> writes:

>"Pete Zaitcev" <zai...@yahoo.com> wrote in message
>news:slrn9453ek....@js006.zaitcev.lan...
>>
>> EMC went out of its way to market insanely huge caches
>> in its controllers, but I am not convinced. Same intelligence
>> that "analyze access patterns" can reside in the host just
>> as easily.
>>

>One advantage of storage cache that you didn't mention is that the cache in
>the storage system is non-volatile and isolated from the direct effects of
>the operating system, so writes survive both power outages and OS crashes
>and are more likely to survive an errant store by the OS, wheras RAM in the
>CPU may not be non-volatile and usually gets lost upon OS failure.

Plus, EMC's market is mainly multiply attaching the same storage system to
several hosts. The host can't analyze access patterns that are happening
in other hosts. The hosts are sometimes completely different. I.e., IBM
mainframe, Sun, Linux and NT all hooked up the same EMC frame -- it
probably doesn't happen as often as EMC sales reps would want you to
believe, but for certain customers this is quite important.

Certainly one should strongly consider storage systems with the intelligence
based in the host, since they will be less expensive, and get upgraded
every time you upgrade the host. But while they may be appropriate for
you, they are not appropriate, or even possible, for everyone.

Stephen Fuld

unread,

Dec 25, 2000, 3:00:03 PM12/25/00

to

"Terje Mathisen" <terje.m...@hda.hydro.com> wrote in message

news:3A467BCC...@hda.hydro.com...

Don't get too wedded to 512 byte disk blocks. A couple of years ago, I
participated in an IDEMA sub group dealing with the effects of increasing
disk areal density. One of the problems is that, as the density increases,
the signal to noise ratio decreases and more powerfull ECCs are needed.
These require more ECC bits to be stored per disk sector. Not really a
problem in general, since the areal density is going up much faster than the
need for ECC bits, but soon the number of ECC bits beomes significant when
compared to the 512 byte sector size, thus limiting effective capacity
growth. Also, some of the required gaps don't scale down with density,
further complicating the problem. The consensus, among the disk engineers,
was a desire to increase the sector size, probably to something like 4K
bytes. This would allow the increases in areal density to be more
effectivly translated into increased capacity, thus lower cost. My job, on
the subgroup was to talk about the "system" implications (which are large).
But the technology implications may force us there eventually. Note that
this group was talking about something like a 5 year time frame before the
change would become necessary.

--
- Stephen Fuld

Snip

Stephen Fuld

unread,

Dec 25, 2000, 3:00:04 PM12/25/00

to

"Erik Corry" <er...@arbat.com> wrote in message
news:9276d9$7uj$1...@ec.arbat.com...

> "Terje Mathisen" <3A43E034...@hda.hydro.com> wrote:
>
> > Doesn't this kind of thinking indicate that the page size should be
> > something like 256K or 512K, simply because that make the transfer time
> > comparable to the seek time?
>
> I get 25 Mbyte/s sustained from new disks according to hdparm, so
> that makes for 10-20ms, which seems a little high when the seek times
> are now 5-10ms.

25 MB/sec seems low for a current generation disk. Also, you are forgetting
to add the rotational latency to the seek. For a 7200 RMB drive, this adds
about 4 ms; for a 10,000 RPM drive, it adds 3 ms.

But my real issue is with the often expressed "rule of thumb" that the
transfer time should be approximately equal to the seek plus latency. Is
there any emperical evidence for this as being a good idea? On a purely
theoretical basis, it appears obvious the Bill Todd is right; you should
read in as much data as you are likely to need. If there is a reasonable
probability of needing the next sector, the cost of increasing the disk
transfer time to include that sector is very small compared with doing
another I/O. On the other hand, reading in that next sector if the
probability of using it is small just wastes resources.

It would seem lke the probability of use for the data in the next sector
should be at least greater than the sector's transfer time divided by the
average access time to optimize things. The actual value should be larger
than that ratio to account for the extra use of the rest of the system
resources by the extra data. On the other hand, the extra system (primarily
CPU time) cost of doing an extra I/O should not be ignored.

An interesting problem!

--
- Stephen Fuld

Snip

> --

Alexis Cousein

unread,

Dec 26, 2000, 5:56:33 AM12/26/00

to Douglas Siebert

Douglas Siebert wrote:

> From other posts in this thread, it sounds like IRIX does the same.

Yes. The R10K also has performance counters that can measure how many
TLB misses you incur in a program run (and tools to support easily
running such experiments), so it's relatively easy to determine what
page size a certain application requires to avoid trashing the TLB (I
usu use page sizes that are 4x the number the runs I do seem to "need",
just to give some extra leeway in case data sizes expand during use).

FWIW, on systems with lots of memory (512MB or 1GB per CPU) and HPC
loads, I tend to use 75% 256k pages and 25% "default" 16KB pages, and
let the IRIX kernel precoalesce pages so that 75% of pages is readily
available as "large" pages. All system daemons use small pages, all HPC
workloads use large pages (all codes are either compiled with -bigp_on,
which lets you enable the use of large pages just with environment
variables, or use "dplace" to force the use of large pages if there's
only a binary available).

Buffer cache is using the smaller page sizes, though the granularity
it's using for *its* buffers is by default *64K*, but can be tuned down
to one page (of 16KB on te 64-bit kernels) (which I sometimes do use for
random I/O apps that insist on using the buffer cache with access
patterns yielding better hit rates with finer granularity given an
amount X of memory).

That setup seems to work well for a typical HPC workload.

I chose 256KB pages just because there are anough apps I bump into that
trash the TLB with even 64K pages -- one application I've just met
spends 45% of its time in TLB misses with page sizes of 16KB, and 25%
with page sizes of 64KB, even though it used only 30MB. Applications
that require 4MB pages seem to be much more rare.

Incidentally, parallel memory allocation (for non-independent jobs) on
larger machines can also be faster with large pages for a given memory
requirement, given the fact you descend into the kernel less often.

Robert Harley

unread,

Dec 26, 2000, 9:16:06 AM12/26/00

to

dsie...@excisethis.khamsin.net (Douglas Siebert) writes:
> HP-UX and the PA-RISC CPU fully support page sizes of 4K, 16K, 64K [...]

> From other posts in this thread, it sounds like IRIX does the same. I
> don't believe that HP and SGI's engineers are substantially more brilliant
> than those who work for Sun, IBM and Compaq, or the guys who hack the

> Linux kernel. [...]

Alpha has "granularity hints" in the PTE which indicate that blocks of
pages should be handled as large pages up to 4 MB, IIRC, to reduce TLB
misses. They are supported in Tru64 Unix (and VMS I believe) and
there is a patch for Alpha Linux.

Bye,
Rob.
.-. .-.
/ \ .-. .-. / \
/ \ / \ .-. _ .-. / \ / \
/ \ / \ / \ / \ / \ / \ / \
/ \ / \ / `-' `-' \ / \ / \
\ / `-' `-' \ /
`-' `-'

Terje Mathisen

unread,

Dec 26, 2000, 10:43:43 AM12/26/00

to

Stephen Fuld wrote:
>
> "Terje Mathisen" <terje.m...@hda.hydro.com> wrote in message

> > I also prefer sector-based (i.e. usually 512 bytes) sub-allocation in
> > the file system, but this doesn't mean that the CPU and OS must support
> > 512-byte memory pages.
>
> Don't get too wedded to 512 byte disk blocks. A couple of years ago, I

You did notice my 'usually 512 bytes' caveat? :-)

I have lived with systems using 1K and 2K 'virtual sectors', back in the
days when sector addressing was limited by a 16-bit register size on Dos
< 3.31.

> participated in an IDEMA sub group dealing with the effects of increasing
> disk areal density. One of the problems is that, as the density increases,
> the signal to noise ratio decreases and more powerfull ECCs are needed.
> These require more ECC bits to be stored per disk sector. Not really a
> problem in general, since the areal density is going up much faster than the
> need for ECC bits, but soon the number of ECC bits beomes significant when
> compared to the 512 byte sector size, thus limiting effective capacity
> growth. Also, some of the required gaps don't scale down with density,
> further complicating the problem. The consensus, among the disk engineers,
> was a desire to increase the sector size, probably to something like 4K
> bytes. This would allow the increases in areal density to be more
> effectivly translated into increased capacity, thus lower cost. My job, on
> the subgroup was to talk about the "system" implications (which are large).
> But the technology implications may force us there eventually. Note that
> this group was talking about something like a 5 year time frame before the
> change would become necessary.

Interesting!

This is yet another example of the stuff we've been debating here: Basic
block sizes needs to scale with system size.

Afaik, CDROMs have a 2.3K block size for audio, corresponding to 2048
bytes when used for data. I sort of assumed that this was chosen to make
the ECC overhead manageable, i.e. about 250-300 bytes for a 2K block.

With a block size of 512 bytes, you would still need nearly as many ECC
bytes, isn't that right? Anyway, useful capacity would be significantly
less.

Greg Lindahl

unread,

Dec 26, 2000, 10:29:38 AM12/26/00

to

"Stephen Fuld" <s.f...@worldnet.att.net> writes:

> But my real issue is with the often expressed "rule of thumb" that the
> transfer time should be approximately equal to the seek plus latency. Is
> there any emperical evidence for this as being a good idea?

Well, 2 things:

1) Most applications don't page, so this is only talking about
workloads where things are paging.

2) Most modern OSes do clustered page ins and page outs. So in
reality, your objection is addressed: bigger is better, and
the rule of thumb gives an OK answer, but you can do better
if you know that the application is being sequential. And you
can avoid wasted work if you know it isn't, although the
the transfer rate you'll see will be terrible.

> It would seem lke the probability of use for the data in the next sector
> should be at least greater than the sector's transfer time divided by the
> average access time to optimize things.

Right, assuming the only pattern you can detect is sequential.

-- g

Andi Kleen

unread,

Dec 26, 2000, 12:03:14 PM12/26/00

to

dsie...@excisethis.khamsin.net (Douglas Siebert) writes:
>
> From other posts in this thread, it sounds like IRIX does the same. I
> don't believe that HP and SGI's engineers are substantially more brilliant
> than those who work for Sun, IBM and Compaq, or the guys who hack the
> Linux kernel. And while they certainly are more brilliant than those who
> work for Microsoft, by definition, even the Microserfs could probably
> code this support into Win2K, given a CPU that supports it, and a bean
> counter who puts it on their to do list.

Big problem of hacking large page support in existing OS is that you need
a memory allocator that does not suffer from internal fragmentation, otherwise
it could be rather hard to allocate large continuous pages.
probably some kind of zone allocator that reserves certain zones in advance
for them. This means that you need some reservation for the big page zones in
advance (no more fully automatic memory management). If you do not want
to effectively preallocate on bootup you need a way to reorganize memory
during runtime so that the large linear memory areas get freed when you
need the big pages. When the OS does not already have such a mechanism
in place and supports it in its datastructures it is relatively hard (and
slow) to add it. For example it requires backlinks from pages to
user page tables: you look at physical pages, decide that you need a few
certain pages to free a 4MB page up, now you need to find all the user
process page tables that reference this page and patch their PTEs up to
move the page somewhere else. To do this effectively requires some
backlink structure which complicates (and slows down) vm management.

With the 4MB pages it is especially hard, because preallocation would
tie up a lot of memory, so you probably need some dynamic approach. Just
dynamic is hard.

The first Linux hammer port will also not support them, except for kernel
code mapping.

-Andi

Stephen Fuld

unread,

Dec 26, 2000, 12:28:46 PM12/26/00

to

"Terje Mathisen" <terje.m...@hda.hydro.com> wrote in message

news:3A48BCAF...@hda.hydro.com...

> Stephen Fuld wrote:
> >
> > "Terje Mathisen" <terje.m...@hda.hydro.com> wrote in message
> > > I also prefer sector-based (i.e. usually 512 bytes) sub-allocation in
> > > the file system, but this doesn't mean that the CPU and OS must
support
> > > 512-byte memory pages.
> >
> > Don't get too wedded to 512 byte disk blocks. A couple of years ago, I
>
> You did notice my 'usually 512 bytes' caveat? :-)

Yes, I did notice, but thought the information I provided below was usefull
anyway. As you know, we seem to get wedded to our assumptions, even when we
shouldn't (but I know YOU wouldn't do that :-)).

Yup.

>
> Afaik, CDROMs have a 2.3K block size for audio, corresponding to 2048
> bytes when used for data. I sort of assumed that this was chosen to make
> the ECC overhead manageable, i.e. about 250-300 bytes for a 2K block.
>
> With a block size of 512 bytes, you would still need nearly as many ECC
> bytes, isn't that right? Anyway, useful capacity would be significantly
> less.

While the ECCs used in CDROMS are different from those used in magnetic
disks, the basic principles are the same. That is, I don't know if 250 to
300 bytes would be needed for a 2K block on a magnetic disk. If the number
cme out the same, it would be coincidence, since the error characteristics
of the channel (e.g. optical versus magnetic) are totally different. But it
is true that the number of bits of ECC scales much less than linearly with
the number of bits over which the ECC protects. Thus there will be far
fewer than twice the number of ECC bits for twice the number of data bits,
all other things equal, and your analysis of the effect on useful capacity
is exactly right. That is the major part of the motivation for larger disk
sectors.

--
- Stephen Fuld

Bill Todd

unread,

Dec 26, 2000, 4:05:44 PM12/26/00

to

Terje Mathisen <terje.m...@hda.hydro.com> wrote in message

news:3A48BCAF...@hda.hydro.com...
> Stephen Fuld wrote:

...

Indeed. I've wondered from time to time whether a move away from 512- (or
in Asia often 1024-)byte sectors was in the works. While it will turn to
molasses the performance of applications that assume they can blind-write
(aligned) data in those sizes (or actually break them if the system doesn't
provide transparent read/modify/write support for such accesses), it's
probably a good thing overall - as long as the new effective size standard
isn't too large, and if 4 KB sounds reasonable to me (as it does), from the
looks of this thread it should be reasonable for most people.

>
> This is yet another example of the stuff we've been debating here: Basic
> block sizes needs to scale with system size.

There are quite a few assumptions inherent in that statement, at least some
of which I think I disagree with.

If you were talking solely about system page sizes (the context suggested
you were talking about disk sector sizes), then I'd find less to disagree
with, though even there I'd suggest that the (minimum) page size should
preferably scale non-linearly with system size (something like the square
root, perhaps) so as to balance the inefficiencies inherent in dealing with
large *numbers* of pages against those inherent in dealing with larger
minimum page size (fragmentation waste, copy-overhead waste, etc.). If not,
the large system at least in some uses sacrifices the economy of scale it
should enjoy compared with smaller configurations. And in any event, as
I've said elsewhere the minimum disk transfer/caching/update size should not
be tied to a large minimum page size (though if the page size is small
enough it's nice, since it facilitates cache management via the
memory-management mechanisms).

If you were talking about disk sector sizes, I don't see much relationship
at all to system (or even disk) size. Rather, any sector size increase
beyond that required to ensure that not too high a percentage of total
storage space is devoted to things like inter-sector gaps, ECC, positioning
information, etc. effectively compromises the random-access nature of the
medium (e.g., as a 4 MB sector size most certainly would, even with disks
that could still contain 100,000 such sectors, thus equaling the sector
count of 512-byte-sector disks from around two decades ago). Furthermore,
until such time as interconnect bandwidths cease to be something we worry
about, unnecessarily-large minimum transfers affect network (or disk bus)
performance as well.

In other words, I'd suggest that basic block sizes are more related to what
the system is doing and to fundamental component cost/performance than to
system size (though system size may be related to the same underlying
issues, and thus some correlation may exist).

- bill

Paul Repacholi

unread,

Dec 26, 2000, 3:19:19 PM12/26/00

to

"Stephen Fuld" <s.f...@worldnet.att.net> writes:

> While the ECCs used in CDROMS are different from those used in magnetic
> disks, the basic principles are the same. That is, I don't know if 250 to
> 300 bytes would be needed for a 2K block on a magnetic disk. If the number
> cme out the same, it would be coincidence, since the error characteristics
> of the channel (e.g. optical versus magnetic) are totally different. But it
> is true that the number of bits of ECC scales much less than linearly with
> the number of bits over which the ECC protects. Thus there will be far
> fewer than twice the number of ECC bits for twice the number of data bits,
> all other things equal, and your analysis of the effect on useful capacity
> is exactly right. That is the major part of the motivation for larger disk
> sectors.

The main difference is that CDs interleave the data, so an error burst
is transformed into several smaller error bursts. Depending how media
and head technologies shape up in the next several years, this may
or may not be needed.

Also. I'd not be supprised if the initial mopve was to a 512/4K block.
4K on the media, but read/writable in 512 blocks in the drive electronics.
Some CD drive do this, but this is ReadOnly, so write merges are not a
problem for them. However, as all disks now have extensive CPU and buffering,
write-merging should be do-able at least as a transition measure.

Stephen Fuld

unread,

Dec 26, 2000, 4:52:39 PM12/26/00

to

"Bill Todd" <bill...@foo.mv.com> wrote in message
news:92b11n$jgo$1...@pyrite.mv.net...

I don't want to put words into Terje's mouth (nor his keyboard), but that is
the sense in which I was agreeing with him. The thing that has allowed
large systems is the scaling down of feature sizes in momory (see John
Mashey's earlier post) and (to a lesser extent) CPUs. This same
technological trend has allowed the scaling down of bit sizes on disks which
has led to the problem I discussed previously of "encouraging" larger disk
sector sizes.

Certainly, the disk sector size should be as small as reasonable to
accomodate the increased overhead, but there are tradoffs to do this. If
you use a larger sector size, then disk capacity may increase and average
transfer rate increases. So if you know you can use the larger sizes, you
should consider doing that. Note that on some disks today, that won't
matter as they don't support block sizes much larger than 512 or they
"aggregate" the 512 blocks internally, so you don't save anything.

--
- Stephen Fuld

>
> - bill
>
>
>

Stephen Fuld

unread,

Dec 26, 2000, 4:52:40 PM12/26/00

to

"Greg Lindahl" <lin...@pbm.com> wrote in message
news:92adh...@news2.newsguy.com...

> "Stephen Fuld" <s.f...@worldnet.att.net> writes:
>
> > But my real issue is with the often expressed "rule of thumb" that the
> > transfer time should be approximately equal to the seek plus latency.
Is
> > there any emperical evidence for this as being a good idea?
>
> Well, 2 things:
>
> 1) Most applications don't page, so this is only talking about
> workloads where things are paging.
>
> 2) Most modern OSes do clustered page ins and page outs. So in
> reality, your objection is addressed: bigger is better, and
> the rule of thumb gives an OK answer, but you can do better
> if you know that the application is being sequential. And you
> can avoid wasted work if you know it isn't, although the
> the transfer rate you'll see will be terrible.

I'm sorry Greg, but I think I lost what was going on here. Your point 1
seems to ignore some of the cases where I/O is very important, but it not
may not be paging; for example, databases. Your point 2, in combination
with point 1 seems to imply that the rule of thumb is irrevalent because the
paging system "does the right thing" by clustering paging I/O as
appropriate,and thus ignoring the rule of thumb (which is certainly true as
long as the minimum page size is not too large - which was the original
thrust of this thread). So I think you are saying that the rule of thumb
doesn't matter any more and people shouldn't be using it. Is that your
position? I'm not trying to argue, just trying to clarify.

>
> > It would seem lke the probability of use for the data in the next sector
> > should be at least greater than the sector's transfer time divided by
the
> > average access time to optimize things.
>
> Right, assuming the only pattern you can detect is sequential.

Again, I am missing something. If the pattern is sequential, the
probability of needing the next sector is one, and blocks should be very
long (to mnimize the number of I/Os), even if the transfer time is much
longer than the seek plus latency. It sems to me that the formula I gave is
correct, no matter what patterns you can detect. If you detect a pattern,
that affects the probability of reference which changes the output of the
formula, but not the formula itself.

BTW, the detection of non-sequential access patterns is a very interesting,
and mostly ignored area of research. There are lots of opportunities for
good work there.

--
- Stephen Fuld

>
> -- g

Stephen Fuld

unread,

Dec 26, 2000, 11:23:44 PM12/26/00

to

"Paul Repacholi" <pr...@prep.synonet.com> wrote in message
news:87wvcnd...@k9.prep.synonet.com...

> "Stephen Fuld" <s.f...@worldnet.att.net> writes:
>
> > While the ECCs used in CDROMS are different from those used in magnetic
> > disks, the basic principles are the same. That is, I don't know if 250
to
> > 300 bytes would be needed for a 2K block on a magnetic disk. If the
number
> > cme out the same, it would be coincidence, since the error
characteristics
> > of the channel (e.g. optical versus magnetic) are totally different.
But it
> > is true that the number of bits of ECC scales much less than linearly
with
> > the number of bits over which the ECC protects. Thus there will be far
> > fewer than twice the number of ECC bits for twice the number of data
bits,
> > all other things equal, and your analysis of the effect on useful
capacity
> > is exactly right. That is the major part of the motivation for larger
disk
> > sectors.
>
> The main difference is that CDs interleave the data, so an error burst
> is transformed into several smaller error bursts.

You mean logically consecutive bits are on different sectors????? Wow. I
have never heard that before. Most disks use an interleaved Reed-Solomon
ECC that allows for handling a larger error burst as essentially multiple
smaller bursts that happen to be consecutive. Is that what you mean?

> Depending how media
> and head technologies shape up in the next several years, this may
> or may not be needed.
>
> Also. I'd not be supprised if the initial mopve was to a 512/4K block.
> 4K on the media, but read/writable in 512 blocks in the drive electronics.
> Some CD drive do this, but this is ReadOnly, so write merges are not a
> problem for them. However, as all disks now have extensive CPU and
buffering,
> write-merging should be do-able at least as a transition measure.
>

Yes, this was, in fact, one of my suggestions. The disk guys don't like it
as it is a lot of work, but I think they will be forced there. But there
are other problems. For example, the ATA/IDE interface doens't even have a
way to specify the block size - it is assumed by the interface. This some
method would have to be created to tell the drive that this request uses 4K
addressing rather than 512 for when the software catches up and wants to
avoid the performance hit on writes. Then you get into a lot of software
(driver) complications, etc., especially since the PC world wants at least
some backwards compatibility and no one wants to maintain two whole sets of
drivers. It gets messy. There are also many systems that use something
that is close to, but not exactly 512 bytes. (For example, the AS/400 uses
520 bytes per sector) and that has to be handled. It is all doable, but it
is messy. We'll see what happens.

--
- Stephen Fuld

Douglas Siebert

unread,

Dec 27, 2000, 3:42:57 AM12/27/00

to

"Stephen Fuld" <s.f...@worldnet.att.net> writes:

>Don't get too wedded to 512 byte disk blocks. A couple of years ago, I

>participated in an IDEMA sub group dealing with the effects of increasing
>disk areal density. One of the problems is that, as the density increases,
>the signal to noise ratio decreases and more powerfull ECCs are needed.
>These require more ECC bits to be stored per disk sector. Not really a
>problem in general, since the areal density is going up much faster than the
>need for ECC bits, but soon the number of ECC bits beomes significant when
>compared to the 512 byte sector size, thus limiting effective capacity
>growth. Also, some of the required gaps don't scale down with density,

Does any major OS, other than Windows (yeah, I know, that one's pretty
major :) ) care that much about 512 byte disk blocks? Does Win2K even
care? If not, since the next consumer Windows will be based on Win2K
code, maybe this won't be too big of a problem by the time the 4K sector
drives arrive. Presumably the drives could have a mode (set up in
firmware, with a driver, or maybe they could decide to put themselves in)
where they still present 512 byte sectors to the user and simply do a
RMW for those cases where not all the sector is modified.

I don't know about anything else, but I messed around on HP-UX about 4-5
years ago with reformatting SCSI disks using 1K, 2K and 4K sectors to see
how much additional capacity I could get. IIRC, it was about 7% moving to
1K, another 3-4% moving to 2K, and maybe 1% moving to 4K. I was able to
create filesystems on them, mount them up, etc. though I didn't use them
in production. I don't recall if I ever tried booting off one. I am
pretty sure I did actually use a disk with 2K sectors on my old NeXT for
a while, and successfully booted off it, but I wouldn't bet my life on
it (I may have intended to but gave up when it didn't work)

Perhaps the 4K sectors might initially appear in SCSI drives, since the
systems they are used on are more likely to be kept up to date, and if
necessary it would be easy to guess when to fall back to the fake 512
byte sectors based on the SCSI command set being used and the negotiated
speed.

Terje Mathisen

unread,

Dec 27, 2000, 5:06:16 AM12/27/00

to

Bill Todd wrote:
>
> Terje Mathisen <terje.m...@hda.hydro.com> wrote in message

> > This is yet another example of the stuff we've been debating here: Basic
> > block sizes needs to scale with system size.
>
> There are quite a few assumptions inherent in that statement, at least some
> of which I think I disagree with.
>
> If you were talking solely about system page sizes (the context suggested
> you were talking about disk sector sizes), then I'd find less to disagree
> with, though even there I'd suggest that the (minimum) page size should
> preferably scale non-linearly with system size (something like the square
> root, perhaps) so as to balance the inefficiencies inherent in dealing with
> large *numbers* of pages against those inherent in dealing with larger
> minimum page size (fragmentation waste, copy-overhead waste, etc.).

Sorry, I should have been clearer:

Yes, I absolutely agree that geometric instead of linear scaling is the
proper solution for most of these cases, simply because that distributes
the work/transistor count/whatever more or less optimally.

> If you were talking about disk sector sizes, I don't see much relationship
> at all to system (or even disk) size. Rather, any sector size increase
> beyond that required to ensure that not too high a percentage of total
> storage space is devoted to things like inter-sector gaps, ECC, positioning
> information, etc. effectively compromises the random-access nature of the
> medium (e.g., as a 4 MB sector size most certainly would, even with disks
> that could still contain 100,000 such sectors, thus equaling the sector
> count of 512-byte-sector disks from around two decades ago). Furthermore,
> until such time as interconnect bandwidths cease to be something we worry
> about, unnecessarily-large minimum transfers affect network (or disk bus)
> performance as well.

Again, I don't want sectors that are any larger than they have to, but
with huge increases in disk sizes/areal densities, I suspect that some
scaling will have to occur. If nothing else, because you must have some
way to recover from a slightly bad spot on the disk surface.

The obvious endpoint for all of this (besides having some sort of
in-disk raid, using multiple platters at the same time), would be when
the disk must treat all the data from a full revolution as a single
item.

Good blocks can be retrieved as soon as they turn up under the head,
other blocks must wait until the full, distributed ECC can be used to
recover the bad spots.

Terje

PS. Re my idea of a in-disk RAID: With a disk that's capable of
reading/writing from/to all platters at the same time, it seems like you
could get very decent IO rates while still having a significantly
improved MTBF, as long as single-platter/single-head failures are
common.

If many/most errors are related to non-duplicated parts of the drive,
then this doesn't help.

Chris Hedley

unread,

Dec 27, 2000, 7:53:22 AM12/27/00

to

In article <3A49BF18...@hda.hydro.com>,

Terje Mathisen <terje.m...@hda.hydro.com> writes:
> The obvious endpoint for all of this (besides having some sort of
> in-disk raid, using multiple platters at the same time), would be when

I'm obviously showing my ignorance of the inner workings of hard discs
here, but I'd pretty much taken it for granted that disc units would do
this by default (even if it means multiplying the bad-sector count by
the head count) Do they really only read from one head at a time?

Chris.

Terje Mathisen

unread,

Dec 27, 2000, 8:29:13 AM12/27/00

to

Old disks did indeed read from single heads, I assume because the drive
electronics would be too costly to replicate.

Afaik, current drives must use multiple heads in parallel. Otherwise it
would be very hard to sustain the multi-MB/s transfer rates we're seeing
even on cheap drives today.

... http://www.storage.ibm.com/hardsoft/diskdrdl/prod/ds75gxp40gv.htm

I.e. my IBM Deskstar 75 has 15GB per platter, running at 7200 RPM, and
sustains 37 MB/s.

If it did this from a single head, then it would need to read
37MB/(7200/60) = 37MB/120 =~ 300+ kB per track, or more than 600
sectors.

According to the IBM specs, this drive has 27,724 user cyclinders and 10
heads, for an average of 270 KB/track.

This would seem to indicate that it has to read from at least two heads
in parallel, but 300+ and 270 are suspiciously similar numbers that I'm
starting to wonder!

OTOH, that same page claims a max media transfer rate of 444 Mbit/s,
which corresponds to 550 MB/s, but that number is probably inclusive of
all overhead (addressing/ECC etc).

On the gripping hand, the page also claims that the transfer rate is the
same for all models, including the 15GB version with just a single
platter.

So, I'd conclude by guessing that IBM has duplicated the read
electronics, so they can read from both sides of a single platter
simultaneously.

This is the only way I can make all the different numbers more or less
consistent.

Terje

Larry Kilgallen

unread,

Dec 27, 2000, 8:30:14 AM12/27/00

to

In article <92ca2g$avr$1...@sword.avalon.net>, dsie...@excisethis.khamsin.net (Douglas Siebert) writes:

> "Stephen Fuld" <s.f...@worldnet.att.net> writes:
>
>>Don't get too wedded to 512 byte disk blocks. A couple of years ago, I
>>participated in an IDEMA sub group dealing with the effects of increasing
>>disk areal density. One of the problems is that, as the density increases,
>>the signal to noise ratio decreases and more powerfull ECCs are needed.
>>These require more ECC bits to be stored per disk sector. Not really a
>>problem in general, since the areal density is going up much faster than the
>>need for ECC bits, but soon the number of ECC bits beomes significant when
>>compared to the 512 byte sector size, thus limiting effective capacity
>>growth. Also, some of the required gaps don't scale down with density,
>
>
> Does any major OS, other than Windows (yeah, I know, that one's pretty
> major :) ) care that much about 512 byte disk blocks? Does Win2K even
> care? If not, since the next consumer Windows will be based on Win2K
> code, maybe this won't be too big of a problem by the time the 4K sector
> drives arrive. Presumably the drives could have a mode (set up in
> firmware, with a driver, or maybe they could decide to put themselves in)
> where they still present 512 byte sectors to the user and simply do a
> RMW for those cases where not all the sector is modified.

That seems like it would get outstandingly complex on a multi-host
SCSI bus where two different systems in the cluster are modifying
adjacent "512 byte blocks".

The operating system which would have the least trouble might be MVS,
since it is able to handle variable-length disk blocks.

==============================================================================
Great Inventors of our time: Al Gore -> Internet; Sun Microsystems -> Clusters
==============================================================================

Magnus Redin

unread,

Dec 27, 2000, 8:57:37 AM12/27/00

to

Terje Mathisen <terje.m...@hda.hydro.com> writes:

> Old disks did indeed read from single heads, I assume because the
> drive electronics would be too costly to replicate.

> Afaik, current drives must use multiple heads in parallel. Otherwise
> it would be very hard to sustain the multi-MB/s transfer rates we're
> seeing even on cheap drives today.

The problem is not drive electronics cost but head positioning since
all heads use the same coil and "suspension". The tracs are very close
to each other and a small alignment change, for instance if the top
head arm is slightly warmer then the bottom, means you loose one or
the other track if you try to use both in parallell.

There is not room for individial coils for each head and mechanical
parts costs a lot compared to a mm2 of silicon.

If you make the "suspension" and head arms stiffer the added mass will
make the seeks slower or require a larger coil wich means even more
mass, larger volume and a significantly larger power consumption. More
mass also means larger sensitivity to mechanical shocks since the
weight per supporting area of material ratio gets worse. Its the
miniturisation that have made todays 3,5" and 2,5" pretty shock
resistant. Every miligram of head, arm and assembly "suspension" mass
is very valuble for nearly all areas of performance, cost, seek time,
power consumption, overall weight and shock resistance,

Regards,
--
--
Magnus Redin Lysator Academic Computer Society re...@lysator.liu.se
Mail: Magnus Redin, Klockaregården 6, 586 44 LINKöPING, SWEDEN
Phone: Sweden (0)70 5160046 and (0)13 214600

Stephen Uitti

unread,

Dec 27, 2000, 10:47:21 AM12/27/00

to

"Terje Mathisen" <terje.m...@hda.hydro.com> wrote in message

news:3A48BCAF...@hda.hydro.com...

>Basic block sizes needs to scale with system size.

What about the application? Take, for example, netnews.
It typically uses many small files. Netnews needs a small filesystem
block allocation size. It doesn't matter what scale your system is.
Some OSs support more than one filesystem at the same time, and
for netnews, the filesystem supporting the smallest blocks wins.

Stephen Fuld wrote:
[larger physical blocks have lower ECC overhead - interesting.]

Small filesystem blocks can be implemented on larger physical
blocks. Contiguous filesystems with lazy writes can eliminate
much of the read-modify-write that otherwise would be required.
The disk drivers attempt to do track reads and writes as it is.
In fact, one can view read ahead and write behind as a sort of
dynamic variable block size system, where the block size is
tailored to the application's behavior at a fine grained level.

Many modern filesystems can be set up by the system administrator
with different sized files. Back when the Linux ext2 was new
(to me), I ran tests over the range of block sizes available. For
large files, there was essentially no difference in speed.
For small files, small allocation units result in higher efficiency
disk usage. [I also ran timing tests with zero "reserved" space,
and found no significant advantage in reserving any space
whatever.] Thus, I tend to set the filesystems up with the
smallest possible blocks [with no reserve space].

On the Subject topic - I have a 68K Mac, which as far as
I know is stuck with the old HFS file system. As I recall, my
blocks are 8 KB each, due to the size of my disks (about
700 MB). Smaller filesystems allow smaller blocks, and
waste less space per file. However, multiple smaller
filesystems have shown themselves to be more trouble
than they're worth. Thus, I've set up the system so that each
drive has just one filesystem, and I simply live with the wasted
space per file. I recall that one filesystem has something like
17,000 files on it. That's probably something like 70 MB
of waste - or 10% of the disk. Mind you, this system likely
runs slightly slower than it would with smaller blocks. After
all, the system has to read an 8 KB block even for a 2 KB
file (which is what I have - that's why I have so many files).
It doesn't seem to make much difference, however.

I used to run DOS applications under Linux via the DOS
emulator. Tests showed that DOS apps got significantly
better disk throughput using files in the Linux native ext2
filesystem than when using a native DOS filesystem. When it
became clear that none of my DOS applications behaved
badly talking to Linux files, I deleted my DOS filesystem
partition. My point here is that a modern filesystem can
overcome the performance problems of using smaller
blocks when compared to an older filesystem using larger
blocks on the same hardware.

Summary - I expect current modern filesystems to
perform pretty well using larger physical page sizes,
even while supporting smaller allocation units. YMMV.

Dennis O'Connor

unread,

Dec 27, 2000, 11:55:08 AM12/27/00

to

"Terje Mathisen" <terje.m...@hda.hydro.com> wrote ...

> Old disks did indeed read from single heads, I assume because the drive
> electronics would be too costly to replicate.
>
> Afaik, current drives must use multiple heads in parallel.

That seems unlikely. Track-to-track spacings are very tight
(28,300 tracks per inch on IBM's Deskstar 75GP, for example).
Since all the heads use the same positioning servo, I doubt
you would reliably get more than one head properly aligned
over the target tracks at any one time. The minor differences
in temperature (warmer in the center, cooler towards the case)
across the platters and arms would mess it up I think.

> Otherwise it
> would be very hard to sustain the multi-MB/s transfer rates we're seeing
> even on cheap drives today.
... http://www.storage.ibm.com/hardsoft/diskdrdl/prod/ds75gxp40gv.htm
>
> I.e. my IBM Deskstar 75 has 15GB per platter, running at 7200 RPM, and
> sustains 37 MB/s.
>
> If it did this from a single head, then it would need to read
> 37MB/(7200/60) = 37MB/120 =~ 300+ kB per track, or more than 600
> sectors.

Well, at 391k BPI (the claimed max recording density), that would require
6.46 inches of a single track, which would be a track with a 2" diameter.
Given that the active area of the disk is found in a 1" wide ring (27k
cylinders at 28k tracks/inch) then if the innermost track were 2"
in diameter the outermost would be 4". This would work fine.
Of course, the recording density varies by cylinder, so these can
only be aproximations. Also, next-track-seek time needs to be
taken into account for sustained numbers; there is room to make
the tracks larger (say, have them go from 2.5" diameter to 4.5"
which would help that). So it looks like a single head works fine.

> OTOH, that same page claims a max media transfer rate of 444 Mbit/s,
> which corresponds to 550 MB/s, but that number is probably inclusive of
> all overhead (addressing/ECC etc).

We know you mean 55MB/sec. :-) And the main reason for the drop
from 55MB/sec media rate to 37MB/sec sustained rate is probably
next-track-seek time. ECC overhead is very low, and IBM uses their
"No-ID" technology to reduce the overhead of addressing; they claim
this gives a 30% increase in storage capacity.

> This is the only way I can make all the different numbers more or less
> consistent.

In my analysis, a single head works fine. Have I missed something ?
--
Dennis O'Connor dm...@primenet.com
Vanity Web Page: http://www.primenet.com/~dmoc/