Large page sizes on Origin 3800

Richard

unread,

Jan 22, 2003, 12:14:11 PM1/22/03

to

Hi,

We're fiddling around trying to work out whether configure our machine to
provide very large memory pages (ie. 4MB, 16MB) will be beneficial, and what
the draw-backs would be.

Preliminary tests show that providing a large fraction of 16MB pages does
yield worthwhile performance improvements for our users' code (we do fluid
dynamics).

Question is, given that large pages are apparently better in terms of TLB
performance, what reasons might there be not to configure the whole machines
memory to use the 16MB page size? What's the downside of such big pages, is
my basic question?

Thanks,

Richard.

Randolph J. Herber

unread,

Jan 22, 2003, 1:05:18 PM1/22/03

to

Simply stated, memory fragmentation. About one-half page (actually
experience indicates about 69%) of the _last_ of a region is unused
and in many cases, not all of a region used with the result that
larger pages tend to cause larger regions also cause some waste.

Larger pages do reduce the size of the kernel's virtual address space
management real memory usage (the page tables are smaller).

But, as long as swap rates do not increase, there is not a good reason
to use larger pages as as you indicate TLB misses are a major impact
on performance. In my experience, a TLB miss costs about 50 instruction
times lost without a context switch -- if the translation is in the
page tables and the real memory backing that virtual address range
is already assigned -- and about 30K instruction times lost plus one
or two I/O waits if the data needs to be swapped back into the CPU.
Two I/O waits are needed if a page must be emptied.

The following header lines retained to effect attribution:
|Date: Wed, 22 Jan 2003 17:14:11 +0000
|From: Richard <r...@astro.le.ac.uk>
|Subject: Large page sizes on Origin 3800
|To: info-ir...@ARL.ARMY.MIL

|Hi,

|Thanks,

|Richard.

Randolph J. Herber, her...@fnal.gov, +1 630 840 2966, CD/CDFTF PK-149F,
Mail Stop 318, Fermilab, Kirk & Pine Rds., PO Box 500, Batavia, IL 60510-0500,
USA. (Speaking for myself and not for US, US DOE, FNAL nor URA.) (Product,
trade, or service marks herein belong to their respective owners.)

Per Ekman

unread,

Jan 23, 2003, 2:49:07 AM1/23/03

to

"Randolph J. Herber" <her...@dcdrjh.fnal.gov> writes:

> Simply stated, memory fragmentation. About one-half page (actually
> experience indicates about 69%) of the _last_ of a region is unused
> and in many cases, not all of a region used with the result that
> larger pages tend to cause larger regions also cause some waste.
>
> Larger pages do reduce the size of the kernel's virtual address space
> management real memory usage (the page tables are smaller).
>
> But, as long as swap rates do not increase, there is not a good reason
> to use larger pages as as you indicate TLB misses are a major impact
> on performance.

I'm I right in assuming that the "not" in the above sentence is a
typo?

*p

Alexis Cousein

unread,

Jan 23, 2003, 8:18:47 AM1/23/03

to Richard

Richard wrote:
> Hi,
>
> We're fiddling around trying to work out whether configure our machine to
> provide very large memory pages (ie. 4MB, 16MB) will be beneficial, and what
> the draw-backs would be.
>
> Preliminary tests show that providing a large fraction of 16MB pages does
> yield worthwhile performance improvements for our users' code (we do fluid
> dynamics).
>
> Question is, given that large pages are apparently better in terms of TLB
> performance, what reasons might there be not to configure the whole machines
> memory to use the 16MB page size?

Two:

-memory usage (as pages are added to program heaps in large page increments)
-OpenMP/pthread program hotspots created

For the first topic, note that if you use PAGESIZE_TEXT (but not many people
do) or large pages for text, you need the latest and greatest rld patch
and a host
of RLD flags if you want more than one DSO mapped into a large page...I
think there
will shortly be a PDF on the Catia section of the www.sgi.com page, as
Catia uses
DSOs extensively and jumps all over the place in them.

If you don't have that patch, pray that you aren't using thousands of DSOs...

For *parallel* programs, very large pages can have adverse effects, in
that shared
data structures that are used by many threads may become smaller than a
page, which
means they'll be mapped onto the memory of *one* C-brick only -- creating
a memory
hotspot.

Of course, fragmenting pages to get small pages is relatively painless.

If you get a *recent* IRIX 6.5 flavour (6.5.17, IIRC), dplace has
-data_lpage_wait and
-stack_lpage_wait commands, so even on a system without precoalesced large
pages, you
can force apps to kick coalesced into action when they request memory.

On the other side, if you set large pages to use,say, up to 90% of the
system's memory, an
application that does not elect to use large pages (e.g. not compiled with
-bigp_on, and not
using the MP or MPI libraries, or one that has unset PAGESIZE_DATA etc.
env vars) will
split large pages fairly rapidly.

A few hints, though:

-*First* check your application. Lots of TLB misses in an application are
indicative of
a scattered working set. You can get rid of the TLB misses without
touching the code,
but chances are that the code lines that trigger TLB misses will still
trigger lots of
secondary cache misses. So profile your application using ssrun with the
TLB miss
counter experiment to see where the TLB misses are caused,
and see if you can't get rid of them by changing the algorithm or the code.

- Secondly, try what the *minimal page* size is that will avoid TLB misses
-- use perfex -y -a
on your largest runs with different page sizes; perfex is almost free
anyway, so I tend to
*always* run HPC jobs with a perfex experiment written to a file.

The way from a base page to a large page is *much* longer for 16M pages
than for 256K pages,
and the latter are *usually* enough to avoid TLB misses. Your chances of
getting such pages
even when some programs have fragemented large pages and used parts of
them as base
pages are also greater ;).

- For sequential programs, use the largest page size that's readily
available on the system even
when your program doesn't need large pages. For *parallel* programs,
check what the *minimal*
page size is that will eliminate TLB misses. Remember that this optimum
may become a lot smaller
for runs of constant problem size as the number of CPUs increases!

But basically, you have two basic strategies.

Determine what large page size the greates TLB trasher on the
system requires. Then you can either:

- set percent_totalmem_XXX_pages to 1, and force that application to use
dplace -xxx_lpage_wait flags [*]
- set percent_totalmem_XXX_pages to e.g. 90, and force all the programs
that loathe or don't use large
pages to fragment large pages (pretty cheap; almost invisible).

And, of course, interesting mixes of the two.

You can set coalesced_precoalesce to 1, but don't forget hat this may be
intrusive: even though it
will *directly* consume an insignificant fraction of the CPU resources,
don't use it on a
machine with lots of large MP jobs that have lots of barrier, as delaying
one thread of those programs
means the entire job runs slower (while the other CPUs usually busy-spin
waiting) [**]

[*] the flags have been there for longer than the actual implementation,
though -- it's not because it's in the
man page that it works. perfex and dlook are your friends. If you
install 6.5.17, though, IIRC, you're
certainly safe.
[**] there *is* a Request For Enhancement to enable kthreads to be bound
to the boot cpuset; when that will
be in place, that may not be as critical. For the same reason, either
systune cralyink routerstat gathering off,
or for a recent 6.5.x flavout, pin numastatd to processor 0.

--
<these messages express my own views, not those of my employer>
Alexis Cousein Senior Systems Engineer
SGI Belgium and Luxemburg a...@brussels.sgi.com

Richard

unread,

Jan 23, 2003, 9:35:02 AM1/23/03

to

Thanks for the replies. Our machine is pretty much exclusively running large
parallel jobs.

Alexis Cousein

unread,

Jan 23, 2003, 9:51:54 AM1/23/03

to Richard

Richard wrote:
> Thanks for the replies. Our machine is pretty much exclusively running large
> parallel jobs.
>
>

Then I'm surprised large pages help that much (unless you're addressing really
*huge* amounts of memory in the entire application). OTOH, if they do
help, you'd better
use them, but it may well be that the optimum is *not* the largest pages that
are available, esp. if the applications are not cache friendly.

Andy Nelson

unread,

Jan 23, 2003, 9:55:58 AM1/23/03

to

Richard <r...@astro.le.ac.uk> wrote:

> Hi,

> We're fiddling around trying to work out whether configure our machine to
> provide very large memory pages (ie. 4MB, 16MB) will be beneficial, and what
> the draw-backs would be.

Hi folks,

Since it is my code that triggered Richard's message
I thought I would provide some background on the
code and the results that instigated his post. I don't
know exactly what significance it may have for the
answers that people give, but here it is in any case.

The machine is used entirely for computational astrophysics.
Some fraction of the codes are do particle evolution
and the rest use grids of various flavors. The mix
is probably 50/50, but I don't know exactly. My own
experience with a grid code is that it doesn't care
anywhere near as much about the page size as a particle
code does.

The specific code in question is an SPH and/or Nbody
code that uses a binary (`Press') tree to determine a
list of nodes which pass an acceptability criterion to
be used to calculate the gravity on a set of particles.
It is parallelized using OpenMP and for the gravity
calculation alone (the most parallel part), the tests
I did yesterday show that it gets perfectly linear
speedup up to at least 96 processors. Other parts
of the code didn't fare so well, but that was somewhat
expected. Back to coding and twiddling for them :-/

The tree traversal takes of order 10% of the times
shown below and the gravity calculation the rest. The
gravity calculation requires 10 double precision
numbers to be loaded per node, consisting of 3 positions,
a mass and 6 quadrupole moments. One calculation is
one inverse sqrt and about 40-50 or so floating point ops
(I don't recall exactly, but it is something like that
number).

For memory optimization at all levels (both cache
and TLB) the values are stored in arrays for which the
first index is one of the quantities and the second is
the node number. For example:

(given that parameter(ix=1, iy=2, iz=3, im=4) )

then treept(ix,i) would refer to the x coordinate of
the i'th particle. A similar array structure exists
for the quadrupole elements the linking of nodes in
the tree (an integer array of daughters and siblings
for each node) to one another.

This means that although one calculation may miss
everything (cache and page) the first time and element
is needed, all the subsequent ones needed for the
same calculation will almost always hit the primary
cache (modulo weird corner cases). Successive tree
traversals are ordinarily done for a particle or
group of particles that are physically very close
to the particle or group that was just completed.
Re your comment Alexis: I specifically designed it
this way based on output from ssrun and perfex.

The problem with a tree traversal and gravity calculation
like this is that it is very non-local in terms of its
memory access. Physically close (equivalent to often
accessed) nodes may be quite distant in memory and vice
versa. There is quite a bit of mitigation that can be (and
has been) done if you order the nodes so that physically
close nodes are nearby in the tree as well, but ultimately
you still need to grab information from relatively distant
parts of memory quite frequently. Richard told me of some
tests with another related code (a couple years ago?) that
didn't have this kind of mitigation incorporated that found
very little sensitivity to page size, and I conclude that it
has cache and page misses all the time no matter what else
you could do to the code.

The test problem was a set of 5million particles arranged
in a homogeneous sphere. For cognoscenti, the tree
opening criterion was set to 0.7 (1.0 is theoretically
borderline unstable and -->0 goes to an n^2 gravity
calculation). Three gravity calculations were made
for all particles and the times for each were averaged.
The individual measurements varied by about 1% or less
from the number shown. Each test was computationally
identical to all of the others.

Round robin memory placement. Page migration off (a
long time ago I did tests that showed migration was
a very bad idea for this code). Total running code size
was approx 4GB, though I think somewhat less was actually
touched and used heavily in these tests, perhaps as
little as 1.5-2GB (these tests weren't designed to look
at some other sections of the code--gravity is where
most of the time is spent).

Large page sizes were used for data and stack segments.
(i.e. environment variables PAGESIZE_DATA and
PAGESIZE_STACK) The system was set to use only one page
size at a time during these tests (i.e. /var/sysgen/stune
had 0% for all page sizes but one and 100% for that one).
Page sizes for text segments were default size. Timing for
the tree traveral+gravity calculation were

16MBpages 1MBpages 64kpages
1 * * 2361.8s
8 86.4s 198.7s 298.1s
16 43.5s 99.2s 148.9s
32 22.1s 50.1s 75.0s
64 11.2s 25.3s 37.9s
96 7.5s 17.1s 25.4s

(*) test not done.

As near as I can tell the numbers show perfect
linear speedup for the runs for each page size.
Across different page sizes there is degradation
as follows:

16m --> 64k decreases by a factor 3.39 in speed
16m --> 1m decreases by a factor 2.25 in speed
1m --> 64k decreases by a factor 1.49 in speed

Profiling data from perfex:

Used perfex -a -x -y a.out

Use of -a means that the counts are a statistical
sample, and not an exact count of every event.

Sum over cpus of floating point times for each test
(very approximate also because of instruction latency
overlap: this table has the `typical' value obtained
from perfex):

16MBpages 1MBpages 64kpages
1 1424s
8 1424s 1424s 1423s
16 1424s 1424s 1424s
32 1423s 1424s 1424s
64 1425s 1424s 1424s
96 1426s 1424s 1424s

Sum over cpus of TLB miss times for each test:

16MBpages 1MBpages 64kpages
1 3489s
8 64.3s 1539s 3237s
16 64.5s 1540s 3241s
32 64.5s 1542s 3244s
64 64.9s 1545s 3246s
96 64.7s 1545s 3251s

Thus the 16MB pages rarely produced page misses,
while the 64kB pages used up 2.5x more time than
the floating point operations that we wanted to
have. I have at least some feeling that the 16MB pages
rarely caused misses because with a 128 entry
TLB (on the R12000 cpu) that gives about 1GB of
addressible memory before paging is required at all,
which I think is quite comparable to the size of
the memory actually used.

Make of this what you will.

Cheers,

Andy

--
Andy Nelson School of Mathematics
an...@maths.ed.ac.uk University of Edinburgh
http://maths.ed.ac.uk/~andy Edinburgh Scotland EH9 3JZ U. K.

Andy Nelson

unread,

Jan 23, 2003, 10:35:56 AM1/23/03

to

Alexis Cousein <a...@brussels.sgi.com> wrote:

> If you get a *recent* IRIX 6.5 flavour (6.5.17, IIRC), dplace has
> -data_lpage_wait and
> -stack_lpage_wait commands, so even on a system without precoalesced large
> pages, you
> can force apps to kick coalesced into action when they request memory.

Just one note about this:

As far as I know/understand, this option of forcing a wait until
large pages are ready could cause a run to wait indefinitely if the
memory is quite fragmented or pages are unavailable. As a user with a
specified time allocation per quarter, I can see certain drawbacks
to this option that a systems person might overlook...

(Richard excepted, since I already told him about it :-)

Alexis Cousein

unread,

Jan 23, 2003, 11:09:29 AM1/23/03

to Andy Nelson

Andy Nelson wrote:
> Re your comment Alexis: I specifically designed it
> this way based on output from ssrun and perfex.

Good. Then my generic catch-all comment need not apply. Of course
there *are* algorithms that generate cache and page misses.

>
> The problem with a tree traversal and gravity calculation
> like this is that it is very non-local in terms of its
> memory access.

Yup.

Physically close (equivalent to often
> accessed) nodes may be quite distant in memory and vice
> versa. There is quite a bit of mitigation that can be (and
> has been) done if you order the nodes so that physically
> close nodes are nearby in the tree as well, but ultimately
> you still need to grab information from relatively distant
> parts of memory quite frequently. Richard told me of some
> tests with another related code (a couple years ago?) that
> didn't have this kind of mitigation incorporated that found
> very little sensitivity to page size, and I conclude that it
> has cache and page misses all the time no matter what else
> you could do to the code.

Probably right.

[...]

> Make of this what you will.

Well, you're probably right in wanting 16M pages. If you're not the only
system
user, though, you're going to find it hard to hold on to large pages if there
are many small page users on the system.

If any of them have parallel codes using hotspot
data structures that are not much larger than 16MB, and whether these
codes are memory bandwidth limited, you have a conflict of interest -
otherwise
I'd advise every user to compile with -bigp_on and just set PAGESIZE_DATA
and PAGESIZE_STACK to 16384 for everyone ;).

I'd set percent_totalmem_16m_pages to 90 or 95 (or whatever your programs
need)
rather than force the use of dplace lpage_wait flags -- if you allow other
users to grab
small pages, you're going to find it harder to actually get large pages.

Do get 6.5.18 -- there have been fairly recent changes to the fallback
policy when a
very large page size can't be found: 6.5.18 will fall back to the next
smaller page size instead
of falling back to base page sizes.

If you really need almost *everything* for your runs, though, you may have
to rein in the buffer
cache on the ondes you're running in, either by clamping down on it
globally (but if you have
other users who depend on the buffer cache,...), or by running your jobs
in a static cpuset with
MEMORY_KERNEL_AVOID.

Alexis Cousein

unread,

Jan 23, 2003, 11:42:09 AM1/23/03

to Andy Nelson

Andy Nelson wrote:
> Alexis Cousein <a...@brussels.sgi.com> wrote:
>
>
>>If you get a *recent* IRIX 6.5 flavour (6.5.17, IIRC), dplace has
>>-data_lpage_wait and
>>-stack_lpage_wait commands, so even on a system without precoalesced large
>>pages, you
>>can force apps to kick coalesced into action when they request memory.
>
>
> Just one note about this:
>
> As far as I know/understand, this option of forcing a wait until
> large pages are ready could cause a run to wait indefinitely if the
> memory is quite fragmented

no. It kicks in coalesced immediately.

or pages are unavailable.

Yes. More likely, you end up getting the pages from a node that's not where
you want it to come from (but then, as you're doing round robin page
allocation,
I doubt that's going to affect you).

Andy Nelson

unread,

Jan 23, 2003, 12:03:55 PM1/23/03

to

Alexis Cousein <a...@brussels.sgi.com> wrote:

> Well, you're probably right in wanting 16M pages. If you're not the only
> system
> user, though, you're going to find it hard to hold on to large pages if there
> are many small page users on the system.

I am far from the only user on the system :-)

Does this mean that a job running with (and using) 16mb pages will
get them stolen, broken up into smaller pages and returned later?
Jobs on this system typically run for days at least, usually weeks
and always in parallel, except for edit/compile/debug stuff in the
boot cpu set (of 8cpus).

> If any of them have parallel codes using hotspot
> data structures that are not much larger than 16MB, and whether these
> codes are memory bandwidth limited, you have a conflict of interest -
> otherwise
> I'd advise every user to compile with -bigp_on and just set PAGESIZE_DATA
> and PAGESIZE_STACK to 16384 for everyone ;).

I'd guess that many of the heavy users have similar requirements to
the code I outlined, so pages as small as 16k would be bad to a
greater or lesser extent for everyone. This is true even though
the grid codes used on the machine run similar memory/length (very
large/long) jobs but with a much different memory access pattern.
However, in my experience with grid codes it appears that
they (the ones I've used) don't care nearly as much about the page
size (16mb 1mb etc so long as it reasonably big) as a particle code
will.

> I'd set percent_totalmem_16m_pages to 90 or 95 (or whatever your programs
> need)
> rather than force the use of dplace lpage_wait flags -- if you allow other
> users to grab
> small pages, you're going to find it harder to actually get large pages.

I'm told by tPtB (Richard and Chris-our other sysadmin) that
we'll probably end up with a mix of page sizes mostly 16, 4
and 1mb, with a few (10%?) 64k pages to keep things balanced
between various page factions.

> Do get 6.5.18 -- there have been fairly recent changes to the fallback
> policy when a
> very large page size can't be found: 6.5.18 will fall back to the next
> smaller page size instead
> of falling back to base page sizes.

Richard and our other sysadmin did that exact upgrade yesterday, just
before this test. Prior to that they said there was an (apparently
reproducable) OS bug with turning on 16mb pages at all so there was
no point in even trying.

> If you really need almost *everything* for your runs, though, you may have
> to rein in the buffer
> cache on the ondes you're running in, either by clamping down on it
> globally (but if you have
> other users who depend on the buffer cache,...), or by running your jobs
> in a static cpuset with
> MEMORY_KERNEL_AVOID.

I'll have to rtfm some on this, since I'm not sure I understand
what it means. :-)

Alexis Cousein

unread,

Jan 23, 2003, 12:52:55 PM1/23/03

to Andy Nelson

Andy Nelson wrote:
> Alexis Cousein <a...@brussels.sgi.com> wrote:
>
>
>>Well, you're probably right in wanting 16M pages. If you're not the only
>>system
>>user, though, you're going to find it hard to hold on to large pages if there
>>are many small page users on the system.
>
>
> I am far from the only user on the system :-)
>
> Does this mean that a job running with (and using) 16mb pages will
> get them stolen, broken up into smaller pages and returned later?

No. It means that if *you* run using 16MB pages, and if after that other
users
break those pages up, and then some of the bits out of them get reused
by the buffer cache, that when you will try to restart your application,
you may
not get that many 16MB pages -- esp. if you don't do an lpage_wait dplace
incantation.

It all depends on usage patterns, of course. If the other users free all
the memory,
and/or they never fill up nodes, they'll tend to stay clear of the
existing 16MB pages.

> I'm told by tPtB (Richard and Chris-our other sysadmin) that
> we'll probably end up with a mix of page sizes mostly 16, 4
> and 1mb, with a few (10%?) 64k pages to keep things balanced
> between various page factions.

Or set it up with the largest pages. People that use intermediate or base
pages will fragment the large pages anyway (and if they want to be sure
-- there's dplace's lpage_wait flags). and if you set up precoalescing,
they'll be
coalesced back, too. To enable use of a certain *different* page size, the
admins just have to make sure that the corresponding percent_totalmem
systune is "1" (or more).

Configuring more than one page size's percent as more than "1"
is usually not that simple to manage -- after some time, the distribution of
what page sizes are available on what bricks can become "interesting"
and I'd wish anyone good luck if they wnat to understand the
performance implications ;}. Unless you fence the programs in cpusets
classing them by their pagesize preferences, but I've rarely seen anyone do
*that*.

Andy Nelson

unread,

Jan 23, 2003, 1:49:23 PM1/23/03

to

Alexis Cousein <a...@brussels.sgi.com> wrote:

> Andy Nelson wrote:
>> Does this mean that a job running with (and using) 16mb pages will
>> get them stolen, broken up into smaller pages and returned later?

> No. It means that if *you* run using 16MB pages, and if after that other
> users
> break those pages up, and then some of the bits out of them get reused
> by the buffer cache, that when you will try to restart your application,
> you may
> not get that many 16MB pages -- esp. if you don't do an lpage_wait dplace
> incantation.

Ok. Two more related questions before I'm done:

1) If I request large pages (say 16m) that aren't available so that
I end up with some smaller pages, is irix smart enough to remember
that I want large pages so that I can get them later in the run if
such become free? I can imagine this would be a difficult thing
to manage...

2) If I do the dplace waiting thing and as you say, my large pages
get allocated on another node, would it be useful to set page migration
on with a very high resistance (I forgot the exact name for it) so that
if memory becomes free later on my own nodes during my run, the page
moves over there, but otherwise everything stays pretty much where
it is? I realize that this is probably an opinion/guess question.

The question has some relevance because for some other parts of
the code (not discussed in my post before) I can see what I believe
is the memory latency to remote nodes and would prefer to make that
as low as possible. One specific and clear example is that I have
a quicksort that runs serially (because I haven't parallelized it
yet), which slows down by about a factor 40% when it is
run on many processors (with round robin placement) vs when it
runs on 1 or 8 (with the very low local latency implied).
Note that although I am reasonably certain that that particular
slowdown is not related to the other threads pinging the master,
I can't be 100% certain given the data I have in hand. In other
words, this paragraph may just be me talking out the side
of my head...

> is usually not that simple to manage -- after some time, the distribution of
> what page sizes are available on what bricks can become "interesting"
> and I'd wish anyone good luck if they wnat to understand the
> performance implications ;}. Unless you fence the programs in cpusets

...uh yeah. I'll leave the system configuration stuff to the
gurus. :-)

Thanks!

Andy

Alexis Cousein

unread,

Jan 23, 2003, 2:08:52 PM1/23/03

to Andy Nelson

Andy Nelson wrote:

> Ok. Two more related questions before I'm done:
>
> 1) If I request large pages (say 16m) that aren't available so that
> I end up with some smaller pages, is irix smart enough to remember
> that I want large pages so that I can get them later in the run if
> such become free? I can imagine this would be a difficult thing
> to manage...

Newly allocated pages are going to use whatever they can find, but it's
not going to migrate already mapped pages and update the translation table
on the fly, no. Once you get a small page, you're stuck with it - esp.
if it's on the heap, as IRIX never returns those virtual addresses to the
OS until the program stops.

>
>
> 2) If I do the dplace waiting thing and as you say, my large pages
> get allocated on another node, would it be useful to set page migration
> on with a very high resistance (I forgot the exact name for it) so that
> if memory becomes free later on my own nodes during my run, the page
> moves over there, but otherwise everything stays pretty much where
> it is? I realize that this is probably an opinion/guess question.

Probably not. If you want your app to work cleanly, better wrap a
MEMORY_MANDATORY and POLICY_PAGE cpuset
around the job, and start as many processes as you have CPUs and let
them allocate all the memory to push everything else out, then let
them bail out.

Note: don't try this with a parallel job -- several independent processes;
sync them using Unix mechanisms. Don't use MPI or OpenMP to
communicate between them.

Oh, BTW -- make sure that the enable_devzero_opt kernel tuneable is *off*.
This
optimization (useful for small servers) is bad for memory allocation speed on
larger machines for large CPU count jobs, at least in 6.5.18.

> the code (not discussed in my post before) I can see what I believe
> is the memory latency to remote nodes

Well, if you use round robin page placement, that's basically a given,
by definition.

Randolph J. Herber

unread,

Jan 23, 2003, 2:15:48 PM1/23/03

to

The following header lines retained to effect attribution:

|Date: Thu, 23 Jan 2003 08:49:07 +0100
|From: Per Ekman <p...@pdc.kth.se>
|Subject: Re: Large page sizes on Origin 3800
|To: info-ir...@ARL.ARMY.MIL

|"Randolph J. Herber" <her...@dcdrjh.fnal.gov> writes:

|*p

Sorry, it is clear to me.

Try it this way:

It is possible that larger page sizes can create conditions that cause
an increase in swapping (the removal of entire processes from the system)
or paging (the removal of _portions_ of processes from the system). If
that happens, then, generally, the performance loss from swapping or paging
will negate any performance gain from eliminating TLB misses.

Other than that, there is no good reasons not to use larger page size.

From my point of view, there was a typographical error of a missing ``not''.

But, as long as swap rates do not increase, there is not a good reason

_not_ use larger pages as as you indicate TLB misses are a major impact
on performance.

Randolph J. Herber, her...@fnal.gov, +1 630 840 2966, CD/CDFTF PK-149F,

Alexis Cousein

unread,

Jan 24, 2003, 7:39:37 AM1/24/03

to

Randolph J. Herber wrote:
> Other than that, there is no good reasons not to use larger page size.

Apart from what I pointed out, of course, on ccNUMA machines (in particular,
for parallel jobs and memory placement).

Randolph J. Herber

unread,

Jan 24, 2003, 12:07:38 PM1/24/03

to

The following header lines retained to effect attribution:

|Date: Fri, 24 Jan 2003 13:39:37 +0100
|From: Alexis Cousein <a...@brussels.sgi.com>

|Subject: Re: Large page sizes on Origin 3800
|To: info-ir...@ARL.ARMY.MIL

|Randolph J. Herber wrote:

|> Other than that, there is no good reasons not to use larger page size.

|Apart from what I pointed out, of course, on ccNUMA machines (in particular,
|for parallel jobs and memory placement).

|<these messages express my own views, not those of my employer>

|Alexis Cousein Senior Systems Engineer
|SGI Belgium and Luxemburg a...@brussels.sgi.com

If there are problems caused by non-uniform memory access or by memory
use granularity, then, in my experience, the first symptom seen by
users is an increased swapping and paging rate. You had not quoted
to what the qualification ``other than that'' refered. The phrase
refered to ``if increased swapping and paging rates do not occur.''
By omitting that, you changed my meaning.