L1 and L2 Line size in a 64 bit proc

John Oliver

unread,

Jan 4, 2004, 10:03:28 PM1/4/04

to

Hello,

I admit that this is a 'homework' question but I'm just looking for
some pointers or advice. I have to justify a cache design and I don't
fully appreciate how to choose the level of associativity in the
cache.

I've to design a Virtual Mem. system for a theoretical 64 bit proc.
with 32 bit address lines. I've chosen to use 42 bits for virtual
space. I've to design the TLB, Page table scheme and a L1 and L2,
both unified. I just have the L1 and L2 left.

My problem is what level of associativity to use for the L1 and L2,
which are to be 16KB and 2MB respectively. In my course we've been
looking at the IA32 arch. which tends to be 4-way in both the L1 and
L2, and I've looked up 64 bit proc. specs on the web and found that
the Itanium2 uses 4 way for the L1 and 8 way for L2, but the Athlon
uses 2 way and 16 way respectively.

Once I have decided upon the associativity I'll be okay with the
number of sets and line size.

thanks,
John

Kai Harrekilde-Petersen

unread,

Jan 4, 2004, 10:25:16 PM1/4/04

to

johnwp...@hotmail.com (John Oliver) writes:

> Hello,
>
> I admit that this is a 'homework' question but I'm just looking for
> some pointers or advice. I have to justify a cache design and I don't
> fully appreciate how to choose the level of associativity in the
> cache.

The level of associativity (ie: the number of sets) determines how
effective you can utilitize the cache: higher associativity level =>
more data can be kept in the cache, but the downside is that the speed
of the cache also drops with increased associativity level.

Usually the number of entries in the onchip TLB is so low that a
fully-associative version is chosen (TLB thrashing is a major
problem).

Also, sometimes upping the number of sets is the only way to go, due
to constraints I forget right now.

> Once I have decided upon the associativity I'll be okay with the
> number of sets and line size.

Line size is basically an independent decision of the associativity.

Mitch Alsup

unread,

Jan 5, 2004, 3:33:14 AM1/5/04

to

johnwp...@hotmail.com (John Oliver) wrote in message news:<44327f72.04010...@posting.google.com>...

> Hello,
>
> I admit that this is a 'homework' question but I'm just looking for
> some pointers or advice. I have to justify a cache design and I don't
> fully appreciate how to choose the level of associativity in the
> cache.
>
> I've to design a Virtual Mem. system for a theoretical 64 bit proc.
> with 32 bit address lines. I've chosen to use 42 bits for virtual
> space. I've to design the TLB, Page table scheme and a L1 and L2,
> both unified. I just have the L1 and L2 left.

What is the size of a page?

If you can make:

associativity * pagesize >= cachesize

then you can avoid all the linear/logical/virtual/real address alias
headaches that occur when a not-associative-enough cache supports the
lowest level of the cache hierarchy.

>
> My problem is what level of associativity to use for the L1 and L2,
> which are to be 16KB and 2MB respectively. In my course we've been
> looking at the IA32 arch. which tends to be 4-way in both the L1 and
> L2, and I've looked up 64 bit proc. specs on the web and found that
> the Itanium2 uses 4 way for the L1 and 8 way for L2, but the Athlon
> uses 2 way and 16 way respectively.

In reality, many times in chip design, you want a cache with more
associativity but cycle time constraints make you choose otherwise.

>
> Once I have decided upon the associativity I'll be okay with the
> number of sets and line size.
>
> thanks,
> John

Treat Line size as an independent variable.

Mitch

Norbert Juffa

unread,

Jan 5, 2004, 3:48:47 AM1/5/04

to

"John Oliver" <johnwp...@hotmail.com> wrote in message news:44327f72.04010...@posting.google.com...

> Hello,
>
> I admit that this is a 'homework' question but I'm just looking for
> some pointers or advice. I have to justify a cache design and I don't
> fully appreciate how to choose the level of associativity in the
> cache.
>
> I've to design a Virtual Mem. system for a theoretical 64 bit proc.
> with 32 bit address lines. I've chosen to use 42 bits for virtual
> space. I've to design the TLB, Page table scheme and a L1 and L2,
> both unified. I just have the L1 and L2 left.
>
> My problem is what level of associativity to use for the L1 and L2,
> which are to be 16KB and 2MB respectively. In my course we've been
> looking at the IA32 arch. which tends to be 4-way in both the L1 and
> L2, and I've looked up 64 bit proc. specs on the web and found that
> the Itanium2 uses 4 way for the L1 and 8 way for L2, but the Athlon
> uses 2 way and 16 way respectively.

General pointer: It would be very useful to work through the relevant
chapters of Patterson & Hennessy, as they provide much quantitative
data for various cache/TLB configurations.

From a software perspective, it is best for all these structures to be
at least 4-way set associative. When processing data it is quite common
to have (at least) two sources and a destination, and one would like to
minimize conflict misses. Similarly, given that dynamic libraries are
very common these days, there are many situations where there are three
active areas of code: calling application, PLT, library. Depending on
the OS memory layout, these can easily cause ICache or ITLB misses if
the structures are only 2-way set associative. I have seen a lot of pain
and suffering in projects trying to optimize software for platforms where
caches or TLBs were less than 4-way set associative.

One additional remark: Make sure the TLB entries cover the contents of
the L1 and L2 caches, i.e. (page size * TLB entries) >= L2 cache size
if the L2 is (mostly) inclusive of L1, and >= L1 + L2 cache size if an
exclusive cache design (like Opteron) is used. Otherwise some strange
performance anomalies can arise, since data may be in the cache but
one needs to absorb a page miss to get to it.

Are you saying the fictious CPU is to have a unified L1? That would
appear suboptimal to me. Separate 16KB cache would make more sense,
as the requirements for data caches (e.g. coherency protocol) and
instruction caches (e.g. pre-decode bits) are usually quite different,
and from what I understand the number of ports in a unified cache
would have to be quite large (I am not a HW designer, but this seems
undesirable).

> Once I have decided upon the associativity I'll be okay with the
> number of sets and line size.

Line size requirements differ depending on the application space. In
general, floating-point intensive applications tend to favor a longer
line length, while integer applications prefer a shorter line length.

-- Norbert Juffa

Nick Maclaren

unread,

Jan 5, 2004, 8:18:40 AM1/5/04

to

In article <44327f72.04010...@posting.google.com>,

johnwp...@hotmail.com (John Oliver) writes:
|>
|> I admit that this is a 'homework' question but I'm just looking for
|> some pointers or advice. I have to justify a cache design and I don't
|> fully appreciate how to choose the level of associativity in the
|> cache.

Congratulations. Your posting is an example of appropriate use of
Usenet for homework! This is rare ....

|> My problem is what level of associativity to use for the L1 and L2,
|> which are to be 16KB and 2MB respectively. In my course we've been
|> looking at the IA32 arch. which tends to be 4-way in both the L1 and
|> L2, and I've looked up 64 bit proc. specs on the web and found that
|> the Itanium2 uses 4 way for the L1 and 8 way for L2, but the Athlon
|> uses 2 way and 16 way respectively.

Experience is that direct-mapped and 2-way associativity are a major
problem in that a high proportion of code will suffer badly from
clashes, where "badly" means virtually every access is a cache miss.
Consider, for example:

FOR i := 0 TO N DO
x[i] := y[i]+z[i]
OD

Hardware designers sometimes choose them because the alternative is
to lose a cycle rather too often, but that invariably causes tuning
hell. Note that the alignment of the arrays in the above will, in
general, be data dependent.

So a good rule is to use at least 4-way if at all possible. It is
much rarer for it to cause trouble like the above, and programs that
are hit by it tend to expose the problem more predictably. The same
applies to TLBs, incidentally, for the same reasons.

Regards,
Nick Maclaren.

Anton Ertl

unread,

Jan 5, 2004, 9:08:20 AM1/5/04

to

Mitch...@aol.com (Mitch Alsup) writes:
>johnwp...@hotmail.com (John Oliver) wrote in message news:<44327f72.04010...@posting.google.com>...
>> Hello,
>>
>> I admit that this is a 'homework' question but I'm just looking for
>> some pointers or advice. I have to justify a cache design and I don't
>> fully appreciate how to choose the level of associativity in the
>> cache.
>>
>> I've to design a Virtual Mem. system for a theoretical 64 bit proc.
>> with 32 bit address lines. I've chosen to use 42 bits for virtual
>> space. I've to design the TLB, Page table scheme and a L1 and L2,
>> both unified. I just have the L1 and L2 left.
>
>What is the size of a page?
>
>If you can make:
>
>associativity * pagesize >= cachesize
>
>then you can avoid all the linear/logical/virtual/real address alias
>headaches that occur when a not-associative-enough cache supports the
>lowest level of the cache hierarchy.

More precisely, if you satisfy this requirement, you can use a
virtually-indexed physically-tagged cache, and that allows you to
access the cache in parallel with the TLB without getting the same
physical line twice in the cache for different virtual addresses
(which would cause consistency trouble). This technique is usually
used for L1 caches, whereas L2 caches tend to be physically-indexed,
meaning that the access only starts after address translation, but has
no such restriction.

As the Alpha (EV45, EV6) and AMD (K6, K7, K8) guys show, there are
ways to achieve consistency in a virtually-indexed cache that don't
impose that restriction (AFAIK consistency is enforced during
replacement by evicting/invalidating lines with the same physical
address as the one that's loaded).

- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

John Oliver

unread,

Jan 5, 2004, 2:22:31 PM1/5/04

to

Thanks very much for all your help. I'll spend some now digesting and
contemplating all the replies. Off to the library to get the Computer
Organization and Design book!
Cheers,
John

Bernd Paysan

unread,

Jan 5, 2004, 9:41:48 PM1/5/04

to

Anton Ertl wrote:
> As the Alpha (EV45, EV6) and AMD (K6, K7, K8) guys show, there are
> ways to achieve consistency in a virtually-indexed cache that don't
> impose that restriction (AFAIK consistency is enforced during
> replacement by evicting/invalidating lines with the same physical
> address as the one that's loaded).

I wonder if you could use the TLB to provide the "in cache" check. With
4k pages and 64 byte lines, you have 64 lines per page. Thus 64 bit per
TLB entry would be sufficient as flags "this is in L1 cache". You only
need this for the (relatively small) first-level TLB*. 128 bits would
be sufficient to say "this is in L1 cache" and "where" when you have a
two-set associative cache (or a three-set). If you add the dirty bits
to the TLB entry, you can start evicting dirty lines to L2 cache before
you throw out the TLB entry (makes 192 bits per L1 TLB entry). And
then, you don't need any L1 cache tags - and especially no expensive
tests.

For fast task-switching, the TLB should contain process IDs or something
like that (on x86 perhaps a hash over the page table base pointer) so
that you don't have to flush it - it would also flush the L1 cache.

You can also use these bits for the replacement algorithm: You can take
two things into account: LRU *and* the number of cached lines. The
replacement policy in the cache itself makes sure that hot pages have
more active cache lines than infrequently used ones.

*) The Opteron has a two-level TLB

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/

jeff carbonell

unread,

Jan 6, 2004, 1:39:30 AM1/6/04

to

nm...@cus.cam.ac.uk (Nick Maclaren) wrote in message news:<btb6l0$78p$1...@pegasus.csx.cam.ac.uk>...

> In article <44327f72.04010...@posting.google.com>,
> johnwp...@hotmail.com (John Oliver) writes:
>
> Experience is that direct-mapped and 2-way associativity are a major
> problem in that a high proportion of code will suffer badly from
> clashes, where "badly" means virtually every access is a cache miss.
> Consider, for example:
>
> FOR i := 0 TO N DO
> x[i] := y[i]+z[i]
> OD
>

This makes sense, but, then why did the designers of the K7 and K8
choose to make the L1 caches 2-way set associative? According to most
benchmarks, they perform well---Oh wait, the L2 cache is a huge victim
buffer for the L1 evictions.

jeff

Norbert Juffa

unread,

Jan 6, 2004, 8:06:23 AM1/6/04

to

"Anton Ertl" <an...@mips.complang.tuwien.ac.at> wrote in message news:2004Jan...@a0.complang.tuwien.ac.at...
> Mitch...@aol.com (Mitch Alsup) writes:
[...]

> >If you can make:
> >
> >associativity * pagesize >= cachesize
> >
> >then you can avoid all the linear/logical/virtual/real address alias
> >headaches that occur when a not-associative-enough cache supports the
> >lowest level of the cache hierarchy.
>
> More precisely, if you satisfy this requirement, you can use a
> virtually-indexed physically-tagged cache, and that allows you to
> access the cache in parallel with the TLB without getting the same
> physical line twice in the cache for different virtual addresses
> (which would cause consistency trouble). This technique is usually
> used for L1 caches, whereas L2 caches tend to be physically-indexed,
> meaning that the access only starts after address translation, but has
> no such restriction.

I am a bit rusty on these things, so I might have my terminology
confused. If associativity * pagesize >= cachesize, all the cache
index bits are taken from the non-translated portion of the address
bits, i.e. these bits are identical in the virtual and the physical
address. So why wouldn't such a cache be called "physically indexed"
as well?

-- Norbert

Nick Maclaren

unread,

Jan 6, 2004, 8:34:08 AM1/6/04

to

In article <af33d33a.04010...@posting.google.com>,

jc_...@hotmail.com (jeff carbonell) writes:
|>
|> This makes sense, but, then why did the designers of the K7 and K8
|> choose to make the L1 caches 2-way set associative? According to most
|> benchmarks, they perform well---Oh wait, the L2 cache is a huge victim
|> buffer for the L1 evictions.

Never confuse benchmarketing with real performance analysis and
tuning. With the former, it is almost always possible to tweak
the code or compiler to bypass problems.

Far too many hardware designers do not listen to real applications
people - inadequate cache and TLB associativity is only one of
many classic misdesigns.

Regards,
Nick Maclaren.

Aaron Spink

unread,

Jan 6, 2004, 9:01:51 AM1/6/04

to

"Bernd Paysan" <bernd....@gmx.de> wrote in message
news:s92pc1-...@cohen.paysan.nom...

> I wonder if you could use the TLB to provide the "in cache" check. With
> 4k pages and 64 byte lines, you have 64 lines per page. Thus 64 bit per
> TLB entry would be sufficient as flags "this is in L1 cache". You only
> need this for the (relatively small) first-level TLB*. 128 bits would
> be sufficient to say "this is in L1 cache" and "where" when you have a
> two-set associative cache (or a three-set). If you add the dirty bits
> to the TLB entry, you can start evicting dirty lines to L2 cache before
> you throw out the TLB entry (makes 192 bits per L1 TLB entry). And
> then, you don't need any L1 cache tags - and especially no expensive
> tests.
>

You might want to take a look at the HP presentations and papers on the
d-cache of the Itanium 2 from the 2002 ISSC conference. They can be found
about halfway down the following page:

http://cpus.hp.com/technical_references/ia64.shtml

Anton Ertl

unread,

Jan 6, 2004, 9:56:10 AM1/6/04

to

"Norbert Juffa" <ju...@earthlink.net> writes:
[on virtually-indexed physically-tagged caches]

>I am a bit rusty on these things, so I might have my terminology
>confused. If associativity * pagesize >= cachesize, all the cache
>index bits are taken from the non-translated portion of the address
>bits, i.e. these bits are identical in the virtual and the physical
>address. So why wouldn't such a cache be called "physically indexed"
>as well?

I don't know the background on this terminology, but my guess would be
that you call it a "virtually-indexed cache" to highlight the fact
that you can already index it before the MMU has translated the address.

Anton Ertl

unread,

Jan 6, 2004, 3:30:33 PM1/6/04

to

jc_...@hotmail.com (jeff carbonell) writes:
>nm...@cus.cam.ac.uk (Nick Maclaren) wrote in message news:<btb6l0$78p$1...@pegasus.csx.cam.ac.uk>...
>> In article <44327f72.04010...@posting.google.com>,
>> johnwp...@hotmail.com (John Oliver) writes:
>>
>> Experience is that direct-mapped and 2-way associativity are a major
>> problem in that a high proportion of code will suffer badly from
>> clashes, where "badly" means virtually every access is a cache miss.
>> Consider, for example:
>>
>> FOR i := 0 TO N DO
>> x[i] := y[i]+z[i]
>> OD
>>
>
>This makes sense, but, then why did the designers of the K7 and K8
>choose to make the L1 caches 2-way set associative? According to most
>benchmarks, they perform well

There's an old theory that Usenet spans different universes. In
Nick's universe, a high proportion of the code misses on "virtually"
every access. In my universe and apparently the universe of the
benchmarkers, the K7 and K8 have relatively low L1 cache miss rates,
thanks to the relatively large size of the L1 cache.

Which universe do you live in? Measure the code you care for and find
out!

>---Oh wait, the L2 cache is a huge victim
>buffer for the L1 evictions.

The exclusive nature of the L2 cache makes code that misses L1 all the
time, but hits L2, twice as slow as an inclusive cache. The reason is
that the performance in such situations is limited by bandwidth
between L1 and L2, and the exclusive cache design uses twice as much
bandwidth. Fortunately such code is rare in my universe.

Nick Maclaren

unread,

Jan 6, 2004, 4:01:16 PM1/6/04

to

In article <2004Jan...@a0.complang.tuwien.ac.at>,

an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:
|>
|> There's an old theory that Usenet spans different universes. In
|> Nick's universe, a high proportion of the code misses on "virtually"
|> every access. In my universe and apparently the universe of the
|> benchmarkers, the K7 and K8 have relatively low L1 cache miss rates,
|> thanks to the relatively large size of the L1 cache.

Please don't post nonsense and attribute it to me.

|> Which universe do you live in? Measure the code you care for and find
|> out!

Perhaps I need to spell it out, for you at least.

Data alignment is very problem and data dependent, and it is common
for an application to run well on one set of data and appallingly
on an apparently equivalent set of data. In particular, arrays
with sizes that are multiples of powers of two are a problem.
So all arrays should have sizes AND INDIVIDUAL DIMENSIONS that
aren't. Well, maybe.

Firstly, that means you must either use a single allocation and
break it up yourself in interesting and bizarre ways (very popular
in the 1960s and 1970s), or dynamically offset your use within
allocations. Without this, you run the risk of large allocations
all being on large power of two boundaries, which is often done
by systems for other reasons of efficiency.

In addition to the difficulty that it isn't always practical,
there is the problem of the physical mapping of caches. And the
unprivileged programmer can neither control nor even query his
virtual to physical address mapping. You then get applications
that run well 99% of the time and like a drain 1% of the time,
unpredictably.

This becomes REALLY serious for mission critical real-time codes,
such as chemical plant control. You absolutely do NOT want the
program slowing down by a large factor (ones of 3-10, overall,
can occur) because the system has distributed data in a way that
it rarely does or the application has changed its data sizes and
hit this problem.

And so it goes.

Regards,
Nick Maclaren.

Terje Mathisen

unread,

Jan 6, 2004, 4:11:26 PM1/6/04

to

Nick Maclaren wrote:
> In addition to the difficulty that it isn't always practical,
> there is the problem of the physical mapping of caches. And the
> unprivileged programmer can neither control nor even query his
> virtual to physical address mapping. You then get applications
> that run well 99% of the time and like a drain 1% of the time,
> unpredictably.
>
> This becomes REALLY serious for mission critical real-time codes,
> such as chemical plant control. You absolutely do NOT want the
> program slowing down by a large factor (ones of 3-10, overall,
> can occur) because the system has distributed data in a way that
> it rarely does or the application has changed its data sizes and
> hit this problem.

Have I told the story about the OS which had process control structures
which were nice powers-of-two in size to speed up access to them?

With a lot of blocked threads, the system ran 4 X (or some number like
that) slower than usual, because the regular scanning of those blocked
threads to check if they needed some OS assistance caused both cache and
TLB trashing.

A little padding later, and everything was back to normal. :-)
>
> And so it goes.

Right.

Terje
--
- <Terje.M...@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

Linus Torvalds

unread,

Jan 6, 2004, 6:40:01 PM1/6/04

to

Terje Mathisen wrote:
>
> Have I told the story about the OS which had process control structures
> which were nice powers-of-two in size to speed up access to them?

This still happens, even when you try to be careful. And yes, it probably
happens more in OS code than other code, exactly because the OS generally
has to track a lot of data structures that are powers-of-two in size
because of simple hardware issues (ie fundamental memory allocation issues
wrt pages and superpages).

So when you have data structures that quite naturally end up being
page-aligned due to allocator issues, and have linked lists of these
things, walking the list ends up being really really bad. Even with
four-way caches or better.

And with cache lines generally having become bigger (they tend to be
invariably 64 or 128 bytes these days), it's becoming harder to try to just
offset the data structures a bit. You've got to offset them rather much,
which can be very wasteful if your data structures are in the single-page
size order.

End result: avoid list walking like the plague for these data structures.
Even if it would be the simplest and obvious way, and the code generated
would be nice and understandable. Even if the average list size isn't that
big, it starts sucking rather quickly.

So even with four- or eight-way associativity there are problematic cases,
but at least at that point you usually can try to alleviate the few problem
spots by recoding. With lower associativity the "few problem spots" are
just too damn many and unrelated accesses end up disturbing each other.

Linus

Brian Hurt

unread,

Jan 6, 2004, 9:00:43 PM1/6/04

to

nm...@cus.cam.ac.uk (Nick Maclaren) wrote in message news:<btem4c$2hl$1...@pegasus.csx.cam.ac.uk>...

> Data alignment is very problem and data dependent, and it is common
> for an application to run well on one set of data and appallingly
> on an apparently equivalent set of data. In particular, arrays
> with sizes that are multiples of powers of two are a problem.
> So all arrays should have sizes AND INDIVIDUAL DIMENSIONS that
> aren't. Well, maybe.

An excellent example of this in practice (IMHO):

http://citeseer.nj.nec.com/bodin94cache.html

*Any* cache level associativity less than fully associative can lead
to associativity problems. Skewed associativity helps- it's not a
silver bullet, but it does help.

> In addition to the difficulty that it isn't always practical,
> there is the problem of the physical mapping of caches. And the
> unprivileged programmer can neither control nor even query his
> virtual to physical address mapping. You then get applications
> that run well 99% of the time and like a drain 1% of the time,
> unpredictably.

Virtual mapping of cache lines is worse, IMHO. You can't map the same
memory in two or more different locations and be cache coherent, and
you have to flush cache when switching memory spaces. Both the ARM
and PA-RISC made this mistake.

Supplying the physical addresses of the pages to the application also
has a lot of problems. For example- suppose the OS does this. It
wouldn't be hard to write an IOCTL to read the page tables, so it's
not technically hard to do. But how long does the page mapping stay
valid? When can the OS move physical pages around? And what can the
application do with the information? Other than know ahead of time
it's about to go splat.

Allowing the application to request certain memory regions have
certain properties- for example, force a certain memory region to be
"cache linear", that the memory region (no larger than cache) be
mapped so that none of the physical addresses collide in cache- has
the feel of an intractible problem from the OS's perspective. Even
just best effort starts getting tricky when resource allocation is
being stressed (i.e. most memory is being used). It certainly makes
page replacement algorithms a lot harder.

>
> This becomes REALLY serious for mission critical real-time codes,
> such as chemical plant control. You absolutely do NOT want the
> program slowing down by a large factor (ones of 3-10, overall,
> can occur) because the system has distributed data in a way that
> it rarely does or the application has changed its data sizes and
> hit this problem.
>

Hard realtime and virtual memory simply don't go together. I'm more
than a little confused by Wind River's recent adoption of Linux.
Linux is not, and never will be, a hard real time OS.

Brian

Nick Maclaren

unread,

Jan 6, 2004, 9:13:46 PM1/6/04

to

In article <81f0f84e.04010...@posting.google.com>,

Brian Hurt <bh...@spnz.org> wrote:
>nm...@cus.cam.ac.uk (Nick Maclaren) wrote in message news:<btem4c$2hl$1...@pegasus.csx.cam.ac.uk>...
>
>> Data alignment is very problem and data dependent, and it is common
>> for an application to run well on one set of data and appallingly
>> on an apparently equivalent set of data. In particular, arrays
>> with sizes that are multiples of powers of two are a problem.
>> So all arrays should have sizes AND INDIVIDUAL DIMENSIONS that
>> aren't. Well, maybe.
>
>An excellent example of this in practice (IMHO):
>
>http://citeseer.nj.nec.com/bodin94cache.html
>
>*Any* cache level associativity less than fully associative can lead
>to associativity problems. Skewed associativity helps- it's not a
>silver bullet, but it does help.

That is true, but is not the whole issue. All of the tuning experience
is that the problems drop off rapidly with associativity, and that
direct mapping and 2-way are VERY bad news to many common programming
paradigms. The one that I gave is not exactly rare! I have seen
problems with 4-way, but they aren't common, and I have NEVER seen any
as catastrophic as ones with 2-way except when trying to break systems.

>> In addition to the difficulty that it isn't always practical,
>> there is the problem of the physical mapping of caches. And the
>> unprivileged programmer can neither control nor even query his
>> virtual to physical address mapping. You then get applications
>> that run well 99% of the time and like a drain 1% of the time,
>> unpredictably.
>
>Virtual mapping of cache lines is worse, IMHO. You can't map the same
>memory in two or more different locations and be cache coherent, and
>you have to flush cache when switching memory spaces. Both the ARM
>and PA-RISC made this mistake.

That is not entirely true. You can use the single owner model and
maintain consistency - that is done for caches, after all!

And you do NOT have to flush cache when switching memory spaces;
just tag the lines with an ASID. That solution has been known and
used for decades, and it works, whether or not people get it wrong.

>Supplying the physical addresses of the pages to the application also
>has a lot of problems. For example- suppose the OS does this. It
>wouldn't be hard to write an IOCTL to read the page tables, so it's
>not technically hard to do. But how long does the page mapping stay
>valid? When can the OS move physical pages around? And what can the
>application do with the information? Other than know ahead of time
>it's about to go splat.

Exactly the same arguments apply AND WERE USED against allowing the
program to know how much CPU time it has used, or even what the time
of day is! Seriously. Do you remember when getting the time of day
was often a privileged operation? The above argument was used to
justify it.

>Allowing the application to request certain memory regions have
>certain properties- for example, force a certain memory region to be
>"cache linear", that the memory region (no larger than cache) be
>mapped so that none of the physical addresses collide in cache- has
>the feel of an intractible problem from the OS's perspective. Even
>just best effort starts getting tricky when resource allocation is
>being stressed (i.e. most memory is being used). It certainly makes
>page replacement algorithms a lot harder.

No, it's not. It's actually very easy. There are several approaches.
One of which is to supply hints and return information on what the
current situation is. Another (better) one is to have such options,
and to refuse the request if they can't be granted.

Yes, doing these makes writing the operating system harder. But NOT
doing so makes serious memory performance tuning bloody impossible.

>Hard realtime and virtual memory simply don't go together. I'm more
>than a little confused by Wind River's recent adoption of Linux.
>Linux is not, and never will be, a hard real time OS.

It doesn't go awfully well with HPC, either :-(

Regards,
Nick Maclaren.

Bernd Paysan

unread,

Jan 6, 2004, 9:56:40 PM1/6/04

to

Aaron Spink wrote:
> You might want to take a look at the HP presentations and papers on
> the d-cache of the Itanium 2 from the 2002 ISSC conference. They can
> be found about halfway down the following page:
>
> http://cpus.hp.com/technical_references/ia64.shtml

Thanks, they implemented more or less exactly what I had in mind.

Hans de Vries

unread,

Jan 7, 2004, 3:17:04 AM1/7/04

to

Bernd Paysan <bernd....@gmx.de> wrote in message news:<s92pc1-...@cohen.paysan.nom>...

>

> For fast task-switching, the TLB should contain process IDs or something
> like that (on x86 perhaps a hash over the page table base pointer) so
> that you don't have to flush it - it would also flush the L1 cache.
>

There is the mystery of the ASN's (Address Space Numbers) in the TLB's of
the Athlon 64 core. Fred Weber confirmed they are still in there during MPF 2003
to Andreas Stiller who was so kind to ask this to Fred for me.

On the other hand there's not a word on ASN in the K8 manuals. (AMD patents give
the impression that the process ASN is in a control register like in the Alpha)
So it's either 'not activated yet' or it's automatically generated like you
suggest.

The latter seems problematic. I made the same remark to Andi Kleen:

" It appears to me that simple hashing the translation base pointer
in CR3 down to 6 bit or so would do the job here without the need
for any CPUID stuff and special OS support "

Andi responded with:

" That would break badly on current Linux/x86-64 which currently uses a
single global per CPU PML4 page. Only entry 0 of it is rewritten on a
VM switch. When I designed this I was worried about it, but AMD
saw no problems so I went forward with this. "

" Also the CR3 hashing would break an important use case: Oracle (and DB2)
Oracle is lots of processes which have an own page table, but share
most of their VM in a big shared memory segment (the database cache)
Any ASN scheme from a CPU vendor who cares about TPC-* benchmark numbers
should handle this somehow, e.g. by putting the shared
memory segment into a special region and give it an own ASN. "

" Linux actually does not share the page tables currently, but other OS
do because it's an very beneficial optimization (can save a lot of
memory etc.) "

Regards, Hans.

James Cownie

unread,

Jan 7, 2004, 9:17:51 AM1/7/04

to

Nick Maclaren wrote:

> In addition to the difficulty that it isn't always practical,
> there is the problem of the physical mapping of caches. And the
> unprivileged programmer can neither control nor even query his
> virtual to physical address mapping. You then get applications
> that run well 99% of the time and like a drain 1% of the time,
> unpredictably.
>
> This becomes REALLY serious for mission critical real-time codes,
> such as chemical plant control. You absolutely do NOT want the
> program slowing down by a large factor (ones of 3-10, overall,
> can occur) because the system has distributed data in a way that
> it rarely does or the application has changed its data sizes and
> hit this problem.

As I'm sure Nick realises this is also a significant issue for an MPI
like parallel code. If there's any interaction between the processes,
then having one of them running at 1/3 of the speed of the others will
likely slow them all down to that speed, and once one has 100+ parallel
processes this "unlikely" slowdown is highly probable.

So it's not just real time which is affected, but also SC.

--
-- Jim

James Cownie <jco...@etnus.com>
Etnus, LLC. +44 117 9071438
http://www.etnus.com

Nick Maclaren

unread,

Jan 7, 2004, 9:51:41 AM1/7/04

to

In article <39QKb.8865$CM4.88...@news-text.cableinet.net>,

Yes - but at least it isn't life-threatening and does not USUALLY
cause total program failure!

My experience is that you will normally drop to the speed of the
slowest processor, but sometimes do worse. If the delay is such
that one of the processors gives up and changes from "fast wait"
to "slow wait" mode, things can run very slowly indeed. In the
worst case I have seen, the wait state was bouncing around the
nodes and the overall system degraded by a factor of over 1,000.

I don't know what triggered those, and have no evidence that it was
mapping-related delays, but no evidence it wasn't. All I know is
that it was triggered by ONE process running a bit slowly for a
short while, leading to complete meltdown. And I have seen it on
several systems, though only twice that badly.

As you know, the naive answer that such a problem is a simple bug
in the communications logic is fallacious. The problem arises
precisely because complicated optimisations usually improve things
but often have graceless - or even disgraceful :-) - degradation
modes. And that is a damn hard problem to resolve.

Regards,
Nick Maclaren.

Jan C. Vorbrüggen

unread,

Jan 7, 2004, 9:57:20 AM1/7/04

to

> Allowing the application to request certain memory regions have
> certain properties- for example, force a certain memory region to be
> "cache linear", that the memory region (no larger than cache) be
> mapped so that none of the physical addresses collide in cache- has
> the feel of an intractible problem from the OS's perspective.

Page colouring should do for this problem. You can buy working systems
using it, although the manufacturer is traditionally reluctant to admit
that it sells such systems at all (infectious disease, infection occured
by acquisition).

> Even just best effort starts getting tricky when resource allocation is
> being stressed (i.e. most memory is being used).

In such situations, your application is likely competing with others for
the processor, or starting to page/swap. A ~40% or even 100% slowdown is
likely irrelevant in such cases.

Jan

Bernd Paysan

unread,

Jan 7, 2004, 12:17:38 PM1/7/04

to

Linus Torvalds wrote:
> End result: avoid list walking like the plague for these data structures.
> Even if it would be the simplest and obvious way, and the code generated
> would be nice and understandable. Even if the average list size isn't that
> big, it starts sucking rather quickly.

Isn't this a nice example where arrays of structures are better replaced by
structures of arrays? You probably have only to check a few items of a
structure per iteration, and with structures of arrays, aliasing is far
less critical.

If your arrays however match page size, and you need to go over all data of
a structure, aliasing becomes very critical, especially when you need more
than a few elements.

So if you want to be clever, split data between those you want to walk
through, and those you don't, and keep the walk-through data in an array.
Best if you don't need a list at all, i.e. if you usually walk through all
the data, use just an array, append at the tail, and fill in blanks from
deleted elements by copying from the tail. This is what current CPUs like -
streaming data access, no gaps, no pointer chasing.

BTW: Anton wrote in this thread that in his world, aliasing doesn't happen.
However, one of the first things we did in Gforth was to purposely dealias
the several stacks it uses - because it did hurt us big back in 486 times.
Maybe that's why we never saw any problem later ;-).

Glew News SBC

unread,

Jan 8, 2004, 6:36:55 AM1/8/04

to

"Linus Torvalds" <torv...@osdl.org> wrote in message
news:bteuv9$str$1...@build.pdx.osdl.net...

> Terje Mathisen wrote:
> >
> > Have I told the story about the OS which had process control structures
> > which were nice powers-of-two in size to speed up access to them?
>
> This still happens, even when you try to be careful. And yes, it probably
> happens more in OS code than other code, exactly because the OS generally
> has to track a lot of data structures that are powers-of-two in size
> because of simple hardware issues (ie fundamental memory allocation issues
> wrt pages and superpages).

...

>
> End result: avoid list walking like the plague for these data structures.

Sigh. Non power of two moduli are easy enough to compute,
at least for an L2 access.

But instead of hardware and software working closer together,
we tend to work further and further apart.

Terje Mathisen

unread,

Jan 8, 2004, 7:51:23 AM1/8/04

to

Glew News SBC wrote:

> "Linus Torvalds" <torv...@osdl.org> wrote in message
> news:bteuv9$str$1...@build.pdx.osdl.net...
>
>>Terje Mathisen wrote:
>>
>>>Have I told the story about the OS which had process control structures
>>>which were nice powers-of-two in size to speed up access to them?
>>
>>This still happens, even when you try to be careful. And yes, it probably
>>happens more in OS code than other code, exactly because the OS generally
>>has to track a lot of data structures that are powers-of-two in size
>>because of simple hardware issues (ie fundamental memory allocation issues
>>wrt pages and superpages).
>

> ....

>
>>End result: avoid list walking like the plague for these data structures.
>
> Sigh. Non power of two moduli are easy enough to compute,
> at least for an L2 access.

I do know how to calculate n mod m, where m is close to a power of two,
if I just have a fast integer mul, but...

>
> But instead of hardware and software working closer together,
> we tend to work further and further apart.

So what's the right thing to do?

Nick Maclaren

unread,

Jan 8, 2004, 9:39:08 AM1/8/04

to

In article <bU6Lb.8214$qa4....@newssvr25.news.prodigy.com>,

"Glew News SBC" <andy-gle...@sbcglobal.net> writes:
|>
|> But instead of hardware and software working closer together,
|> we tend to work further and further apart.

The same applies for operating systems and applications people :-(

Regards,
Nick Maclaren.

Linus Torvalds

unread,

Jan 8, 2004, 4:00:01 PM1/8/04

to

Glew News SBC wrote:

> "Linus Torvalds" <torv...@osdl.org> wrote in message
> news:bteuv9$str$1...@build.pdx.osdl.net...
>> Terje Mathisen wrote:
>> >
>> > Have I told the story about the OS which had process control structures
>> > which were nice powers-of-two in size to speed up access to them?
>>
>> This still happens, even when you try to be careful. And yes, it probably
>> happens more in OS code than other code, exactly because the OS generally
>> has to track a lot of data structures that are powers-of-two in size
>> because of simple hardware issues (ie fundamental memory allocation
>> issues wrt pages and superpages).
> ...
>>
>> End result: avoid list walking like the plague for these data structures.
>
> Sigh. Non power of two moduli are easy enough to compute,
> at least for an L2 access.

You didn't read my explanation of _why_ there are so many power-of-two
allocations.

It's not because we have an array which has entires the size of a
power-of-two. It's because the fundamental memory allocation unit HAS TO BE
a power of two for some rather basic reasons. When was the last time you
saw hardware that had non-power-of-two page sizes? Right. They just don't
exist, and they won't be existing any time soon.

So as a result, a lot of your OS allocations _will_ be based on power-of-two
offsets - whether they are related to each other or not. Which means that
your access patterns will invariably be biased towards those powers-of-two
as well.

Yes, we do offsetting etc when we only use partial pages, but as I also
explained, with cache lines sizes regularly being 128 bytes, you simply
_cannot_ color your data accesses very much if the basic allocation unit is
a page and you mostly work in single pages. And single pages is what you
work with in an OS. Big allocations are very much the exception.

And don't get me wrong: we can, and we do, try to do our best to avoid cache
conflicts. But in the end the software side can't do everything, and
hardware should try to avoid arbitrary limitations and just go for higher
associativity.

It's a question of give-and-take. Software tries its best to be
cache-friendly, but hardware should try its best to be sw-friendly too. If
either of those parts fail, your performance will suck. It's not "one or
the other", it's both.

I'm not asking for fully associative caches, btw. I accept that
associativity is complicated. I'm just stating that there is only so much
that software can do to reasonably avoid problems with low levels of
associativity, and that I personally consider anything less than 4-way to
be just too damn _fragile_ from a performance standpoint.

Linus

Glew News SBC

unread,

Jan 8, 2004, 7:45:43 PM1/8/04

to

[Linus, paraphrased]>
n*2^m for large sized data structures
causing cache thrashing

> >>End result: avoid list walking like the plague for these data
structures.

[Glew]

> > Sigh. Non power of two moduli are easy enough to compute,
> > at least for an L2 access.

[Terje]

> I do know how to calculate n mod m, where m is close to a power of two,
> if I just have a fast integer mul, but...
> >
> > But instead of hardware and software working closer together,
> > we tend to work further and further apart.
>
> So what's the right thing to do?

The best thing that I can think of is for some architect like me to
take a chance, and build a better cache hash function.

It'll help on code that hasn't had this problem tuned away.
Unfortunately, it will not help on code that has already had
this problem tuned away, by folks like Linus.
And as such it would have trouble getting past the
Hennessy and Patterson watchdogs.

The hope being that, eventually, the natural way to write
such code would return.

Here's a question for Linus: if there were not problems
with linked lists colliding in the cache for n*2^m data structures (large
m),
would you eventually return to using such data structures?
Or, are the non-linked-list techniques just plain better overall?
Can you give some examples of the non-linked-list techniques?

If the non-linked-list techniques are better overall,
then maybe it isn't worth it: cache thrashing because of such
resonances just supplies extra incentive to do the right thing.
Although there is nevertheless something to be said for making
non-highly-tuned code work well.

---

Obviously, I am posting this because I don't think my idea of
non-power-of-two hashing is likely to fly - it has not flown
in either of the companies I have been involved with recently,
AMD and Intel.

At Intel, there is definitely the "prove it" problem.

AMD has a more concrete issue: even if AMD did the right thing,
if Intel did not Linus would still be tuning his code to run well on
Intel's processors. So AMD would not get the full benefit until
Intel follows. (Unless AMD displaces Intel in the marketplace,
which I hope will happen. :-)

So I am afraid that this falls into the class of innovation were taking
the first step is difficult.

Glew News SBC

unread,

Jan 8, 2004, 7:59:14 PM1/8/04

to

[Glew]

> > Sigh. Non power of two moduli are easy enough to compute,
> > at least for an L2 access.

[Linus]

> You didn't read my explanation of _why_ there are so many power-of-two
> allocations.
>
> It's not because we have an array which has entires the size of a
> power-of-two. It's because the fundamental memory allocation unit HAS TO
BE
> a power of two for some rather basic reasons. When was the last time you
> saw hardware that had non-power-of-two page sizes? Right. They just don't
> exist, and they won't be existing any time soon.

You misunderstand my "sigh".

I am totally in agreement that it is unreasonable to ask software
to force data structures to be non-power-of-two.

What I regret is that it is fairly easy, in hardware, to avoid
the cache thrashing problems that power-of-two sized data
structures cause.

Consider a 4M L2 data cache with 64 byte cache lines, 16 way associative.
That's 4M/64/16 = exp2(22-6-4) = exp2(12) = 4K sets.
Essentially requiring modulo 4K in the cache index computation
- which is, of course, just a bitfield extract. Possibly with a few XORs
for hashing in of upper bits.

The circuit to compute modulo 4K-1 or 4K+1 is really not that hard.
It might not be reasonable to use in a fast L0 cache,
but it is pretty reasonable to use in an L2 cache.

I personally like this idea of non-power-of-two, Mersenne,
or prime cache indexing. There's lots of literature on it.
There have been a few other proposals to improve the hash
function to avoid resonance collisions.

---

So, no, Linus when I complain about there being insufficient
hardware / software trading off, I think that in this case *hardware*
should change, not software.

If, that is, power-of-two data structures and linked lists
are still a good idea.

Greg Lindahl

unread,

Jan 8, 2004, 8:53:31 PM1/8/04

to

In article <HriLb.8341$jy....@newssvr25.news.prodigy.com>,

Glew News SBC <andy-gle...@sbcglobal.net> wrote:

>AMD has a more concrete issue: even if AMD did the right thing,
>if Intel did not Linus would still be tuning his code to run well on
>Intel's processors. So AMD would not get the full benefit until
>Intel follows.

AMD would still get a benefit from all the non-tuned code in the
universe. There's probably a fair amount of such code in the Linux
kernel, and tons in user apps, which are easier to simulate.

Only a few compilers have optimizations to avoid this problem, and
those tweaks are mainly related to padding arrays to avoid collisions
-- a tiny subset of the problem.

Were you not able to show an improvement in SPECcpu in simulation?

There is a similar problem in memory systems with bank conflicts;
classic Cray PVP machines had a power-of-two problem, the perf
counters would report it, and users padded their arrays by hand. I
think Tera adopted a hashed fix for their machine. I can't think of
any more general-purpose examples.

-- greg

Brian Hurt

unread,

Jan 9, 2004, 4:53:43 AM1/9/04

to

nm...@cus.cam.ac.uk (Nick Maclaren) wrote in message news:<btf8ea$fr6$1...@pegasus.csx.cam.ac.uk>...

> In article <81f0f84e.04010...@posting.google.com>,
> Brian Hurt <bh...@spnz.org> wrote:
> >nm...@cus.cam.ac.uk (Nick Maclaren) wrote in message news:<btem4c$2hl$1...@pegasus.csx.cam.ac.uk>...
> >
> >> Data alignment is very problem and data dependent, and it is common
> >> for an application to run well on one set of data and appallingly
> >> on an apparently equivalent set of data. In particular, arrays
> >> with sizes that are multiples of powers of two are a problem.
> >> So all arrays should have sizes AND INDIVIDUAL DIMENSIONS that
> >> aren't. Well, maybe.
> >
> >An excellent example of this in practice (IMHO):
> >
> >http://citeseer.nj.nec.com/bodin94cache.html
> >
> >*Any* cache level associativity less than fully associative can lead
> >to associativity problems. Skewed associativity helps- it's not a
> >silver bullet, but it does help.
>
> That is true, but is not the whole issue. All of the tuning experience
> is that the problems drop off rapidly with associativity, and that
> direct mapping and 2-way are VERY bad news to many common programming
> paradigms. The one that I gave is not exactly rare! I have seen
> problems with 4-way, but they aren't common, and I have NEVER seen any
> as catastrophic as ones with 2-way except when trying to break systems.

For an N-way set associative cache of size N*C, all you need is N+1
addresses which are equal to each other modulo C and you're doomed.
With N=2, all you need is three- easy enough to get with
code+heap+stack, or even code+code+stack. With N=4, you need five
addresses- harder to get. But not impossible. Especially if you're
doing big array walking in stupid ways.

> >Virtual mapping of cache lines is worse, IMHO. You can't map the same
> >memory in two or more different locations and be cache coherent, and
> >you have to flush cache when switching memory spaces. Both the ARM
> >and PA-RISC made this mistake.
>
> That is not entirely true. You can use the single owner model and
> maintain consistency - that is done for caches, after all!
>
> And you do NOT have to flush cache when switching memory spaces;
> just tag the lines with an ASID. That solution has been known and
> used for decades, and it works, whether or not people get it wrong.

The biggest problem with this is that it requires more bits of storage
per cache line to hold the ASID tags. Which means less cache per mm^2
of chip realestate.

>
> >Supplying the physical addresses of the pages to the application also
> >has a lot of problems. For example- suppose the OS does this. It
> >wouldn't be hard to write an IOCTL to read the page tables, so it's
> >not technically hard to do. But how long does the page mapping stay
> >valid? When can the OS move physical pages around? And what can the
> >application do with the information? Other than know ahead of time
> >it's about to go splat.
>
> Exactly the same arguments apply AND WERE USED against allowing the
> program to know how much CPU time it has used, or even what the time
> of day is! Seriously. Do you remember when getting the time of day
> was often a privileged operation? The above argument was used to
> justify it.

No, I don't remember.

I will note that both the time of day and the CPU time used are only
given within certain error bounds. The OS doesn't gaurentee that
arbitrarily large amounts of time don't pass between two arbitrarily
close calls to time(). It *generally* doesn't, but no gaurentees are
given.

>
> >Allowing the application to request certain memory regions have
> >certain properties- for example, force a certain memory region to be
> >"cache linear", that the memory region (no larger than cache) be
> >mapped so that none of the physical addresses collide in cache- has
> >the feel of an intractible problem from the OS's perspective. Even
> >just best effort starts getting tricky when resource allocation is
> >being stressed (i.e. most memory is being used). It certainly makes
> >page replacement algorithms a lot harder.
>
> No, it's not. It's actually very easy. There are several approaches.
> One of which is to supply hints and return information on what the
> current situation is. Another (better) one is to have such options,
> and to refuse the request if they can't be granted.
>
> Yes, doing these makes writing the operating system harder. But NOT
> doing so makes serious memory performance tuning bloody impossible.

I was about to launch into a mathematical proof of how you're wrong,
when I managed to mathematically prove I'm wrong :-).

OK, here's the idea. All the HPC people need is to keep their pages
on the same cache line boundaries. Basically, for an N-way set
associative cahce of size N*C and page size P, if you have two virtual
address X1 and X2 such that X1/P != X2/P mod C/P, you want their
physical addresses Y1 and Y2 to also be Y1/P != Y2/P mod C/P. The
easiest way to do this is to simply gaurentee that for every virtual
address X, the physical address Y will always follow X/P = Y/P + K mod
C for some arbitrary K. You would have the same K for all pages in a
virtual memory space. *What* K a given process has you don't care.
Of course, it'd make some corner cases (like memory maps) if all
processes in the system had the same K, but you might have performance
advantages if you could vary Ks.

But anyways. The OS has some algorithm by which it handles pages. A
classic example would be the clock algorithm, which I'll use as an
example. Generalizing should be obvious. Instead of having a single
big clock, where the OS moves from page Y to page Y+1, the OS instead
keeps C/P clocks, each of which moves from page Y to Y + C/P. Instead
of having one big memory space which is handled by a single page
replacement algorithm, you have C/P copies of the same algorithm each
with it's own memory spaces- the spaces of all pages Y/P = K mod C/P.

The only problem with this is tuning the OS. Very large cache sizes
relative to the page sizes, or very small memory sizes relative to the
cache sizes, and the system would start having performance problems.
There seems to be a sweet spot right around having memory only 128
times larger than cache. With a 4K page size and memory only 128
times larger than cache, this means you have one memory manager for
every 512K of memory. For a 2G system, this means 4096 different
memory regions. This is a desktop system, server and HPC systems
would be way worse. Although desktop systems tend to have small caches
relative to the size of memory- 2G systems often have only 512K, or
even 256K, of cache. Meaning each memory manger has 4-8M of memory
under it's command. Much better.

Although I suppose you could just set C/P to some resonable constant,
like 16 or something. Act like you have a 64K cache or so. Or pick
the midsized (L2) cache to do your calculations. Or just have a
system that doesn't have a lot of data overhead (like the clock
algorithm) and doesn't mind having a small amount of memory to work
with. We did do virtual memory systems in 512K. I go that far back.

Note that this idea doesn't have the OS telling the application diddly
squat. Nor does it require the application to tell the OS diddly
squat. The application can simply assume certain relationships
between the virtual memory and the cache behavior.

Actually, this solves at least one of the problems real time processes
have with virtual memory. The other, swapping, is easy to handle-
don't do it with real time processes. I like this.

>
> >Hard realtime and virtual memory simply don't go together. I'm more
> >than a little confused by Wind River's recent adoption of Linux.
> >Linux is not, and never will be, a hard real time OS.
>
> It doesn't go awfully well with HPC, either :-(
>

Although when an HPC system hits the jackpot and suddenly runs
signifigantly slower, lawyers don't tend to get involved. Although I
do wonder why HPC doesn't have dedicated OSs like real time does. You
guys have the money.

Linus Torvalds

unread,

Jan 9, 2004, 6:51:46 AM1/9/04

to

Glew News SBC wrote:
>
> You misunderstand my "sigh".

I did indeed. I've gotten so used to the way caches work that I have become
myopic.

Your idea to use a hash to distribute the caches more evenly sounds very
feasible, and would likely solve some problems. On the other hand it would
make it harder to tune for caches, which some projects like doing.
Personally, I'd love to just try to keep the footprint down and not worry
about cache coloring, and a hashed cache index would force that that. It
would likely be a lot more effective than what we do in software.

On the other hand, on the codes that _are_ tuned, it would possibly hurt, so
you'd get a fair amount of push-back on it.

And I suspect that with something like 8-way associativity, you won't find
many cases where it really matters. I think Intel already does 8-way on the
L2, and the Opteron does 16-way? I can't imagine that conflict misses end
up being that common any more at that point.

Linus

Jan C. Vorbrüggen

unread,

Jan 9, 2004, 8:44:25 AM1/9/04

to

> if there were not problems with linked lists colliding in the cache for
> n*2^m data structures (large m), would you eventually return to using
> such data structures? Or, are the non-linked-list techniques just plain
> better overall?

It's not so much a question of access - linked-list vs anything-else - but
of data layout - structure-of-arrays vs array-of-structures.

If you will access most or all of the information contained in the structure,
you want to use the array-of-structures, because the memory hierarchy is
prefetching the information you will need soon for you. If you are looking
at only one or a few - say, less or equal in number to the associativity of
the nearest cache - items in the structure, go for the structure-of-arrays,
and do partially-linear searches (again, to exploit that spatial locality
your cache is offering). In principle, you could even combine the two:
search for the right item in an array, and once you have found it, get the
rest of the information from the corresponding structure.

So, the answer, as usual, is "it depends".

Jan

Mark Smotherman

unread,

Jan 9, 2004, 6:23:43 PM1/9/04

to

"Glew News SBC" <andy-gle...@sbcglobal.net> writes:

>Obviously, I am posting this because I don't think my idea of
>non-power-of-two hashing is likely to fly - it has not flown
>in either of the companies I have been involved with recently,
>AMD and Intel.

Cf. HP, who used to hash cache addresses when they used direct-mapped
caches in the 7x00 series but stopped in the 7300LC

To reduce the impact on timing, we eliminated cache address hashing,
which had been used to reduce thrashing in directly mapped cache
designs. Once we added associativity, hashing was no longer necessary.

Terry W. Blanchard and Paul G. Tobin, "The PA 7300LC Microprocessor"
http://cpus.hp.com/technical_references/highly_integrated.pdf

I believe the earlier 7x00 hashing could be turned on or off by the OS.
If someone has an old HP box, it might make a nice comparsion to run a
list walk w/ and w/o the hashing.

--
Mark Smotherman, Computer Science Dept., Clemson University, Clemson, SC

Glew News SBC

unread,

Jan 11, 2004, 12:20:32 AM1/11/04

to

"Jan C. Vorbrüggen" <jvorbr...@mediasec.de> wrote in message
news:3FFE69E...@mediasec.de...

Struct of Array will always be a win with regards to the line size:
if you are only accessing one word per struct, and the cache line
contains W words, Array of Struct will waste (W-1)/W of the
cache storage, but Struct of Array won't.

Beyond the cache line, however, non-power-of-two cache index
hash functions potentially allow all lines in the cache to be used
for any field.

Glew News SBC

unread,

Jan 11, 2004, 12:20:32 AM1/11/04

to

"Linus Torvalds" <torv...@osdl.org>
> Glew News SBC wrote:
> > [non power of 2 cache index hash functions]:

>
> Your idea to use a hash to distribute the caches more evenly sounds very
> feasible, and would likely solve some problems. On the other hand it would
> make it harder to tune for caches, which some projects like doing.
> Personally, I'd love to just try to keep the footprint down and not worry
> about cache coloring, and a hashed cache index would force that that. It
> would likely be a lot more effective than what we do in software.
>
> On the other hand, on the codes that _are_ tuned, it would possibly hurt,
so
> you'd get a fair amount of push-back on it.

If the hash function is modulo a prime, then it only hurts codes that
have chosen struct or array size that are multiples of exactly that
prime.

Are there any other ways in which you can hurt?

> And I suspect that with something like 8-way associativity, you won't find
> many cases where it really matters. I think Intel already does 8-way on
the
> L2, and the Opteron does 16-way? I can't imagine that conflict misses end
> up being that common any more at that point.
>
> Linus

That's the gotcha.

Prime moduli really "spread" a struct member so that it can occupy any entry
in the cache - effectively increasing the associativity. But if you are
already
doing 16 way associativity, the gain from increasing it much more is small.
It doesn't really improve average behavior; it does tend to make the
worst case behaviour better.

(Note: sometimes I'll say "improved hash functions", and sometimes "prime
moduli", which are just a particular form of improved hash function. I have
a personal liking for prime moduli, although there have been many other
improved hash functions proposed.)

If you have an improved hash function, you will probably nevertheless want
to
maintain associativity in case of collisions. However, cache collisions are
much
less systematic than with power-of-2 caches.

E.g. with prime moduli you may be able to get by with 4 or 8 way
associativity,
rather than 16 way associativity.

Hmm... I wonder if prime moduli would save power? Trade off matching more
tags vs. computing the modulus. I've never looked at it from the power
point
of view.

Some improved hash functions, such as prime moduli, require more tag bits.

---

Probably the biggest problem with an improved hash function such as prime
moduli is that, while it is fairly easy to see that it can be used to access
most L2
caches, it is much harder to make that case for an L1 or L0 data cache
access
that wants to be low latency.

I.e. it's easy to use hashing to reduce cache collisions for the L2 cache,
but not for the closer in L1 and L0 caches. Which means that aggressive
software will try to tune for cache resonances anyway.

E.g. you can't begin indexing the cache until all of the address bits are
available,
since all contribute to the modulus. That's bad enough for a virtually
index cache;
it's really bad, close to intolerable, for a physically indexed cache.
Modern caches
do not perform the translation before indexing the data cache; in fact, they
don't
even complete the translation until a long time after supplying the load
data.

I've played with schemes such as caching the translation and the moduli for
likely
address base registers. This works for ptr->field, for structures that are
smaller than
the resonance of the cache. It doesn't work so well for
ptr_to_array[large_offset1][large_offset2].

Jan C. Vorbrüggen

unread,

Jan 12, 2004, 10:25:40 AM1/12/04

to

> Struct of Array will always be a win with regards to the line size:
> if you are only accessing one word per struct, and the cache line
> contains W words, Array of Struct will waste (W-1)/W of the
> cache storage, but Struct of Array won't.

Depends on the details of the scenario. Imagine doing a binary search
on an array of structs: no matter what you do for the search part, you
waste bandwidth by the above factor (except when your search interval
is down to a "few" items, but that is only for a constant number of a
few final steps), but when you've finally reached the target item, the
rest of the struct has already been (effectively pre-)fetched.

Jan

Jan C. Vorbrüggen

unread,

Jan 12, 2004, 10:29:29 AM1/12/04

to

> Hmm... I wonder if prime moduli would save power? Trade off matching
> more tags vs. computing the modulus.

In the design rationale for the fully-associative cache on the T9000,
Inmos argued that, compared to a usefully-set associative design, the
fully-associative solution required less power. I believe it also made
it easier to run (parts of) the cache as on-chip memory (IIRC, one could
at start-up load the tag store of (parts of) the CAM with the desired
addresses).

Jan