Does PowerPC 970 has Tagged TLBs (Address Space Identifiers)

Behrang Saeedzadeh

unread,

May 12, 2003, 3:21:27 AM5/12/03

to

Hi

As you know a tagged TLB is useful for improving performance in
context switches so that it's not required to refill the entire TLB.
When a context switch happens, the new context does not have to refill
the TLB entries that their ASID is the same as its ASID.

The only processor that has a tagged TLB atchitecture and I'm aware of
it is MIPS. Tagged TLBs are key to the success of microkernel based
OSes.

I just wanted to know if the Power4 and PowerPC 970 architectures are
equipped with tagged TLBs or not.

Yours,
Behrang S.

Bruce Hoult

unread,

May 12, 2003, 4:32:15 AM5/12/03

to

In article <3486820e.03051...@posting.google.com>,
behr...@yahoo.com (Behrang Saeedzadeh) wrote:

All previous PowerPC chips have had tagged TLBs. Entries are tagged not
with the process ID but with the segment ID, where a segment is a 256 MB
block of address space. A single process may use multiple segments, and
multiple processes may share a segment.

-- Bruce

Per Ekman

unread,

May 12, 2003, 8:23:45 AM5/12/03

to

behr...@yahoo.com (Behrang Saeedzadeh) writes:

> Hi
>
> As you know a tagged TLB is useful for improving performance in
> context switches so that it's not required to refill the entire TLB.
> When a context switch happens, the new context does not have to refill
> the TLB entries that their ASID is the same as its ASID.
>
> The only processor that has a tagged TLB atchitecture and I'm aware of
> it is MIPS. Tagged TLBs are key to the success of microkernel based
> OSes.

I think tagged TLBs are the rule rather than the exception, at least
in the high end. AFAIK MIPS, Alpha and IA-64 all have ASIDs of one
form or another and the Power4 has a Segment Lookaside Buffer that I
assume could fulfill a similar function.

*p

Sander Vesik

unread,

May 12, 2003, 11:31:19 AM5/12/03

to

Sparc has too - aren't x86-xx and ARM(ok, not sure on that) the only ones
that don't?

>
> *p
>

--
Sander

+++ Out of cheese error +++

Anne & Lynn Wheeler

unread,

May 12, 2003, 12:03:27 PM5/12/03

to

sort of the wrong question. lots of virtual memory architectures have
associated TLB entries with address space. In the original 370
architecture this was referred to as STO-associative (segment table
origin associative, where the segment table was the virtual address
space specific table and STO or segment table origin was the address
of that address space unique table).

With 801 (starting in the '70s) ... and inverted tables ... there was
no longer an address space specific tables. 801, rather than having an
address space in real storage than could be use as a tag to uniquely
tag TLB-entries ... when to tag bits for each virtual segment. In the
original 370 architecture, this was referred to as STE-associative (or
PTO-associative .... TLB entries were tagged using the origin address
of each segment page table).

ROMP supported a 12-bit tag. There were 16 segment registers (in
32-bit address) and the currently active address space was defined by
loading specific tag values in each of the 16 segment registers. On a
TLB miss, a new entry was loaded with the real address ... and the
corresponding 12bit segment tag value. In this sense, there was no
address space specific ID ... instead each virtual segment (256mbytes
of 32-bit address space) had unique TLB tag. Not only wasn't it
necessary to flush and reload the complex TLB on an address space
switch ... but since the TLB entries were virtual segment associative
... rather than virtual address space associative ... it was possible
for shared segments across multiple address spaces to share the same
TLB entries.

The original 360/67 from the 60s didn't support multiple address
spaces and therefor flushed and refilled the TLB (actually it had a
fully associative arry) on each address space change.

The high-end 370s (sarting in the early '70s) had multi-tagged
TLB. The 370/165 & 370/168 had a seven entry STO-stack ... aka it
could remember up to seven STOs and there was a three bit TAG for each
TLB ... corresponding to the seven STOs remembered in the
STO-stack. Loading an active STO that wasn't in the STO stack
... resulted in scavenging one of the current seven STOs and flushing
all of the corresponding TLB entires.

The 12bit segment-id tag for ROMP (PC/RT) from the early '80s ... gave
rise to the reference that ROMP had 40bit virtual addressing (aka
28bit displacement of a segment plus 12bit tag). RIOS (RS/6000)
doubled the segment-id tag bits to 24 ... giving rise to desciptions
of RIOS having 52bit virtual addressing.

The mainframe 370 generations never did go to STE-associative ... even
tho they were finding that half of the TLB entries tended to be shared
kernel entries. In the early '80s ... the mainframe added a special
super bit tag ... effectively which was the common tag that was to
refer to the set of common, shared entries across all address spaces.
This was a very specific case for the major operating system MVS.

related discussion on whether or not the TLB tag bits can be considered
part of the virtual address space bits:
http://www.garlic.com/~lynn/2003e.html#0 Resolved: There Are No Programs With >32 Bits of Text
http://www.garlic.com/~lynn/2003e.html#12 Resolved: There Are No Programs With >32 Bits of Text

misc. past 801/romp/rios refs:
http://www.garlic.com/~lynn/subtopic.html#801

--
Anne & Lynn Wheeler | ly...@garlic.com - http://www.garlic.com/~lynn/
Internet trivia, 20th anniv: http://www.garlic.com/~lynn/rfcietff.htm

Per Ekman

unread,

May 12, 2003, 12:01:18 PM5/12/03

to

Sander Vesik <san...@haldjas.folklore.ee> writes:

> Per Ekman <p...@pdc.kth.se> wrote:

> > I think tagged TLBs are the rule rather than the exception, at least
> > in the high end. AFAIK MIPS, Alpha and IA-64 all have ASIDs of one
> > form or another and the Power4 has a Segment Lookaside Buffer that I
> > assume could fulfill a similar function.
>
> Sparc has too - aren't x86-xx and ARM(ok, not sure on that) the only ones
> that don't?

ISTR that ARM has some PTE field that they didn't call ASID but that
looked like it could be used to that effect. Don't have the ARMARM
with me just now though.

*p

Peter Boyle

unread,

May 12, 2003, 12:30:57 PM5/12/03

to

Don't know about power4/ppc970, but the lowly embedded PPC440 chip has
a PID field in its TLB, plus two translation spaces TS=0,1.

PID==0 is a wild card.

Peter

> >
> > *p
> >
>
> --
> Sander
>
> +++ Out of cheese error +++
>

Peter Boyle pbo...@physics.gla.ac.uk

Andy Glew

unread,

May 12, 2003, 12:37:52 PM5/12/03

to

> As you know a tagged TLB is useful for improving performance in
> context switches so that it's not required to refill the entire TLB.
> When a context switch happens, the new context does not have to refill
> the TLB entries that their ASID is the same as its ASID.
>
> The only processor that has a tagged TLB atchitecture and I'm aware of
> it is MIPS. Tagged TLBs are key to the success of microkernel based
> OSes.

Tagged TLBs are one way of allowing TLB entries to persist across
context switches. However, TLBs tagged with process IDs do not
help shared libraries or data that are shared between some, but not
all, processes.

I'm tempted to say that process-ID TLB tagging is now known to be
a dead end wrt instruction set architecture, because of these
other alternatives:
(0) Process ID tagged TLBs
(1) Folded Address Space
(2) Object ID tagged TLBs
(3) Snoopy TLBs

(0) - the original poster probably understands them. TLB entries are tagged
with a process ID, and are only hit if the current process ID matches that
in the TLB.
Usually, there is a kluge to allow the process ID tag for kernel TLB
entries to
be ignored, so that kernel TLB entries can be shared amongst all processes.

(1) As others have pointed out, the sort of "folded address space" (my name)
that is in some of the Power chips, the IBM PA, and the Intel Itanium,
give you something better than tagged TLBs.
I call these "folded" because, given a V1-bit virtual address, the upper
S1 bits are looked up in a table that provide you with V2 = V1-S1+S2 bits
of virtual address - the smaller V1-bit virtual address is "unfolded" into
a larger V2-bit virtual address. I.e. the V1-bit virtual address space is
divided up into 2^S1 "segments" of (2^(V1-S1)) bits each.
Typical sizes are V1=64 bits, V2=80 bits, and S1=64-52 bits.

So, at the very least, you can just use this as a 16 bit process ID TLB tag.
But you should be able to see how this can be used for partial sharing.

(2) Let me just briefly mention object ID tagged TLBs, where enries are
tagged not with the process ID, but with an "objectID" that corresponds
to (a) shared library id, or (b) program text id, or (c) finally, the
process
id for process private, unshared, data. I.e. instead of all TLB entries
having the same TLB tag, a process uses TLB entries with several different
TLB tags.
One implementation reads TLB entries, and compares the tag to a list of
"currently active" tags on the processor. A miss might constitute a TLB
miss,
or possibly the list of currently active tags can itself be considered to
miss.
Another implementation uses "activate/deactivate" CAMs or scans.
It can be seen that the special handling of kernel TLB entries in
process ID tagged TLBs is just a runt form of this.

(3) Finally, there arises the possibility of snooped or coherent TLBs.
TLB entries can be snooped to remain consistent with the memory copy
of the page tables, and consistent with the current process.
AMD's "TLB Probe Filter" can be considered a step in this direction,
towards snoopy TLBs. Interestingly, while I was at Intel and at university
(both prior to Intel, and at Wisconsin after/between Intel stints) I
designed
similar structures preceding AMD's announcement, and then I tried
to figure out what AMD had built based on the few scanty slides;
now that I have seen what AMD actually built, I can also see how
to be better.
If TLB miss costs matter, there could well be an arms race in this
area between AMD and Intel.

The nice thing about snoopy TLBs is that they do not require any
architectural changes, or OS changes. The bad thing is that, done
naive, they are quite expensive in hardware; potentially lots of
snoopers. Of course, you don't need to naive; snoopers can be
shared between many different TLB entries, if there is any degree
of locality or non-sparseness in the virtual address space.

I feel reasonably confident in saying that snoopy TLBs are buildable
for conventional virtual memory architectures. I feel less confident
in saying that snoopy TLBs are buildable for IBM style multilayer
virtual machine architectures; or, rather, they are buildable, but
virtualized page tables provide an extra combinatorica factor;
and, since the most common ways of dealing with page tables in
VMs involve the virtual machine host unmapping the virtual machine
guest's page tables, it is not clear that snoopy page tables need
to be extended to multilayer VMs. But they could be.

It is important to note that there are two issues here:
(1) snooping page table memory writes, so that the TLBs
can be consistent, whether instantaneously or delayed until
the next TLB "invalidate"
(2) tracking which TLB entries belong to which process
They are related.

While it would certainly be possible to have TLBs "instantaneously"
coherent with mmory, it is not clear
(a) if that might not break some OSes
(b) if that might not be unnecessarily expensive
(b') if that might not prohibit some interesting implementations.

===

It's not clear if any of this is worthwhile, if TLB misses are cheap
- e.g. if they can be done speculatively, if they can use the cache, etc.

MIPS probably needed some help because of software TLB miss handling.
Although even this can be accelerated, e.g. on a multithreaded machine.

===

Anyway, bottom line:

Tagged TLBs are probably reasonable, since just about all implementations
described above have one form or another of TLB tags.

MIPS-style process ID tagged TLBs are probably a dead-end.

Folded virtual addresses or object ID tagged TLBs are probably better.

Snoopy TLBs are a bit more expensive, but not as expensive as the naive
think,
and are architecturally invisible.

Rudi Chiarito

unread,

May 12, 2003, 3:59:24 PM5/12/03

to

"Andy Glew" <andy-gle...@sbcglobal.net> writes:
> (1) As others have pointed out, the sort of "folded address space" (my name)
> that is in some of the Power chips, the IBM PA, and the Intel Itanium,
> give you something better than tagged TLBs.

> (2) Let me just briefly mention object ID tagged TLBs, where enries are
> tagged not with the process ID, but with an "objectID" that corresponds
> to (a) shared library id, or (b) program text id, or (c) finally, the
> process id for process private, unshared, data.

Intel Itanium and HP PA - I assume the above was a slip - actually
fall into category 2, too. The ID with which entries are tagged is
called "protection key" (IA64) or "protection ID" (PA). Itanium of
course has more ID registers (at least 16 vs. 4 or 8).

Even ARM has something similar and it's called "domains";
unfortunately there are only sixteen of them. That, though, was
probably adequate for Newton OS, which is more or less what the ARM
MMU was designed for.

--
Perfection is attained not when there is no longer anything to add,
but when there is no longer anything to take away (A. de Saint-Exupery)
Rudi Chiarito ru...@amiga.com

Jan C. Vorbrüggen

unread,

May 13, 2003, 4:01:39 AM5/13/03

to

> > (1) As others have pointed out, the sort of "folded address space" [...]

> > that is in some of the Power chips, the IBM PA, and the Intel Itanium,
> > give you something better than tagged TLBs.
> > (2) Let me just briefly mention object ID tagged TLBs, where enries are
> > tagged not with the process ID, but with an "objectID" that corresponds
> > to (a) shared library id, or (b) program text id, or (c) finally, the
> > process id for process private, unshared, data.
>
> Intel Itanium and HP PA - I assume the above was a slip - actually
> fall into category 2, too. The ID with which entries are tagged is
> called "protection key" (IA64) or "protection ID" (PA). Itanium of
> course has more ID registers (at least 16 vs. 4 or 8).

Maybe Andy _was_ thinking of IBM, namely the recent description of what
is done by the PPC processors, which uses the mechanism of (1) to implement
(2) (by using OS conventions for deriving objectIDs from virtual addresses,
AIUI).

Jan

Anton Ertl

unread,

May 13, 2003, 4:07:55 AM5/13/03

to

"Andy Glew" <andy-gle...@sbcglobal.net> writes:
>Tagged TLBs are one way of allowing TLB entries to persist across
>context switches. However, TLBs tagged with process IDs do not
>help shared libraries or data that are shared between some, but not
>all, processes.

They help them just in the same way as they help non-shared VMAs, i.e,
by letting their TLB entries persist across context switches. They
may result in multiple TLB entries for the same object. But is this a
significant performance problem?

Trying to think of a typical scenario where persistence across context
switches helps significantly: There would be a lot of processes that
do little computation before activating the scheduler again (due to a
blocking system call, e.g., I/O or IPC), and there would be one or
several CPU-bound processes who would suffer from TLB misses after
each activation of a non-CPU-bound process if there were no ASIDs.

In such a scenario:

Is there significant sharing of active pages between the CPU-bound
processes and the non-CPU-bound processes? Probably not.

Is there sharing between the non-CPU-bound processes? There probably
would be, but if the CPU-bound processes need lots of TLB entries, it
will throw out the other TLB entries anyway, so this sharing cannot be
utilized. If the CPU-bound process does not need lots of TLB entries,
would utilizing the sharing with more sophisticated TLB tagging help
much?

Is there sharing between the CPU-bound processes? Maybe, but context
switches between them are rare enough that this is not a significant
issue.

>I'm tempted to say that process-ID TLB tagging is now known to be
>a dead end wrt instruction set architecture, because of these
>other alternatives:
> (0) Process ID tagged TLBs

Benefit: persistence across context switches

Cost: some changes to the OS

> (1) Folded Address Space

Benefit: sharing of TLB entries between objects

Cost: (In addition to OS changes) Various restrictions at the user
level if you want to make use of the benefit. E.g., AIX originally
allowed only 10 mmaps per process. I don't consider the benefits
worth such costs.

> (2) Object ID tagged TLBs

386 style segmentation? Or something like 2, but with more
flexibility?

The former requires lots of user-level changes.

> (3) Snoopy TLBs

Benefit: completely transparent to software.

Cost: Additional hardware complexity.

If the hardware cost can be made small enough, this looks like a
winner. Otherwise I think that (0) still has the best benefit/cost
ratio.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

Jon Beniston

unread,

May 13, 2003, 4:44:35 PM5/13/03

to

"Per Ekman" <p...@pdc.kth.se> wrote in message
news:mjeadds...@toby.pdc.kth.se...

CP15 register 13: Process ID, which is part of the Fast Context Switch
Extension could be the one your after.

Cheers,
JonB

> *p
>

Sander Vesik

unread,

May 13, 2003, 7:08:47 PM5/13/03

to

Good point - if you have something that identifies processes at context
switch time then you can add "transparent" support for ASIDs with very
small to no actual changes to the OS. basicly, you have to inform the
processor when you switch to a task what its id is and when an id is no
longer vbalid.

>
> Cheers,
> JonB

Hans de Vries

unread,

May 13, 2003, 10:39:52 PM5/13/03

to

"Andy Glew" <andy-gle...@sbcglobal.net> wrote in message news:<A5Qva.63$d97...@newssvr17.news.prodigy.com>...

> AMD's "TLB Probe Filter" can be considered a step in this direction,
> towards snoopy TLBs. Interestingly, while I was at Intel and at university
> (both prior to Intel, and at Wisconsin after/between Intel stints) I
> designed
> similar structures preceding AMD's announcement, and then I tried
> to figure out what AMD had built based on the few scanty slides;
> now that I have seen what AMD actually built, I can also see how
> to be better.
> If TLB miss costs matter, there could well be an arms race in this
> area between AMD and Intel.

For the curious

AMD Patent 6,510,508: Translation lookaside buffer flush filter

" A translation lookaside buffer (TLB) flush filter. In one embodiment,
a central processing unit includes a TLB for storing recent address
translations. A TLB flush filter monitors blocks of memory from which
address translations have been loaded and cached in the TLB. The TLB
flush filter is configured to detect if any of the underlying address
translations in memory have changed. If no changes have occurred, the
TLB flush filter may then prevent a flush of the TLB following the next
context switch. If changes have occurred to the underlying address
translations, the TLB flush filter may then allow a flush of the TLB
following a context switch. "

This looks a bit like a hyper-threading counter-strategy.
A "hyper-threading enabled' program, either hand-coded or compiler
generated, would run multiple threads in the same memory context.
C and Fortran source with explicit forking with short threads would
be slow when compiled to a non SMT machine with a lot of context
switch overhead.

Hyper-threading is supposed to be transparent in the sense that threads
may run on different logical or physical processors. After having a look
in the manual: Threads are started via an IPI (Inter Processor Interrupt)
The interesting question would be if AMD could adapt the Hammer to be
x86 instruction compatible. This would for instance need a modified HALT
instruction which is used by threads to finish.

Regards, Hans

Mitch Alsup

unread,

May 13, 2003, 11:25:14 PM5/13/03

to

"Andy Glew" <andy-gle...@sbcglobal.net> wrote in message news:<A5Qva.63$d97...@newssvr17.news.prodigy.com>...

<snip>

>
> While it would certainly be possible to have TLBs "instantaneously"
> coherent with mmory, it is not clear
> (a) if that might not break some OSes
> (b) if that might not be unnecessarily expensive
> (b') if that might not prohibit some interesting implementations.
>

<snip>

I know of at least two BIOSs that will fail to boot with an instantaneous
TLB update.

But another great advantage would be the complete elimination of TLB
shootdown software. Need to remove access to a page in use everywhere
--write the new entry into memory. By the time the store is performed
globally, all TLBs are coherent wrt the state of that page mapping.

Mitch

Andy Glew

unread,

May 14, 2003, 12:34:16 AM5/14/03

to

"Rudi Chiarito" <ru...@amiga.com> wrote in message
news:m3wugw9...@amiga.com...

> "Andy Glew" <andy-gle...@sbcglobal.net> writes:
> > (1) As others have pointed out, the sort of "folded address space" (my
name)

> > that is in some of the Power chips, the HP PA, and the Intel Itanium,

> > give you something better than tagged TLBs.
> > (2) Let me just briefly mention object ID tagged TLBs, where enries are
> > tagged not with the process ID, but with an "objectID" that corresponds
> > to (a) shared library id, or (b) program text id, or (c) finally, the
> > process id for process private, unshared, data.
>
> Intel Itanium and HP PA - I assume the above was a slip - actually
> fall into category 2, too. The ID with which entries are tagged is
> called "protection key" (IA64) or "protection ID" (PA). Itanium of
> course has more ID registers (at least 16 vs. 4 or 8).

Yes, I meant the HP PA.

The Itanium *are* an example of what I call object IDs.

This chip has *both* protection keys and folded address space.
Keeping in touch with its kitchen sink philosophy.

Andy Glew

unread,

May 14, 2003, 12:41:41 AM5/14/03

to

> > [Me, Andy Glew]

> >Tagged TLBs are one way of allowing TLB entries to persist across
> >context switches. However, TLBs tagged with process IDs do not
> >help shared libraries or data that are shared between some, but not
> >all, processes.
>

> [Anton Ertl]

> They help them just in the same way as they help non-shared VMAs, i.e,
> by letting their TLB entries persist across context switches. They
> may result in multiple TLB entries for the same object. But is this a
> significant performance problem?
>
> Trying to think of a typical scenario where persistence across context
> switches helps significantly: There would be a lot of processes that
> do little computation before activating the scheduler again (due to a
> blocking system call, e.g., I/O or IPC), and there would be one or
> several CPU-bound processes who would suffer from TLB misses after
> each activation of a non-CPU-bound process if there were no ASIDs.

Consider a webserver type workload, except with the webserver using
separate processes instead of threads sharing process memory.
(E.g. fault isolation - if one thread goes wild, it brings them all down,
whereas processes are isolated. IBM zSeries' putting webservers
in different virtual machines is an extreme example of this.)

The webserver processes share code. They may even share page data
modulo fault containment.

If you are mmaping the data, TLB misses will happen.

If the data is XML DOM, TLB misses will happen.
(Of course, it should be SAX for efficiency...)

Andy Glew

unread,

May 14, 2003, 12:59:04 AM5/14/03

to

> > (1) Folded Address Space
>
> Benefit: sharing of TLB entries between objects
>
> Cost: (In addition to OS changes) Various restrictions at the user
> level if you want to make use of the benefit. E.g., AIX originally
> allowed only 10 mmaps per process. I don't consider the benefits
> worth such costs.

Folded address spaces can be used exactly as process IDs.
But they can also be used for more advanced imlementations.

Yes: I painfully remember AIX's limitations. Bad code: the
software abstraction was mapped 1:1 to the hardware facility,
instead of using the hardware facility to accelerate the common
case, but providing generic code when the hardware capabilities
were exceeded.

I also agree that providing hardware facilities with small limits
tends to encourage such bad code. Personally, I don't like
folded address spaces, because they encourage such arbitrary
limits. But I have tried to understand them, and figure out
better ways.

Itanium only provides 8 regions. That's pretty bad.
Yet nevertheless, it is probably sufficient for most
of the "object groups" that might benefit from
TLB sharing:
1) the kernel
2) the process' program text
3) the most common shared libraries...

> > (2) Object ID tagged TLBs
>
> 386 style segmentation?

God, no!!!! (Although sometimes I think fondly of 2D memory...)

> Or something like 2, but with more flexibility?

Typically, the protection keys are stored with the page table entries in
memory;
or, if using software TLB miss handling, it doesn't matter how they get
stored
in memory; and the protection key is inserted into the TLB along with the
virtual
address.

On IA64, the protection key for a translation is checked against
the protection key cache, which must be at least of size 16.
If miss, a protection key miss trap to the OS is taken.
The OS can fault the access, or access a larger table of active
protection keys.

Sander Vesik

unread,

May 14, 2003, 10:12:11 AM5/14/03

to

But it would probably get really hairy in cases like freebsd where
hardware page tables are expandable, don't contain persistent information
and can be discraded and regenerated as needed.

>
> Mitch

Andy Glew

unread,

May 14, 2003, 11:30:41 AM5/14/03

to

> > I know of at least two BIOSs that will fail to boot with an
instantaneous
> > TLB update.

Of course, such BIOSes are also quite likely to break
on an aggressive out-of-order machine.

The "official" x86 rule is that any TLB entry,
at any time, can be thrashed out and refetched
from memory.

Iain McClatchie

unread,

May 14, 2003, 8:52:27 PM5/14/03

to

> But another great advantage would be the complete elimination of TLB
> shootdown software.

Hooray! TLB shootdowns scale miserably on many CPUs, for the same
reasons that used to cripple cache coherency. It's nice to reuse the
same hardware keeping the caches coherent to keep the TLBs coherent.

(I proposed this back in '96 after seeing the work that some poor
O/S hacker was doing to make Irix do well on some multi-threaded
benchmark on a 32 CPU machine. The benchmark had bottlenecked on
a single thread through the TLB shootdown code, which ended up being
many threads passing a semaphore from one to the next.)

Andi Kleen

unread,

May 15, 2003, 3:47:22 AM5/15/03

to

iai...@truecircuits.com (Iain McClatchie) writes:

>> But another great advantage would be the complete elimination of TLB
>> shootdown software.
>
> Hooray! TLB shootdowns scale miserably on many CPUs, for the same
> reasons that used to cripple cache coherency. It's nice to reuse the
> same hardware keeping the caches coherent to keep the TLBs coherent.

IA64 has hardware support for remote TLB shotdown (ptc.g - "global tlb
flush for a tlb coherence domain")

Unfortunately Intel in their wisdom specified that only one such
instruction can be outstanding for the whole machine, which makes it
require an global lock again. That's even worse than conventional
TLB flushes using interrupts which can be at least implemented
multithreaded.

Also big machines are composed of multiple coherence domains, so they
can require IPIs again.

> (I proposed this back in '96 after seeing the work that some poor
> O/S hacker was doing to make Irix do well on some multi-threaded
> benchmark on a 32 CPU machine. The benchmark had bottlenecked on
> a single thread through the TLB shootdown code, which ended up being
> many threads passing a semaphore from one to the next.)

It is still a serious problem. e.g. one common issue is the swapper
running on one CPU, doing page aging and flushing TLBs for that and
another process on another CPU always dirtying new pages. Most time
will be just spent with processing TLB shotdown code between the
CPUs of the process and the swapper.

-Andi

Anton Ertl

unread,

May 15, 2003, 1:50:57 PM5/15/03

to

"Andy Glew" <andy-gle...@sbcglobal.net> writes:
>> Cost: (In addition to OS changes) Various restrictions at the user
>> level if you want to make use of the benefit. E.g., AIX originally
>> allowed only 10 mmaps per process. I don't consider the benefits
>> worth such costs.
>
>Folded address spaces can be used exactly as process IDs.
>But they can also be used for more advanced imlementations.
>
>Yes: I painfully remember AIX's limitations. Bad code: the
>software abstraction was mapped 1:1 to the hardware facility,
>instead of using the hardware facility to accelerate the common
>case, but providing generic code when the hardware capabilities
>were exceeded.
>
>I also agree that providing hardware facilities with small limits
>tends to encourage such bad code.

Small limits encourage such bad code less than somewhat larger limits
(just by virtue of making the restrictions more painful).

But the "good code" is more complex than both the bad code and the
general code that does not utilize the hardware.

>Itanium only provides 8 regions. That's pretty bad.
>Yet nevertheless, it is probably sufficient for most
>of the "object groups" that might benefit from
>TLB sharing:
> 1) the kernel
> 2) the process' program text
> 3) the most common shared libraries...

For simple programs like the ones I write, yes:

cat /proc/28713/maps |grep r-x

reports 5 read-only mappings of files. But programs like Mozilla map
huge numbers of components (86 read-only mappings from files, 286
mappings overall in a mozilla process I just looked at), and with
application suites like KDE I think that the benefit of TLB sharing
only a few components is small even relative to the benefit of TLB
sharing all components (and I don't think that there is that much
benefit there). Is it worth the additional OS complexity on such
hardware?

Let's look at the example you gave, a web server; looking at one of
the Apache child processes at one of our servers, I see that 9 (out of
31) mappings are read-only file mappings. Ok, not so bad. But I
think the trend is going towards using more components in this area,
too.

Bernd Paysan

unread,

May 16, 2003, 4:28:21 AM5/16/03

to

Anton Ertl wrote:
> reports 5 read-only mappings of files. But programs like Mozilla map
> huge numbers of components (86 read-only mappings from files, 286
> mappings overall in a mozilla process I just looked at), and with
> application suites like KDE I think that the benefit of TLB sharing
> only a few components is small even relative to the benefit of TLB
> sharing all components (and I don't think that there is that much
> benefit there). Is it worth the additional OS complexity on such
> hardware?

Application suites like KDE have lots of common components. They even use a
trick to decrease startup time: They have a kinit process, which loads all
the relevant libraries. To start a new KDE app, kinit forks, and loads the
app as library (the other libraries are already open and resolved), instead
of the app loading the libraries.

I like POWER's or PA-RISCs "segments" which have several root page tables
depending on the highest address bits. This allows the OS to map the text
segment of all shared libraries into one shared page table.

> Let's look at the example you gave, a web server; looking at one of
> the Apache child processes at one of our servers, I see that 9 (out of
> 31) mappings are read-only file mappings. Ok, not so bad. But I
> think the trend is going towards using more components in this area,
> too.

Probably your Apache has turned most modules off.

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/