Virtual address Memory Protection Unit

Paul A. Clayton

unread,

Jun 18, 2012, 4:22:24 PM6/18/12

to

Given the desirability of including smaller, more energy-efficient
cores on the same chip with higher performance cores, it seemed
that there might be some place for extremely simple cores
(comparable to ARM Cortex M series) that use a MPU rather than a
TLB that can use a beefy core's outer TLB to (optionally)
translate virtual addresses used by the MPU.

Obviously this would have the issues of a virtually tagged cache,
but such might allow such a core to be used more broadly and more
easily than if the MPU used physical addresses. (Something like
the Stanford MIPS method of including the ASID in the virtual
address--using a variable length mask--might substantially reduce
the need for extra ASID bits. With a backing translation layer,
permissions could be applied to somewhat coarse-grained regions
as long as virtual addresses within a region that do not have
the same permissions are invalid.)

It might be worthwhile to support automated fill of the MPU from
the shared TLB.

The main motivation behind this little idea was the lack of
use by Linux of the M-series cores on some OMAP chips. The fact
that the next OMAP will use software compatible small cores
(A-7 small cores with A-15 large cores, IIRC) seems to imply that
there is not much point dealing with the hassles of a virtually
tagged cache just to save a tiny bit of power, but there might be
some (less common) use for the idea (or this sharing might stir
someone's creativity).

MitchAlsup

unread,

Jun 21, 2012, 7:22:20 PM6/21/12

to

We found that as you scale back the computation power (i.e. complexity) ofthe core, you can dramatically scale back to size of the TLB. So you might need a 64-entry FA TLB backed up by a 512-entry 4-way set TLB for a great big OoO core; you can get away with a 32-entry FA TLB fro a core of 1/2 to compute power of its GBOoO bretheren.

Other smaller earlier cores point to a serious disadvantage if/when the TLB gets smaller than 24-entries (in any configuration).

Mitch

Terje Mathisen

unread,

Jun 22, 2012, 3:22:35 AM6/22/12

to

Is it even possible to have good performance with a TLB which is too
small to cover the cache hierarchy?

I.e. I remember fondly (NOT!) the Pentium model that had 512 KB L2 cache
but only 64 entries for 4 KB pages, i.e. maximum 256 KB of coverage.

We got very distinct performance knees when the working set passed the
L1 size, the TLB size and the L2 size.

Can you reload TLB entries fast enough (from cache?) that this problem
is ameliorated?

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

nm...@cam.ac.uk

unread,

Jun 22, 2012, 4:28:46 AM6/22/12

to

In article <s2jdb9-...@ntp6.tmsw.no>,

Terje Mathisen <"terje.mathisen at tmsw.no"> wrote:
>MitchAlsup wrote:
>
>> We found that as you scale back the computation power (i.e.
>> complexity) ofthe core, you can dramatically scale back to size of
>> the TLB. So you might need a 64-entry FA TLB backed up by a 512-entry
>> 4-way set TLB for a great big OoO core; you can get away with a
>> 32-entry FA TLB fro a core of 1/2 to compute power of its GBOoO
>> bretheren.
>>
>> Other smaller earlier cores point to a serious disadvantage if/when
>> the TLB gets smaller than 24-entries (in any configuration).
>
>Is it even possible to have good performance with a TLB which is too
>small to cover the cache hierarchy?
>
>I.e. I remember fondly (NOT!) the Pentium model that had 512 KB L2 cache
>but only 64 entries for 4 KB pages, i.e. maximum 256 KB of coverage.
>
>We got very distinct performance knees when the working set passed the
>L1 size, the TLB size and the L2 size.
>
>Can you reload TLB entries fast enough (from cache?) that this problem
>is ameliorated?

It's been done, but at horrendous cost - and I am talking about not
being able to provide other optimisations, and imposing undesirable
constraints on software, not direct cost.

I remain amazed at the fiendish contortions that people are prepared
to go through just to avoid admitting that the world of computing has
changed in the past 40 years, when the solution to this is so simple.
Simply abandoning the whole demand paging / TLB model in favour of
segment authorities (not x86 segments, of course) eliminates the
whole horrible mess for essentially no downside.

But, of course, I am totally heretical in regarding the need to
replace completely misdesigned, misimplemented and just plain broken
software as not a downside :-)

Regards,
Nick Maclaren.

Robert A Duff

unread,

Jun 22, 2012, 6:07:11 PM6/22/12

to

nm...@cam.ac.uk writes:

> Simply abandoning the whole demand paging / TLB model in favour of
> segment authorities (not x86 segments, of course) eliminates the
> whole horrible mess for essentially no downside.

Could you please explain what you mean by "segment authorities"?
Putting it in google does not enlighten me.

I'm pretty sure I agree with "not x86 segments, of course".

> But, of course, I am totally heretical in regarding the need to
> replace completely misdesigned, misimplemented and just plain broken
> software as not a downside :-)

;-)

- Bob

nm...@cam.ac.uk

unread,

Jun 23, 2012, 12:21:26 AM6/23/12

to

In article <wcctxy3...@shell01.TheWorld.com>,

Robert A Duff <bob...@shell01.TheWorld.com> wrote:
>
>> Simply abandoning the whole demand paging / TLB model in favour of
>> segment authorities (not x86 segments, of course) eliminates the
>> whole horrible mess for essentially no downside.
>
>Could you please explain what you mean by "segment authorities"?
>Putting it in google does not enlighten me.

Essentially TLBs, but referring to Unix-like segments, with
start addresses and lengths; they have had many names over
the years. So a whole segment would be contiguous.

Regards,
Nick Maclaren.

MitchAlsup

unread,

Jun 23, 2012, 12:12:20 PM6/23/12

to

On Friday, June 22, 2012 2:22:35 AM UTC-5, Terje Mathisen wrote:
> MitchAlsup wrote:
> > We found that as you scale back the computation power (i.e.
> > complexity) ofthe core, you can dramatically scale back to size of
> > the TLB. So you might need a 64-entry FA TLB backed up by a 512-entry
> > 4-way set TLB for a great big OoO core; you can get away with a
> > 32-entry FA TLB fro a core of 1/2 to compute power of its GBOoO
> > bretheren.
> >
> > Other smaller earlier cores point to a serious disadvantage if/when
> > the TLB gets smaller than 24-entries (in any configuration).
>
> Is it even possible to have good performance with a TLB which is too
> small to cover the cache hierarchy?

In my opinion, yes. What it requires, however, is that the TWA (tablewalk accelerator) does not get scaled back like the TLB gets scaled back. The TWA contains the page directory cache, and some other data to short-circuit the table walk process. Since the TWA is not always active, it does nto add much to the power burden.

If you do scale back the TWA, bad things happen as Nick mentions.

Mitch

Michael S

unread,

Jun 23, 2012, 5:58:35 PM6/23/12

to

On Jun 23, 7:21 am, n...@cam.ac.uk wrote:
> In article <wcctxy3f1bk....@shell01.TheWorld.com>,

> Robert A Duff <bobd...@shell01.TheWorld.com> wrote:
>
>
>
> >> Simply abandoning the whole demand paging / TLB model in favour of
> >> segment authorities (not x86 segments, of course) eliminates the
> >> whole horrible mess for essentially no downside.
>
> >Could you please explain what you mean by "segment authorities"?
> >Putting it in google does not enlighten me.
>
> Essentially TLBs, but referring to Unix-like segments, with
> start addresses and lengths; they have had many names over
> the years. So a whole segment would be contiguous.
>
> Regards,
> Nick Maclaren.

Which granularity?
Do virtual address ranges of different segments allowed to overlap?
How do you propose to implement the analog of "hardware page walker"
or do you think that pure software-managed TLB is sufficient?
Do you realize that range checks are significantly more expensive than
equality checks and that it means that you "segment TLB" will have
significantly fewer entries (10x fewer? what hW experts are thinking)
than page TLBs today? Or, may be, you are willing to give up full
associativity?

Of course, all those questions are relatively insignificant. More
significant is OS designers love to manage fixed-size pages and hate
to manage variable-sized segments. So you idea is doomed due this
single reason alone, regardless of any other difficulties.

Ivan Godard

unread,

Jun 24, 2012, 7:40:28 PM6/24/12

to

On 6/22/2012 1:28 AM, nm...@cam.ac.uk wrote:
<snip>

>
> I remain amazed at the fiendish contortions that people are prepared
> to go through just to avoid admitting that the world of computing has
> changed in the past 40 years, when the solution to this is so simple.
> Simply abandoning the whole demand paging / TLB model in favour of
> segment authorities (not x86 segments, of course) eliminates the
> whole horrible mess for essentially no downside.
>
> But, of course, I am totally heretical in regarding the need to
> replace completely misdesigned, misimplemented and just plain broken
> software as not a downside :-)
>
>
> Regards,
> Nick Maclaren.
>

Why not go all the way to caps?

Ivan

nm...@cam.ac.uk

unread,

Jun 25, 2012, 8:15:47 AM6/25/12

to

In article <js88hb$p6e$2...@speranza.aioe.org>,

Ivan Godard <igo...@pacbell.net> wrote:
>
>> I remain amazed at the fiendish contortions that people are prepared
>> to go through just to avoid admitting that the world of computing has
>> changed in the past 40 years, when the solution to this is so simple.
>> Simply abandoning the whole demand paging / TLB model in favour of
>> segment authorities (not x86 segments, of course) eliminates the
>> whole horrible mess for essentially no downside.
>>
>> But, of course, I am totally heretical in regarding the need to
>> replace completely misdesigned, misimplemented and just plain broken
>> software as not a downside :-)
>

>Why not go all the way to caps?

Well, yes, but what I am proposing doesn't actually impact very
much, and could be used to run almost all existing software;
only the most foully misdesigned would need rewriting. The
trouble with capabilities are that they are incompatible with
the POSIX model, and therefore force major changes. Yes, I agree
that the POSIX model is broken, but abandoning it is a major
upheaval.

Regards,
Nick Maclaren.

Stefan Monnier

unread,

Jun 25, 2012, 11:48:10 AM6/25/12

to

> But, of course, I am totally heretical in regarding the need to
> replace completely misdesigned, misimplemented and just plain broken
> software as not a downside :-)

Another simple approach (compared to segments) is a virtually-addressed
cache, with a pagetable in virtual memory.
Basically "physical=pagetable[virtual>>N]" so you don't need a TLB at
all (the page table is cached instead). You also save power because you
only need to do address translation upon cache misses.
Of course, it's not very friendly to POSIX OSes, so I'm not holding my breath.

Stefan

MitchAlsup

unread,

Jun 25, 2012, 12:24:40 PM6/25/12

to nm...@cam.ac.uk

On Monday, June 25, 2012 7:15:47 AM UTC-5, (unknown) wrote:
> In article <js88hb$p6e$2...@speranza.aioe.org>,
> Ivan Godard <igo...@pacbell.net> wrote:
> >
> >> I remain amazed at the fiendish contortions that people are prepared
> >> to go through just to avoid admitting that the world of computing has
> >> changed in the past 40 years, when the solution to this is so simple.
> >> Simply abandoning the whole demand paging / TLB model in favour of
> >> segment authorities (not x86 segments, of course) eliminates the
> >> whole horrible mess for essentially no downside.
> >>
> >> But, of course, I am totally heretical in regarding the need to
> >> replace completely misdesigned, misimplemented and just plain broken
> >> software as not a downside :-)
> >
> >Why not go all the way to caps?
>
> Well, yes, but what I am proposing doesn't actually impact very
> much, and could be used to run almost all existing software;
> only the most foully misdesigned would need rewriting. The
> trouble with capabilities are that they are incompatible with
> the POSIX model, and therefore force major changes.

Another basic problem with capabilities is that it has been proven
that they cannot provide the kind of security that they postulated
they could provide. {Aside from making all current applications
worthless.} Heck, even AS400 dropped the capabilities from System 38.

Mitch

MitchAlsup

unread,

Jun 25, 2012, 12:26:28 PM6/25/12

to

On Monday, June 25, 2012 10:48:10 AM UTC-5, Stefan Monnier wrote:
> > But, of course, I am totally heretical in regarding the need to
> > replace completely misdesigned, misimplemented and just plain broken
> > software as not a downside :-)
>
> Another simple approach (compared to segments) is a virtually-addressed
> cache, with a pagetable in virtual memory.

Oh, please no.

Everyone who ever designed a chip with one of these is firmly in the camp of never wanting to do that again; FIRMLY.

Mitch

Anne & Lynn Wheeler

unread,

Jun 25, 2012, 1:58:02 PM6/25/12

to

MitchAlsup <Mitch...@aol.com> writes:
> Another basic problem with capabilities is that it has been proven
> that they cannot provide the kind of security that they postulated
> they could provide. {Aside from making all current applications
> worthless.} Heck, even AS400 dropped the capabilities from System 38.

folklore is that S/38 is simplified flavor of Future System effort. FS
included self-describing data ... in the hardware ... every access
conceivably could result in five levels of storage access re-direction.
One of the nails in the FS coffin was that if a FS machine was made out
of the (then) currently fastest hardware (370/195), the throughput could
be that of a 370/145 (between 10:1 and 30:1 slow-down; specifically
compared was Eastern airlines System/one ACP/TPF reservation system that
ran on 370/195). S/38 was selling in low-end market and could get
hardware fast enough and it wasn't as cost sensitive. misc. past posts
mentionin FS
http://www.garlic.com/~lynn/submain.html#futuresys

I've frequently claimed that John's creation of 801/risc was heavily
oriented to going to the exact opposite of what was happening in the FS
effort.
http://www.garlic.com/~lynn/subtopic.html#801

circa 1980, there was an effort to move large variety of internal
microprocessors to 801/risc (as/400 follow-on to s/38, low&mid-range
370s, variety of "controller" microprocessors, etc). For variety of
reasons the efforts were all aborted and new generation of different
CISC processors were created (including for the as/400).

with the power/pc in the 90s, rochester finally did move as/400 from
cisc to 801/risc processor (although there were big arguments about
adding a "65th bit" to 64bit addressing)

recent post mentioning TYMSHARE developed GNOSIS, a capabilty
operating system that ran on 370 hardware
http://www.garlic.com/~lynn/2012i.html#40 GNOSIS & KeyKOS

when M/D bought TYMSHARE in the 80s, GNOSIS was spun-off as KeyKOS. As
part of the spin-off, I was brought in to audit GNOSIS. One of the
GNOSIS issues was it was targeted at allowing 3rd party vendors to
develop applications and services delivered on TYMSHARE service bureau
platform. There was enormous pathlength overhead in GNOSIS related to
accounting & charges.

With the spinoff, the accounting/charge pathlength was removed ... and
benchmarking claims in the late 80s & early 90s showed KeyKOS
outperforming the 370 high-performance ACP/TPF transaction system. Some
of the claims, was that with the high-level capability abstraction, a
lot of optimization could be done that wasn't possible in the ACP/TPF
transaction model.

GNOSIS
http://en.wikipedia.org/wiki/GNOSIS
KeyKOS
http://en.wikipedia.org/wiki/KeyKOS
EROS
http://en.wikipedia.org/wiki/EROS_%28microkernel%29
CapROS
http://en.wikipedia.org/wiki/CapROS
Coyotos
http://en.wikipedia.org/wiki/Coyotos

TPF
http://en.wikipedia.org/wiki/Transaction_Processing_Facility

--
virtualization experience starting Jan1968, online at home since Mar1970

nm...@cam.ac.uk

unread,

Jun 25, 2012, 3:45:24 PM6/25/12

to

In article <ce5896c0-6cb1-4af9...@googlegroups.com>,

MitchAlsup <Mitch...@aol.com> wrote:
>
>Another basic problem with capabilities is that it has been proven
>that they cannot provide the kind of security that they postulated
>they could provide. {Aside from making all current applications
>worthless.} Heck, even AS400 dropped the capabilities from System 38.

Have they? I doubt it. I have used one that had an open challenge
to all users to break in, and nobody succeeded.

Regards,
Nick Maclaren.

Paul A. Clayton

unread,

Jun 25, 2012, 5:35:45 PM6/25/12

to

MPUs seem to have eight to 16 entries. Admittedly, MPUs
are intended for embedded processors and each entry can
cover a large memory space (some use power-of-two sized
regions, some use a base and a bound).

(The ARM Cortex-M series seem to be wimpier than the
old MIPS R2000. E.g., two [Cortex-M0+] or three stage
pipeline.)

The thought presented would also allow pages within an
area guarded by a MPU entry to be invalid, so fewer
larger permission regions could be practical.

(Effectively, such a mechanism wraps the entire core
[including caches and MPU] in a virtualization layer
where the core's physical addresses become virtual
addresses. It was thought that this might simplify the
use of such extremely wimpy cores.)

Stefan Monnier

unread,

Jun 25, 2012, 7:17:53 PM6/25/12

to

>> > But, of course, I am totally heretical in regarding the need to
>> > replace completely misdesigned, misimplemented and just plain broken
>> > software as not a downside :-)
>> Another simple approach (compared to segments) is a virtually-addressed
>> cache, with a pagetable in virtual memory.
> Oh, please no.
> Everyone who ever designed a chip with one of these is firmly in the camp of
> never wanting to do that again; FIRMLY.

Interesting. Why not?

Stefan

MitchAlsup

unread,

Jun 25, 2012, 7:41:23 PM6/25/12

to

Huge number of headaches caused by loose ends.

Mitch

van...@vsta.org

unread,

Jun 25, 2012, 8:13:18 PM6/25/12

to

MitchAlsup <Mitch...@aol.com> wrote:
> Huge number of headaches caused by loose ends.

I never worked on the Vax VM code for BSD, but I remember the folks who did
were *not* helped by the Vax's design decision that VM data structures use
the VM system. The circularity created a lot of special cases. Hopefully
one of'em is around here.

--
Andy Valencia
Home page: http://www.vsta.org/andy/
To contact me: http://www.vsta.org/contact/andy.html

Stefan Monnier

unread,

Jun 25, 2012, 9:13:08 PM6/25/12

to

>> > Everyone who ever designed a chip with one of these is firmly in
>> > the camp of never wanting to do that again; FIRMLY.
>> Interesting. Why not?
> Huge number of headaches caused by loose ends.

I'm curious about what those can be,

Stefan

jacko

unread,

Jun 26, 2012, 7:38:28 PM6/26/12

to

Consider a problem of where to locate the base table to refer to the boot sequence code to start.