Re: Loongson 3A5000?

Thomas Koenig

unread,

Apr 22, 2021, 1:01:42 PM4/22/21

to

(x-post, f-up).

gareth evans <headst...@yahoo.com> schrieb:
> AIUI this is a new ISA that does not tread on the
> patents of ARM, X86 and MIPS.
>
> I've googled unsuccessfully so any pointers
> to this ISA?
>
> (As always, real computer scientists program
> in assembler :-) )

Maybe this question would be better answered in comp.arch?

EricP

unread,

Apr 22, 2021, 1:20:31 PM4/22/21

to

Loongson a licensed MIPS32/MIPS64 clone.

https://en.wikipedia.org/wiki/Loongson

George Neuner

unread,

Apr 22, 2021, 2:19:23 PM4/22/21

to

On Thu, 22 Apr 2021 13:20:03 -0400, EricP
<ThatWould...@thevillage.com> wrote:

>Loongson a licensed MIPS32/MIPS64 clone.
>
>https://en.wikipedia.org/wiki/Loongson

It is reported that they have created a new ISA. They have called the
1st chip to use it "MIPS64 compatible", but it sounds like the long
term plan is to move away from MIPS.

https://www.tomshardware.com/news/loongson-technology-develops-its-own-cpu-instruction-set-architecture
https://technode.com/2021/04/21/silicon-loongson-promises-self-reliance-with-new-architecture/

MitchAlsup

unread,

Apr 22, 2021, 2:23:38 PM4/22/21

to

I have a FREE ISA I developed that does not tread on any known
(still valid) IP from other companies. All you have to do is ask.

John Dallman

unread,

Apr 22, 2021, 4:04:15 PM4/22/21

to

In article <s5sa5k$b8l$1...@newsreader4.netcologne.de>,

https://mp.weixin.qq.com/s/a1M0OVzVyKKQoiG8i_0P0w is the announcement,
but I need to use Google Translate, being unable to read Chinese.

It stresses the importance of a software ecosystem. That is indeed very
important, but the MIPS64-compatible Loongson has had over a decade to
acquire dominance of the MIPS64 ecosystem for desktop and server
computing, and that has not happened. Nobody else is doing much with MIPS
except in the embedded field, so adding enough to the ecosystem to become
dominant was perfectly possible.

Abandoning it and creating a new ISA may work, but it is not guaranteed.
The Loongson doesn't seem to have had much market success in China, and
the Chinese government will be taking economic risks if it tries to force
businesses to adopt this new ISA (and a corresponding OS) in place of
Wintel.

Looking more carefully:

"In 2020, Dragon Core SMIC launched LoongArch architecture based
on two decades of CPU development and ecological construction,
including infrastructure and vector instructions, virtualization,
binary translation and other extended parts, nearly 2,000
instructions. The Dragon Core architecture does not contain the
MIPS instruction system."

"Dragon core architecture abandons some of the old contents of
the traditional instruction system that do not adapt to the
current development trend of hardware and software design
technology, and absorbs many advanced technological development
achievements in the field of instruction system design in recent
years. Compared with the original compatible instruction system,
not only is it easier to design high performance and low power
consumption in hardware, but also easier to compile optimization
and develop operating system and virtual machine in software."

Those quotes make it sound as if they've replaced the basic MIPS64
instruction set with something new, while retaining their various
extensions. Those included instructions to make x86 emulation easier,
which they would, under this hypothesis, have retained. However, I may
well be reading too much into a translation.

If they have an easy translation of MIPS64 into Dragon Core, that could
account for the statements about the 3A5000 supporting Loongson, or
Dragon Core, or it could be that it can actually run both. They'd need to
drop the Loongson capability, or make sure it was slower than Dragon Core,
to achieve a separate ecosystem, of course.

Various bits: SMIC is presumably
<https://en.wikipedia.org/wiki/Semiconductor_Manufacturing_International_C
orporation>

"Dragon Core SMIC Technology Co., Ltd., is committed to Dragon
Core series CPU design, production, sales and service. Dragon
Core's main products include the "Dragon Core 1" small CPU for
industry applications, the "Dragon Core 2" CPU for industrial
and terminal applications, and the "Dragon Core 3" CPU for desktop
and server applications."

The replacement for Loongson is presumably Dragon Core 3.

<https://www.cnx-software.com/2021/04/17/loongson-loongarch-cpu-instructio
n-set-architecture/> has more, as does
<https://www.eet-china.com/kj/43a63100.html>

"LoongArch on the MIPS instruction translation efficiency is 100%
performance, arm instruction translation efficiency is 90%
performance, x86 Linux translation efficiency is 80% performance."

Hum, looks like the idea about the new architecture having a replacement
for MIPS that's quite similar might be correct. Here's some unofficial
documemtation:
<https://github.com/loongson-community/docs/tree/master/unofficial/loongar
ch> Can someone who knows MIPS say how similar they are?

John

gareth evans

unread,

Apr 22, 2021, 4:26:58 PM4/22/21

to

I had already found that article but it does not cover
the 3A5000 which comes after the MIPS64 products

Theo Markettos

unread,

Apr 23, 2021, 4:42:44 PM4/23/21

to

John Dallman <j...@cix.co.uk> wrote:
> Hum, looks like the idea about the new architecture having a replacement
> for MIPS that's quite similar might be correct. Here's some unofficial
> documemtation:
> <https://github.com/loongson-community/docs/tree/master/unofficial/loongar
> ch> Can someone who knows MIPS say how similar they are?

I'm not a MIPS expert but it looks like a superset of MIPS. There are some
extra instructions that look like additions or variations of existing MIPS
instructions, but I would guess that MIPS userland code could continue to
run. That might explain why they say 100% of MIPS performance (which is
poor in some circumstances, presumably hence the extra instructions).

There's no information on the privileged mode, which could be different
while still allowing MIPS userland code to run.

Theo

EricP

unread,

Apr 23, 2021, 6:29:35 PM4/23/21

to

I don't know MIPS so can't comment.
I see references to 3A5000 implementing LoongISA v2.0.

LoongISA, also called LoongArch, is their extensions to the MIPS ISA
which allows it to run binary translated software from many sources,
MIPS, x86, ARM. Version 1.0 seems to have been developed around 2015
and implemented on the Loongson-3A1500 four-core CPU.
Loongson chips also go by the name "Godson".
The recent microarchitecture names are GS464E and GS464V.

As they refer to LoongISA v2.0 I guess the 3A5000 is more of the above.
Since MIPS closed shop and the US put technology export restrictions
on China, they probably want their own in-house kit.
They also say they have removed any instructions covered by external patents.

There is a 2015 open-access paper in Chinese:

LoongISA for compatibility with mainstream instruction set architecture, 2015
https://www.sciengine.com/publisher/scp/journal/SSI/45/4/10.1360/N112014-00300?slug=fulltext

which does have an English abstract:

"This paper introduces the Loongson instruction set architecture (LoongISA),
which extends the MIPS instruction set architecture for compatibility with
X86 and ARM mainstream instruction set architectures.
New instructions, runtime environments, and system states are added to
MIPS through MIPS UDI (User Defined Interface) to accelerate the binary
translation of X86 and ARM binary codes to LoongISA binary code. In addition,
binary translation systems have been built based on LoongISA to run
MS-Windows and its applications, X86 Linux applications, and ARM Android
applications. LoongISA is implemented in the Loongson-3A1500 four-core
CPU product of Loongson Technology Corporation Limited.
Performance evaluations using the Loongson-3A1500 FPGA verification
platform show that with hardware support, the binary translation system
of Loongson 3A1500 can achieve very high efficiency."

It has some examples in English.

The LoongISA v1.0 MIPS instruction set extensions seem to deal with
things like x86 flags, which MIPS doesn't have, so they added
new instruction to generates the arithmetic flags.

MIPS uses software translated TLB which I gather behaves differently
than x86, so they added a hardware TLB that behaves like x86's.

MitchAlsup

unread,

Apr 23, 2021, 8:40:00 PM4/23/21

to

MIPS uses a software reloaded TLB, once loaded, the TLB performs translations
in HW while the cache is being accessed.

From about the time of R3000::

A TLB miss raises an exception. The exception handler reads a control register
which has a hash of the virtual address (and other info) and then assesses a
table in memory, checks that the entry matches the virtual address and task ID
and if so installs the entry and returns from exception. A miss of the memory table
has a software routine walk the page tables, install the entry in the memory table
and then loads the TLB and returns.

A Table-hit costs only about 19 cycles. A Table-miss costs a bit over a hundred.
Table-miss where the MMU table is not in cache can be considerably more expensive.

Anton Ertl

unread,

Apr 24, 2021, 7:31:02 AM4/24/21

to

EricP <ThatWould...@thevillage.com> writes:
>MIPS uses software translated TLB which I gather behaves differently
>than x86, so they added a hardware TLB that behaves like x86's.

As Mitch Alsup mentioned, the TLB translates in hardware. On a TLB
miss software does the address translation, and it can impement
arbitrary translation schemes, including the multi-level page tables
of IA-32 and AMD64 (and I guess that if you look at Linux-MIPS, you
will find that it uses something close to this scheme).

Maybe Loongson added hardware support for faster dispatch between
multiple translation schemes, e.g, between their traditional page
table format, the IA-32 format, the AMD64 format (I think they have
some differences, e.g., the NX bit), and one or more ARM formats.

Thinking about it, I guess that the bottom-level page descriptions may
differ between what the TLB expects from you, the format used by
IA-32, the one used by AMD64, and one or more ARM formats. E.g.,
AFAIK the MIPS TLB has an X bit for executable access, while AMD64 has
an NX bit. So there is some bit manipulation necessary between these
formats. Maybe Loongson has added hardware support to make this
faster.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

BGB

unread,

Apr 24, 2021, 10:50:43 AM4/24/21

to

Should be similar in my case as well...

The closest thing to an ancestor ISA was from a processor released ~ 24
years ago...

Apart from 'WEX' or similar, there isn't a whole lot that it does which
was not already commonplace 20 years ago.

OTOH, the design isn't particularly optimized for server/PC/workstation
use-cases; where the preference tends to be to do a scalar ISA and then
let superscalar or OoO hardware sort it out.

From what I can gather though, most ISA related patents appear to be
used more for sake of protecting against unlicensed clones, than for
fighting off unrelated ISA's which happen to have similar features, so
dunno there...

EricP

unread,

Apr 24, 2021, 11:36:59 AM4/24/21

to

Anton Ertl wrote:
> EricP <ThatWould...@thevillage.com> writes:
>> MIPS uses software translated TLB which I gather behaves differently
>> than x86, so they added a hardware TLB that behaves like x86's.
>
> As Mitch Alsup mentioned, the TLB translates in hardware. On a TLB
> miss software does the address translation, and it can impement
> arbitrary translation schemes, including the multi-level page tables
> of IA-32 and AMD64 (and I guess that if you look at Linux-MIPS, you
> will find that it uses something close to this scheme).

Yes, thanks, slip of the tongue.
I should have said Software Managed TLB (SM-TLB).

> Maybe Loongson added hardware support for faster dispatch between
> multiple translation schemes, e.g, between their traditional page
> table format, the IA-32 format, the AMD64 format (I think they have
> some differences, e.g., the NX bit), and one or more ARM formats.

It has been known since 1993 that SM-TLB can cost as much as 25-40%
performance loss due to pipeline flush and reload for the TLB-miss interrupt.
The pipelines for OoO are much larger now and Loongsoon seems
to be targeting long ROB's so the penalty would be even higher.
One wonders why MIPS was still using SM-TLB.

A hardware table walker eliminates those unnecessary TLB pipeline flushes.

If it was up to me I would have eliminated the SM-TLB long ago and
use a hardware table walker. Actually a hybrid HW-SM-TLB would have a
HW walker but also TLB entries that can be read & written by software.

> Thinking about it, I guess that the bottom-level page descriptions may
> differ between what the TLB expects from you, the format used by
> IA-32, the one used by AMD64, and one or more ARM formats. E.g.,
> AFAIK the MIPS TLB has an X bit for executable access, while AMD64 has
> an NX bit. So there is some bit manipulation necessary between these
> formats. Maybe Loongson has added hardware support to make this
> faster.
>
> - anton

Yes, the page table and PTE format too.
If they go for a HW table walker then they need HW mode bits
to indicate what kind of table to read/write.

If they stay with a SM-TLB then having special instructions to reformat
from Intel, AMD or ARM PTE format into a MIPS PTE format would likely
save many bitwise insert/extract instructions in the TLB miss handler.

Ivan Godard

unread,

Apr 24, 2021, 1:03:00 PM4/24/21

to

On 4/24/2021 8:36 AM, EricP wrote:
> Anton Ertl wrote:
>> EricP <ThatWould...@thevillage.com> writes:
>>> MIPS uses software translated TLB which I gather behaves differently
>>> than x86, so they added a hardware TLB that behaves like x86's.
>>
>> As Mitch Alsup mentioned, the TLB translates in hardware. On a TLB
>> miss software does the address translation, and it can impement
>> arbitrary translation schemes, including the multi-level page tables
>> of IA-32 and AMD64 (and I guess that if you look at Linux-MIPS, you
>> will find that it uses something close to this scheme).
>
> Yes, thanks, slip of the tongue.
> I should have said Software Managed TLB (SM-TLB).
>
>> Maybe Loongson added hardware support for faster dispatch between
>> multiple translation schemes, e.g, between their traditional page
>> table format, the IA-32 format, the AMD64 format (I think they have
>> some differences, e.g., the NX bit), and one or more ARM formats.
>
> It has been known since 1993 that SM-TLB can cost as much as 25-40%
> performance loss due to pipeline flush and reload for the TLB-miss
> interrupt.
> The pipelines for OoO are much larger now and Loongsoon seems
> to be targeting long ROB's so the penalty would be even higher.
> One wonders why MIPS was still using SM-TLB.
>
> A hardware table walker eliminates those unnecessary TLB pipeline flushes.

Cost depends on the frequency of misses and the cost of task/interrupt
switch. Mixed-size page ("jumbo") entries dramatically reduce the
frequency because they cover more addresses in fewer entries.
Architectures like Mill or Mitch's can reduce the switch cost to
tolerable. The flexibility and robustness advantages of SM-TLB remain.

Stefan Monnier

unread,

Apr 24, 2021, 1:52:31 PM4/24/21

to

> If it was up to me I would have eliminated the SM-TLB long ago and
> use a hardware table walker. Actually a hybrid HW-SM-TLB would have a
> HW walker but also TLB entries that can be read & written by software.

I think most hardware page table walkers automatically provide this
"hybrid mode": just treat the hardware page table as "the TLB" and
add/remove entries from it as you see fit.

IIRC that's how some OSes (Mach maybe?) handled portability between
different architectures: they had their own "native" page tables which
were translated lazily (on the fly) to the hardware's format in response
to page faults.

Stefan

Quadibloc

unread,

Apr 24, 2021, 2:15:40 PM4/24/21

to

On Thursday, April 22, 2021 at 2:26:58 PM UTC-6, gareth evans wrote:

> I had already found that article but it does not cover
> the 3A5000 which comes after the MIPS64 products

The news item I saw about the Loongson 3A5000 confirms
that it has a new ISA, and is not a copy of the MIPS or the
Alpha like some Chinese chips. However, it also noted that the
ISA had only been disclosed, at this time, to select business
partners of Loongson, so it isn't available on the Internet at this
time.

John Savard

MitchAlsup

unread,

Apr 24, 2021, 3:06:54 PM4/24/21

to

On Saturday, April 24, 2021 at 12:52:31 PM UTC-5, Stefan Monnier wrote:
> > If it was up to me I would have eliminated the SM-TLB long ago and
> > use a hardware table walker. Actually a hybrid HW-SM-TLB would have a
> > HW walker but also TLB entries that can be read & written by software.
<
> I think most hardware page table walkers automatically provide this
> "hybrid mode": just treat the hardware page table as "the TLB" and
> add/remove entries from it as you see fit.
<

Most HW TLBs have the ability to read and write the TLB. Most vendors do
not give access to this feature.

<
>
> IIRC that's how some OSes (Mach maybe?) handled portability between
> different architectures: they had their own "native" page tables which
> were translated lazily (on the fly) to the hardware's format in response
> to page faults.
<

But this naturally leads to a discussion about 2-level MMU systems so
the OS manages the pages for its tasks and the VM manages the pages
for its active OSs. It seems to me that this is where the SW reloaded TLB
would have a good chance of falling flat on its face.

Given a nomenclature of:: Virtual address -> Guest address -> Physical
address:: where
Virtual address gets created during AGEN of a memory reference instruction
Guest address is a Virtual address translated through the OS MMU tables
Physical address is a Guest address translated through the VM MMU tables
Physical addresses are what addresses DRAM memory.

It seems to me that this 2D-MMU would make the SW reload problem
significantly worse that the 1D MMU.

But it very well might be amenable to the "memory Table" being big enough
and therefore the page table walks don't happen all that often, But making
SW walk though 2 sets of MMU tables seems to violate separation of
privilege between OS and VM.
<
>
>
> Stefan

Anton Ertl

unread,

Apr 25, 2021, 3:54:15 AM4/25/21

to

EricP <ThatWould...@thevillage.com> writes:
>It has been known since 1993 that SM-TLB can cost as much as 25-40%
>performance loss

How much TLB misses cost depends on the TLB miss rate and on the cost
of invoking and executing the miss handler.

As an example for TLB miss rate, when doing a naive 700x700 matrix
multiply on an Ivy Bridge (hardware table walker) without huge pages
and without auto-vectorization, I saw a factor of 17 slowdown between
the fastest and the slowest loop arrangement, mainly due to TLB
misses.

But the miss rate can be reduced by letting the TLB cover more
physical memory. IIRC SGI used 16KB pages towards the end. Modern
CPUs tend to have thousands of TLB entries in a two-level structure;
of course a hardware table walker might cost less hardware than adding
more entries.

>due to pipeline flush and reload for the TLB-miss interrupt.
>The pipelines for OoO are much larger now and Loongsoon seems
>to be targeting long ROB's so the penalty would be even higher.

I don't think a pipeline flush is necessary (in general; not sure if
what I outline below is possible for MIPS; probably not). The TLB
miss handler could be inserted into the instruction stream right where
the instruction fetcher is at the moment, using a fresh set of
registers to work with, do its thing while the memory access is still
waiting to complete and indicate its completion with a special
finishing instruction. If the TLB miss handler was successful, the
memory access would continue, otherwise it would produce a page fault
(and that would flush the pipeline).

Disadvantages and possible problems:

* The TLB miss handler may have to wait for earlier instructions in
the front end and ready earlier instructions in the execution engine
to be processed. Probably not a long wait, however (unless a
scheduling queue is full, and the front end waits for space where to
put its results; in that case, a solution may to to flush the
pipeline, see below)); actually, I think that this would often be
faster than waiting for the missing memory access to reach
retirement (which is probably the usual way to treat such a thing as
exception).

* A more serious problem is the waiting instructions, which may clog
up resources until there is a deadlock. If the TLB-missing memory
access waits in the load-store unit, just a few bunched TLB-misses
might have this effect. One could avoid that by either sending the
TLB miss back into the scheduling queue (which probably requires
extra complexity), or by having some waiting area in the load/store
unit (how big?). But the scheduling queues might also contain so
many instructions waiting for the memory access to finish that there
is no room for the TLB miss handler. One way to deal with that is
to flush the pipeline and do the TLB miss handler as classical
exception if there is not enough room.

Instead of a full pipeline flush, an alternative is to just clear
enough of the queues to make room for the TLB miss handler; so the
miss handler would start with a statement of its queue requirements,
and the instruction fetcher would have to reset its
after-the-TLB-miss-handler PC to the oldest instruction evicted by
this, and all queues would have to drop or invalidate younger
instructions.

Of course, the complexity and hardware requirements of TLB miss
handling without pipeline flush may be worse than one or two table
walkers. AFAIK the multi-level page table has won now (unlike when
MIPS was first designed), so the flexibility of a software TLB miss
handler is no longer a practical advantage (except for researchers who
want to play with alternative page table schemes), shifting the
balance towards hardware table walkers.

MitchAlsup

unread,

Apr 25, 2021, 12:28:29 PM4/25/21

to

On Sunday, April 25, 2021 at 2:54:15 AM UTC-5, Anton Ertl wrote:
> EricP <ThatWould...@thevillage.com> writes:
> >It has been known since 1993 that SM-TLB can cost as much as 25-40%
> >performance loss
> How much TLB misses cost depends on the TLB miss rate and on the cost
> of invoking and executing the miss handler.
>
> As an example for TLB miss rate, when doing a naive 700x700 matrix
> multiply on an Ivy Bridge (hardware table walker) without huge pages
> and without auto-vectorization, I saw a factor of 17 slowdown between
> the fastest and the slowest loop arrangement, mainly due to TLB
> misses.
>

There are 4 "loops==transposes" in Matrix300 (SPEC89),
Loop 1 takes a TLB miss every 512 iterations of the loop
Loop 2 and 3 take a TLB miss every 64 iterations of the loop
Loop 4 takes a TLB miss every 1 iteration of the loop
{With a 32-entry FA TLB mapping 4KB pages}

>
> But the miss rate can be reduced by letting the TLB cover more
> physical memory. IIRC SGI used 16KB pages towards the end. Modern
> CPUs tend to have thousands of TLB entries in a two-level structure;
> of course a hardware table walker might cost less hardware than adding
> more entries.
<

There are lots of ways to accelerate HW TLB table walkers, some
allow walking of individual page mapping levels each cycle, others
allow the table walker to be walking more than one TLB miss
simultaneously, and then there is the classic 2-level TLB using
32-to-64 entry FA TLB at the first level and 512-2048 entry 4-way set
for the second level. With a 1 or 2 cycle miss penalty, the 2-level TLB
has become quite popular.

<
> >due to pipeline flush and reload for the TLB-miss interrupt.
> >The pipelines for OoO are much larger now and Loongsoon seems
> >to be targeting long ROB's so the penalty would be even higher.
> I don't think a pipeline flush is necessary (in general; not sure if
> what I outline below is possible for MIPS; probably not).
>

The pipeline flush is required to get instructions younger than the
TLB miss out of the execution window.

>
> The TLB
> miss handler could be inserted into the instruction stream right where
> the instruction fetcher is at the moment,
<

Remember there is the table-hit miss case, and the table-miss miss case;
while the former can be inserted, you are going to need a pipeline flush for
the later. I suspect that the Table-hit miss case could be done in HW just
as easily as in SW.....

<
> using a fresh set of
> registers to work with, do its thing while the memory access is still
> waiting to complete and indicate its completion with a special
> finishing instruction. If the TLB miss handler was successful, the
> memory access would continue, otherwise it would produce a page fault
> (and that would flush the pipeline).
>
> Disadvantages and possible problems:
>
> * The TLB miss handler may have to wait for earlier instructions in
> the front end and ready earlier instructions in the execution engine
> to be processed. Probably not a long wait, however (unless a
> scheduling queue is full, and the front end waits for space where to
> put its results; in that case, a solution may to to flush the
> pipeline, see below)); actually, I think that this would often be
> faster than waiting for the missing memory access to reach
> retirement (which is probably the usual way to treat such a thing as
> exception).
<

Here, you are actually arguing that the pipeline should be flushed.

<
>
> * A more serious problem is the waiting instructions, which may clog
> up resources until there is a deadlock. If the TLB-missing memory
> access waits in the load-store unit, just a few bunched TLB-misses
> might have this effect. One could avoid that by either sending the
> TLB miss back into the scheduling queue (which probably requires
> extra complexity), or by having some waiting area in the load/store
> unit (how big?). But the scheduling queues might also contain so
> many instructions waiting for the memory access to finish that there
> is no room for the TLB miss handler. One way to deal with that is
> to flush the pipeline and do the TLB miss handler as classical
> exception if there is not enough room.
<

Flushing is simpler and works all the time.

EricP

unread,

Apr 25, 2021, 1:18:35 PM4/25/21

to

MitchAlsup wrote:
> On Saturday, April 24, 2021 at 12:52:31 PM UTC-5, Stefan Monnier wrote:
>>> If it was up to me I would have eliminated the SM-TLB long ago and
>>> use a hardware table walker. Actually a hybrid HW-SM-TLB would have a
>>> HW walker but also TLB entries that can be read & written by software.
> <
>> I think most hardware page table walkers automatically provide this
>> "hybrid mode": just treat the hardware page table as "the TLB" and
>> add/remove entries from it as you see fit.
> <
> Most HW TLBs have the ability to read and write the TLB. Most vendors do
> not give access to this feature.

Right, for testing, so this is mostly a matter of documenting
existing control registers.

I would have a hardware table walker use a bottom-up translate algorithm,
which means the internal tree levels have their own small TLB's.
These interior TLB's are also exposed and documented and have
control registers to add, remove, read and probe entries.

> <
>> IIRC that's how some OSes (Mach maybe?) handled portability between
>> different architectures: they had their own "native" page tables which
>> were translated lazily (on the fly) to the hardware's format in response
>> to page faults.
> <
> But this naturally leads to a discussion about 2-level MMU systems so
> the OS manages the pages for its tasks and the VM manages the pages
> for its active OSs. It seems to me that this is where the SW reloaded TLB
> would have a good chance of falling flat on its face.
>
> Given a nomenclature of:: Virtual address -> Guest address -> Physical
> address:: where
> Virtual address gets created during AGEN of a memory reference instruction
> Guest address is a Virtual address translated through the OS MMU tables
> Physical address is a Guest address translated through the VM MMU tables
> Physical addresses are what addresses DRAM memory.

VA->GA->VM is sequential but it could be optimized by having 3 TLB's:
one for VA->GA with ASID tagged entries,
one for GA->VM the GAID tagged entries,
and one that combines VA->VM with ASID-GAID tagged entries.

A miss on the combined translates through the first two
and saves the combined result with both tags.

Using bottom-up translate table walkers for VA->GA and GA->VM
could minimize the translation accesses and allows all levels
of each of VA tables and GA tables to be searched in parallel.

It might also be possible to eliminate the outer level GA->VM TLB
but keep its interior level TLB's and just use the combined
VA->VM TLB to hold final entries.

> It seems to me that this 2D-MMU would make the SW reload problem
> significantly worse that the 1D MMU.

SW reload should work with bottom-up translate.
Just requires exposing the interior level TLB's control registers.

> But it very well might be amenable to the "memory Table" being big enough
> and therefore the page table walks don't happen all that often, But making
> SW walk though 2 sets of MMU tables seems to violate separation of
> privilege between OS and VM.

Different TLB miss exceptions, one goes to guest OS, one goes to VM.
Each handler runs with its own privilege context and looks at its own table.

VM handler adds the combined ASID-GAID entry to the third TLB.

MitchAlsup

unread,

Apr 25, 2021, 3:00:22 PM4/25/21

to

On Sunday, April 25, 2021 at 12:18:35 PM UTC-5, EricP wrote:
> MitchAlsup wrote:
> > On Saturday, April 24, 2021 at 12:52:31 PM UTC-5, Stefan Monnier wrote:
> >>> If it was up to me I would have eliminated the SM-TLB long ago and
> >>> use a hardware table walker. Actually a hybrid HW-SM-TLB would have a
> >>> HW walker but also TLB entries that can be read & written by software.
> > <
> >> I think most hardware page table walkers automatically provide this
> >> "hybrid mode": just treat the hardware page table as "the TLB" and
> >> add/remove entries from it as you see fit.
> > <
> > Most HW TLBs have the ability to read and write the TLB. Most vendors do
> > not give access to this feature.
> Right, for testing, so this is mostly a matter of documenting
> existing control registers.
>
> I would have a hardware table walker use a bottom-up translate algorithm,
> which means the internal tree levels have their own small TLB's.
<

I always called these table-walk accelerators, and mostly they were organized
around SRAMs (post 2000) but prior to that they were organized around
reusable buffer entries. These things have to be snooped so if someone
ends up writing an entry in the table, the buffer gets clobbered.

<
> These interior TLB's are also exposed and documented and have
> control registers to add, remove, read and probe entries.
<

Of the machines I know about, these registers are accessible through a
bus that is "not all that fast" so a read of the TLB might take 100+ cycles.
Something faster would be required if the CPU was to allow SW direct
access. There were literally hundreds of 64-bit (and larger) registers on
this bus; and we did not spend much time on making it fast as long as
it worked. This is fine for testing and not acceptable for actual use.

<
> > <
> >> IIRC that's how some OSes (Mach maybe?) handled portability between
> >> different architectures: they had their own "native" page tables which
> >> were translated lazily (on the fly) to the hardware's format in response
> >> to page faults.
> > <
> > But this naturally leads to a discussion about 2-level MMU systems so
> > the OS manages the pages for its tasks and the VM manages the pages
> > for its active OSs. It seems to me that this is where the SW reloaded TLB
> > would have a good chance of falling flat on its face.
> >
> > Given a nomenclature of:: Virtual address -> Guest address -> Physical
> > address:: where
> > Virtual address gets created during AGEN of a memory reference instruction
> > Guest address is a Virtual address translated through the OS MMU tables
> > Physical address is a Guest address translated through the VM MMU tables
> > Physical addresses are what addresses DRAM memory.
> VA->GA->VM is sequential but it could be optimized by having 3 TLB's:
> one for VA->GA with ASID tagged entries,
> one for GA->VM the GAID tagged entries,
> and one that combines VA->VM with ASID-GAID tagged entries.
<

We always did this with 1 TLB VA-assocated with-PA. A miss would activate
the table walker (after the check for PTE in L2TLB) with its table-walk-accelerators.

EricP

unread,

Apr 25, 2021, 4:59:59 PM4/25/21

to

MitchAlsup wrote:
> On Sunday, April 25, 2021 at 12:18:35 PM UTC-5, EricP wrote:
>>
>> I would have a hardware table walker use a bottom-up translate algorithm,
>> which means the internal tree levels have their own small TLB's.
> <
> I always called these table-walk accelerators, and mostly they were organized
> around SRAMs (post 2000) but prior to that they were organized around
> reusable buffer entries. These things have to be snooped so if someone
> ends up writing an entry in the table, the buffer gets clobbered.
> <

Just to be sure we are talking about the same thing...

By bottom-up translator I'm thinking of a fully assoc lookup of the PTE for
each level of the page table. For example the x64 page table has 5 levels,
each covering a different number of the address bits:
VA[63:12] level 1
VA[63:21] level 2
VA[63:30] level 3
VA[63:39] level 4
VA[63:48] level 5

So each level has a TLB that contains PTE entries indexed by a
different part of the VA.

Initially the tables are empty and a TLB-miss triggers a top down walk.
As the PTE for each of levels 5 to 1 is read, if it valid it is loaded
into the TLB entry for that level, with a fully assoc index using
the virtual address bits for that level.

A bottom-up walk looks up VA[63:12] in the level 1 TLB.
If it is a hit, it gives us the level 1 PTE that contains the
physical frame number of the code or data page.
If it is a miss, then look up VA[63:21] in the level 2 TLB.
If it is a hit, it gives us the level 2 PTE that contains the
physical frame number containing the level 1 PTE.
If a miss, look up VA[63:30] in level 3 TLB, and so on.
We walk backwards up the tree reusing each levels' PTE's
we loaded on the top down walk.

The advantage is that if we miss at level 1, we often hit at level 2
and can load the level 1 PTE with a single memory read instead of 5
of a top-down walk.

One can also play with this depending on power budget, like do them all
in parallel and using a priority selector to choose the lowest level,
or some parallel and some pipelined-serial.

>> VA->GA->VM is sequential but it could be optimized by having 3 TLB's:
>> one for VA->GA with ASID tagged entries,
>> one for GA->VM the GAID tagged entries,
>> and one that combines VA->VM with ASID-GAID tagged entries.
> <
> We always did this with 1 TLB VA-assocated with-PA. A miss would activate
> the table walker (after the check for PTE in L2TLB) with its table-walk-accelerators.
> <

Yes but you were also concerned about the cost of TLB misses
costing a full what is now a 10-level table walk.
This might be able to load a VA->PA entry with just 1 or 2 memory reads.

MitchAlsup

unread,

Apr 25, 2021, 6:19:41 PM4/25/21

to

On Sunday, April 25, 2021 at 3:59:59 PM UTC-5, EricP wrote:
> MitchAlsup wrote:
> > On Sunday, April 25, 2021 at 12:18:35 PM UTC-5, EricP wrote:
> >>
> >> I would have a hardware table walker use a bottom-up translate algorithm,
> >> which means the internal tree levels have their own small TLB's.
> > <
> > I always called these table-walk accelerators, and mostly they were organized
> > around SRAMs (post 2000) but prior to that they were organized around
> > reusable buffer entries. These things have to be snooped so if someone
> > ends up writing an entry in the table, the buffer gets clobbered.
> > <
> Just to be sure we are talking about the same thing...
>
> By bottom-up translator I'm thinking of a fully assoc lookup of the PTE for
> each level of the page table. For example the x64 page table has 5 levels,
> each covering a different number of the address bits:
> VA[63:12] level 1
> VA[63:21] level 2
> VA[63:30] level 3
> VA[63:39] level 4
> VA[63:48] level 5
>
> So each level has a TLB that contains PTE entries indexed by a
> different part of the VA.
<

This was what we did in the Ross TWA. There were 3-4 entries in the
L1 TWA and 2 entries in the L2 TWA. IN both cases, the PTP was
recorded. Later machines using similar TWAs captured entire cache
lines {post Ross later machines}.

<
>
> Initially the tables are empty and a TLB-miss triggers a top down walk.
> As the PTE for each of levels 5 to 1 is read, if it valid it is loaded
> into the TLB entry for that level, with a fully assoc index using
> the virtual address bits for that level.
>
> A bottom-up walk looks up VA[63:12] in the level 1 TLB.
> If it is a hit, it gives us the level 1 PTE that contains the
> physical frame number of the code or data page.
> If it is a miss, then look up VA[63:21] in the level 2 TLB.
> If it is a hit, it gives us the level 2 PTE that contains the
> physical frame number containing the level 1 PTE.
> If a miss, look up VA[63:30] in level 3 TLB, and so on.
> We walk backwards up the tree reusing each levels' PTE's
> we loaded on the top down walk.
<

We did all the look ups simultaneously and then used a FF1 circuit
to tell us where to start.

<
>
> The advantage is that if we miss at level 1, we often hit at level 2
> and can load the level 1 PTE with a single memory read instead of 5
> of a top-down walk.
<

And the more data one has at each level, the fewer memory accesses
are needed. Basically, you are constrained between the size of 1 PTE
on the small end and one cache line on the large end.

<
>
> One can also play with this depending on power budget, like do them all
> in parallel and using a priority selector to choose the lowest level,
> or some parallel and some pipelined-serial.
> >> VA->GA->VM is sequential but it could be optimized by having 3 TLB's:
> >> one for VA->GA with ASID tagged entries,
> >> one for GA->VM the GAID tagged entries,
> >> and one that combines VA->VM with ASID-GAID tagged entries.
> > <
> > We always did this with 1 TLB VA-assocated with-PA. A miss would activate
> > the table walker (after the check for PTE in L2TLB) with its table-walk-accelerators.
> > <
> Yes but you were also concerned about the cost of TLB misses
> costing a full what is now a 10-level table walk.
<

My 66000 table walking system can skip levels which puts small applications
(like cat,...) into a <as small as> a 2 level system while still supporting a 64-bit
VA space.

But it is NOT a 10 level system !! it is a 5×5 = 25 level system !!
{Each lookup at the guest level requires an entire page walk at the HV level.
And this is why skipping levels and large pages are necessary ! It is also
why it cannot/should not be done in SW as 20% of the accesses occur in
OS VA space while the other 80% occur in HV/VM VA space.}

<
> This might be able to load a VA->PA entry with just 1 or 2 memory reads.
<

You can hold onto the cache line which contained the last PTE loaded
and possibly not access memory at all to install the next PTE! for strip
mining application data order.

EricP

unread,

Apr 26, 2021, 6:02:25 PM4/26/21

to

MitchAlsup wrote:
> On Sunday, April 25, 2021 at 3:59:59 PM UTC-5, EricP wrote:
>> MitchAlsup wrote:
>>> On Sunday, April 25, 2021 at 12:18:35 PM UTC-5, EricP wrote:
>>>> VA->GA->VM is sequential but it could be optimized by having 3 TLB's:
>>>> one for VA->GA with ASID tagged entries,
>>>> one for GA->VM the GAID tagged entries,
>>>> and one that combines VA->VM with ASID-GAID tagged entries.
>>> <
>>> We always did this with 1 TLB VA-assocated with-PA. A miss would activate
>>> the table walker (after the check for PTE in L2TLB) with its table-walk-accelerators.
>>> <
>> Yes but you were also concerned about the cost of TLB misses
>> costing a full what is now a 10-level table walk.
> <
> My 66000 table walking system can skip levels which puts small applications
> (like cat,...) into a <as small as> a 2 level system while still supporting a 64-bit
> VA space.

Skipping over unused table levels helps.
And it works with bottom-up translate too.

But I think we can do better than that by using
various block address translate methods as an option,
in addition to page table translates.

> But it is NOT a 10 level system !! it is a 5×5 = 25 level system !!

Right. I knew that but forgot.

> {Each lookup at the guest level requires an entire page walk at the HV level.
> And this is why skipping levels and large pages are necessary ! It is also
> why it cannot/should not be done in SW as 20% of the accesses occur in
> OS VA space while the other 80% occur in HV/VM VA space.}
> <
>> This might be able to load a VA->PA entry with just 1 or 2 memory reads.
> <
> You can hold onto the cache line which contained the last PTE loaded
> and possibly not access memory at all to install the next PTE! for strip
> mining application data order.

I have two ways which eliminate the N^2 table walk cost.

In my MMU the top 4 address bits are an index into a hardware table
whose entries specify the method addresses in that range are translated.

Method-1 is Page Table Translate (PTT) and it specifies a table root
physical address. Address translates on this table can be optimized
with level skip and bottom-up.

Method-2 is Direct Block Translate (DBT) which specifies an area size,
protection, and a 64-bit base address to add to the Effective Address (EA)
to produce the physical address. This performs an arithmetic relocation
of a contiguous range of EA to a contiguous range of PA directly
without any memory reads.
This requires that the relocated block be physically contiguous.

Method-3 is Indirect Block Translate (IBT), a mixture of 1 & 2.
The entry specifies an area size, protection, and 64-bit
base address to add to the EA giving an area offset.
The page number is extracted from offset bits [63:12] and used as an
index into a 1 level physically contiguous PTE vector whose base
physical address is specified in the entry. The net result is a
single memory access to load a PTE to translate the VA->PA
while retaining the ability to page fault in that memory area.

The majority of OS the fixed size kernel space can use 3 DBT areas
to relocate the OS code, read-only and read-write data.
Another PTT area for dynamically managed memory for
things like kernel drivers.
A single DBT area can map graphics card memory.

====================================

A VMM can nest the above mapping methods to efficiently relocate
a guest at little or no cost.

The VMM reserves a fixed size physical working set for guests OS.
It doesn't have to apply to all guest OS, just ones we want to optimize.
I would expect that many VMM probably already allocate a fixed
size physical working set to guest VM's to prevent thrashing.

VMM sets up one block translates to relocate all guest addresses together.

Lets say VMM creates guest OS A (Guest-A) and tells it that it has 8 GB.
Guest-A will manage memory within that range, only producing GA's
with Guest Frame Numbers (GFN) up to 2^33>>12 = 2^21.
When Guest-A is running and the MMU has its Guest Space ID (GSID)
we switch the mapping to block address translate.

A Direct Block Translate can relocate all 8 GB GA's to a
contiguous 8 GB area of physical memory with just an ADD.

With an Indirect Block Translate the VMM allocates a contiguous array
in physical memory to hold GOS-A's 2^21 PTE's, a 16 MB array.
That array's base address is given to the MMU associated with
Guest-A's GSID tag.

The Guest-A's VA->GA address translate proceeds as usual,
including optimizations like skipping and bottom-up.
The MMU matches the current GSID and retrieves the array base.
MMU uses the GFN to index directly to a PTE that maps GA->PA (or is invalid).
Cost is 1 physical memory operation per guest memory operation
but the VMM does have the option to page swap guest memory.

MitchAlsup

unread,

Apr 26, 2021, 7:06:29 PM4/26/21

to

I have a 3-bit level indicator that spans the whole table structure from
Root pointer to the PTEs. The Root pointer can point at a single 4KB
page of PTEs (for small applications) making this task a 1-level look
up ! {And so on}

And at each level, a level indicator of 000 indicates this is a PTE and the
level tells one how big the page is.

<
>
> Method-1 is Page Table Translate (PTT) and it specifies a table root
> physical address. Address translates on this table can be optimized
> with level skip and bottom-up.
>
> Method-2 is Direct Block Translate (DBT) which specifies an area size,
> protection, and a 64-bit base address to add to the Effective Address (EA)
> to produce the physical address. This performs an arithmetic relocation
> of a contiguous range of EA to a contiguous range of PA directly
> without any memory reads.
> This requires that the relocated block be physically contiguous.
<

In my system I have something like this but it is called a port hole.
A port hole consists of a base address (1st byte accessible), a bounds
(last byte accessible), and a root poiniter and this is used to translate
an address through a foreign address space (which can be 1 or 2 levels.)
Each layer in the translation structure can remove permissions to the
address space it translates (both PTP and PTE).

Port holes are how a debugger accesses a task it is controlling
(i.e., remotely) so the task being debugger does not know that it
is being debugged. The registers and thread header are memory
memory mapped so all the debugger needs is to know how to walk
the system structures in order to gain access to everything the
thread can access.

<
>
> Method-3 is Indirect Block Translate (IBT), a mixture of 1 & 2.
> The entry specifies an area size, protection, and 64-bit
> base address to add to the EA giving an area offset.
> The page number is extracted from offset bits [63:12] and used as an
> index into a 1 level physically contiguous PTE vector whose base
> physical address is specified in the entry. The net result is a
> single memory access to load a PTE to translate the VA->PA
> while retaining the ability to page fault in that memory area.
>
> The majority of OS the fixed size kernel space can use 3 DBT areas
> to relocate the OS code, read-only and read-write data.
<

Similar, but the OS-level uses large pages to do the above, while the
HV uses more regular sized pages to perform swap-stuff.

<
> Another PTT area for dynamically managed memory for
> things like kernel drivers.
> A single DBT area can map graphics card memory.
<

single DBT per HV or a single DBT ?? (running under HV)

<
>
> ====================================
>
> A VMM can nest the above mapping methods to efficiently relocate
> a guest at little or no cost.
>
> The VMM reserves a fixed size physical working set for guests OS.
> It doesn't have to apply to all guest OS, just ones we want to optimize.
> I would expect that many VMM probably already allocate a fixed
> size physical working set to guest VM's to prevent thrashing.
>
> VMM sets up one block translates to relocate all guest addresses together.
>
> Lets say VMM creates guest OS A (Guest-A) and tells it that it has 8 GB.
> Guest-A will manage memory within that range, only producing GA's
> with Guest Frame Numbers (GFN) up to 2^33>>12 = 2^21.
> When Guest-A is running and the MMU has its Guest Space ID (GSID)
> we switch the mapping to block address translate.
<

Exactly my plan.....

<
>
> A Direct Block Translate can relocate all 8 GB GA's to a
> contiguous 8 GB area of physical memory with just an ADD.
>
> With an Indirect Block Translate the VMM allocates a contiguous array
> in physical memory to hold GOS-A's 2^21 PTE's, a 16 MB array.
> That array's base address is given to the MMU associated with
> Guest-A's GSID tag.
<

pretty much my plan.....

<
>
> The Guest-A's VA->GA address translate proceeds as usual,
> including optimizations like skipping and bottom-up.
> The MMU matches the current GSID and retrieves the array base.
> MMU uses the GFN to index directly to a PTE that maps GA->PA (or is invalid).
> Cost is 1 physical memory operation per guest memory operation
> but the VMM does have the option to page swap guest memory.
<

Did you ever run through the thought of having a HW PTE fetcher
(for TLB misses) that operates much like the MIPS SW PTE fetcher
for fetches that hit in the memory hash table, and trap if the big
memory table fails to have the entry ?

I ran through this about a decade ago as an excersize to rid the
processor of excessive table-walk-accelerators.

I mainly got bonked because it roughly doubled the memory footprint of
the MMU tables.

Ivan Godard

unread,

Apr 26, 2021, 7:45:20 PM4/26/21

to

How does the debugger follow across a fork/spawn/visit?

Can it follow a syscall into system space?

Can the debugger inspect use the MMIO to inspect registers of a process
that is in-flight in another core?

MitchAlsup

unread,

Apr 26, 2021, 8:00:24 PM4/26/21

to

Given a MMU translation tables that enable such accesses, any thread can
do these.

What must a Mill debugger do to single step a program specialized for a
16-wide machine ?

Ivan Godard

unread,

Apr 26, 2021, 8:30:02 PM4/26/21

to

I wasn't saying you didn't do it; I was asking for details of how you
did it. Please give :-)

> What must a Mill debugger do to single step a program specialized for a
> 16-wide machine ?

Mill single-step is cycle-by-cycle; that's the usual way for wide-issue
machines.

The sim has a GDB-esque UI that we expect to also use for a
target-resident debugger when we get around to one. You can probe the
machine state as of the current cycle, and also probe inflight state
such as operations that are in process in an FU or loads that aren't
retired yet. The debugger does not have special permissions; it has its
own, and those of the program being debugged.

If the program makes a portal call to another turf the debugger cannot
see into the callee. If the callee makes a callback to the program then
the debugger can see that. As portal calls are all in the same stack,
thi means that the debugger sees a fragments stack address range, the
same as what the program itself can see.

To answer my own questions:

1) the Mill debugger is thread-aware, and the UI lets you switch view
among threads in the same turf (or with complementary permissions in
general). The view switch is manual, but fork/spawn/visit are specially
recognized and prompt the debugger to switch spontaneously. This works
well for coroutine behavior, but debugging true parallel processing is
an unsolved problem for the field in general.

2) Syscalls are just portals on the Mill, and get no special handling.
If the debugger has both caller and callee permissions it follows along
just like a non-portal call; otherwise the callee is treated as atomic
and the debugger get control when control returns to the debugged turf..
The return can be a call return, interrupt, a visit, or anything else
that gets a CPU to the turf/thread combination.

3) The debugger can inspect MMIO if it has permissions for it, but that
would be unusual. More often it would want to inspect spill history, so
as to make a backtrace for example. That history is not usually with app
permissions, which is all the debugger has. Instead there is a trusted
service that inspects spill state on behalf of a debugger (or the app
for that matter - what's a debugger, after all). Because the Mill spill
stack is disjoint from the data stack, this means that the spill state
can't be monkeyed with, precluding various exploits.

How do you do these things in your design?

Quadibloc

unread,

Apr 27, 2021, 10:20:18 AM4/27/21

to

Here is the article that I saw:

https://www.tomshardware.com/news/loongson-technology-develops-its-own-cpu-instruction-set-architecture

John Savard

Thomas Koenig

unread,

Apr 27, 2021, 10:45:30 AM4/27/21

to

Quadibloc <jsa...@ecn.ab.ca> schrieb:
> https://www.tomshardware.com/news/loongson-technology-develops-its-own-cpu-instruction-set-architecture

They could have used POWER instead :-)

EricP

unread,

Apr 27, 2021, 11:55:36 AM4/27/21

to

MitchAlsup wrote:
> On Monday, April 26, 2021 at 5:02:25 PM UTC-5, EricP wrote:
>>
>> The majority of OS the fixed size kernel space can use 3 DBT areas
>> to relocate the OS code, read-only and read-write data.
> <
> Similar, but the OS-level uses large pages to do the above, while the
> HV uses more regular sized pages to perform swap-stuff.
> <

My goal is to optimize the access to the relatively small number of
large lumpy memory things that are static in a system address space.
E.g. to eliminate TLB misses on most interrupt handlers, etc.

Because a Direct Block Translate doesn't need to access memory
at all to do its job, it can be used for the main parts of a kernel
that are static in size, loaded at boot, and pinned in memory anyway.
Those didn't need the flexibility of a page table, but on most hardware
a page table was the only tool available for mapping VA->PA.

Memory resident on an I/O device in general and graphics memory in
particular are currently mapped into kernel virtual space using
PTE's because that is the only tool available.
Device memory is not allocated from the kernel pool on device mount,
and is not recycled back to the kernel pool on device dismount.

Graphics memory is notable because it is a great multi-GB contiguous lump.

None of this needs the flexibility of a page table as
it never page faults (and the OS should crash itself if it does).
It doesn't need the expense on walking a page table to load
a TLB with entries that never fault and rarely or never change.

Direct Block Translates are perfect for this.

Kernel heap that expands and contracts dynamically,
and kernel device drivers that are loaded and unloaded dynamically,
these would use a page table mapped area to load them into.

This applies both to the guest OS and the VMM.
The guest OS is setting up its memory areas using guest area tables,
HV sets up the host area table.
When HV switches running guests, it switches the
MMU's guest area map entries.

>> Another PTT area for dynamically managed memory for
>> things like kernel drivers.
>> A single DBT area can map graphics card memory.
> <
> single DBT per HV or a single DBT ?? (running under HV)
> <

Short question, long answer...

We will ignore for the moment how a HV and its guests can coordinate
their direct access to the graphics memory since they don't know
what each other are doing, and the guest doesn't know the HV exists.

Remember, what a guest thinks are physical addresses are
hypervisor virtual addresses. HV will set up its area map tables for
each guest to create a fake physical memory layout in HV's virtual space.
When HV switches guests, it switches any area table entries for that guest
(just like when an OS switches processes, it switches pages tables of
that process).

Say HV wants to create a guest with 8 GB of RAM,
plus pass a memory mapped 4 GB graphics memory to it.

HV allocates 8 GB of contiguous real physical memory and
assigns HV virtual address 0 as its location for this guest.
HV address 0 is controlled by area[0] so HV sets it to
Direct Block Translate and gives it the physical memory address.

HV assigns area[1] to the graphics card, so the top 4 bits of the
HV virtual address are 0x1, and sets a Direct Block Translate that maps
that HV virtual address to the graphics card's real physical memory.

HV scheduler selects guest to run, loads the 2 area entries for that
guest into MMU area table.
Guest executes and thinks it has 8 GB of RAM starting at
physical address 0, and 4 GB of graphics memory at
physical address 0x1000_0000_0000_0000.

Guest boots and wants to pass the graphics memory to the X-Windows process.
For that process, guest OS sets its guest area[2] to Direct Block Translate
and (what it thinks is) the physical address to 0x1000_0000_0000_0000,
causing the graphics memory to be mapped in the X-Windows process
at virtual address 0x2000_0000_0000_0000.

Guest OS selects X-Windows process to run, loadss area entries for it
into guests MMU area table.

X-Windows runs and reads byte of graphics memory located at
its virtual address 0x2000_0000_0000_0000.
Guest MMU looks up the VA's 4-bit area[2] and sees a DBT entry
and adds the offset to create GA 0x1000_0000_0000_0000.
GA is passed to the HV MMU.
HV MMU looks up the GA's 4-bit area[1] and sees a DBT entry
and adds the offset to create PA of real graphics memory.
Core reads graphics cards' byte.

No memory accesses were required to perform the relocations.

Whew...

> <
> Did you ever run through the thought of having a HW PTE fetcher
> (for TLB misses) that operates much like the MIPS SW PTE fetcher
> for fetches that hit in the memory hash table, and trap if the big
> memory table fails to have the entry ?
>
> I ran through this about a decade ago as an excersize to rid the
> processor of excessive table-walk-accelerators.
>
> I mainly got bonked because it roughly doubled the memory footprint of
> the MMU tables.

Sounds like an inverted page table "accelerator"?
No, I haven't seen that.

I read about a port of Linux to, I think it was, a PPC 750.
IIRC the PPC 750 had an inverted page table, but Linux code expects a
3 level hierarchical table so the port treated the hierarchical table
as the "real" one, and the inverted table like a software managed TLB.

Brett

unread,

Apr 29, 2021, 1:03:05 AM4/29/21

to

No, Power does not have load/store pair, without which you cannot compete
on the high end today.

The antique RISC instruction sets are now officially dead going forward.

Which begs the question of whether the all the instruction sets proposed
here support load/store pair...

Thomas Koenig

unread,

Apr 29, 2021, 1:51:13 AM4/29/21

to

Brett <gg...@yahoo.com> schrieb:

> Thomas Koenig <tko...@netcologne.de> wrote:
>> Quadibloc <jsa...@ecn.ab.ca> schrieb:
>>> https://www.tomshardware.com/news/loongson-technology-develops-its-own-cpu-instruction-set-architecture
>>
>> They could have used POWER instead :-)
>
> No, Power does not have load/store pair, without which you cannot compete
> on the high end today.

You mean something like

# Load VSX Vector instructions load a quadword from
# storage as a vector of 16 byte elements, 8 halfword
# elements, 4 word elements, 2 doubleword elements or
# a quadword element into a VSR.

?

Granted, it's from the not yet implemented 3.1 ISA, and it is
a prefixed instruction.

> The antique RISC instruction sets are now officially dead going forward.

It still seems to be alive :-)

MitchAlsup

unread,

Apr 29, 2021, 12:09:22 PM4/29/21

to

On Thursday, April 29, 2021 at 12:03:05 AM UTC-5, gg...@yahoo.com wrote:
> Thomas Koenig <tko...@netcologne.de> wrote:
> > Quadibloc <jsa...@ecn.ab.ca> schrieb:
> >> https://www.tomshardware.com/news/loongson-technology-develops-its-own-cpu-instruction-set-architecture
> >
> > They could have used POWER instead :-)
> No, Power does not have load/store pair, without which you cannot compete
> on the high end today.
>
> The antique RISC instruction sets are now officially dead going forward.
<

They were dead by 2005, but they just hadn't noticed it yet.

>
> Which begs the question of whether the all the instruction sets proposed
> here support load/store pair...

Does LM (Load Multiple) and SM (Store Multiple) and MM (Move Memory)
Qualify ?

MitchAlsup

unread,

Apr 29, 2021, 6:17:43 PM4/29/21

to

Could these bits be placed in the Root Pointer ?

But I also got to thinking that my top level table only uses 7-bits of the
virtual address, indexing only 128 entries where there are 512 doublewords
present. I "could" convert accesses into this level page from PTP/PTE
into a porthole descriptor using 2 of the additional doublewords, so we
would then have a base/bounds values and a PTP/PTE to the next level
in the table. The 3rd added doubleword would hold things like the ASID,
various flags for the OS/HV to use, and other stuff. <see below>

<
>
> Method-1 is Page Table Translate (PTT) and it specifies a table root
> physical address. Address translates on this table can be optimized
> with level skip and bottom-up.
>
> Method-2 is Direct Block Translate (DBT) which specifies an area size,
> protection, and a 64-bit base address to add to the Effective Address (EA)
> to produce the physical address. This performs an arithmetic relocation
> of a contiguous range of EA to a contiguous range of PA directly
> without any memory reads.
> This requires that the relocated block be physically contiguous.
>
> Method-3 is Indirect Block Translate (IBT), a mixture of 1 & 2.
> The entry specifies an area size, protection, and 64-bit
> base address to add to the EA giving an area offset.
> The page number is extracted from offset bits [63:12] and used as an
> index into a 1 level physically contiguous PTE vector whose base
> physical address is specified in the entry. The net result is a
> single memory access to load a PTE to translate the VA->PA
> while retaining the ability to page fault in that memory area.
>
> The majority of OS the fixed size kernel space can use 3 DBT areas
> to relocate the OS code, read-only and read-write data.
<

<from above>
So if I use several of these top level descriptors, I could map contiguous
chunks of memory (with base and bounds limits) for those things which
are not paged.

What I do not know is whether HVs page pages that the OS thinks are never
paged ??

So, these things might look really good for mapping HV areas that are
actually never paged. Or, I could give the OS the illusion that certain OS
pages are contiguous, and let the HV page them as it desires.

Either way, the paging overhead goes down if the top descriptor porthole
is a PTE rather than a PTP--one access to memory with base and bounds
limits (where that 1 access is 1/2 a cache line.)

Needs more thought

Thanks for the ideas.

Ivan Godard

unread,

Apr 29, 2021, 8:42:45 PM4/29/21

to

What gos in the hardware TLB? Or do you have two, one for pages and one
(with a range comparator) for regions?

MitchAlsup

unread,

Apr 29, 2021, 8:54:36 PM4/29/21

to

On Thursday, April 29, 2021 at 7:42:45 PM UTC-5, Ivan Godard wrote:
> On 4/29/2021 3:17 PM, MitchAlsup wrote:
> > On Monday, April 26, 2021 at 5:02:25 PM UTC-5, EricP wrote:
> >> MitchAlsup wrote:

<snip>

> > <from above>
> > So if I use several of these top level descriptors, I could map contiguous
> > chunks of memory (with base and bounds limits) for those things which
> > are not paged.
> >
> > What I do not know is whether HVs page pages that the OS thinks are never
> > paged ??
> >
> > So, these things might look really good for mapping HV areas that are
> > actually never paged. Or, I could give the OS the illusion that certain OS
> > pages are contiguous, and let the HV page them as it desires.
> >
> > Either way, the paging overhead goes down if the top descriptor porthole
> > is a PTE rather than a PTP--one access to memory with base and bounds
> > limits (where that 1 access is 1/2 a cache line.)
> >
> > Needs more thought
> What gos in the hardware TLB? Or do you have two, one for pages and one
> (with a range comparator) for regions?
<

I think there would have to be 2 structures, one a more classic TLB for
the pages, and another one (an RLB) that contains base and bounds.

The real question is what to do if the OS entries disagree in base and bounds
with the HV tables ???
<

Ivan Godard

unread,

Apr 29, 2021, 9:11:20 PM4/29/21

to

I understand the base-and-bounds - that's our PLB. But once you get past
the bounds check, what's the difference between adding a region start to
the address and what you would normally do at a large-granularity level
of the usual table (our TLB)?

> The real question is what to do if the OS entries disagree in base and bounds
> with the HV tables ???

Isn't a HV trying to page an OS region just a bug? seems to me that the
HV has to open the OS kimono. Aren't there other places where the HV has
to be aware of the OS view of the world?

Stephen Fuld

unread,

Apr 29, 2021, 10:19:49 PM4/29/21

to

On 4/29/2021 5:54 PM, MitchAlsup wrote:

Didn't at least one version of the PowerPC have both a TLB and, IIRC 2
or 4) base and bound registers? Again, IIRC, the idea was precisely to
take pressure off the TLB for large fixed areas such as the OS code and
data. I think if there was a disagreement, the base and bounds
registers too precedence, but don't hold me to that.

I remember thinking that seemed like a good idea, and wondered why no
one else seemed to do that?

Pause . . .

OK, I found my copy of "The PowerPC Architecture" from IBM (I have the
second edition dated 1994. I describes what I remembered. They have
what is essentially a page table, which they call segments, and a "Block
Address Translation (BAT) mechanism, running in parallel.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

MitchAlsup

unread,

Apr 30, 2021, 10:21:50 AM4/30/21

to

There are areas of memory that are simply paged without B&B and there are
areas that are paged with B&B. Those paged with B&B may be made physically
contiguous by assigning a page level that is larger than the region described
by B&B.

<
> > The real question is what to do if the OS entries disagree in base and bounds
> > with the HV tables ???
<
> Isn't a HV trying to page an OS region just a bug? seems to me that the
> HV has to open the OS kimono. Aren't there other places where the HV has
> to be aware of the OS view of the world?
<

Consider that an HV is managing 1,000,000 OSs and each OS (but half dozen)
is running under the assumption that is it running without an HV (DOS for
example). Those OSs assume they have all of man memory and somebody is
needed to provide that illusion.

Those OSs that are HV aware are "far more efficient", but the model needs to
enable OSs that are not HV aware at some level of performance. Many of those
OSs are not available outside of a binary image on a disk drive somewhere, so
there is no reasonable path to making them HV aware.

Michael S

unread,

Apr 30, 2021, 10:47:55 AM4/30/21

to

On Thursday, April 29, 2021 at 8:51:13 AM UTC+3, Thomas Koenig wrote:
> Brett <gg...@yahoo.com> schrieb:
> > Thomas Koenig <tko...@netcologne.de> wrote:
> >> Quadibloc <jsa...@ecn.ab.ca> schrieb:
> >>> https://www.tomshardware.com/news/loongson-technology-develops-its-own-cpu-instruction-set-architecture
> >>
> >> They could have used POWER instead :-)
> >
> > No, Power does not have load/store pair, without which you cannot compete
> > on the high end today.
> You mean something like
>
> # Load VSX Vector instructions load a quadword from
> # storage as a vector of 16 byte elements, 8 halfword
> # elements, 4 word elements, 2 doubleword elements or
> # a quadword element into a VSR.
>
> ?

No, he means Load/store pair in aarch64 style.
For load:
Rd0 = Eff_addr[0], Rd1 = Eff_addr[1]
For store:
Eff_addr[0]= Rs0, Eff_addr[1] = Rs1

IBM's new prefixed load/store instructions are just way of fitting very wide (34b, if I am not mistaken) immediate displacement into what once was fixed-width ISA.
More like move in x86 direction than in aarch64 direction.

>
> Granted, it's from the not yet implemented 3.1 ISA, and it is
> a prefixed instruction.
> > The antique RISC instruction sets are now officially dead going forward.
> It still seems to be alive :-)

You don't understand Brett's definition of "officially dead".
"Officially dead" == not used by Apple and unlikely, in his opinion, to be used by Apple in foreseeable future.

EricP

unread,

Apr 30, 2021, 1:00:48 PM4/30/21

to

MitchAlsup wrote:
> On Monday, April 26, 2021 at 5:02:25 PM UTC-5, EricP wrote:
>> I have two ways which eliminate the N^2 table walk cost.
>>
>> In my MMU the top 4 address bits are an index into a hardware table
>> whose entries specify the method addresses in that range are translated.
> <
> Could these bits be placed in the Root Pointer ?
>
> But I also got to thinking that my top level table only uses 7-bits of the
> virtual address, indexing only 128 entries where there are 512 doublewords
> present. I "could" convert accesses into this level page from PTP/PTE
> into a porthole descriptor using 2 of the additional doublewords, so we
> would then have a base/bounds values and a PTP/PTE to the next level
> in the table. The 3rd added doubleword would hold things like the ASID,
> various flags for the OS/HV to use, and other stuff. <see below>
> <

I think you might have slightly misunderstood me.
This is all about enhancing x64 CR3 so that it allows new mapping methods.
In a normal non-virtual-machine context there is one CR3.
In a hypervisor context there are two, one controlled by the guest OS,
and a nested CR3 controlled by HV.

Currently x64 has CR3 containing the physical address of the page table
root frame. The table has 5 levels and a 48 bit virtual space.

In that level-5 frame, PTEs [0:255] cover the lower 2^47 virtual address
range and are typically associated with the current process user space,
and PTEs [256:511] cover the upper 2^47 virtual address range and
are typically associated with the OS kernel.
There is nothing in the hardware that enforces those associations,
it is just convention.

Viewing addresses as unsigned integers, the above layout puts the
kernel "system space" at high addresses, the process "user space"
at the low addresses, and has a giant dead zone between them.

Each core in the SMP system has its own CR3 which points at its own
private page table root frame. All cores common map OS system space
so the root PTEs [256:511] are the same.
However each core can have a different process mapped so they
have different root PTEs [0:255].

When the OS want to map a new process so it can run its thread,
it copies up to 256 process PTEs from the process header into its
private page table root frame. This of course is optimized to only copy
the range that is actually valid. If any process root PTEs change,
the updates are written to the process header, then an
Inter-Processor Interrupt (IPI) informs all other cores to
re-copy process PTEs [0:255] into their private root frame.

One issue I want to eliminate is the above PTE copying on process switch.

============================================
In my first cut at improving this, I make CR3 a 2-entry array,
indexed by the virtual address msb. That would make CR3[0] map the
lower process user space, and CR3[1] map the upper system space.

There are now 2 pages tables, for the current process address space
and for system address space.
To switch processes, the OS just changes CR3[0] to point a new process
root frame leaving CR3[1] alone, so none of this PTE copying.

To optimize page table walks in the above, the CR3 root pointer
as well as interior PTE entries allow skipping levels.
For a small page table, CR3[0] can skip to page table tree at level 3.
And this still works with bottom-up translate.

Later I wanted to eliminate the TLB lookups for the large linear
allocations of virtual space. To do that, I want some form of
Block Address Translate (BAT) by arithmetic relocation.

To support BAT, CR3 becomes an array of 16 entries indexed by
virtual address bits [63:60], with entries specifying which
translation method, page table or BAT to use for that Area.

If a CR3-Area entry specifies a page translate method,
then it has details for it, root frame physical address and level #.
If a CR3-Area entry specifies a block address translate method,
then it has details for the BAT such as offset to physical address,
size in bytes, protection, cache control.

CR3-Area[0] can continue to point to a page table mapping user space,
CR3-Area[15] can continue to point to a page table mapping system space.
Additionally it can now add a BAT entry CR3-area[1] mapping the
graphics physical memory into virtual memory,
but now it requires no TLB lookups for that memory area.

To switch processes the OS just switches CR3-Area[0] to point
to that process page table, like before.

Then later when we started talking about hypervisors
I thought of a new kind of BAT which allowed paging.
It makes use of the fact the the guest OS has already mapped all its
scattered virtual addressed into compact linear range of GA's.
So we don't need a multi-level translate tree, just 1 level.
Thus was born the Indirect Block Translate.

>> Method-1 is Page Table Translate (PTT) and it specifies a table root
>> physical address. Address translates on this table can be optimized
>> with level skip and bottom-up.
>>
>> Method-2 is Direct Block Translate (DBT) which specifies an area size,
>> protection, and a 64-bit base address to add to the Effective Address (EA)
>> to produce the physical address. This performs an arithmetic relocation
>> of a contiguous range of EA to a contiguous range of PA directly
>> without any memory reads.
>> This requires that the relocated block be physically contiguous.
>>
>> Method-3 is Indirect Block Translate (IBT), a mixture of 1 & 2.
>> The entry specifies an area size, protection, and 64-bit
>> base address to add to the EA giving an area offset.
>> The page number is extracted from offset bits [63:12] and used as an
>> index into a 1 level physically contiguous PTE vector whose base
>> physical address is specified in the entry. The net result is a
>> single memory access to load a PTE to translate the VA->PA
>> while retaining the ability to page fault in that memory area.
>>
>> The majority of OS the fixed size kernel space can use 3 DBT areas
>> to relocate the OS code, read-only and read-write data.
> <
> <from above>
> So if I use several of these top level descriptors, I could map contiguous
> chunks of memory (with base and bounds limits) for those things which
> are not paged.

Yes. A Direct Block Translate, after range and access checks,
takes the 64-bit effective address and adds a 64-bit offset.
If this is a Guest-OS then that is a GA to pass to HV for its translation.
If this is a HV or bare metal then that is a PA.

> What I do not know is whether HVs page pages that the OS thinks are never
> paged ??

The guest has its own CR3 and and hypervisor has its own CR3 as before.
Its just each CR3 now specifies up to 16 maps.

Guest OS manages its page tables and BAT entries to map multiple
virtual spaces into the guest address "physical space",
then HV takes GA's and uses whichever method it wants to map
the mostly contiguous GA's to physical memory.

> So, these things might look really good for mapping HV areas that are
> actually never paged. Or, I could give the OS the illusion that certain OS
> pages are contiguous, and let the HV page them as it desires.

Yes. The first is the Direct Block Translate, the second is Indirect.
The Indirect Block Translate does look up addresses in TLB,
on a TLB-miss it reads PTEs and can page fault, but there is no
tree to walk - essentially a 1 level direct map.

Both translations are making use of the fact that there are large
lumps of continuous address ranges that don't need page table trees.

Those contiguous lumps exist either because it is inherent in the
memory, like a graphics card, or because the Guest-OS has already
used its time end energy to collect it together.
So no reason for HV to uselessly repeat this process.

> Either way, the paging overhead goes down if the top descriptor porthole
> is a PTE rather than a PTP--one access to memory with base and bounds
> limits (where that 1 access is 1/2 a cache line.)
>
> Needs more thought
> <
>

> <snip>
>
> Thanks for the ideas.

You are welcome.

MitchAlsup

unread,

Apr 30, 2021, 1:42:25 PM4/30/21

to

On Friday, April 30, 2021 at 12:00:48 PM UTC-5, EricP wrote:
> MitchAlsup wrote:
> > On Monday, April 26, 2021 at 5:02:25 PM UTC-5, EricP wrote:
> >> I have two ways which eliminate the N^2 table walk cost.
> >>
> >> In my MMU the top 4 address bits are an index into a hardware table
> >> whose entries specify the method addresses in that range are translated.
> > <
> > Could these bits be placed in the Root Pointer ?
> >
> > But I also got to thinking that my top level table only uses 7-bits of the
> > virtual address, indexing only 128 entries where there are 512 doublewords
> > present. I "could" convert accesses into this level page from PTP/PTE
> > into a porthole descriptor using 2 of the additional doublewords, so we
> > would then have a base/bounds values and a PTP/PTE to the next level
> > in the table. The 3rd added doubleword would hold things like the ASID,
> > various flags for the OS/HV to use, and other stuff. <see below>
> > <
> I think you might have slightly misunderstood me.
> This is all about enhancing x64 CR3 so that it allows new mapping methods.
> In a normal non-virtual-machine context there is one CR3.
> In a hypervisor context there are two, one controlled by the guest OS,
> and a nested CR3 controlled by HV.
<

OK that part was unclear.

>
> Currently x64 has CR3 containing the physical address of the page table
> root frame. The table has 5 levels and a 48 bit virtual space.
>
> In that level-5 frame, PTEs [0:255] cover the lower 2^47 virtual address
> range and are typically associated with the current process user space,
> and PTEs [256:511] cover the upper 2^47 virtual address range and
> are typically associated with the OS kernel.
<

Yes, this is known of as a canonical address space mapping and all of
the unused address bits have to be the same as the top address bit.

<
> There is nothing in the hardware that enforces those associations,
> it is just convention.
<

AMD HW actually checks the untranslated address bits for being the same.

>
> Viewing addresses as unsigned integers, the above layout puts the
> kernel "system space" at high addresses, the process "user space"
> at the low addresses, and has a giant dead zone between them.
<

In AMD parlance, the OS is in the negative address space.

>
> Each core in the SMP system has its own CR3 which points at its own
> private page table root frame. All cores common map OS system space
> so the root PTEs [256:511] are the same.
> However each core can have a different process mapped so they
> have different root PTEs [0:255].
<

So at the top level, all tasks have the same data in the top 1/2 page of
of the controlling MMU. This seems to be a waste and the waste could
have been minimized to a single PTP in the top page of the MMU table
for a task.

<
>
> When the OS want to map a new process so it can run its thread,
> it copies up to 256 process PTEs from the process header into its
> private page table root frame. This of course is optimized to only copy
> the range that is actually valid. If any process root PTEs change,
> the updates are written to the process header, then an
> Inter-Processor Interrupt (IPI) informs all other cores to
> re-copy process PTEs [0:255] into their private root frame.
<

This is where that single PTP would come in handy.

>
> One issue I want to eliminate is the above PTE copying on process switch.
<

Given x86-64 MMU tables:: and that the OS is unlikely to exceed 2^39
one could simply utilize one PTP in the negative address space to
point at the page tables for that 39-bit address space and then you
only have to copy one PTP to initialize that end of the address map.

<
>
> ============================================
> In my first cut at improving this, I make CR3 a 2-entry array,
> indexed by the virtual address msb. That would make CR3[0] map the
> lower process user space, and CR3[1] map the upper system space.
>
> There are now 2 pages tables, for the current process address space
> and for system address space.
<

And what would you do with an HV ? CR3 is a 3 entry array?

<
> To switch processes, the OS just changes CR3[0] to point a new process
> root frame leaving CR3[1] alone, so none of this PTE copying.
<

at this level its all PTPs.

>
> To optimize page table walks in the above, the CR3 root pointer
> as well as interior PTE entries allow skipping levels.
> For a small page table, CR3[0] can skip to page table tree at level 3.
> And this still works with bottom-up translate.
>
> Later I wanted to eliminate the TLB lookups for the large linear
> allocations of virtual space. To do that, I want some form of
> Block Address Translate (BAT) by arithmetic relocation.
>
> To support BAT, CR3 becomes an array of 16 entries indexed by
> virtual address bits [63:60], with entries specifying which
> translation method, page table or BAT to use for that Area.
>
> If a CR3-Area entry specifies a page translate method,
> then it has details for it, root frame physical address and level #.
> If a CR3-Area entry specifies a block address translate method,
> then it has details for the BAT such as offset to physical address,
> size in bytes, protection, cache control.
>
> CR3-Area[0] can continue to point to a page table mapping user space,
> CR3-Area[15] can continue to point to a page table mapping system space.
> Additionally it can now add a BAT entry CR3-area[1] mapping the
> graphics physical memory into virtual memory,
> but now it requires no TLB lookups for that memory area.
<

I would argue that you do, indeed have a TLB--or something that smells
so similar to a TLB that the difference is not relevant. A TLB is a block
of logic that translates a virtual address into a physical address. And
certainly that function is being performed.

>
> To switch processes the OS just switches CR3-Area[0] to point
> to that process page table, like before.
>
> Then later when we started talking about hypervisors
> I thought of a new kind of BAT which allowed paging.
> It makes use of the fact the the guest OS has already mapped all its
> scattered virtual addressed into compact linear range of GA's.
> So we don't need a multi-level translate tree, just 1 level.
<

Hmmmmm......

err, this is no longer 1 level, you have the add (above) and the translate (below).
Both add and translate will take ~one cycle each on a modern rather high frequency
CPU. Now the translation part of L1 lookup is slower than data lookup, and this
screws up the cache pipeline......

<
> If this is a Guest-OS then that is a GA to pass to HV for its translation.
> If this is a HV or bare metal then that is a PA.
<
> > What I do not know is whether HVs page pages that the OS thinks are never
> > paged ??
<
> The guest has its own CR3 and and hypervisor has its own CR3 as before.
> Its just each CR3 now specifies up to 16 maps.
<

I like the underlying principle you are using here, just not some of the warts.
I continue to want the translate function to be as fast as tag-read part of L1.
But I do think you are on the right rack.

>
> Guest OS manages its page tables and BAT entries to map multiple
> virtual spaces into the guest address "physical space",
> then HV takes GA's and uses whichever method it wants to map
> the mostly contiguous GA's to physical memory.
<

So what happens if an address space covered by a BAT needs to be migrated to disk ?
{For example this subsystem will not need any CPU cycles for a week ?}

<
> > So, these things might look really good for mapping HV areas that are
> > actually never paged. Or, I could give the OS the illusion that certain OS
> > pages are contiguous, and let the HV page them as it desires.
<
> Yes. The first is the Direct Block Translate, the second is Indirect.
> The Indirect Block Translate does look up addresses in TLB,
> on a TLB-miss it reads PTEs and can page fault, but there is no
> tree to walk - essentially a 1 level direct map.
<

So lets say we have a large subsystem that is run once a week: it
contains 2^38 bytes and is covered by 1-2-3 IBTs. It is going to
take 10-100 seconds of disk time to migrate the subsystem to
disk. If it were pages, it would still take the same total time but
the time would be distributed across "lots" more system events
making for a smoother running system.

>
> Both translations are making use of the fact that there are large
> lumps of continuous address ranges that don't need page table trees.
>
> Those contiguous lumps exist either because it is inherent in the
> memory, like a graphics card, or because the Guest-OS has already
> used its time end energy to collect it together.
> So no reason for HV to uselessly repeat this process.
<

I effectively map VA->PA in the TLB, the repetition only occurs during
table walk, and accelerators eliminate most actual memory accesses.

<
> > Either way, the paging overhead goes down if the top descriptor porthole
> > is a PTE rather than a PTP--one access to memory with base and bounds
> > limits (where that 1 access is 1/2 a cache line.)
> >
> > Needs more thought
> > <
> >
> > <snip>
> >
> > Thanks for the ideas.
>
> You are welcome.
<

Repeating, I do think you are on the right track.......

EricP

unread,

Apr 30, 2021, 1:48:21 PM4/30/21

to

I have seen the PPC BAT registers in manuals in the past.
However PPC's BAT's look like x64's MTRR Memory Type Range Registers
than what I described, in that they pick of ranges of virtual space to
bypass, and then there is one page table for everything else.

Range mappers would need a big HW table of arithmetic comparators to
parallel check a whole bunch of (low_addr <= eff_addr <= upr_addr).
Like a fully associative index except with arithmetic compares,
so expensive, slow, power hungry, doesn't scale well.

PPC BAT's have a minimum size, 128kB I think, and the translation
is also all mixed up with their segmentation registers.
So other than the word "block", it is really not what I was thinking of.

What I'm describing is somewhat similar to PDP-11 MMU with its
segment relocations selected by high address bits as an index,
and a little like VAX MMU with its two page tables for system and
process address space such that address space switch takes one write.
And then rework it all for a 64-bit address space.

Ivan Godard

unread,

Apr 30, 2021, 2:31:11 PM4/30/21

to

Migration time organization does not have to be the same as execution
time organization. There's nothing that says you have to write the whole
thing before reusing space. Split it into 64 pieces and treat them as
individual writes, giving each written piece back to the slab allocator
when done. You get the same smoother system, without needing to have the
(unnecessary) page structure while it is resident.

Now loading it piecemeal is more an issue, but you can do the same thing
by just mapping the entry 1/64th, loading it, starting it, and handling
the misses when the app touches a 64th that hasn't come in yet. Yes, you
have to remap with each piece, and you are likely to not get very far
without everything having been brought in, but that can be ameliorated
by a little smarts in the layout, and/or in the loading order based of
the profiles from when you brought it in before.

As an alternative to remapping every piece, you can load the pieces in
address order instead of demand order, and just increase the bound of
the base-and-bound with each one in.

MitchAlsup

unread,

Apr 30, 2021, 4:30:07 PM4/30/21

to

Not necessarily;

You could organize the top page of the mapping tables such that instead
of having a PTP per doubleword, you have a region descriptor of 4 DWs.
This descriptor is indexed by 7 Higher order address bits (depending on
the level) and the base and bounds apply to all the pages under this
entry. Thus, the address compares are only done on TableWalk and not
on TLB access.

Brett

unread,

Apr 30, 2021, 8:54:38 PM4/30/21

to

That is unfair, I am not that big an Apple fanboy.
Apple switched to Intel in 2005 and in 2009 ended PowerPC support.
“Apple released Mac OS X v10. 6 "Snow Leopard" on August 28, 2009 as
Intel-only”.

I did not declare PowerPC and all antique RISC’s dead on either of these
dates.
RISC has an obvious price and die size advantage over x86 and was very much
alive.

Antique RISC (as opposed to modern RISC) died when MIPS died, that is when
ARM8 went down market, and RISC-V went mass market. Leaving no market for
outdated architectures besides legacy compatibility decline. Free RISC-V or
fast ARM8, make your pick.

There is some room for cheaper than ARM8 license but fast in the middle,
it’s a tight squeeze that Mill and others here are trying to fit. Sucks to
be clearly better but have little or no market.

But here is some hope, do you want your car to run ARM8 so every hacker on
the planet can hack your car, or something different?

There is room for other instruction encodings if for only this reason, it
keeps POWER alive for IBM and will make a small market for Loongson in
China.

Quadibloc

unread,

May 1, 2021, 12:48:17 AM5/1/21

to

On Friday, April 30, 2021 at 6:54:38 PM UTC-6, gg...@yahoo.com wrote:

> But here is some hope, do you want your car to run ARM8 so every hacker on
> the planet can hack your car, or something different?

That's considered "security by obscurity", which isn't.

John Savard

MitchAlsup

unread,

May 1, 2021, 10:48:31 AM5/1/21

to

Only 3 letter agencies believe in security through obscurity.
>
> John Savard

Stefan Monnier

unread,

May 1, 2021, 11:54:26 AM5/1/21

to

To the extent that they only believe in it *for others*, I think they
are actually far from the only ones :-(

Stefan

Marcus

unread,

May 1, 2021, 3:17:48 PM5/1/21

to

Not sure that it's the same. Attack surfaces are attractive if enough
systems use them (or if there's a significant win). Having an uncommon
enough ISA, OS or system architecture will give you some level of
protection against the most common hacker attempts (for the hacker it's
a matter of effort vs outcome).

"Security by anomaly"? Still not a solid strategy though

/Marcus

Terje Mathisen

unread,

May 1, 2021, 4:29:08 PM5/1/21

to

Stuxnet proved that with a suffiently interesting target, even a very
special cpu+api interface, plus air gap security, isn't enough to stop a
determined attacker.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Stefan Monnier

unread,

May 1, 2021, 5:11:01 PM5/1/21

to

> Not sure that it's the same. Attack surfaces are attractive if enough
> systems use them (or if there's a significant win). Having an uncommon
> enough ISA, OS or system architecture will give you some level of
> protection against the most common hacker attempts (for the hacker it's
> a matter of effort vs outcome).
>
> "Security by anomaly"? Still not a solid strategy though

I'd call it "security by natural selection", in the sense that you
gain some safety by virtue of the others being slightly easier targets.

Stefan

Stefan Monnier

unread,

May 1, 2021, 5:20:48 PM5/1/21

to

> Stuxnet proved that with a suffiently interesting target, even a very
> special cpu+api interface, plus air gap security, isn't enough to stop
> a determined attacker.

But if the attackers, while quite determined to attack someone, don't
care very much who they attack, running a somewhat unusual setup can be
enough.

Of course, nowadays many attacks don't care about the ISA, but if you
consider machine-language-level attacks, then maybe it could make sense
to use a kind of "randomized ISA" (along the same lines as ASLR).

I generally prefer language-based security, so I'd tend to presume that
implementing a "randomized ISA" feature wouldn't be worth the trouble,
but it's probably not much more complex than the kind of encrypted RAM
craziness that already exists.

Stefan

MitchAlsup

unread,

May 1, 2021, 6:25:30 PM5/1/21

to

On Saturday, May 1, 2021 at 4:20:48 PM UTC-5, Stefan Monnier wrote:
> > Stuxnet proved that with a suffiently interesting target, even a very
> > special cpu+api interface, plus air gap security, isn't enough to stop
> > a determined attacker.
<
> But if the attackers, while quite determined to attack someone, don't
> care very much who they attack, running a somewhat unusual setup can be
> enough.
<

Attacks become much harder when only the address space of the user
is in the MMU. The sharing of the upper address space for the OS creates
a lot of opportunities for the attackers.

>
> Of course, nowadays many attacks don't care about the ISA, but if you
> consider machine-language-level attacks, then maybe it could make sense
> to use a kind of "randomized ISA" (along the same lines as ASLR).
<

Separating the address space of the JITer from the address space of
the JITed would do similar.

Brett

unread,

May 2, 2021, 12:21:07 AM5/2/21

to

There is also a time delay for hacks to distribute which is 90% of the
benefit of running POWER. IBM will have patched most exploits before enough
time has passed for the exploit to be ported to POWER. Besides of course
three letter agencies which only bother hostile enemies while spying on
all. I expect that I am spied on and don’t care.

The creator of a new exploit will go mass market to make the most from that
exploit, POWER is not what you would target first, which means basically
never as most exploits would be patched before the port started.

EricP

unread,

May 3, 2021, 1:04:28 PM5/3/21

to

MitchAlsup wrote:
> On Friday, April 30, 2021 at 12:00:48 PM UTC-5, EricP wrote:
>> MitchAlsup wrote:
>>> On Monday, April 26, 2021 at 5:02:25 PM UTC-5, EricP wrote:
>>>> I have two ways which eliminate the N^2 table walk cost.
>>>>
>>>> In my MMU the top 4 address bits are an index into a hardware table
>>>> whose entries specify the method addresses in that range are translated.
>>> <
>>> Could these bits be placed in the Root Pointer ?
>>>
>>> But I also got to thinking that my top level table only uses 7-bits of the
>>> virtual address, indexing only 128 entries where there are 512 doublewords
>>> present. I "could" convert accesses into this level page from PTP/PTE
>>> into a porthole descriptor using 2 of the additional doublewords, so we
>>> would then have a base/bounds values and a PTP/PTE to the next level
>>> in the table. The 3rd added doubleword would hold things like the ASID,
>>> various flags for the OS/HV to use, and other stuff. <see below>
>>> <
>> I think you might have slightly misunderstood me.
>> This is all about enhancing x64 CR3 so that it allows new mapping methods.
>> In a normal non-virtual-machine context there is one CR3.
>> In a hypervisor context there are two, one controlled by the guest OS,
>> and a nested CR3 controlled by HV.
> <
> OK that part was unclear.

Ideally an OS, general purpose or HV, should be unable to determine
if it is running on a HV or bare metal.

We might be debugging a HV (guest) running on a HV (metal).
The HV (guest) might run running its own GP guests, Windows, Linux, etc.
to test for bugs.

However the hardware doesn't need to be optimized for
3 nested levels as there is insufficient market demand.

>> Currently x64 has CR3 containing the physical address of the page table
>> root frame. The table has 5 levels and a 48 bit virtual space.
>>
>> In that level-5 frame, PTEs [0:255] cover the lower 2^47 virtual address
>> range and are typically associated with the current process user space,
>> and PTEs [256:511] cover the upper 2^47 virtual address range and
>> are typically associated with the OS kernel.
> <
> Yes, this is known of as a canonical address space mapping and all of
> the unused address bits have to be the same as the top address bit.
> <
>> There is nothing in the hardware that enforces those associations,
>> it is just convention.
> <
> AMD HW actually checks the untranslated address bits for being the same.
>> Viewing addresses as unsigned integers, the above layout puts the
>> kernel "system space" at high addresses, the process "user space"
>> at the low addresses, and has a giant dead zone between them.
> <
> In AMD parlance, the OS is in the negative address space.
>> Each core in the SMP system has its own CR3 which points at its own
>> private page table root frame. All cores common map OS system space
>> so the root PTEs [256:511] are the same.
>> However each core can have a different process mapped so they
>> have different root PTEs [0:255].
> <
> So at the top level, all tasks have the same data in the top 1/2 page of
> of the controlling MMU. This seems to be a waste and the waste could
> have been minimized to a single PTP in the top page of the MMU table
> for a task.
> <

Yes, they could have done so with 2 root pointers with the
address msb selecting between them.
At least with multiple root pointers, it is an OS designers decision.

However there can also be times when you want the common mapped
system space for one SMP core to be slightly different.
To accomplish that, each core requires its own root (level 5) PTP,
and most of its entries point to shared levels 4..1 PTP.
The cost we are talking about a few page frames per core,
and cores already have their own private data such as interrupt stack.

>> When the OS want to map a new process so it can run its thread,
>> it copies up to 256 process PTEs from the process header into its
>> private page table root frame. This of course is optimized to only copy
>> the range that is actually valid. If any process root PTEs change,
>> the updates are written to the process header, then an
>> Inter-Processor Interrupt (IPI) informs all other cores to
>> re-copy process PTEs [0:255] into their private root frame.
> <
> This is where that single PTP would come in handy.
>> One issue I want to eliminate is the above PTE copying on process switch.
> <
> Given x86-64 MMU tables:: and that the OS is unlikely to exceed 2^39
> one could simply utilize one PTP in the negative address space to
> point at the page tables for that 39-bit address space and then you
> only have to copy one PTP to initialize that end of the address map.
> <

Yes, that is what I'm thinking.
Also each PTP root pointer has a level number so it can skip
right to the interior node level it wants.

For example, a PTP could point directly to single level 2 page,
which contains mostly 2 MB PTE's.
That would cover 2^9 * 2^21 = 2^30 bytes, which is enough for
most uses, and eliminates most table walking.

>> ============================================
>> In my first cut at improving this, I make CR3 a 2-entry array,
>> indexed by the virtual address msb. That would make CR3[0] map the
>> lower process user space, and CR3[1] map the upper system space.
>>
>> There are now 2 pages tables, for the current process address space
>> and for system address space.
> <
> And what would you do with an HV ? CR3 is a 3 entry array?
> <

(I'm just working this through as we talk, this is not fully thought out.)

This follows from the idea that an OS, GP or HV, should be
unable to tell whether it is running on metal or on an HV.
So each OS should think that it is managing its own MMU on metal.

The MMU mapping table is a 16 entry array indexed by address [63:60].

There are 2 sets hardware MMU mapping registers,
one for guests and one for "self".

An OS always writes the "self" MMU map register to change its own map.
If it is running on metal then it actually writes the host MMU.

If it is on HV, then reading or writing the "self" MMU traps to HV.
Because the MMU HW supports 2 nested levels,
HV trap handler writes the guests changes to the guest nested MMU.
HV writes changes to its own address space to its "self" MMU,
which usually would not trap because HV is running on metal.

But say we were running a guest GP OS, on a guest HV, on a HV on metal.
when the GP guest writes to its self-MMU, it traps to the guest HV.
When guest HV, thinking it is running on metal, writes to the guest MMU,
it actually traps to host HV, which must emulate the extra nested level.
(The emulation is like what hypervisors used to do on x86, before the
nested page tables, when a guest OS wrote to CR3.)

Does that make sense?

I think I get you.
I thought I could get away with two translate steps because they
were cheap: a 4:16 decode to select MMU map register, 64-bit add.

I think I see what you mean. The current translate mechanism uses
a single unified TLB to translate VA=>GA and GA=>PA in 1 lookup.

Whereas what I'm suggesting could require 2 distinct steps.

I'm understanding this now. Needs more thought.

MitchAlsup

unread,

May 3, 2021, 2:41:44 PM5/3/21

to

On Monday, May 3, 2021 at 12:04:28 PM UTC-5, EricP wrote:
> MitchAlsup wrote:
> > On Friday, April 30, 2021 at 12:00:48 PM UTC-5, EricP wrote:
> >> MitchAlsup wrote:
> >>> On Monday, April 26, 2021 at 5:02:25 PM UTC-5, EricP wrote:
> >>>> I have two ways which eliminate the N^2 table walk cost.
> >>>>
> >>>> In my MMU the top 4 address bits are an index into a hardware table
> >>>> whose entries specify the method addresses in that range are translated.
> >>> <
> >>> Could these bits be placed in the Root Pointer ?
> >>>
> >>> But I also got to thinking that my top level table only uses 7-bits of the
> >>> virtual address, indexing only 128 entries where there are 512 doublewords
> >>> present. I "could" convert accesses into this level page from PTP/PTE
> >>> into a porthole descriptor using 2 of the additional doublewords, so we
> >>> would then have a base/bounds values and a PTP/PTE to the next level
> >>> in the table. The 3rd added doubleword would hold things like the ASID,
> >>> various flags for the OS/HV to use, and other stuff. <see below>
> >>> <
> >> I think you might have slightly misunderstood me.
> >> This is all about enhancing x64 CR3 so that it allows new mapping methods.
> >> In a normal non-virtual-machine context there is one CR3.
> >> In a hypervisor context there are two, one controlled by the guest OS,
> >> and a nested CR3 controlled by HV.
> > <
> > OK that part was unclear.
> Ideally an OS, general purpose or HV, should be unable to determine
> if it is running on a HV or bare metal.
<

Yes, exactly. An OS may understand that it is running under an HV and
by way of this understanding perform less of its ancillary jobs (like page
swapping) allowing HV to do this, and SVCing to HV for certain common
functions (like file system and network stuff). And by doing so, be more
efficient and provide better performance.

However, the ability of an HV to give the complete illusion of "Gee, I think
I am running on bare metal" to any HV or OS it is "running" is required.

Can you give an example ?

Right now, I am thinking that "this MMU" table is the top page of where
the root pointer is pointing. The top page, instead of containing 512
PTPs/PTEs contains 128 portholes (a PTP with base and bounds
checking). As a memory based structure, it needs no context switch
maintenance overhead. The unused doubleword can contain things
like ASID, and maybe a foreign access mask of some sort.

>
> There are 2 sets hardware MMU mapping registers,
> one for guests and one for "self".
>
> An OS always writes the "self" MMU map register to change its own map.
> If it is running on metal then it actually writes the host MMU.
>
> If it is on HV, then reading or writing the "self" MMU traps to HV.
> Because the MMU HW supports 2 nested levels,
> HV trap handler writes the guests changes to the guest nested MMU.
> HV writes changes to its own address space to its "self" MMU,
> which usually would not trap because HV is running on metal.
<

I found it "easier" all around to map all of these control registers into
memory and have HW snarf those accesses, that way, the host and
guest always simply write to memory and if they have been virtualized
the memory gets the values, but if they are actual, memory and the actual
control register gets the values. It takes the trapping out of the control.

>
> But say we were running a guest GP OS, on a guest HV, on a HV on metal.
> when the GP guest writes to its self-MMU, it traps to the guest HV.
> When guest HV, thinking it is running on metal, writes to the guest MMU,
> it actually traps to host HV, which must emulate the extra nested level.
> (The emulation is like what hypervisors used to do on x86, before the
> nested page tables, when a guest OS wrote to CR3.)
<

The meory abstraction is simply cleaner.
>
> Does that make sense?
<
Yes, makes perfect sense--IFF you are restricted to a control register architecture!
If you are not so restricted, having HW snarf memory accesses is much cleaner.
{You do not trap due to privilege, but to memory access protections, and 90%+
of the reason for having privilege disappears. If you an smite that last 10%, like
I did and Mill did, you no longer need the concept.}

For your amusement::
buffer to the 4:16 decoder is 1 gate
4:16 decode is 2 gates
Buffering up the 4:16 so it can read >48 & <64 bits for the add is 2 gates
Reading 1:16 registers to the adder is 2 gates of delay
>48 & <64 adder is 11 gates.
this is 18 gates of delay.

Yes because this TLB associates VA directly with PA ! ignoring the intermediates--
much like a "normal" TLB associates VA with PA and remembers nothing of the
table walk !

EricP

unread,

May 4, 2021, 11:50:56 AM5/4/21

to

MitchAlsup wrote:
> On Monday, May 3, 2021 at 12:04:28 PM UTC-5, EricP wrote:
>> MitchAlsup wrote:
>>> <
>>> So at the top level, all tasks have the same data in the top 1/2 page of
>>> of the controlling MMU. This seems to be a waste and the waste could
>>> have been minimized to a single PTP in the top page of the MMU table
>>> for a task.
>>> <
>> Yes, they could have done so with 2 root pointers with the
>> address msb selecting between them.
>> At least with multiple root pointers, it is an OS designers decision.
>>
>> However there can also be times when you want the common mapped
>> system space for one SMP core to be slightly different.
> <
> Can you give an example ?
> <

We have an SMP system with 2 cores, c0 & c1.
The OS has data structures for each core containing its state,
indexed by the core number. But how does each core know its number?

On x86/x64 the FS and GS segment registers are repurposed in both
user and kernel mode to point to the structures of "this core".

But what if the hardware did NOT have such registers,
what do you do when you port the OS to that platform?

One way is to assign the same virtual address, say 1234,
to contain this information, like the core number but then set up
core c0 and c1 page tables to map that VA to different physical pages.
When c0 reads VA 1234 it gets 0, c1 reads 1234 and gets 1.
Now each core can index to its information in common memory.

Another example: c0 page table root points to process P0 and
c1's root points to process P1. Both c0 and c1 need to
read and write their current process's page table entries,
which means each needs to see them mapped into its system space,
an array of PTE's starting at, say, kernel virtual address 8765_0000.
When c0 write PTE's at that address, it modifies its current process P0.
When c1 write PTE's at that address, it modifies its current process P1.

To achieve this the page frames mapped at VA 8765_0000 for core c0 are
different for c1, which means the chain of PTE's leading from c0's root
to those page frames is different from c1's.
Each core's kernel reads and writes to "my current process's page table",
but the current process is different, with different table frames.

Another example: there is a hardware device with memory mapped
control registers that's only physically addressable from core c0.
Only c0 needs to map those registers into its virtual space.

But most of the rest of the kernel code and data is common mapped,
where same virtual address denotes the same physical address.
These would share interior page table frames as much as possible.

robf...@gmail.com

unread,

May 4, 2021, 4:12:47 PM5/4/21

to

> But how does each core know its number?

Define a ready only core number register that returns the number of the core accessing the register. The core number would need to be combined with the hart_id in a multiprocessor system. Another alternative I think will work if it is not possible to add a register is to use a semaphore or other atomic update to a value in memory.

For my current 64-bit mmu the top two address bits are “left over” and used to index into a table of four root page directory entries. The ASID * 4 is also used to index into the root page directory entries. This gives a separate page table for each address space.
(PDE *) p = root_page_table[asid*4+vpn[63:62]];
Process or tasks make use of address spaces which is a common denominator so the thread id / core number / hart_id are not used. Everything is by the address space.
Table depth is one to five page directories deep, depending of app needs (upper levels of tables bypassed for smaller apps). Each directory consuming 10 bits of address space.
PTE’s contain more bookkeeping information than usual and are 20 bytes in size. For instance the page key (20 bits) and asid number are stored in the PTE in addition to the physical address page and virtual address page (Each 24 bits).
A page table is 20kB in size, this gives 1024x20 byte page table entries. Memory pages are 4kB so each page table is five memory pages in size.

EricP

unread,

May 4, 2021, 4:40:14 PM5/4/21

to

MitchAlsup wrote:
> On Monday, May 3, 2021 at 12:04:28 PM UTC-5, EricP wrote:
>> MitchAlsup wrote:
>>> <
>>> And what would you do with an HV ? CR3 is a 3 entry array?
>>> <
>> (I'm just working this through as we talk, this is not fully thought out.)
>>
>> This follows from the idea that an OS, GP or HV, should be
>> unable to tell whether it is running on metal or on an HV.
>> So each OS should think that it is managing its own MMU on metal.
>>
>> The MMU mapping table is a 16 entry array indexed by address [63:60].
> <
> Right now, I am thinking that "this MMU" table is the top page of where
> the root pointer is pointing. The top page, instead of containing 512
> PTPs/PTEs contains 128 portholes (a PTP with base and bounds
> checking). As a memory based structure, it needs no context switch
> maintenance overhead. The unused doubleword can contain things
> like ASID, and maybe a foreign access mask of some sort.
>> There are 2 sets hardware MMU mapping registers,
>> one for guests and one for "self".

Ok, if they have bounds then they act like MTRR's.
If those entries have control field allowing to select different
translation methods, and one can do arithmetic relocation
then that is equivalent to a BAT.

>> I think I see what you mean. The current translate mechanism uses
>> a single unified TLB to translate VA=>GA and GA=>PA in 1 lookup.
> <
> Yes because this TLB associates VA directly with PA ! ignoring the intermediates--
> much like a "normal" TLB associates VA with PA and remembers nothing of the
> table walk !

>>>> Those contiguous lumps exist either because it is inherent in the
>>>> memory, like a graphics card, or because the Guest-OS has already
>>>> used its time end energy to collect it together.
>>>> So no reason for HV to uselessly repeat this process.
>>> <
>>> I effectively map VA->PA in the TLB, the repetition only occurs during
>>> table walk, and accelerators eliminate most actual memory accesses.
>>> <

I think I have an idea what this needs to do: the goal of the
unified TLB is to automatically load a single entry that maps
the largest range possible as defined by the guest and HV maps.

There are two nested levels, guest and host,
both can have their own TLB cache of its own entries so
that when walks are needed, they are as efficient as possible.
(I'm assuming such a TLB cache includes any BAT entries.)

And there is one unified TLB that covers both nested levels.

For AMD's current x64 nested tables, if the page sizes are different
between levels, then the unified TLB caches using the smaller page size.
There are just 3 pages sizes, 4kB, 2MB and 1GB.
Also, there are no guest MTRR's.

So if we have something equivalent to guest and host MTRR/BAT's then
the question becomes when there are two of the above nested,
how does the unified TLB deal with BAT entries?

The unified TLB entry should cover as large a range as possible.
Lets assume the unified TLB uses a _ternary_ CAM so we can decide
on the fly which address bits to match and which are don't care.
Effectively allows it to match any naturally aligned page size.

A BAT entry only requires a single step to translate a block.
So its advantage for table walks is constant cost of 1.

If a page table page is nested on a PTP then the unified size
is the smaller of the two.
The same should apply for BATs, PTP on BAT, BAT on PTP, BAT on BAT,
the unified TLB entry is the intersection of the two size ranges.

On a unified TLB miss, it translate using the guest TLB,
passes that GA with the page/bat size to the HV TLB,
takes the intersection of the guest and HV levels,
and chooses the number of address bits to match creating single
unified entry that covers the largest mapping range possible.

E.g. a guest 6 GB BAT entry nested on a HV 10 GB BAT entry
produces a 6 GB unified entry, which, depending on the alignment,
which separates into an aligned 4 GB entry plus a 2 GB entry
because TLB entries must align on natural boundaries.
The one containing the target address is loaded into the unified TLB
and the number of address bit to match gets set as either
[63:32] for 4 GB or [63:31] for 2 GB, pointing to the proper
offset in the HV's 10 GB BAT physical section.

Voila, we have automatically created a single unified TLB entry
covering the largest range of addresses possible based on the
current setup of the guest and HV mapping tables.

MitchAlsup

unread,

May 4, 2021, 7:41:44 PM5/4/21

to

On Tuesday, May 4, 2021 at 10:50:56 AM UTC-5, EricP wrote:
> MitchAlsup wrote:
> > On Monday, May 3, 2021 at 12:04:28 PM UTC-5, EricP wrote:
> >> MitchAlsup wrote:
> >>> <
> >>> So at the top level, all tasks have the same data in the top 1/2 page of
> >>> of the controlling MMU. This seems to be a waste and the waste could
> >>> have been minimized to a single PTP in the top page of the MMU table
> >>> for a task.
> >>> <
> >> Yes, they could have done so with 2 root pointers with the
> >> address msb selecting between them.
> >> At least with multiple root pointers, it is an OS designers decision.
> >>
> >> However there can also be times when you want the common mapped
> >> system space for one SMP core to be slightly different.
> > <
> > Can you give an example ?
> > <
> We have an SMP system with 2 cores, c0 & c1.
> The OS has data structures for each core containing its state,
> indexed by the core number. But how does each core know its number?
<

When I was at Denelcor, ATT wanted a way more expensive license for
Unix if the cores could tell who they were. And the HW had no particular
mechanism for a thread to ask "Who am I", indeed, due to task swapping
and other behind the seens stuff, by the time it got a response to "who am
I" the response could be improper already........In any event HEP was a
machine that was inherently parallel with 64 user threads/tasks per CPU
and 64 OS threads/tasks per CPU. But since there was no way for a CPU
to get a useful response to such a query, ATT allowed us to use the klower
cost license..........but I digress......

>
> On x86/x64 the FS and GS segment registers are repurposed in both
> user and kernel mode to point to the structures of "this core".
>
> But what if the hardware did NOT have such registers,
> what do you do when you port the OS to that platform?
>
> One way is to assign the same virtual address, say 1234,
> to contain this information, like the core number but then set up
> core c0 and c1 page tables to map that VA to different physical pages.
> When c0 reads VA 1234 it gets 0, c1 reads 1234 and gets 1.
> Now each core can index to its information in common memory.
>
> Another example: c0 page table root points to process P0 and
> c1's root points to process P1. Both c0 and c1 need to
> read and write their current process's page table entries,
> which means each needs to see them mapped into its system space,
> an array of PTE's starting at, say, kernel virtual address 8765_0000.
<

At one place I worked with multiple CPUs in a system (but each CPU was
its own core) the OS mapped an area of memory whereby if one had a
physical address (like the bits from a PTP/PTE) one could add a known
constant (high order bits) and access physical memory through virtual
memory.
<
So the OS could walk the page tables only a bit slower than the TLB
table walker could walk the page tables. Maybe this is a bit more
difficult now since the page tables are 2-D and there may be an HV involved.
<
But one thing I did in My 66000 was have the page fault handler get
the virtual address causing a page fault and the physical address of the
PTP/PTE which caused the exception to be raised. So, PFH did not have to
walk the tables to service the page fault.

<
> When c0 write PTE's at that address, it modifies its current process P0.
> When c1 write PTE's at that address, it modifies its current process P1.
>
> To achieve this the page frames mapped at VA 8765_0000 for core c0 are
> different for c1, which means the chain of PTE's leading from c0's root
> to those page frames is different from c1's.
<
> Each core's kernel reads and writes to "my current process's page table",
> but the current process is different, with different table frames.
<

OK, I seem to be lost.

Why is one core writing to my current process page <table> any different
that thread[a] and thread[b] having different thread_global_data where
things like errno is kept on a per thread basis ?

>
> Another example: there is a hardware device with memory mapped
> control registers that's only physically addressable from core c0.
> Only c0 needs to map those registers into its virtual space.
<

This sounds like a stupidly convoluted kernel model..........but that is just
my interpretation.

>
> But most of the rest of the kernel code and data is common mapped,
> where same virtual address denotes the same physical address.
> These would share interior page table frames as much as possible.
<

So why is the thread_global_data trick played out in non-privileged code
not workable here ?

MitchAlsup

unread,

May 4, 2021, 7:55:40 PM5/4/21

to

The only HW using a ternary CAM is the Mc 68851 (that I know of)
and we insiders know why.......

The alternative is to have a TLB for 4K a TLB for 2M and a TLB for 1G
that add up to the number of entries you want (say 32, or 64 for FA
256-1024 for L2 TLBs). Since the partitions are fixed, you can apply
the VA bits to all 3 sections and if a cam matches, its data is read out
to the same bit lines, so only reload has to decide which section to write.
These are a lot more area and power efficient than ternary cams.

>
> A BAT entry only requires a single step to translate a block.
> So its advantage for table walks is constant cost of 1.
>
> If a page table page is nested on a PTP then the unified size
> is the smaller of the two.
> The same should apply for BATs, PTP on BAT, BAT on PTP, BAT on BAT,
> the unified TLB entry is the intersection of the two size ranges.
>
> On a unified TLB miss, it translate using the guest TLB,
> passes that GA with the page/bat size to the HV TLB,
> takes the intersection of the guest and HV levels,
> and chooses the number of address bits to match creating single
> unified entry that covers the largest mapping range possible.
>
> E.g. a guest 6 GB BAT entry nested on a HV 10 GB BAT entry
> produces a 6 GB unified entry, which, depending on the alignment,
> which separates into an aligned 4 GB entry plus a 2 GB entry
> because TLB entries must align on natural boundaries.
> The one containing the target address is loaded into the unified TLB
> and the number of address bit to match gets set as either
> [63:32] for 4 GB or [63:31] for 2 GB, pointing to the proper
> offset in the HV's 10 GB BAT physical section.
>
> Voila, we have automatically created a single unified TLB entry
> covering the largest range of addresses possible based on the
> current setup of the guest and HV mapping tables.
<

The 2D page table in My 66000 can have the guest map VA->GA
using large pages, and still allow the HV to map GA->PA using
std page sizes. This allows the OS to be given the illusion it needs
to do nothing about paging (such as DOS) and allow the HV to
perform these duties.

Ivan Godard

unread,

May 4, 2021, 8:27:55 PM5/4/21

to

So clue us outsiders!

MitchAlsup

unread,

May 4, 2021, 9:40:04 PM5/4/21

to

So, a normal CAM has a 2-transistor comparitor making an 8-t cell.
You can write a value into the cell using a pair of bit lines and a select line
You can read a value from the cell using a pair of bit lines and a select line
.........this is a seldom used function but present for completeness.
You can match a value in the cell using a pair of bit lines and driving a
.........single match line which is wired ORed to other match lines.

This 8-T cell is about 2× larger than a std SRAM cell--mainly because you
can no longer share the n-channel diffusion between adjacent cells in the
bit line direction and the bit line capacitance increases by a harmful amount.

If the comparitor transistor pair are setup for matching the comparitor
transistors are P-channel and if the pair are setup for mismatching the
comparitor transistors are N-channel. N-Channels are faster and stronger.

The Mc68851 used 2 SRAM cells side by side, one to record the true-bit line
the other to record the complement bit line. The comparitor used one value
from the true cell and one value from the compliment cell. A value of 00
would always mismatch, a value of 10 would mismatch if the bit lines
were 01, and a value of 01 would mismatch is the bit lines were 10,
a 11 would never mismatch (it did not participate.)

So this cell is about 3× the size of a std SRAM cell and 150% the size of a
standard CAM cell. Due to the size of the cells, the wider delay in and out
increases, and the write logic is a "bit" more complicated.

You need the cam to store 3 state, and this fundamentally re

MitchAlsup

unread,

May 4, 2021, 9:46:18 PM5/4/21

to

> You need the cam to store 3 states, and this fundamentally requires 2
bits

{There seems to be some key sequence i can type that posts the message
before I am done typing...and hit the "post message" button}

EricP

unread,

May 5, 2021, 12:18:01 PM5/5/21

to

MitchAlsup wrote:
> On Tuesday, May 4, 2021 at 10:50:56 AM UTC-5, EricP wrote:
>> When c0 write PTE's at that address, it modifies its current process P0.
>> When c1 write PTE's at that address, it modifies its current process P1.
>>
>> To achieve this the page frames mapped at VA 8765_0000 for core c0 are
>> different for c1, which means the chain of PTE's leading from c0's root
>> to those page frames is different from c1's.
> <
>> Each core's kernel reads and writes to "my current process's page table",
>> but the current process is different, with different table frames.
> <
> OK, I seem to be lost.
>
> Why is one core writing to my current process page <table> any different
> that thread[a] and thread[b] having different thread_global_data where
> things like errno is kept on a per thread basis ?

Errno is likely in Thread Local Store, which on WinNT is dynamically
allocated for each thread and not done with 'stupid mapping tricks'.

I probably screwed up the explanation. Let me try again.

The question was why would each core set up its system space differently.

The short answer if because each core can map a different process,
and it needs read-write access to the process page tables in order
to manage them. To get that RW access the core must set up PTEs for
its system space that point to the process page table frames.
As each core can map a different process, each must have a different
page table root to its tree containing PTEs mapping its page table.

(Using a 32-bit 3-level table with 8-byte PTEs x86 as reference
because it is easier.)

The short answer on how that is accomplished is that we note
that the processes PT2 level frames already contain PTEs that
point at all the valid PT1 level frames.

If we reuse those process PT2 frames as though they are PT1 frames,
then everything they point at appears in this cores' virtual space.
And what do they point at - the PT1 page frames.

So by patching just two level-2 PTEs to point to the level-2 table frames,
reusing the level-2 table frames as though they were level-1 table frames,
the whole process page table appears in the cores' virtual space
as RW data pages. Two other patches are needed to map the process
level-2 table frames in as RW data pages.

Which PTEs we patch decides where in the virtual address it appears at.
In 32-bit WinNT the page table is 512k PTEs, 4 MB starting at VA xC0000_0000
and the process space portion of that is PTE entries [0:256k].
The level-2 table would be located just above the level-1 table.

The technical requirement to do the above is that level-2 PTEs need to be
compatible to level-1 PTEs as far as the hardware walker is concerned.

I just realized that if the page table allows skipping levels then its
design needs to coordinate with reuse of table frames at different levels.

Paul A. Clayton

unread,

May 5, 2021, 1:11:01 PM5/5/21

to

On Tuesday, May 4, 2021 at 7:55:40 PM UTC-4, MitchAlsup wrote:
[snip]

> The only HW using a ternary CAM is the Mc 68851 (that I know of)
> and we insiders know why.......

MIPS traditionally had fully associative TLBs with paired-power-of-two
page size (two aligned 4, 16, 64, 256 KiB etc. pages); not quite full
ternary (as two tag bits can share a mask bit). Itanium 2 L2 TLBs
also supported multiple page sizes with full associativity. I think
Zen's L1 TLB also used masked CAMs for two page sizes (4 KiB
[32 KiB clusters?] and 2 MiB).

> The alternative is to have a TLB for 4K a TLB for 2M and a TLB for 1G
> that add up to the number of entries you want (say 32, or 64 for FA
> 256-1024 for L2 TLBs). Since the partitions are fixed, you can apply
> the VA bits to all 3 sections and if a cam matches, its data is read out
> to the same bit lines, so only reload has to decide which section to write.
> These are a lot more area and power efficient than ternary cams.

I do wish that my idea of using huge page entries for PDE entries
was adopted. Such reduces the utilization issue associated with
huge page TLB entries (a memory region will either use a huge page
or a PDE, so a larger number of 2 MiB entries may be reasonable,
though such does increase the data memory relative to huge pages
[unless one extends the physical address space for huge pages as
x86 did with PAE]). I have not read anything to indicate that such is
a bad design point, but I am not a hardware designer.

Another design point would be to use huge page size entries for
tag compression for base page size entries. The huge page entries
could indicate if mapped to a PDE or huge PTE with an index
(possibly shrunk by the tables not being fully associative) to the
appropriate table for the data. A 2 MiB *page* hit would immediately
know nine bits (and any permission et al. metadata common between
PTEs and PDEs) and the huge page extension data table might be
able to be close enough (and simply-enough-addressed) to fit
translation latency requirements.

(Another, unlikely to be used, trick would be using smaller tag "huge
page" entries for sign or zero extended base page size entries.)

For L2 TLBs, hash-rehash seems to be the preferred mechanism
(even though such wastes tag and data bits for huge pages) when
the number of page sizes is small and well-spread. (For small and
compact, Seznec's overlaid skewed associative TLB has some
attraction, especially if size prediction or constraint can be used
to reduce look-up associativity.)

By the way, one does not need two bits for three states; the <1.6
bits required are just less convenient for implementation.☺

MitchAlsup

unread,

May 5, 2021, 1:17:28 PM5/5/21

to

On Wednesday, May 5, 2021 at 12:11:01 PM UTC-5, Paul A. Clayton wrote:
> On Tuesday, May 4, 2021 at 7:55:40 PM UTC-4, MitchAlsup wrote:
> [snip]
> > The only HW using a ternary CAM is the Mc 68851 (that I know of)
> > and we insiders know why.......
<
> MIPS traditionally had fully associative TLBs with paired-power-of-two
> page size (two aligned 4, 16, 64, 256 KiB etc. pages); not quite full
> ternary (as two tag bits can share a mask bit). Itanium 2 L2 TLBs
> also supported multiple page sizes with full associativity. I think
> Zen's L1 TLB also used masked CAMs for two page sizes (4 KiB
> [32 KiB clusters?] and 2 MiB).
<

I built something like that for SPARC at Ross. It still occupied a lot
more space than a textbook FA CAM. And yes it had storage cells
that when set would mask off address bits from the comparison
and when clear would allow those cells to participate.

LoL

Terje Mathisen

unread,

May 5, 2021, 4:41:35 PM5/5/21

to

Even when you combine a pair of them, you get 9 states and still need 4
bits: It is only when you can stuff 3 or 5 of them together that you can
save a bit (or two):

3^3 = 27 = 5 bits
3^5 = 243 = 8 bits

a...@littlepinkcloud.invalid

unread,

May 6, 2021, 6:42:29 AM5/6/21

to

MitchAlsup <Mitch...@aol.com> wrote:
> On Monday, May 3, 2021 at 12:04:28 PM UTC-5, EricP wrote:
>
>> Ideally an OS, general purpose or HV, should be unable to determine
>> if it is running on a HV or bare metal.
> <
> Yes, exactly. An OS may understand that it is running under an HV and
> by way of this understanding perform less of its ancillary jobs (like page
> swapping) allowing HV to do this, and SVCing to HV for certain common
> functions (like file system and network stuff). And by doing so, be more
> efficient and provide better performance.
>
> However, the ability of an HV to give the complete illusion of "Gee, I think
> I am running on bare metal" to any HV or OS it is "running" is required.

I remember chatting to a Linux kernel architect about an IBM Power box
under his desk. I asked him how many processors were in there, and he
said "I think it's sixteen, but I'm not sure." I was surprised he
didn't know, and he replied with something to the effect that the
hypervisor said there were sixteen, but that might not be true,
because the hypervisor, running on its own processor, could fake a
configuration. IBM's virtualization was so good that the Linux kernel
would work perfectly, and always runs virtualized. From what I
remember, there wasn't any way to run without a hypervisor and neither
was there any point to doing so.

Andrew.

EricP

unread,

May 7, 2021, 8:42:44 AM5/7/21

to

MitchAlsup wrote:
> On Tuesday, May 4, 2021 at 7:27:55 PM UTC-5, Ivan Godard wrote:
>> On 5/4/2021 4:55 PM, MitchAlsup wrote:
>>> On Tuesday, May 4, 2021 at 3:40:14 PM UTC-5, EricP wrote:
>>>> MitchAlsup wrote:
>>>>> On Monday, May 3, 2021 at 12:04:28 PM UTC-5, EricP wrote:
>>>>>> MitchAlsup wrote:
>>>>>>> <
>>>>>>> And what would you do with an HV ? CR3 is a 3 entry array?
>>>>>>> <
>>>>>> (I'm just working this through as we talk, this is not fully thought out.)
>>>>>>
>>>>>> This follows from the idea that an OS, GP or HV, should be
>>>>>> unable to tell whether it is running on metal or on an HV.
>>>>>> So each OS should think that it is managing its own MMU on metal.
>>>>>>

>>>>>> The MMU mapping table is a 16 entry array indexed by address [63:60]..

> ..........this is a seldom used function but present for completeness.

> You can match a value in the cell using a pair of bit lines and driving a

> ..........single match line which is wired ORed to other match lines.

>
> This 8-T cell is about 2× larger than a std SRAM cell--mainly because you
> can no longer share the n-channel diffusion between adjacent cells in the
> bit line direction and the bit line capacitance increases by a harmful amount.
>
> If the comparitor transistor pair are setup for matching the comparitor
> transistors are P-channel and if the pair are setup for mismatching the
> comparitor transistors are N-channel. N-Channels are faster and stronger.
>
> The Mc68851 used 2 SRAM cells side by side, one to record the true-bit line
> the other to record the complement bit line. The comparitor used one value
> from the true cell and one value from the compliment cell. A value of 00
> would always mismatch, a value of 10 would mismatch if the bit lines
> were 01, and a value of 01 would mismatch is the bit lines were 10,
> a 11 would never mismatch (it did not participate.)
>
> So this cell is about 3× the size of a std SRAM cell and 150% the size of a
> standard CAM cell. Due to the size of the cells, the wider delay in and out
> increases, and the write logic is a "bit" more complicated.
>

> You need the cam to store 3 state, and this fundamentally requires 2 bits

Sure a TCAM is more expensive than a CAM which is more than registers.
But designers regularly do things like double-up the register file
to get more write ports. Why is that ok but spending an equivalent
or even less amount on a TCAM is not, particularly if it gets the
functionality that cannot be obtained by other means?

MitchAlsup

unread,

May 7, 2021, 11:57:23 AM5/7/21

to

It comes down to how few clocks one can run the LD pipeline in. And
whereas when one doubles the register file for more read ports, the
speed of the file is not harmed. When you use TCAMs for your TLB
you often miss your speed target. Sure you can do it if you are
willing to add a clock to the load pipeline, but can you do TCAMs and
not add a clock to the load pipeline? AND not loose clock frequency !

George Neuner

unread,

May 7, 2021, 7:51:23 PM5/7/21

to

On Tue, 4 May 2021 18:46:17 -0700 (PDT), MitchAlsup
<Mitch...@aol.com> wrote:

>{There seems to be some key sequence i can type that posts the message
>before I am done typing...and hit the "post message" button}

May help to look at whatever "accessibility" features your OS offers.
If you can get a sound or visual indication of special keys being
pressed, you can train yourself to short-circuit the sequence.

YMMV,
George

MitchAlsup

unread,

May 7, 2021, 8:23:15 PM5/7/21

to

As a 60 words per minute typist, the message is gone before any info
could ever show up on the screen. But I'm sure it has something to do
with hitting the control key along with a key the left hand is over.
>
> YMMV,
> George

Stefan Monnier

unread,

May 7, 2021, 10:01:08 PM5/7/21

to

> As a 60 words per minute typist, the message is gone before any info
> could ever show up on the screen. But I'm sure it has something to do
> with hitting the control key along with a key the left hand is over.

In Emacs you can hit `C-h l` to view the last hundred or so keys you hit
(and the command that they ran). It seems a basic enough feature that
other editors probably offer as well.

Stefan

Quadibloc

unread,

May 8, 2021, 8:54:30 AM5/8/21

to

On Thursday, May 6, 2021 at 4:42:29 AM UTC-6, a...@littlepinkcloud.invalid wrote:

> I remember chatting to a Linux kernel architect about an IBM Power box
> under his desk. I asked him how many processors were in there, and he
> said "I think it's sixteen, but I'm not sure." I was surprised he
> didn't know, and he replied with something to the effect that the
> hypervisor said there were sixteen, but that might not be true,
> because the hypervisor, running on its own processor, could fake a
> configuration. IBM's virtualization was so good that the Linux kernel
> would work perfectly, and always runs virtualized. From what I
> remember, there wasn't any way to run without a hypervisor and neither
> was there any point to doing so.

Hmm. When one turns the box on, doesn't one talk to the hypervisor
at least for long enough to choose an installed operating system to
use? And if so, why couldn't one ask the hypervisor how many cores
there were?

John Savard

MitchAlsup

unread,

May 8, 2021, 10:24:15 AM5/8/21

to

What/who are you going to ask that question if a HV is running OSs on
a virtualized HV ? The real one or the fake one ?

I remember a story told by Lynne and Anne where an IBM executive
was asking why that machine over there was running virtualized, and
it turns out the overhead was down in the 2% region and they could
debug the virtualized HV under a real HV finding those situations where
the VHV would/could crash without actually crashing the machine and
other OSs and applications it was running.
<
>
> John Savard

Thomas Koenig

unread,

May 8, 2021, 12:23:08 PM5/8/21

to

MitchAlsup <Mitch...@aol.com> schrieb:

> What/who are you going to ask that question if a HV is running OSs on
> a virtualized HV ? The real one or the fake one ?

> I remember a story told by Lynne and Anne where an IBM executive
> was asking why that machine over there was running virtualized, and
> it turns out the overhead was down in the 2% region and they could
> debug the virtualized HV under a real HV finding those situations where
> the VHV would/could crash without actually crashing the machine and
> other OSs and applications it was running.

They probably ran this under OS/VU.

http://www.weathergraphics.com/tim/ibm.htm

EricP

unread,

May 11, 2021, 11:55:06 AM5/11/21

to

It is convoluted!

Not Your Parents' Physical Address Space, 2015 usenix.org
http://www.barrelfish.org/publications/pas_hotos15.pdf

EricP

unread,

May 11, 2021, 1:02:36 PM5/11/21

to

(Looks like TCAMs get used a lot in network routers so thats where the
research has been, rather than TLB's which have stable designs.)

These folks build a TCAM with 11 transistors that can do a
16 row by 16-bit match in about 100 ps (I don't think that
time includes a priority selector, just match line output):

[paywalled]
Energy and Area Efficient 11-T Ternary Content Addressable Memory
for High-Speed Search 2019
https://ieeexplore.ieee.org/abstract/document/8754423

WRT MTRR's, Intel & AMD seem to be able to hang a bunch of those
on the address lines, manual says 96.

The address range binary comparator looks similar cost per entry to TCAM.
To test if A <= B doesn't require a full (64-12=) 52-bit subtract with
lookahead - it can be done by XOR'ing the bits, doing a FF1 from msb to lsb
to select the highest non-equal pair, and looking at that B's bit.
So the address line loading for a range compare is 2 XOR's per entry
and the delay is mostly for a 52-bit FF1.

It should be possible to have both TCAM and RR entries in a unified TLB.

There are a lot of people looking at and writing papers on the nested page
table cost issues. A lot are looking at similar solutions as we have
discussed: some form of block translate to eliminate table walks.

Quadibloc

unread,

May 11, 2021, 2:04:12 PM5/11/21

to

I just did a quick search: ternary content addressible memories
don't store the content in base-3; instead, they allow search
words with "don't care" bits.

John Savard

MitchAlsup

unread,

May 11, 2021, 3:42:05 PM5/11/21

to

Which is why you end up with 2 flip-flops to store the available states.
>
> John Savard

MitchAlsup

unread,

May 12, 2021, 3:33:57 PM5/12/21

to

After reading the paper, and considering the topics for a few days,
I think one could pull off such an abstraction in the My 66000 Memory
management tables.