Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

x86-S

114 views
Skip to first unread message

Dan Cross

unread,
May 20, 2023, 1:07:18 PM5/20/23
to
Likely to be filed under the, "Too Little Too Late" category,
Intel has recent put forth a proposal for simplifying the x86_64
architecture, and in particularly, discarding some of the legacy
baggage of a 45 year old architecture. Details, including a
link to the specifically proposed architectural changes, are
here:
https://www.intel.com/content/www/us/en/developer/articles/technical/envisioning-future-simplified-architecture.html

Seemingly the biggest additions from an OS writer's perspective,
assuming a 64-bit system, are:

1. A pair of MSRs for jumping between using 4 and 5 level page
tables without the need to trampoline through 32-bit
protected mode and disabling paging as an intermediate
step, and
2. A change to the SIPI sequence for MP startup where APs come
up directly in 64-bit mode, with paging enabled. This is
accomplished by introduction of a new MSR where one can put
a pointer to a small data structure that includes initial
values for %rip, %cr0, %cr3, and %cr4.

In tandem, a whole bunch of stuff is eliminated: legacy 16 bit
modes, call gates, segment limits and base for 32-bit mode, etc.

Frankly, this seems like putting lipstick on a pig: all of this
seems "nice", I suppose, but I don't see the point of much of
it. Consider the 64-bit SIPI sequence, for example: sure, this
eliminates some boilerplate trampoline code at AP bringup, but
a) that code isn't very hard to write, and b) once it's written
it is not as though one goes around often changing it. I
suppose that a benefit is that an AP may be able to start
running against an address space that is defined by a page
table with the PML4 or PML5 somewhere about 4GiB in the physical
address space, but that seems like a minor win in the grand
scheme of things.

The L4<->L5 paging thing appears useful at first blush, but how
often is one doing that on a single system? It seems unlikely
to be particularly useful in practice; in particular since
paging is under control of the operating system; to what end
would it oscilate back and forth between two page table depths?

On the other hand, many annoying vestiges of the past are left:
the TSS continues to have a stack table for mapping kernel
stacks (why not just make those MSRs?); an opportunity to
simplify (or eliminate) the IDT was lost; segmentation very much
remains part of the architecture (and part of the 64-bit syscall
mechanism!); removal of "unrestricted guest mode" from VMX makes
writing a VMM to support legacy operating systems that much
harder.

So it begs the question: what is the point of this proposal? It
doesn't seem to add much that is particularly useful, while
removing some things that are (unrestricted guest mode) and
ignoring most of the historical barnacles of the architecture.

Sorry, Intel: I know its your cash cow, but it's time to put x86
to bed. This won't change that.

- Dan C.

JJ

unread,
May 21, 2023, 6:20:05 AM5/21/23
to
On Sat, 20 May 2023 17:07:16 -0000 (UTC), Dan Cross wrote:
> Likely to be filed under the, "Too Little Too Late" category,
> Intel has recent put forth a proposal for simplifying the x86_64
> architecture, and in particularly, discarding some of the legacy
> baggage of a 45 year old architecture. Details, including a
> link to the specifically proposed architectural changes, are
> here:
> https://www.intel.com/content/www/us/en/developer/articles/technical/envisioning-future-simplified-architecture.html
>
> Seemingly the biggest additions from an OS writer's perspective,
> assuming a 64-bit system, are:
>
> 1. A pair of MSRs for jumping between using 4 and 5 level page
> tables without the need to trampoline through 32-bit
> protected mode and disabling paging as an intermediate
> step, and
> 2. A change to the SIPI sequence for MP startup where APs come
> up directly in 64-bit mode, with paging enabled. This is
> accomplished by introduction of a new MSR where one can put
> a pointer to a small data structure that includes initial
> values for %rip, %cr0, %cr3, and %cr4.
>
> In tandem, a whole bunch of stuff is eliminated: legacy 16 bit
> modes, call gates, segment limits and base for 32-bit mode, etc.
[snip]

Intel's i64 version 2.

And I bet that, if it's adopted, the CPU would not be cheaper than x86-64.
Despite that it's mostly just a trimmed down version of x86-64.

Luke A. Guest

unread,
May 21, 2023, 6:24:35 AM5/21/23
to
On 20/05/2023 18:07, Dan Cross wrote:
> Likely to be filed under the, "Too Little Too Late" category,
> Intel has recent put forth a proposal for simplifying the x86_64
> architecture, and in particularly, discarding some of the legacy
> baggage of a 45 year old architecture. Details, including a
> link to the specifically proposed architectural changes, are
> here:
> https://www.intel.com/content/www/us/en/developer/articles/technical/envisioning-future-simplified-architecture.html

Or just dump the archaic x86(-64) arch and go full RISC, with current
manufacturing technology, they'd be speedy as fuck. Apparently all x86
ave been RISC underneath for ages now anyway.

Dan Cross

unread,
May 21, 2023, 10:20:05 AM5/21/23
to
In article <u4crh1$1hngg$1...@dont-email.me>,
The issue there is backwards compatibility with an enormous
installed base. Intel tried back back in the early 00s with the
Itanium; it did not go well for them. They had previously tried
back in the 80s with the i432 and that was even worse. In an
amazing display of lack of self-awareness, they made many of the
same mistakes with Itanium that they had previously made with
i432 (waiting for the perfect compiler to make the thing go fast
was a mistake with both projects). They made similar errors
with the i860, too, and made other RISC mistakes with the i960
(which was marketed as a microcontroller)---noteably, the 960
used register windows, like SPARC and the original Berkeley RISC
machine. History has shown this seemingly nifty idea not to be
that great in practice.

Modern x86 is a weird beast; I understand it's a pretty standard
dataflow processor underneath all of the x86 goo. The x86 stuff
is weird, but is actually a relatively small percentage of the
overall die surface (like, 5% or something). Some compiler
writers think of it as a relatively compact bytecode sitting
over a RISC core, with the side-benefit of being relatively
cache efficient. I'm not sure I buy that arugment, particularly
as things like RISC-V, with it's compressed opcode profile, is
both a very clean RISC and competitively compact in terms of
e.g. icache space.

- Dan C.

Scott Lurndal

unread,
May 21, 2023, 10:49:05 AM5/21/23
to
cr...@spitfire.i.gajendra.net (Dan Cross) writes:
>Likely to be filed under the, "Too Little Too Late" category,
>Intel has recent put forth a proposal for simplifying the x86_64
>architecture, and in particularly, discarding some of the legacy
>baggage of a 45 year old architecture. Details, including a
>link to the specifically proposed architectural changes, are
>here:
>https://www.intel.com/content/www/us/en/developer/articles/technical/envisioning-future-simplified-architecture.html
>
>Seemingly the biggest additions from an OS writer's perspective,
>assuming a 64-bit system, are:
>
>1. A pair of MSRs for jumping between using 4 and 5 level page
> tables without the need to trampoline through 32-bit
> protected mode and disabling paging as an intermediate
> step, and
>2. A change to the SIPI sequence for MP startup where APs come
> up directly in 64-bit mode, with paging enabled. This is
> accomplished by introduction of a new MSR where one can put
> a pointer to a small data structure that includes initial
> values for %rip, %cr0, %cr3, and %cr4.
>
>In tandem, a whole bunch of stuff is eliminated: legacy 16 bit
>modes, call gates, segment limits and base for 32-bit mode, etc.
>
>Frankly, this seems like putting lipstick on a pig: all of this
>seems "nice", I suppose, but I don't see the point of much of
>it.

It certainly will simplify the tasks for the architecture team,
the RTL implementation team(s), the Verification team(s), the Post silicon
Team(s) and the software folks. Not something to be sneezed at, it's
a lot of work verifying the legacy support with all the odd corner
cases and fifty year old test cases. There will be some, likely
minor, improvements in area utilization at the current node, so perhaps
they can squeeze in another core or two in the same area. Or
additional memory controllers. etc.

I've heard rumblings about this for a decade now from colleagues
(mostly former intel processor designers).

Dan Cross

unread,
May 21, 2023, 12:17:10 PM5/21/23
to
In article <2aqaM.646077$5S78....@fx48.iad>,
This is the thing though: how much of a difference will _these_
changes really make here? I'll certainly concede that they will
make some difference, but is it enough to be worthwhile given
the churn this will force on software writers?

Of all the weirdness in x86, eliminating this stuff seems like
pretty small potatoes. I mean, they could get rid of
segmentation entirely in 64-bit mode, and that seems like it
would have a bigger effect (no more `POP SS` style problems), or
they could fix NMI handling, or make the ISA generally more
orthogonal and predictable, so that you could reason about it
without resorting to deep study of Agner's site....

>There will be some, likely
>minor, improvements in area utilization at the current node, so perhaps
>they can squeeze in another core or two in the same area. Or
>additional memory controllers. etc.

I dunno.... Looking at the massive surface area of SIMD units
and their associated caches and registers on Xeon cores, I doubt
this is the limiting factor. Still, more memory controllers are
always welcome.

>I've heard rumblings about this for a decade now from colleagues
>(mostly former intel processor designers).

That's intriguing; I wonder what this says about x86 at Intel
generally. Are they running into a wall with their ability to
add functionality to the architecture?

- Dan C.

Scott Lurndal

unread,
May 21, 2023, 1:15:00 PM5/21/23
to
I believe it to be a significant factor. Their primary concern is
Windows and Linux; it will require very little software churn there, and Intel
will likely provide patches directly to the vendor(s) (e.g. microsoft)
and directly to the linux kernel mailing list.

Yes, it will basically put Paul and PDOS out of business, but he can
always run in emulation.

>
>Of all the weirdness in x86, eliminating this stuff seems like
>pretty small potatoes. I mean, they could get rid of
>segmentation entirely in 64-bit mode,

Now, that would be churn. Granted AMD's SVM and Intel's VT-X
(once nested page tables/extended page tables were added) have
basically eliminated the last need for segment limit register
(which AMD had removed in amd64 but added back later because
XEN needed them for paravirtualization (before NPT/EPT)).

I see this proposal eliminates limit checking, which was only
added to x86_64 to support Xen.

Most of these changes will affect the boot loaders and secondary
bootstrap for the most part, which is where the progression from
power-on through real-mode, protected-mode, enable paging, enable
longmode occurs. There should be very few changes to the modern
kernels (windows, linux), if any.


? and that seems like it
>would have a bigger effect (no more `POP SS` style problems), or
>they could fix NMI handling, or make the ISA generally more
>orthogonal and predictable,

Again, that's churn that affects applications or OS. While getting
rid of the iomap and IOPL==3 -might- affect applications,
it is quite unlikely and there are alternatives for most modern
operating systems (e.g. virt-io in linux) to grant access to
the hardware to user-mode applications.


>>There will be some, likely
>>minor, improvements in area utilization at the current node, so perhaps
>>they can squeeze in another core or two in the same area. Or
>>additional memory controllers. etc.
>
>I dunno.... Looking at the massive surface area of SIMD units
>and their associated caches and registers on Xeon cores, I doubt
>this is the limiting factor. Still, more memory controllers are
>always welcome.
>
>>I've heard rumblings about this for a decade now from colleagues
>>(mostly former intel processor designers).
>
>That's intriguing; I wonder what this says about x86 at Intel
>generally. Are they running into a wall with their ability to
>add functionality to the architecture?

The main grumbling was always about the arcane boot process
still requiring full 8086 semantics and the long process to
get to long mode. The complexity of verification and post
silicon testing (legacy 8059 pic et al) was also discussed.

Each generation adds more features that need formal verification
during the design process (and to ensure correct functionality
with new internal interconnect structures (bus, ring, mesh, et alia))
and that costs more man-hours to integrate into the design without
breaking any of the legacy stuff. If this all lets them tape
a new chip out a month early, that's a win.

Dan Cross

unread,
May 21, 2023, 5:07:07 PM5/21/23
to
In article <lksaM.450847$ZhSc....@fx38.iad>,
If, as you say, the concern is doing away with 16-bit entirely,
from a complexity/testing perspective, then I suppose I can see
it.

>Yes, it will basically put Paul and PDOS out of business,

What a shame.

>but he can always run in emulation.

What a shame.

>>Of all the weirdness in x86, eliminating this stuff seems like
>>pretty small potatoes. I mean, they could get rid of
>>segmentation entirely in 64-bit mode,
>
>Now, that would be churn. Granted AMD's SVM and Intel's VT-X
>(once nested page tables/extended page tables were added) have
>basically eliminated the last need for segment limit register
>(which AMD had removed in amd64 but added back later because
>XEN needed them for paravirtualization (before NPT/EPT)).

Would it really? Limits and base are already ignored in long
mode; about the only thing it's still used for is GSBASE/FSBASE
and for that we have MSRs. But, having to program non-null
segment selectors into STAR, and having to have a valid GDT,
adds seemingly unnecessary complexity. If they're going to
swap around how they do AP startup with a brand-new SIPI type,
it doesn't seem like a big lift to just do away with
segmentation entirely.

>I see this proposal eliminates limit checking, which was only
>added to x86_64 to support Xen.

I believe that's for 32-bit mode? Both AMD and Intel already
ignore segment limits in 64-bit mode, and both effectively
ignore segment base as well. At least for CS, DS, ES and SS; FS
and GS are treated specially for thread- and core-local storage,
but I don't think any that is changing in this proposal.

>Most of these changes will affect the boot loaders and secondary
>bootstrap for the most part, which is where the progression from
>power-on through real-mode, protected-mode, enable paging, enable
>longmode occurs. There should be very few changes to the modern
>kernels (windows, linux), if any.

Yup. That code (as you well know) tends to be write-once and
mostly forget. I get that the hardware folks want to make their
lives easier, but in the short term, this adds complexity to the
code (which must now be specialized to detect whether it's
running on an x86-S CPU or current x86_64 and behave
accordingly). I suppose we could treat x86-S as an entirely
separate architecture.

>? and that seems like it
>>would have a bigger effect (no more `POP SS` style problems), or
>>they could fix NMI handling, or make the ISA generally more
>>orthogonal and predictable,
>
>Again, that's churn that affects applications or OS. While getting
>rid of the iomap and IOPL==3 -might- affect applications,
>it is quite unlikely and there are alternatives for most modern
>operating systems (e.g. virt-io in linux) to grant access to
>the hardware to user-mode applications.

I don't see how virtio can give a user-application pass-through
access to programmed IO, but I appreciate an argument that says
that there can be a uioring sort of thing to communicate IO
requests from userspace to the kernel without a trap.

For that matter, I don't see how doing away with segmentation in
the 64-bit mode really adds that much, either: typically, once
one gets into long mode, one sets the segmentation register
exactly once and never touches them again, except where you're
forced to due to some superflous weirdness with syscall/sysret
and exceptions. On the other hand, a we just start treating cs,
ds, es, and ss WI/RAZ, registers, and eliminated oddities like
"interrupts are blocked until after the instruction after a pop
into ss" it would simplify the hardware _and_ close a stupid
potential security bug. What would really be the problem here?
We could leave the trap format untouched for compatibility with
32-bit mode.

>>>There will be some, likely
>>>minor, improvements in area utilization at the current node, so perhaps
>>>they can squeeze in another core or two in the same area. Or
>>>additional memory controllers. etc.
>>
>>I dunno.... Looking at the massive surface area of SIMD units
>>and their associated caches and registers on Xeon cores, I doubt
>>this is the limiting factor. Still, more memory controllers are
>>always welcome.
>>
>>>I've heard rumblings about this for a decade now from colleagues
>>>(mostly former intel processor designers).
>>
>>That's intriguing; I wonder what this says about x86 at Intel
>>generally. Are they running into a wall with their ability to
>>add functionality to the architecture?
>
>The main grumbling was always about the arcane boot process
>still requiring full 8086 semantics and the long process to
>get to long mode. The complexity of verification and post
>silicon testing (legacy 8059 pic et al) was also discussed.

The process to get to long mode is annoying, sure, but it's not
that long: what, about fifty instructions or so to go from the
first instruction on the SIPI page to running in a high-level
language? Slightly more on the BSP, sure, but not much more.

I can appreciate the desire to get rid of 8086 semantics, but I
am much less sympathetic to the argument about code distance
from boot to long mode.

And systems like Linux are still quite adament about using e.g.
the PIT and 8159 to calibrate the TSC when they can (which is
super annoying when you're trying to bring up a new hypervisor)!

>Each generation adds more features that need formal verification
>during the design process (and to ensure correct functionality
>with new internal interconnect structures (bus, ring, mesh, et alia))
>and that costs more man-hours to integrate into the design without
>breaking any of the legacy stuff. If this all lets them tape
>a new chip out a month early, that's a win.

I suppose....

I mean, don't get me wrong, I believe you, but I find the entire
thing a bit dubious.

- Dan C.

BGB

unread,
May 21, 2023, 11:07:19 PM5/21/23
to
It does help.

Though, this doesn't fix a lot of of the ugly issues with x86-64.
But, by the time one cleans up the mess, what one would have isn't
really x86-64 anyone...


>> Yes, it will basically put Paul and PDOS out of business,
>
> What a shame.
>
>> but he can always run in emulation.
>
> What a shame.
>

At this point, it almost seems like maybe x86 should be "put out to
pasture" (as a native hardware-level instruction set) and pretty much
the whole mess moved over to emulation (using "a good JIT" can be
reasonably effective).

Ideally, the native ISA should be open, and able to be used directly by
software. Or, having a fallback "x86 emulation" mode, which can launch a
conventional OS in a sort of emulator/hypervisor (so, say, one can run
an x86-64 OS on it, probably with reduced performance).


But, OTOH, not like there is an ideal alternative:
* ARM: Not ideal, as it is not an open ISA;
* RISC-V: Open ISA, but still a little lacking;
* BJX2: Could work OK, but I am a bit biased here...


I would almost thing RISC-V could be a good solution, but:
Some of its design choices, I don't feel are ideal;
The maturity of the tooling is unexpectedly lacking.
GCC is missing things like PIE and shared objects;
Many of the extensions are half-backed and don't mesh together well.
...

Getting good performance from RISC-V would still require a "clever" CPU
core, as naive designs would fall short in terms of performance (and
there are some issues here that are "not fixable" within the existing
encoding scheme).

For a small device, something akin to a RISC-V core running an x86
emulator as part of its ROM could almost make sense (but, performance
wouldn't be so great as the use of condition-codes by x86 would be a
serious issue for RISC-V's ISA design).



One could almost argue for reviving IA-64, except some of its design
choices don't make much sense either.

One likely needs, say:
Likely a VLIW or similar;
Supporting an in-order or out-of-order pipeline;
Probably 64 or 128 general-purpose registers;
Directly accessible, no register windows;
Has built-in support for SIMD operations;
Likely FPU existing as a subset of the SIMD operators;
Supports large-immediate forms in the instruction stream;
Needs to be able to support 32 and 64-bit inline constants.
Instruction predication;
...


Instruction size is an issue.

In my ISA, I was using 32-bit instructions, but it isn't really possible
to fit "everything I would want" into a 32-bit instruction format (in my
case, this leads to a certain inescapable level of non-orthogonality,
and also some larger variable-length instructions).


Could fit a bit more into a 48-bit instruction format, but code-density
would be worse. Well, and/or one uses an instruction encoding scheme
similar to IA-64... Both have drawbacks.


Then again, if Intel was like "We are bringing back something kinda
similar to IA-64 but with AVX glued on", possibly it wouldn't go over
very well...

Well, and also when one considers that some of the compiler
weaknesses/difficulties are still not an entirely solved issue.

Though, it is likely that rather than focusing on traditional native
compilation, the emphasis would be on JIT compiling x86 and x86-64 code
to the thing.


>>> Of all the weirdness in x86, eliminating this stuff seems like
>>> pretty small potatoes. I mean, they could get rid of
>>> segmentation entirely in 64-bit mode,
>>
>> Now, that would be churn. Granted AMD's SVM and Intel's VT-X
>> (once nested page tables/extended page tables were added) have
>> basically eliminated the last need for segment limit register
>> (which AMD had removed in amd64 but added back later because
>> XEN needed them for paravirtualization (before NPT/EPT)).
>
> Would it really? Limits and base are already ignored in long
> mode; about the only thing it's still used for is GSBASE/FSBASE
> and for that we have MSRs. But, having to program non-null
> segment selectors into STAR, and having to have a valid GDT,
> adds seemingly unnecessary complexity. If they're going to
> swap around how they do AP startup with a brand-new SIPI type,
> it doesn't seem like a big lift to just do away with
> segmentation entirely.
>

Ironically, if one goes over to software managed TLB, then the whole
"nested page table" thing can disappear into the noise (as it is all
software).


Or, you have people like me going and using B-Trees in place of
page-tables, since B-Trees don't waste as much memory when one has a
sparse address space managed with aggressive ASLR (the pages in the
upper levels of the page-tables being almost entirely empty with sparse
ASLR).

Granted, I don't expect many other people are likely to consider using
B-Trees in place of page-tables to be a sensible idea.



>> I see this proposal eliminates limit checking, which was only
>> added to x86_64 to support Xen.
>
> I believe that's for 32-bit mode? Both AMD and Intel already
> ignore segment limits in 64-bit mode, and both effectively
> ignore segment base as well. At least for CS, DS, ES and SS; FS
> and GS are treated specially for thread- and core-local storage,
> but I don't think any that is changing in this proposal.
>

Yeah.


>> Most of these changes will affect the boot loaders and secondary
>> bootstrap for the most part, which is where the progression from
>> power-on through real-mode, protected-mode, enable paging, enable
>> longmode occurs. There should be very few changes to the modern
>> kernels (windows, linux), if any.
>
> Yup. That code (as you well know) tends to be write-once and
> mostly forget. I get that the hardware folks want to make their
> lives easier, but in the short term, this adds complexity to the
> code (which must now be specialized to detect whether it's
> running on an x86-S CPU or current x86_64 and behave
> accordingly). I suppose we could treat x86-S as an entirely
> separate architecture.
>

Presumably you could only boot these from EFI anyways.
In some ways it makes some sense...


But, if one is going to be breaking backward compatibility anyways (as
far as the OS in concerned), still almost makes sense to consider
abandoning x86 entirely and deal with legacy software via emulation
(with a "good enough" emulator being provided that the OS's are less
likely to try to force all of the binaries over to the new instruction set).

But, does mean that, besides things being convenient for the OS devs,
one likely needs a design capable of getting up to ~ 90%+ of native x86
performance via JIT (this part would likely be a serious problem for
trying to emulate x86 via RISC-V).


> - Dan C.
>

Luke A. Guest

unread,
May 22, 2023, 6:00:49 AM5/22/23
to
On 21/05/2023 15:18, Dan Cross wrote:
> In article <u4crh1$1hngg$1...@dont-email.me>,
> Luke A. Guest <lag...@archeia.com> wrote:
>> On 20/05/2023 18:07, Dan Cross wrote:
>>> Likely to be filed under the, "Too Little Too Late" category,
>>> Intel has recent put forth a proposal for simplifying the x86_64
>>> architecture, and in particularly, discarding some of the legacy
>>> baggage of a 45 year old architecture. Details, including a
>>> link to the specifically proposed architectural changes, are
>>> here:
>>> https://www.intel.com/content/www/us/en/developer/articles/technical/envisioning-future-simplified-architecture.html
>>
>> Or just dump the archaic x86(-64) arch and go full RISC, with current
>> manufacturing technology, they'd be speedy as fuck. Apparently all x86
>> ave been RISC underneath for ages now anyway.
>
> The issue there is backwards compatibility with an enormous
> installed base. Intel tried back back in the early 00s with the
> Itanium; it did not go well for them. They had previously tried

I know, I remember it well, all the decent RISC chip manufacturers
basically dumping their stuff for something inferior.

As I said already, chips already do x86/64 emulation on top of a RISC
architecture, so what's the problem?

> back in the 80s with the i432 and that was even worse. In an
> amazing display of lack of self-awareness, they made many of the
> same mistakes with Itanium that they had previously made with
> i432 (waiting for the perfect compiler to make the thing go fast
> was a mistake with both projects). They made similar errors
> with the i860, too, and made other RISC mistakes with the i960
> (which was marketed as a microcontroller)---noteably, the 960
> used register windows, like SPARC and the original Berkeley RISC
> machine. History has shown this seemingly nifty idea not to be
> that great in practice.

It's not that it's "not too great" it's more that they fucked it up by
doing weird shit, like making a CPU have so many instructions just to
run Ada, I use Ada, it's great, but that was just stupid. But that was a
time when Ada features required compiler features that weren't inventd
or had only just been and were slow.


Dan Cross

unread,
May 22, 2023, 7:09:07 AM5/22/23
to
In article <u4em65$213rr$1...@dont-email.me>, BGB <cr8...@gmail.com> wrote:
>On 5/21/2023 4:07 PM, Dan Cross wrote:
>> [snip]
>> Would it really? Limits and base are already ignored in long
>> mode; about the only thing it's still used for is GSBASE/FSBASE
>> and for that we have MSRs. But, having to program non-null
>> segment selectors into STAR, and having to have a valid GDT,
>> adds seemingly unnecessary complexity. If they're going to
>> swap around how they do AP startup with a brand-new SIPI type,
>> it doesn't seem like a big lift to just do away with
>> segmentation entirely.
>>
>
>Ironically, if one goes over to software managed TLB, then the whole
>"nested page table" thing can disappear into the noise (as it is all
>software).
>
>
>Or, you have people like me going and using B-Trees in place of
>page-tables, since B-Trees don't waste as much memory when one has a
>sparse address space managed with aggressive ASLR (the pages in the
>upper levels of the page-tables being almost entirely empty with sparse
>ASLR).
>
>Granted, I don't expect many other people are likely to consider using
>B-Trees in place of page-tables to be a sensible idea.

Software-managed page tables actually dramatically complicate
address space management in a hypervisor, in part because the
page table used by a guest is not generally knowable in advance
(a guest can just make up their own).

Why a B-tree instead of a radix tree, anyway?

- Dan C.

Scott Lurndal

unread,
May 22, 2023, 10:48:06 AM5/22/23
to
cr...@spitfire.i.gajendra.net (Dan Cross) writes:
>In article <lksaM.450847$ZhSc....@fx38.iad>,
>Scott Lurndal <sl...@pacbell.net> wrote:

>
>>Yes, it will basically put Paul and PDOS out of business,
>
>What a shame.
>

:-)


>>Now, that would be churn. Granted AMD's SVM and Intel's VT-X
>>(once nested page tables/extended page tables were added) have
>>basically eliminated the last need for segment limit register
>>(which AMD had removed in amd64 but added back later because
>>XEN needed them for paravirtualization (before NPT/EPT)).
>
>Would it really? Limits and base are already ignored in long
>mode;

They were in the original AMD64 implementation, but that changed
quickly - the data segment limit is still enforced in long mode
(to support XEN (and VMware) paravirt).

https://www.pagetable.com/?p=25

A subsequent opteron added it back just for the data segment.
I can't speak for Intel in this instance, but AMD definitely
added support for the DS limit checking in long mode in the
second or third generation opteron.

> about the only thing it's still used for is GSBASE/FSBASE
>and for that we have MSRs. But, having to program non-null
>segment selectors into STAR, and having to have a valid GDT,
>adds seemingly unnecessary complexity. If they're going to
>swap around how they do AP startup with a brand-new SIPI type,
>it doesn't seem like a big lift to just do away with
>segmentation entirely.
>
>>I see this proposal eliminates limit checking, which was only
>>added to x86_64 to support Xen.
>
>I believe that's for 32-bit mode? Both AMD and Intel already
>ignore segment limits in 64-bit mode, and both effectively

See above.

>>Most of these changes will affect the boot loaders and secondary
>>bootstrap for the most part, which is where the progression from
>>power-on through real-mode, protected-mode, enable paging, enable
>>longmode occurs. There should be very few changes to the modern
>>kernels (windows, linux), if any.
>
>Yup. That code (as you well know) tends to be write-once and
>mostly forget. I get that the hardware folks want to make their
>lives easier, but in the short term, this adds complexity to the
>code (which must now be specialized to detect whether it's
>running on an x86-S CPU or current x86_64 and behave
>accordingly). I suppose we could treat x86-S as an entirely
>separate architecture.

A couple of checks of the CPUID output and a bit of new code
seems reasonable to position for the future.

>
>I don't see how virtio can give a user-application pass-through
>access to programmed IO, but I appreciate an argument that says
>that there can be a uioring sort of thing to communicate IO
>requests from userspace to the kernel without a trap.

We do that all the time on our processors. Applications like DPDK
and Open Data Plane (ODP) rely on user-mode access to the
device MMIO (often using SR-IOV virtual functions) space and direct
DMA (facilitated by an IOMMU/SMMU) initiated by usermode code.

Interrupts are still mediated by the OS (virt-io provides these
capabilities), although DPDK/ODP generally poll completion rings
rather than use interrupts.

>
>For that matter, I don't see how doing away with segmentation in
>the 64-bit mode really adds that much, either

Again, is the hardware implementation and verification cost
that's being saved.

Dan Cross

unread,
May 22, 2023, 12:48:56 PM5/22/23
to
In article <HfLaM.617957$Ldj8....@fx47.iad>,
Scott Lurndal <sl...@pacbell.net> wrote:
>cr...@spitfire.i.gajendra.net (Dan Cross) writes:
>>In article <lksaM.450847$ZhSc....@fx38.iad>,
>>Scott Lurndal <sl...@pacbell.net> wrote:
>[snip]
>>>Now, that would be churn. Granted AMD's SVM and Intel's VT-X
>>>(once nested page tables/extended page tables were added) have
>>>basically eliminated the last need for segment limit register
>>>(which AMD had removed in amd64 but added back later because
>>>XEN needed them for paravirtualization (before NPT/EPT)).
>>
>>Would it really? Limits and base are already ignored in long
>>mode;
>
>They were in the original AMD64 implementation, but that changed
>quickly - the data segment limit is still enforced in long mode
>(to support XEN (and VMware) paravirt).
>
>https://www.pagetable.com/?p=25
>
>A subsequent opteron added it back just for the data segment.
>I can't speak for Intel in this instance, but AMD definitely
>added support for the DS limit checking in long mode in the
>second or third generation opteron.

Funny, I checked both the SDM and the AMD APM before posting,
and both say segment checking is not enforced in 64-bit mode.
(SDM sec vol 3A sec 5.3.1 and APM vol 2 sec 4.8.2). Ah, wait;
I see now: APM vol 2 sec 4.12.2 says that they _are_ enforced
if EFER.LMSLE is set to 1 (see, this is why we can't have nice
things). Apparently, Intel never did this since they had VT-x,
so I don't imagine they care that much for x86S. With SVM, I
wonder how much AMD cares, either.
Yeah. That adds some modicum of complexity to the boot path,
but is obviously doable.

It's still not entirely clear to me now the BSP/BSC is supposed
to boot, however. If the world starts in 64-bit mode, and that
still requires paging to be enabled, then who sets up the page
tables that the BSP starts up on?

>>I don't see how virtio can give a user-application pass-through
>>access to programmed IO, but I appreciate an argument that says
>>that there can be a uioring sort of thing to communicate IO
>>requests from userspace to the kernel without a trap.
>
>We do that all the time on our processors. Applications like DPDK
>and Open Data Plane (ODP) rely on user-mode access to the
>device MMIO (often using SR-IOV virtual functions) space and direct
>DMA (facilitated by an IOMMU/SMMU) initiated by usermode code.

Ok, sure, but that's not PIO. Unprivileged access to the PIO
space seems like it's just going away. I think that's probably
fine as almost all high-speed devices are memory-mapped anyway,
so we're just left with legacy things like the UART or PS/2
keyboard controller or whatever.

>Interrupts are still mediated by the OS (virt-io provides these
>capabilities), although DPDK/ODP generally poll completion rings
>rather than use interrupts.

Really? Even with SR-IOV and the interrupt remapping tables in
the IOMMU? Are you running in VMX non-root mode? Why not use
posted interrupts?

>>For that matter, I don't see how doing away with segmentation in
>>the 64-bit mode really adds that much, either
>
>Again, is the hardware implementation and verification cost
>that's being saved.

Sorry, that was poorly worded: I meant, I don't see how it costs
that much to do away with it. Even on AMD it appears that one
has to go out of one's way to use it.

- Dan C.

Scott Lurndal

unread,
May 22, 2023, 1:19:39 PM5/22/23
to
Yep, thats the one. I have been deep into AArch64 for the last
12 years, so I haven't dug through the AMD docs for a while.

>
>Yeah. That adds some modicum of complexity to the boot path,
>but is obviously doable.
>
>It's still not entirely clear to me now the BSP/BSC is supposed
>to boot, however. If the world starts in 64-bit mode, and that
>still requires paging to be enabled, then who sets up the page
>tables that the BSP starts up on?

I haven't dug into it, but perhaps they come up in some funky
identity mode when the PT root pointer (CR3?) hasn't been programmed.

>
>>>I don't see how virtio can give a user-application pass-through
>>>access to programmed IO, but I appreciate an argument that says
>>>that there can be a uioring sort of thing to communicate IO
>>>requests from userspace to the kernel without a trap.
>>
>>We do that all the time on our processors. Applications like DPDK
>>and Open Data Plane (ODP) rely on user-mode access to the
>>device MMIO (often using SR-IOV virtual functions) space and direct
>>DMA (facilitated by an IOMMU/SMMU) initiated by usermode code.
>
>Ok, sure, but that's not PIO.

By PIO are you referring to 'in' and 'out' instructions that have
been obsolete for three decades except for a few legacy devices
like the UART (and access to pci config space, although PCI
express defines the memory mapped ECAM as an alternative which
is used on non-intel/amd systems)?


> Unprivileged access to the PIO
>space seems like it's just going away. I think that's probably
>fine as almost all high-speed devices are memory-mapped anyway,
>so we're just left with legacy things like the UART or PS/2
>keyboard controller or whatever.

Plus, with PCI, a "io space" bar can be programmed to sit anywhere
in the physical address space. With most modern devices either
being PCI or providing PCI configuration space semantics, one can
still use PIO even on ARM processors via IO BAR. Not that there really are
any modern PCI/PCIe devices that use anything other than "memory space"
bars.

>
>>Interrupts are still mediated by the OS (virt-io provides these
>>capabilities), although DPDK/ODP generally poll completion rings
>>rather than use interrupts.
>
>Really? Even with SR-IOV and the interrupt remapping tables in
>the IOMMU? Are you running in VMX non-root mode? Why not use
>posted interrupts?

Hmm. I do seem to recall some mechanisms for interrupt virtualization
in the IOMMU, but I've been, as noted above, in the ARMv8 world for a while now.

Speaking for ARM systems, the guts of the interrupt controller
(including the interrupt acknowledge registers) are privileged. There
is no way to segregate user-mode-visible interrupts from all others
which is needed to ensure that a user-mode program can't royally screw
up the system, the kernel must accept and end the interrupt. The
ARM GICv3 is actually much more sophisticated than the local and I/O
APICs' on Intel and the GICv4 adds some level of interrupt virtualization
to support delivery directly to the guest without intervention from
the hypervisor. IIRC, the Intel IOMMU interrupt remapping tables
were to support that type of usage, not direct user mode access
(which would require user-mode access to the local APIC to end the
interrupt).


>
>>>For that matter, I don't see how doing away with segmentation in
>>>the 64-bit mode really adds that much, either
>>
>>Again, is the hardware implementation and verification cost
>>that's being saved.
>
>Sorry, that was poorly worded: I meant, I don't see how it costs
>that much to do away with it. Even on AMD it appears that one
>has to go out of one's way to use it.

Gotcha. Concur.

Dan Cross

unread,
May 22, 2023, 2:18:05 PM5/22/23
to
In article <ltNaM.3348574$9sn9.2...@fx17.iad>,
Scott Lurndal <sl...@pacbell.net> wrote:
>cr...@spitfire.i.gajendra.net (Dan Cross) writes:
>[snip]
>>It's still not entirely clear to me now the BSP/BSC is supposed
>>to boot, however. If the world starts in 64-bit mode, and that
>>still requires paging to be enabled, then who sets up the page
>>tables that the BSP starts up on?
>
>I haven't dug into it, but perhaps they come up in some funky
>identity mode when the PT root pointer (CR3?) hasn't been programmed.

Now that would genuinely be a useful change.

>>>>I don't see how virtio can give a user-application pass-through
>>>>access to programmed IO, but I appreciate an argument that says
>>>>that there can be a uioring sort of thing to communicate IO
>>>>requests from userspace to the kernel without a trap.
>>>
>>>We do that all the time on our processors. Applications like DPDK
>>>and Open Data Plane (ODP) rely on user-mode access to the
>>>device MMIO (often using SR-IOV virtual functions) space and direct
>>>DMA (facilitated by an IOMMU/SMMU) initiated by usermode code.
>>
>>Ok, sure, but that's not PIO.
>
>By PIO are you referring to 'in' and 'out' instructions that have
>been obsolete for three decades except for a few legacy devices
>like the UART

Well, yes. (The context was the removal of both ring 3 port
access instructions, as well as the IOPL from TSS.)

>(and access to pci config space, although PCI
>express defines the memory mapped ECAM as an alternative which
>is used on non-intel/amd systems)?

I try to blot that out of my mind.

I believe that PCI express deprecates the port-based access
method to config space; MMIO _must_ be supported and in
particular, is the only way to get access to the extended
capability space. So from that perspective we're not losing
anything. Certainly, I've used memory-mapped IO for dealing
with PCI config space on x86_64. The port-based access method
is really only for compatibility with legacy systems at this
point.

>> Unprivileged access to the PIO
>>space seems like it's just going away. I think that's probably
>>fine as almost all high-speed devices are memory-mapped anyway,
>>so we're just left with legacy things like the UART or PS/2
>>keyboard controller or whatever.
>
>Plus, with PCI, a "io space" bar can be programmed to sit anywhere
>in the physical address space. With most modern devices either
>being PCI or providing PCI configuration space semantics, one can
>still use PIO even on ARM processors via IO BAR. Not that there really are
>any modern PCI/PCIe devices that use anything other than "memory space"
>bars.

Yup. It really seems like the only devices that demand access
via port IO are the legacy "PC" devices; if the 8159A is going
away, what's left? The RTC, UART and keyboard controller? Is
the PIT dual-wired to an IOAPIC for interrupt generation?

>>>Interrupts are still mediated by the OS (virt-io provides these
>>>capabilities), although DPDK/ODP generally poll completion rings
>>>rather than use interrupts.
>>
>>Really? Even with SR-IOV and the interrupt remapping tables in
>>the IOMMU? Are you running in VMX non-root mode? Why not use
>>posted interrupts?
>
>Hmm. I do seem to recall some mechanisms for interrupt virtualization
>in the IOMMU, but I've been, as noted above, in the ARMv8 world for a while now.

Is this the point where I express my jealousy? :-D

But yes: the IOMMU can be used to deliver interrupts directly to
a VCPU (provided you're using APIC virtualization) by writing to
a posted-interrupt vector. The resulting interrupt will be
generated and delivered in the guest without intervention from
the hypervisor.

>Speaking for ARM systems, the guts of the interrupt controller
>(including the interrupt acknowledge registers) are privileged. There
>is no way to segregate user-mode-visible interrupts from all others
>which is needed to ensure that a user-mode program can't royally screw
>up the system, the kernel must accept and end the interrupt.

I'm not sure I understand; I thought the GIC was memory mapped,
including for the banked per-CPU registers? Is the issue that
you don't want to expose the entire mapping (I presume this has
to be on some page granularity) to userspace?

>The
>ARM GICv3 is actually much more sophisticated than the local and I/O
>APICs' on Intel and the GICv4 adds some level of interrupt virtualization
>to support delivery directly to the guest without intervention from
>the hypervisor. IIRC, the Intel IOMMU interrupt remapping tables
>were to support that type of usage, not direct user mode access
>(which would require user-mode access to the local APIC to end the
>interrupt).

That is correct; perhaps I'm misintpreting what you meant
earlier: I think I gather now that you're talking about
overloading functionality meant for virtualization to provide
unprivileged access to devices in a host. That is, allocate a
virtual function and pass that through to a userspace process,
but don't enter a virtualzied CPU context?

- Dan C.

BGB

unread,
May 22, 2023, 3:04:16 PM5/22/23
to
If the guest is using the same type of software managed TLB, one doesn't
emulate the guest's page-tables, one emulates the guest's TLB
(effectively running the TLB through another level of virtual address
translation).


> Why a B-tree instead of a radix tree, anyway?
>

If you mean a radix tree, like conventional page-tables, the issue is
mostly a combination of a large address space and ASLR.

The page table works fine for a 48-bit address space, but starts to have
problems for a larger space.


In my case, the "full" virtual address space is 96 bits.

The upper levels of the page table end up being mostly empty, and if one
tries to ASLR addresses within this space, then the memory overhead from
the page tables ends up being *absurd* (huge numbers of page tables
often with only a single entry being non-zero).

I had ended up often using a hybrid strategy, where the upper-bits of
the address are managed using a B-Tree, and the low-order bits with a
more conventional page table.

Say, for 16K pages:
Addr(95:36): B-Tree
Addr(35:14): 2-level page-table

Then one can use ASLR freely without burning through excessive amounts
of memory (say, with an 8-level page table for 16K pages).

Note that 4K pages would require a 10-level page-table, and 64K pages a
7-level page-table.


A pure B-Tree would use less memory than the hybrid strategy, but the
drawback of a B-Tree is that it is slower.

It is possible to speed-up the B-Tree by using a hash-table to cache
lookups, but this is most effective with the hybrid strategy.


Using hash-tables as the primary lookup (rather than B-Trees) had been
looked into as well, but hash tables have drawbacks when used in this
way (they don't scale very well).

Had also experimented with using AVL Trees, however these ended up
slightly worse in terms of both memory overhead and performance when
compared with B-Trees (though, AVL Trees are a little simpler to implement).

...


Note that the memory layout here would have programs within their own
local 48 bits (programs not generally needing to care about anything
beyond their own 48-bit space), but the programs are placed randomly
within the 96-bit space (so that one program can't "guess" an address
into another program's memory; but memory can still be "shared" via
128-bit "huge" pointers).

Say:
void *ptr; //points within the local 48-bit space (64-bit pointer)
void * __huge ptr; //points within the 96-bit space (128-bit)
...

Say:
void ** __huge ptr; //128-bit pointer to 64-bit pointers
__huge void **ptr; //64-bit pointer to 128-bit pointers
__huge void ** __huge ptr; //128-bit pointer to 128-bit pointers

...

Though, this part is specific to BGBCC.



I had put off some of this for a little while, but it came up again
mostly because my CPU core can also run 64-bit RISC-V, but it turns out
GCC doesn't support either PIE or shared-objects for this target
("WTF?"), so to be able to load them up as programs, I need to give them
their own address space, and throwing them off into random parts of
96-bit land seemed the "lesser of two evils" (as compared with needing
to actually deal with multiple address spaces in the kernel).

This is a little bit of an annoyance as it does mean needing to widen
the virtual memory system and anything that deals directly with system
calls to deal with the larger address space (and, secondarily, use a
wrapper interface because, if any of this code is built with GCC in
RISC-V mode, then GCC doesn't support either 128-bit pointers or 128-bit
integers).

Though, in some cases, this will mean needing to copy things into local
buffers and copy them back into the program's address range (so, a
similar annoyance to if one was dealing with multiple address spaces).

...


Scott Lurndal

unread,
May 22, 2023, 3:20:46 PM5/22/23
to
Intel, AMD and ARM chips all have hardware translation walkers. All
have a facility to support guest OS management of page tables
(nested page tables on AMD, extended page tables on Intel and stage 2
page tables on ARM). They also have I/O memory management units that
also support multiple translation stages to allow guests to program
guest physical addresses into hardware DMA engines, some can even
support translations using both stages to allow user-mode code to
directly program hardware DMA engines when running under
a guest OS (e.g. for a NIC or storage adapter virtual function assigned
directly to user-mode code).

Even those using MIPS chips (the last with software walkers) such as
Cavium investigated adding a hardware walker before they switched to
ARMv8.

There is no possibility that the software (operating system and
hypervisor) folks will support software managed TLBs in a new architecture.
Zero chance.

Dan Cross

unread,
May 22, 2023, 4:12:51 PM5/22/23
to
Indeed, but that introduces complications: on a TLB miss
interrupt, the hypervisor must invoke the guest's TLB miss
handler to supply the translation (since it can't just walk the
guest's page tables, since it doesn't know about them), then
trap the TLB update. That's all straight-forward, but it's also
slow. Often, to mitigate this, the hypervisor will cache recent
translations itself (e.g., Disco's L2TLB), but now we have to
emulate removals as well (again, straight-forward, but something
to consider nontheless). And when we consider nested
virtualization (that is, a hypervisor running under a
hypervisor) this overhead gets even worse.

L2PT's like the EPT and NPT are wins here; even in the nested
VM case, where we have to resort to shadow paging techniques, we
can handle L2 page faults in the top-level hypervisor.

There's a reason soft-TLBs have basically disappeared. :-)

- Dan C.

Scott Lurndal

unread,
May 22, 2023, 4:45:33 PM5/22/23
to
cr...@spitfire.i.gajendra.net (Dan Cross) writes:
>In article <ltNaM.3348574$9sn9.2...@fx17.iad>,
>Scott Lurndal <sl...@pacbell.net> wrote:
>>cr...@spitfire.i.gajendra.net (Dan Cross) writes:
>>[snip]
>>>It's still not entirely clear to me now the BSP/BSC is supposed
>>>to boot, however. If the world starts in 64-bit mode, and that
>>>still requires paging to be enabled, then who sets up the page
>>>tables that the BSP starts up on?
>>
>>I haven't dug into it, but perhaps they come up in some funky
>>identity mode when the PT root pointer (CR3?) hasn't been programmed.
>
>Now that would genuinely be a useful change.

The document describes IA32_SIPI_ENTRY_STRUCT as containing:

- A bit that selects startup or shutdown (+63 filler bits)
- The RIP for the AP to start executing at
- The a CR3 value for the AP to use when starting
- The CR0 value
- The CR4 value

This is used to start all the secondary processors.

The boostrap processor:

"The CPU starts executing in 64-bit paged mode after reset. The
Firmware Interface Table (FIT) contains a reset state structure
containing RIP and CR3 that defines the initial execution state of
the CPU."

I'm presuming that a management processor provided by the
mainboard vendor will initialize the FIT out-of-band before
releasing the BSP from reset sufficent to execute the UEFI
firmware and boot loader.


>
>>>>>I don't see how virtio can give a user-application pass-through
>>>>>access to programmed IO, but I appreciate an argument that says
>>>>>that there can be a uioring sort of thing to communicate IO
>>>>>requests from userspace to the kernel without a trap.
>>>>
>>>>We do that all the time on our processors. Applications like DPDK
>>>>and Open Data Plane (ODP) rely on user-mode access to the
>>>>device MMIO (often using SR-IOV virtual functions) space and direct
>>>>DMA (facilitated by an IOMMU/SMMU) initiated by usermode code.
>>>
>>>Ok, sure, but that's not PIO.
>>
>>By PIO are you referring to 'in' and 'out' instructions that have
>>been obsolete for three decades except for a few legacy devices
>>like the UART
>
>Well, yes. (The context was the removal of both ring 3 port
>access instructions, as well as the IOPL from TSS.)

Ok. Coming from the unix/linux world, ring 3 access to those
has generally not been allowed and I don't see removal of that
capability as a loss, but rather a gain.

>
>>(and access to pci config space, although PCI
>>express defines the memory mapped ECAM as an alternative which
>>is used on non-intel/amd systems)?
>
>I try to blot that out of my mind.
>
>I believe that PCI express deprecates the port-based access
>method to config space; MMIO _must_ be supported and in
>particular, is the only way to get access to the extended
>capability space.

Intel cheated and used four unused bits in the high bits
of cf8 for the extended config space on some of the southbridge
chipsets in the early PCIe days. I've not checked recently
to see if newer chipsets/SoCs still support that.

> So from that perspective we're not losing
>anything. Certainly, I've used memory-mapped IO for dealing
>with PCI config space on x86_64. The port-based access method
>is really only for compatibility with legacy systems at this
>point.

For ARM, the standard is ECAM (See Server Base System Architecture
document).

https://developer.arm.com/documentation/den0029/latest/
>
>>> Unprivileged access to the PIO
>>>space seems like it's just going away. I think that's probably
>>>fine as almost all high-speed devices are memory-mapped anyway,
>>>so we're just left with legacy things like the UART or PS/2
>>>keyboard controller or whatever.
>>
>>Plus, with PCI, a "io space" bar can be programmed to sit anywhere
>>in the physical address space. With most modern devices either
>>being PCI or providing PCI configuration space semantics, one can
>>still use PIO even on ARM processors via IO BAR. Not that there really are
>>any modern PCI/PCIe devices that use anything other than "memory space"
>>bars.
>
>Yup. It really seems like the only devices that demand access
>via port IO are the legacy "PC" devices; if the 8159A is going
>away, what's left? The RTC, UART and keyboard controller? Is
>the PIT dual-wired to an IOAPIC for interrupt generation?

Don't the have an architected high precision timer (HPET) that
is used instead of the PIT in these modern times?

>
>>>>Interrupts are still mediated by the OS (virt-io provides these
>>>>capabilities), although DPDK/ODP generally poll completion rings
>>>>rather than use interrupts.
>>>
>>>Really? Even with SR-IOV and the interrupt remapping tables in
>>>the IOMMU? Are you running in VMX non-root mode? Why not use
>>>posted interrupts?
>>
>>Hmm. I do seem to recall some mechanisms for interrupt virtualization
>>in the IOMMU, but I've been, as noted above, in the ARMv8 world for a while now.
>
>Is this the point where I express my jealousy? :-D

I've quite enjoyed the last decade working on a
significant architectural upgrade from ARMv7.

Watching the architecture grow from initial early
release documents and modeling the Processor, SMMU, and
Interrupt Controller has been educational and fun.


>>Speaking for ARM systems, the guts of the interrupt controller
>>(including the interrupt acknowledge registers) are privileged. There
>>is no way to segregate user-mode-visible interrupts from all others
>>which is needed to ensure that a user-mode program can't royally screw
>>up the system, the kernel must accept and end the interrupt.
>
>I'm not sure I understand; I thought the GIC was memory mapped,
>including for the banked per-CPU registers?

That was GICv2. That interface only supported 8 cores/threads,
so they designed GICv3 for ARMv8. That uses CPU system registers
to interface rather than the former memory mapped CPU interface.
Much cleaner and easily accomodates many thousands of CPUs; also
expanded to handle large numbers (2^22) of interrupts, including
software generated interrupts for IPI, per-processor local interrupts
for things like timers, profiling, debugging, wired interrupts
(level or edge) and message signaled interrupts (edge only).

The GICv3 CPU interface also includes special hypervisor support. In
GICv3.0, the hypervisor could inject interrupts in the guest but still
was required to handle all physical interrupts itself (and pass them
to the guest as required). With GICv4.0, a mechanism was added such
that interrupts (message signaled) could be directly injected into
the guest by the hardware without hypervisor intervention. GICv4.1
added support for hardware injected virtual software generated
interrupts (vSGI), so SMP guests could send IPI's to other cores
assigned to it using virtual core numbers mapped by the GIC into
a interrupt to be injected (if the guest was actively scheduled on
the core, otherwise it would be recorded and delivered when the
guest virtual CPU is next scheduled on that core).




>
>That is correct; perhaps I'm misintpreting what you meant
>earlier: I think I gather now that you're talking about
>overloading functionality meant for virtualization to provide
>unprivileged access to devices in a host. That is, allocate a
>virtual function and pass that through to a userspace process,
>but don't enter a virtualzied CPU context?

Yes, that's the basic use pattern for the DPDK. Usermode drivers
directly access the networking hardware without operating system
intervention. Virtualized[*] _or_ bare-metal. Interrupts are the
complicating factor since most processors do not have the capability
to deliver interrupts to user-mode handlers directly. Thus DPDK
and ODP poll completion queues on the network hardware rather than
waiting for interrupts.

https://www.dpdk.org/
https://opendataplane.org/

[*] Requires the SMMU/IOMMU to do both stages of translation, i.e.
va->guestpa and guestpa->machinepa.

BGB

unread,
May 22, 2023, 9:38:07 PM5/22/23
to
I had gone with software managed TLB, but in my case this was partly
because my ISA had evolved out of SuperH and similar, and the approach
seemed to make sense in terms of a "make the hardware simple and cheap"
sense.


Similarly, some features, like B-Tree based page-tables, applying ACL
checks to virtual-memory pages (as opposed to a more conventional
protection-ring scheme), wouldn't really be quite as practical with a
hardware page-table walker.


But, with it being handled in software, one can basically do whatever
they want...


If the guest OS wants to use a page-table and pretend there is a
page-table walker, this is easy enough to pull off as well.

Granted, emulating a TLB on top of page-tables is more difficult, but
this is mostly because page-tables are less flexible in this case.


...



BGB

unread,
May 22, 2023, 10:00:47 PM5/22/23
to
IME, it isn't really all that difficult in practice.

Granted, for things like removing a page from the TLB, these were
generally handled by loading an empty TLBE (for a given virtual address)
roughly 8 times in a row. With a set-associative TLB, this was enough to
make sure the page was evicted (and should also be easy enough to detect
and handle).


> L2PT's like the EPT and NPT are wins here; even in the nested
> VM case, where we have to resort to shadow paging techniques, we
> can handle L2 page faults in the top-level hypervisor.
>

But, if one uses SW TLB, then NPT (as a concept) has no reason to need
to exist...


> There's a reason soft-TLBs have basically disappeared. :-)
>

Probably depends some on how the software-managed TLB is implemented.

In my case, TLB miss triggers an interrupt, and there is an "LDTLB"
instruction which basically means "Take the TLBE from these two
registers and shove it into the TLB at the appropriate place".

In this case, there is no way for the program to directly view or modify
the contents of the TLB.


As can be noted in my testing, TLB miss rate is typically low enough
(with 16K pages and a 256x 4-way TLB) that the performance impact of
handling TLB misses in software doesn't really have all that much effect
on the overall performance of the program.

As can be noted:
TLB miss rate increases significantly with 4K pages vs 16K pages;
Dropping to 64x 4-way also increases miss rate;
Miss rate is "pretty bad" at 16x 4-way.

Generally, stable operation of the TLB seems to require at least 4-way
associativity for the L2 TLB (an L1 TLB can get by more effectively with
1-way assuming a modulo indexing scheme; So, 16x 1-way for the L1 TLB).

Note that fully associative TLBs aren't really viable on an FPGA.


> - Dan C.
>

Dan Cross

unread,
May 23, 2023, 8:20:01 AM5/23/23
to
In article <u4h5b9$2afd7$1...@dont-email.me>, BGB <cr8...@gmail.com> wrote:
>[snip]
>Granted, emulating a TLB on top of page-tables is more difficult, but
>this is mostly because page-tables are less flexible in this case.

No, emulating the _TLB_ itself is easier, but overall memory
management is more complex, particularly in recursive VM
situations.

- Dan C.

muta...@gmail.com

unread,
May 23, 2023, 9:27:29 AM5/23/23
to
On Monday, May 22, 2023 at 1:15:00 AM UTC+8, Scott Lurndal wrote:

> Yes, it will basically put Paul and PDOS out of business, but he can
> always run in emulation.

I already have a nominally 64-bit version of PDOS that
runs under 64-bit UEFI.

I'm still waiting for compiler support to effectively force
it back to 32-bit.

For PDOS/386, as of yesterday, the entire toolchain except the
compiler is public domain, and all the assembler is masm
syntax.

ie we now have a public domain assembler that is sufficiently
masm-compatible for my purposes.

I haven't yet proven that the entire PDOS can be built with
Visual Studio, now that the language has been switched.

I've proven it with Watcom though.

If I dumb down the source base to SubC then I should be able
to have a completely public domain solution, but I am holding
out for Octogram C.

There is currently work being done to convert the public domain
toolchain to 64-bit, but direction, and even definition, is still being
negotiated.

BFN. Paul.

muta...@gmail.com

unread,
May 23, 2023, 11:11:51 AM5/23/23
to
On Tuesday, May 23, 2023 at 9:27:29 PM UTC+8, muta...@gmail.com wrote:

> I haven't yet proven that the entire PDOS can be built with
> Visual Studio, now that the language has been switched.

I just received the tool I needed to complete this (convert
a PE executable into a binary by inserting a jmp and
populating BSS), and ... it works!

So now PDOS can be built with professional Microsoft tools
instead of having to take your chances with jackasses on
the internet.

BFN. Paul.

Dan Cross

unread,
May 23, 2023, 12:24:31 PM5/23/23
to
In article <u4h6lu$2e7an$1...@dont-email.me>, BGB <cr8...@gmail.com> wrote:
>On 5/22/2023 3:10 PM, Dan Cross wrote:
>[snip]
>> L2PT's like the EPT and NPT are wins here; even in the nested
>> VM case, where we have to resort to shadow paging techniques, we
>> can handle L2 page faults in the top-level hypervisor.
>>
>
>But, if one uses SW TLB, then NPT (as a concept) has no reason to need
>to exist...

Yes, at great expense.

>> There's a reason soft-TLBs have basically disappeared. :-)
>
>Probably depends some on how the software-managed TLB is implemented.

Not really; the design issues and the impact are both
well-known. Think through how a nested guest (note, not a
nested page table, but a recursive instance of a hypervisor)
would be handled.

>In my case, TLB miss triggers an interrupt, and there is an "LDTLB"
>instruction which basically means "Take the TLBE from these two
>registers and shove it into the TLB at the appropriate place".

That's pretty much the way they all work, yes.

- Dan C.

Dan Cross

unread,
May 23, 2023, 12:31:05 PM5/23/23
to
In article <u4ip84$hpp$1...@reader2.panix.com>,
Dan Cross <cr...@spitfire.i.gajendra.net> wrote:
>In article <u4h6lu$2e7an$1...@dont-email.me>, BGB <cr8...@gmail.com> wrote:
>>On 5/22/2023 3:10 PM, Dan Cross wrote:
>>[snip]
>>> L2PT's like the EPT and NPT are wins here; even in the nested
>>> VM case, where we have to resort to shadow paging techniques, we
>>> can handle L2 page faults in the top-level hypervisor.
>>>
>>
>>But, if one uses SW TLB, then NPT (as a concept) has no reason to need
>>to exist...
>
>Yes, at great expense.
>
>>> There's a reason soft-TLBs have basically disappeared. :-)
>>
>>Probably depends some on how the software-managed TLB is implemented.
>
>Not really; the design issues and the impact are both
>well-known. Think through how a nested guest (note, not a
>nested page table, but a recursive instance of a hypervisor)
>would be handled.

Another thing to consider in a virtualized context with a
soft-TLB: suppose the host and guest want to occupy the same
region of virtual memory. How does the host wrest control
back from the guest, if the guest has usurped the host's
mappings? On MIPS, you have KSEGs, which is one approach
here, but note that under (say) Disco you had to modify the
guest kernel as a result.

- Dan C.

Dan Cross

unread,
May 23, 2023, 1:11:22 PM5/23/23
to
In article <KvQaM.2005263$t5W7.1...@fx13.iad>,
Scott Lurndal <sl...@pacbell.net> wrote:
>cr...@spitfire.i.gajendra.net (Dan Cross) writes:
>>In article <ltNaM.3348574$9sn9.2...@fx17.iad>,
>>Scott Lurndal <sl...@pacbell.net> wrote:
>>>cr...@spitfire.i.gajendra.net (Dan Cross) writes:
>>>[snip]
>>>>It's still not entirely clear to me now the BSP/BSC is supposed
>>>>to boot, however. If the world starts in 64-bit mode, and that
>>>>still requires paging to be enabled, then who sets up the page
>>>>tables that the BSP starts up on?
>>>
>>>I haven't dug into it, but perhaps they come up in some funky
>>>identity mode when the PT root pointer (CR3?) hasn't been programmed.
>>
>>Now that would genuinely be a useful change.
>
>The document describes IA32_SIPI_ENTRY_STRUCT as containing:
>
> - A bit that selects startup or shutdown (+63 filler bits)
> - The RIP for the AP to start executing at
> - The a CR3 value for the AP to use when starting
> - The CR0 value
> - The CR4 value
>
>This is used to start all the secondary processors.

Yeah, the AP (secondary processor) case was pretty clear.

>The boostrap processor:
>
> "The CPU starts executing in 64-bit paged mode after reset. The
> Firmware Interface Table (FIT) contains a reset state structure
> containing RIP and CR3 that defines the initial execution state of
> the CPU."
>
>I'm presuming that a management processor provided by the
>mainboard vendor will initialize the FIT out-of-band before
>releasing the BSP from reset sufficent to execute the UEFI
>firmware and boot loader.

Ah, the AMD PSP approach, aka, how to start the engine on a
ship. I guess the days of DRAM training from the BSP are over,
which isn't necessarily a bad thing.

I imagine that they'll just embed this into the SoC complex
directly.

>>>>>We do that all the time on our processors. Applications like DPDK
>>>>>and Open Data Plane (ODP) rely on user-mode access to the
>>>>>device MMIO (often using SR-IOV virtual functions) space and direct
>>>>>DMA (facilitated by an IOMMU/SMMU) initiated by usermode code.
>>>>
>>>>Ok, sure, but that's not PIO.
>>>
>>>By PIO are you referring to 'in' and 'out' instructions that have
>>>been obsolete for three decades except for a few legacy devices
>>>like the UART
>>
>>Well, yes. (The context was the removal of both ring 3 port
>>access instructions, as well as the IOPL from TSS.)
>
>Ok. Coming from the unix/linux world, ring 3 access to those
>has generally not been allowed and I don't see removal of that
>capability as a loss, but rather a gain.

ioperm(2) and iopl(2)? :-)

>>>(and access to pci config space, although PCI
>>>express defines the memory mapped ECAM as an alternative which
>>>is used on non-intel/amd systems)?
>>
>>I try to blot that out of my mind.
>>
>>I believe that PCI express deprecates the port-based access
>>method to config space; MMIO _must_ be supported and in
>>particular, is the only way to get access to the extended
>>capability space.
>
>Intel cheated and used four unused bits in the high bits
>of cf8 for the extended config space on some of the southbridge
>chipsets in the early PCIe days. I've not checked recently
>to see if newer chipsets/SoCs still support that.

Sigh. Actually, I checked the PCIe spec last night and I think
I was just wrong. It looks like you can still do it, but
address/data register pairs are problematic, so I always just
use ECAM.

I think port IO can be useful at very early boot for writing
debugging data to the UART, or for supporting legacy "PC"
devices; beyond that, I find it annoying. We did use it once in
an experimental hypervisor to execute the equivalent of a VMCALL
without first having to trap into (guest) kernel mode.

>> So from that perspective we're not losing
>>anything. Certainly, I've used memory-mapped IO for dealing
>>with PCI config space on x86_64. The port-based access method
>>is really only for compatibility with legacy systems at this
>>point.
>
>For ARM, the standard is ECAM (See Server Base System Architecture
>document).
>
>https://developer.arm.com/documentation/den0029/latest/

ARM has always been saner than x86. :-) It makes sense that
they'd start with and stick to ECAM, since (AFAIK) they never
had programmed IO instructions. How else _would_ you do it?
(That's a rhetorical question, btw.)

>>Yup. It really seems like the only devices that demand access
>>via port IO are the legacy "PC" devices; if the 8159A is going
>>away, what's left? The RTC, UART and keyboard controller? Is
>>the PIT dual-wired to an IOAPIC for interrupt generation?
>
>Don't the have an architected high precision timer (HPET) that
>is used instead of the PIT in these modern times?

It does, but the HPET has weird problems of its own and is more
rarely used than one might otherwise expect. It's a really
annoying device in a lot of ways.

>>>>>Interrupts are still mediated by the OS (virt-io provides these
>>>>>capabilities), although DPDK/ODP generally poll completion rings
>>>>>rather than use interrupts.
>>>>
>>>>Really? Even with SR-IOV and the interrupt remapping tables in
>>>>the IOMMU? Are you running in VMX non-root mode? Why not use
>>>>posted interrupts?
>>>
>>>Hmm. I do seem to recall some mechanisms for interrupt virtualization
>>>in the IOMMU, but I've been, as noted above, in the ARMv8 world for a while now.
>>
>>Is this the point where I express my jealousy? :-D
>
>I've quite enjoyed the last decade working on a
>significant architectural upgrade from ARMv7.
>
>Watching the architecture grow from initial early
>release documents and modeling the Processor, SMMU, and
>Interrupt Controller has been educational and fun.

Definitely some really cool things have happened in that space
while those of us slumming it with x86 have looked on with envy.
This actually sounds remarkably like what one gets with the
x2APIC and APIC virtualization these days, though perhaps more
coherent than on x86: one interacts with the LAPIC in x2 mode
via MSRs (though the IOMMU still uses memory-mapped address/data
register pairs). MSI delivery coupled with posted interrupts
and the remapping tables in the IOMMU have cooperated to make it
easy to inject interrupts into a guest without hypervisor
intervention. AMD's AVIC supports functionality that makes lets
guests VCPUs send IPIs amongst themselves without hypervisor
intervention, but there are all kinds of problems with it. It
sounds like GICv4 centralizes all of this, whereas on the x86
platforms it gets scattered between a variety of components; AND
ARM gives you the bare-metal passthru functionality.

>>That is correct; perhaps I'm misintpreting what you meant
>>earlier: I think I gather now that you're talking about
>>overloading functionality meant for virtualization to provide
>>unprivileged access to devices in a host. That is, allocate a
>>virtual function and pass that through to a userspace process,
>>but don't enter a virtualzied CPU context?
>
>Yes, that's the basic use pattern for the DPDK. Usermode drivers
>directly access the networking hardware without operating system
>intervention. Virtualized[*] _or_ bare-metal. Interrupts are the
>complicating factor since most processors do not have the capability
>to deliver interrupts to user-mode handlers directly. Thus DPDK
>and ODP poll completion queues on the network hardware rather than
>waiting for interrupts.

Tracking. That's good stuff. It strikes me that DoE did
similar things with NIX as part of the FastOS work.

>https://www.dpdk.org/
>https://opendataplane.org/
>
>[*] Requires the SMMU/IOMMU to do both stages of translation, i.e.
> va->guestpa and guestpa->machinepa.

_nod_

- Dan C.

BGB

unread,
May 23, 2023, 2:18:57 PM5/23/23
to
On 5/23/2023 11:22 AM, Dan Cross wrote:
> In article <u4h6lu$2e7an$1...@dont-email.me>, BGB <cr8...@gmail.com> wrote:
>> On 5/22/2023 3:10 PM, Dan Cross wrote:
>> [snip]
>>> L2PT's like the EPT and NPT are wins here; even in the nested
>>> VM case, where we have to resort to shadow paging techniques, we
>>> can handle L2 page faults in the top-level hypervisor.
>>>
>>
>> But, if one uses SW TLB, then NPT (as a concept) has no reason to need
>> to exist...
>
> Yes, at great expense.
>

Doesn't seem all that expensive.


In terms of LUTs, a soft TLB uses far less than a page walker.

And, the TLB doesn't need to have a mechanism to send memory requests
and handle memory responses, ...

It uses some Block-RAM's for the TLB, but those aren't too expensive.


In terms of performance, it is generally around 1.5 kilocycle per TLB
miss (*1), but as-is these typically happen roughly 50 or 100 times per
second or so.

On a 50 MHz core, only about 0.2% of the CPU time is going into handling
TLB misses.


Note that a page-fault (saving a memory page to an SD card and loading a
different page) is around 1 megacycle.


*1: Much of this goes into the cost of saving and restoring all the
GPRs, where my ISA has 64x 64-bit GPRs. The per-interrupt cost could be
reduced significantly via register banking, but then one pays a lot more
for registers which are only ever used during interrupt handling.


>>> There's a reason soft-TLBs have basically disappeared. :-)
>>
>> Probably depends some on how the software-managed TLB is implemented.
>
> Not really; the design issues and the impact are both
> well-known. Think through how a nested guest (note, not a
> nested page table, but a recursive instance of a hypervisor)
> would be handled.
>

The emulators for my ISA use SW TLB, and I don't imagine a hypervisor
would be that much different, except that they would likely use TLB ->
TLB remapping, rather than abstracting the whole memory subsystem.

One could also have the guest OS use page-tables FWIW.


I had originally intended to use firmware managed TLB with the OS using
page-tables, but this switched to plain software TLB mostly because I
ran out of space in the 32K Boot ROM (mostly due to things like
boot-time CPU sanity testing, *).

*: Idea being that during boot, the CPU tests many of the core ISA
features to verify they are working as intended (say, to detect things
like if a change to the Verilog broke the ALU or similar, ...).


Besides the sanity testing, the Boot ROM also contains a FAT filesystem
interface and PE/COFF / PEL4 loader (well, and also technically an ELF
loaded, but I am mostly using PEL4).


Where PEL4 is:
PC/COFF but without the MZ stub;
Compresses most of the image using LZ4.
Decompressing LZ4 being faster than reading in more data.

The LZ4 compression seems to work well with binary code vs my own RP2
compression (which works better for general data, but not as well for
machine-code). Both formats being byte-oriented LZ variants (but they
differ in terms of how LZ matches are encoded and similar).

Have observed that LZ4 decompression tends to be slightly faster on
conventional machines (like x86-64), but on my ISA, RP2 is a little faster.

Note that Deflate can give slightly better compression, but is around an
order of magnitude slower.


Generally, in PEL4, the file headers are left in an uncompressed state,
but all of the section data and similar is LZ compressed.
Where, header magic:
PE\0\0: Uncompressed
PEL0: Also uncompressed (similar to PE\0\0)
PEL3: RP2 Compression (Not generally used)
PEL4: LZ4 Compression
PEL6: LZ4LLB (Modified LZ4, Length-Limited Encoding)

If the header is 'MZ', it checks for an offset to the start of the PE
header, but then assumes normal (uncompressed) PE/COFF.


Also PEL4 uses a different checksum algorithm from normal PE/COFF, as
the original checksum algorithm sucked and could not detect some of the
main types of corruption that result from LZ screw-ups.

The "linear sum with carry-folding" was instead replaced with a "linear
sum and sum-of-linear-sums with carry-folding XORed together". It is
significantly faster than something like Adler32 (or CRC32), while still
providing many of the same benefits (namely, better error detection than
the original checksums).

Checksum is verified after the whole image is loaded/decompressed into RAM.


For my ABI, the "Global Pointer" entry in the Data directory was
repurposed into handling a floating "data section" which may be loaded
at a different address from ".text" and friends (so multiple program or
DLL instances can share the same copy of the ".text" and similar), with
the base-relocation table being internally split in this area (there is
a GBR register which points to the start of ".data", which in turn
points to a table which can be used for the program or DLLs to reload
their own corresponding data section into GBR; for "simple case" images,
this is simply a self-pointer).

Some sections, like the resource section, were effectively replaced (the
resource section now uses a format resembling the "Quake WAD2" format,
just with a different header and the offsets in terms of RVA's). Things
like "resource lumps" could then be identified with a 16-chacracter name
(typically uncompressed, apart from any compression due to the PEL4
compression, with bitmap images typically stored in the DIB/BMP format,
audio using RIFF/WAVE, ...).


Otherwise, the format is mostly similar to normal PE/COFF.


>> In my case, TLB miss triggers an interrupt, and there is an "LDTLB"
>> instruction which basically means "Take the TLBE from these two
>> registers and shove it into the TLB at the appropriate place".
>
> That's pretty much the way they all work, yes.
>

I think there were some that exposed the TLB as MMIO or similar, and the
ISR handler would then be expected to write the new TLBE into a MMIO array.

The SH-4 ISA also had something like this (in addition to the LDTLB
instruction), but I didn't keep this feature, and from what I could
tell, the existing OS's (such as the Linux kernel) didn't appear to use
it...

They also used a fully-associative TLB, which is absurdly expensive, so
I dropped to a 4-way set-associative TLB (while also making the TLB a
bit larger).


They had used a 64-entry fully-associative array, I ended up switching
to 256x 4-way, which is a total of around 1024 TLBEs.

So, in this case, the main TLB ends up as roughly half the size of an L1
cache (in terms of Block RAM), but uses less LUTs than an L1 cache.


As-is, a 16K L1 needs roughly 32K of Block-RAM (roughly half the space
eaten by tagging metadata with a 16-byte line size; while a larger
cache-line size would more efficiently use the BRAM's, it would also
result in a significant increase in LUT cost).


Dan Cross

unread,
May 23, 2023, 2:26:43 PM5/23/23
to
In article <u4ivui$2lenq$1...@dont-email.me>, BGB <cr8...@gmail.com> wrote:
>On 5/23/2023 11:22 AM, Dan Cross wrote:
>> In article <u4h6lu$2e7an$1...@dont-email.me>, BGB <cr8...@gmail.com> wrote:
>>> On 5/22/2023 3:10 PM, Dan Cross wrote:
>>> [snip]
>>>> L2PT's like the EPT and NPT are wins here; even in the nested
>>>> VM case, where we have to resort to shadow paging techniques, we
>>>> can handle L2 page faults in the top-level hypervisor.
>>>>
>>>
>>> But, if one uses SW TLB, then NPT (as a concept) has no reason to need
>>> to exist...
>>
>> Yes, at great expense.
>>
>
>Doesn't seem all that expensive.
>
>
>In terms of LUTs, a soft TLB uses far less than a page walker.

You're thinking in terms of hardware, not software or
performance.

>And, the TLB doesn't need to have a mechanism to send memory requests
>and handle memory responses, ...
>
>It uses some Block-RAM's for the TLB, but those aren't too expensive.
>
>
>In terms of performance, it is generally around 1.5 kilocycle per TLB
>miss (*1), but as-is these typically happen roughly 50 or 100 times per
>second or so.
>
>On a 50 MHz core, only about 0.2% of the CPU time is going into handling
>TLB misses.

That's not the issue.

The hypervisor has to invoke the guest's
TLB miss handler, which will have to fault _again_ once it tries
to write to the TLB to insert an entry; this can lead to several
round-trips, bouncing between the host and guest several times.
With nested VMs, this gets significantly worse.

> [snip]
>One could also have the guest OS use page-tables FWIW.

How does the hypervisor know the format of the guest's page
tables, in general?

- Dan C.

Scott Lurndal

unread,
May 23, 2023, 2:42:14 PM5/23/23
to
cr...@spitfire.i.gajendra.net (Dan Cross) writes:
>In article <KvQaM.2005263$t5W7.1...@fx13.iad>,
>Scott Lurndal <sl...@pacbell.net> wrote:
>>cr...@spitfire.i.gajendra.net (Dan Cross) writes:
>>>In article <ltNaM.3348574$9sn9.2...@fx17.iad>,
>>>Scott Lurndal <sl...@pacbell.net> wrote:
>>>>cr...@spitfire.i.gajendra.net (Dan Cross) writes:
>>>>[snip]
>>>>>It's still not entirely clear to me now the BSP/BSC is supposed
>>>>>to boot, however. If the world starts in 64-bit mode, and that
>>>>>still requires paging to be enabled, then who sets up the page
>>>>>tables that the BSP starts up on?

>>The boostrap processor:
>>
>> "The CPU starts executing in 64-bit paged mode after reset. The
>> Firmware Interface Table (FIT) contains a reset state structure
>> containing RIP and CR3 that defines the initial execution state of
>> the CPU."
>>
>>I'm presuming that a management processor provided by the
>>mainboard vendor will initialize the FIT out-of-band before
>>releasing the BSP from reset sufficent to execute the UEFI
>>firmware and boot loader.
>
>Ah, the AMD PSP approach, aka, how to start the engine on a
>ship. I guess the days of DRAM training from the BSP are over,
>which isn't necessarily a bad thing.

That doesn't necessarily follow. It is possible to run the
training code from the L1/L2 caches with careful coding.


>>
>>Ok. Coming from the unix/linux world, ring 3 access to those
>>has generally not been allowed and I don't see removal of that
>>capability as a loss, but rather a gain.
>
>ioperm(2) and iopl(2)? :-)

Hence "generally". Can't recall ever seeing it used.

>
>>>>(and access to pci config space, although PCI
>>>>express defines the memory mapped ECAM as an alternative which
>>>>is used on non-intel/amd systems)?
>>>
>>>I try to blot that out of my mind.
>>>
>>>I believe that PCI express deprecates the port-based access
>>>method to config space; MMIO _must_ be supported and in
>>>particular, is the only way to get access to the extended
>>>capability space.
>>
>>Intel cheated and used four unused bits in the high bits
>>of cf8 for the extended config space on some of the southbridge
>>chipsets in the early PCIe days. I've not checked recently
>>to see if newer chipsets/SoCs still support that.
>
>Sigh. Actually, I checked the PCIe spec last night and I think
>I was just wrong. It looks like you can still do it, but
>address/data register pairs are problematic, so I always just
>use ECAM.

Yeah, cf8/cfc was always an intel specific mechanism anyway;
I don't recall it being defined in the PCI Spec (check's 1993
copy, which states "System dependent issues ... such as mapping
various PCI address spaces (config, memory, i/o) into host
CPU address spaces, ordering rules, etc are described in the
PCI System Design guide" - which I don't have a copy of handy)

>
>I think port IO can be useful at very early boot for writing
>debugging data to the UART, or for supporting legacy "PC"
>devices; beyond that, I find it annoying.

Yes, there's no doubt about that. On the AArch64 chips I
work with, the UART is memory mapped (compatible with the
ARM PL011 UART), and has a PCI configuration space; so
the early boot code needs to scan the PCI bus, read the
BAR and use MMIO through that bar (the PCI Enhanced
Allocation capability is present, so the BARs are
fixed rather than programable). Interrupts are handled
via MSI-X vectors (and being level sensitive, a pair
of vectors are used, one to assert and one to deassert
the interrupt).


> We did use it once in
>an experimental hypervisor to execute the equivalent of a VMCALL
>without first having to trap into (guest) kernel mode.

It was useful to use the back door to "mask" NMI as well as
supporting legacy devices in the 3Leaf distributed hypervisor,
and for port 0x80 debugging (we had a PCI card with a pair of
seven segment displays which captured/displayed port 0x80).


>>For ARM, the standard is ECAM (See Server Base System Architecture
>>document).
>>
>>https://developer.arm.com/documentation/den0029/latest/
>
>ARM has always been saner than x86. :-)

We had settled on using ECAM in our ARMv8 chip and suggested
it during the early SBSA discussions with ARM and partners.

Granted ECAM didn't exist (and there wasn't really sufficient
address space in a 32-bit system for a full multi segment ECAM anyway) when
Intel 'invented' cf8/cfc.

> It makes sense that
>they'd start with and stick to ECAM, since (AFAIK) they never
>had programmed IO instructions. How else _would_ you do it?
>(That's a rhetorical question, btw.)

Well, existing art included various peek/poke backdoors similar
to cf8/cfc using MMIO registers at the time. I was on the architecture
team when we started the ARMv8 processors and pushed using ECAM in our
implemention; our MIPS chips had used peek/poke registers in the
PCI controller (onboard devices didn't look like PCI). In the
ARM chip, for discovery purposes, we chose make all the on-chip
devices and accelerators look like PCI devices and thus something
like the ECAM became necessary.



>>
>>Watching the architecture grow from initial early
>>release documents and modeling the Processor, SMMU, and
>>Interrupt Controller has been educational and fun.
>
>Definitely some really cool things have happened in that space
>while those of us slumming it with x86 have looked on with envy.

I've been fortunate to have worked near the bleeding edge:
from new mainframe architecture (updating a 20 Y.O. Arch) in the early 80s to
uKernel based distributed Unix-like operating systems in the late 80s/early 90's
to a distributed version of IRIX in the late 90's transitioning to
linux (contributed KDB while at SGI) and hypervisors in late 90s
(somewhat in parallel with disco) and a distributed hypervisor
in the 2000's (3Leaf Systems) it's been quite a ride.

BGB

unread,
May 23, 2023, 3:42:15 PM5/23/23
to
On 5/23/2023 1:26 PM, Dan Cross wrote:
> In article <u4ivui$2lenq$1...@dont-email.me>, BGB <cr8...@gmail.com> wrote:
>> On 5/23/2023 11:22 AM, Dan Cross wrote:
>>> In article <u4h6lu$2e7an$1...@dont-email.me>, BGB <cr8...@gmail.com> wrote:
>>>> On 5/22/2023 3:10 PM, Dan Cross wrote:
>>>> [snip]
>>>>> L2PT's like the EPT and NPT are wins here; even in the nested
>>>>> VM case, where we have to resort to shadow paging techniques, we
>>>>> can handle L2 page faults in the top-level hypervisor.
>>>>>
>>>>
>>>> But, if one uses SW TLB, then NPT (as a concept) has no reason to need
>>>> to exist...
>>>
>>> Yes, at great expense.
>>>
>>
>> Doesn't seem all that expensive.
>>
>>
>> In terms of LUTs, a soft TLB uses far less than a page walker.
>
> You're thinking in terms of hardware, not software or
> performance.
>

Software cost is usually considered "virtually free" in comparison to
hardware. So what if the virtual memory subsystem needs a few more kB
due to a more complex hardware interface?, ...

Would care more about performance if my benchmarks showed it "actually
mattered" in terms of macro-scale performance.


Spending a few orders of magnitude more clock-cycles on a TLB miss
doesn't matter if the TLB miss rate is low enough that it disappears in
the noise.

It is like making a big fuss over the clock-cycle budget of having a
1kHz clock-timer IRQ...

Yes, those clock IRQs eat clock-cycles, but mostly, there isn't too much
reason to care.


Well, except if one wants to do a 32kHz clock-timer like on the MSP430,
currently this is an area where the MSP430 wins (trying to do a 32kHz
timer IRQ basically eats the CPU...).



>> And, the TLB doesn't need to have a mechanism to send memory requests
>> and handle memory responses, ...
>>
>> It uses some Block-RAM's for the TLB, but those aren't too expensive.
>>
>>
>> In terms of performance, it is generally around 1.5 kilocycle per TLB
>> miss (*1), but as-is these typically happen roughly 50 or 100 times per
>> second or so.
>>
>> On a 50 MHz core, only about 0.2% of the CPU time is going into handling
>> TLB misses.
>
> That's not the issue.
>
> The hypervisor has to invoke the guest's
> TLB miss handler, which will have to fault _again_ once it tries
> to write to the TLB to insert an entry; this can lead to several
> round-trips, bouncing between the host and guest several times.
> With nested VMs, this gets significantly worse.
>

So?...

If it is maybe only happening 50 to 100 times a second or so, it doesn't
matter. Thousands or more per second, it does, but in the general case
it does not, provided the CPU has a reasonable sized TLB.

If it did start to be an issue (with programs with a fairly large
working set), one can make the TLB bigger (and/or go to 64K pages, but
this has its own drawbacks).


It maybe matters more if the OS also swaps page tables for multitasking
and if each page-table swap involves a TLB flush, but I am not doing it
that way (one could use ASIDs; in my case I am just using a huge
monolithic virtual address space).


Granted, the use of a monolithic address space does make a serious
annoyance for trying to run RISC-V ELF objects on top of this, as GCC
apparently doesn't support either PIE or Shared Objects, ...

At least, my PEL4 binaries were designed to be able to deal with a
monolithic virtual address space (and also use in a NO-MMU environment).

But, in this case, there is also the fallback that I have a 96-bit
address-mode extension with a 65C816-like addressing scheme, which can
mimic having a number of 48-bit spaces within an otherwise monolithic
address space.

Though, does currently have the limitation of effectively dropping the
TLB to 2-way associative when active.


>> [snip]
>> One could also have the guest OS use page-tables FWIW.
>
> How does the hypervisor know the format of the guest's page
> tables, in general?
>

They have designated registers and the tree formats are documented as
part of the ISA/ABI specs...


One could define it such that if page tables are used, one of the
defined formats, and the page is present, the hypervisor could be
allowed to translate the page itself and skip the TLB Miss ISR (falling
back to the ISR if the page-table is flagged as an unknown format).

Though, generally, things like ACL Miss ISR's would still need to be
forwarded to the guest, but these are much less common (it is generally
sufficient to use a 4 or 8 entry cache for ACL checks).


As-is, the defined formats are:
xxx: N-level Page-Table
3 levels for 48b address and 16K pages.
4 levels for 48b address and 4K pages.
Bit pattern encodes tree depth and format.
013: AVL Tree (Full Address)
113: B-Tree (Full Address)
213: Hybrid B-Tree (last-level page table)
313: Hybrid B-Tree (last 2-levels page table)


The B-Tree cases being mostly intended for 96-bit modes, since:
48-bit mode works fine with a conventional page-table;
As noted, 8 level page tables suck...

At present, most of the page-table formats assume 64-bit entries with a
48-bit physical address.
Low order bits are control flags;
Upper 16 bits are typically the ACLID.
The ACLID indirectly encoding "who can do what with this page".

There was an older VUGID system, but this system requires more bits to
encode (user/group/other, rwxrwxrwx). So, it has been partially
deprecated in favor of using ACL checks for everything.


> - Dan C.
>

Dan Cross

unread,
May 23, 2023, 4:18:55 PM5/23/23
to
In article <u4j4qu$2luoh$1...@dont-email.me>, BGB <cr8...@gmail.com> wrote:
>On 5/23/2023 1:26 PM, Dan Cross wrote:
>[snip]
>>> On a 50 MHz core, only about 0.2% of the CPU time is going into handling
>>> TLB misses.
>>
>> That's not the issue.
>>
>> The hypervisor has to invoke the guest's
>> TLB miss handler, which will have to fault _again_ once it tries
>> to write to the TLB to insert an entry; this can lead to several
>> round-trips, bouncing between the host and guest several times.
>> With nested VMs, this gets significantly worse.
>>
>
>So?...

I wonder: have you looked into why essentially every modern
architecture in common use today uses hardware page tables?
The hardware engineers working on are not stupid, and they are
perfectly well aware of everything you said about e.g. larger
TLBs. Yet there is a reason they chose to implement things
the way essentially every extant modern architecture has.
Perhaps they are aware of something you would find illuminating.

The issues I'm talking about very much exist and very much
affect world-world designs. I'll take the slightly larger cost
in transistors over the disadvantages, including forcing
pipeline flushes, thrashing the icache to handle TLB fault
misses, and significantly more complex virtualization.

Besides....what do you do if a guest decides it wants to insert
a mapping covering part the hypervisor itself into the TLB?

> [snip]
>>> One could also have the guest OS use page-tables FWIW.
>>
>> How does the hypervisor know the format of the guest's page
>> tables, in general?
>>
>
>They have designated registers and the tree formats are documented as
>part of the ISA/ABI specs...

The point of a hypervisor is to provide a faithful emulation
of the _hardware_: it's up to the guest to decide what ABI it
uses. The hypervisor can't really force that onto the guest,
and sothere's no "ABI" as such in a non-paravirtualized
hypervisor. The whole point is that unmodified guests can run
without change and think that they're running directly on the
bare metal.

It's unclear what the point of an ISA-mandated page table format
would be in a system that doesn't use them. What prevents a
guest from just ignoring them and doing its own thing?

- Dan C.

Dan Cross

unread,
May 23, 2023, 4:36:35 PM5/23/23
to
In article <7O7bM.3161054$iS99.2...@fx16.iad>,
Are we assuming the external processor can write to those
caches? If so, perhaps, but it begs the question: why? I've
evidently got a perfectly capable processor running before the
x86 cores can even come out of reset, and it can mess with the
microarchitectural state of the x86 CPU anyway: why not just
let it train DRAM as well?

If it can't write to those caches, then I don't see how it could
initialize page tables that the CPU would start executing from,
unless it dumped them into a mandatory SRAM buffer or something.

>>>Ok. Coming from the unix/linux world, ring 3 access to those
>>>has generally not been allowed and I don't see removal of that
>>>capability as a loss, but rather a gain.
>>
>>ioperm(2) and iopl(2)? :-)
>
> Hence "generally". Can't recall ever seeing it used.

I suppose I misinterpreted "generally." A general-purpose
mechanism to expose the functionality exists, though I agree
that relatively few applications make use of it. A friend of
mine did a PhD back in the 90s and did use it pretty extensively
to take data from an ISA-bus device (no interrupts though; he
just burned CPU polling).

>[snip]
>>Sigh. Actually, I checked the PCIe spec last night and I think
>>I was just wrong. It looks like you can still do it, but
>>address/data register pairs are problematic, so I always just
>>use ECAM.
>
>Yeah, cf8/cfc was always an intel specific mechanism anyway;
>I don't recall it being defined in the PCI Spec (check's 1993
>copy, which states "System dependent issues ... such as mapping
>various PCI address spaces (config, memory, i/o) into host
>CPU address spaces, ordering rules, etc are described in the
>PCI System Design guide" - which I don't have a copy of handy)

I checked my copy of the 6.0 spec yesterday and it mentions
"io" in addition to config and MMIO, which I interpret to mean
port-space IO.

>>I think port IO can be useful at very early boot for writing
>>debugging data to the UART, or for supporting legacy "PC"
>>devices; beyond that, I find it annoying.
>
>Yes, there's no doubt about that. On the AArch64 chips I
>work with, the UART is memory mapped (compatible with the
>ARM PL011 UART), and has a PCI configuration space; so
>the early boot code needs to scan the PCI bus, read the
>BAR and use MMIO through that bar (the PCI Enhanced
>Allocation capability is present, so the BARs are
>fixed rather than programable). Interrupts are handled
>via MSI-X vectors (and being level sensitive, a pair
>of vectors are used, one to assert and one to deassert
>the interrupt).

Level sensitive? Huh.

I wrote a driver for a memory-mapped UART in an AMD SoC complex
a few months ago; we use it for early boot on our machines. It
is soon enough after coming out of reset that I don't bother
with interrupts; we just poll.

Having to go through config space to map a BAR to drive a UART
seems excessive to me.

>[snip]
>Well, existing art included various peek/poke backdoors similar
>to cf8/cfc using MMIO registers at the time. I was on the architecture
>team when we started the ARMv8 processors and pushed using ECAM in our
>implemention; our MIPS chips had used peek/poke registers in the
>PCI controller (onboard devices didn't look like PCI). In the
>ARM chip, for discovery purposes, we chose make all the on-chip
>devices and accelerators look like PCI devices and thus something
>like the ECAM became necessary.

Oh of course; memory-mapped address/data register pairs would
give the same effect.

Making everything look like PCI certainly makes things regular.

>>Definitely some really cool things have happened in that space
>>while those of us slumming it with x86 have looked on with envy.
>
>I've been fortunate to have worked near the bleeding edge:
>from new mainframe architecture (updating a 20 Y.O. Arch) in the early 80s to
>uKernel based distributed Unix-like operating systems in the late 80s/early 90's
>to a distributed version of IRIX in the late 90's transitioning to
>linux (contributed KDB while at SGI) and hypervisors in late 90s
>(somewhat in parallel with disco) and a distributed hypervisor
>in the 2000's (3Leaf Systems) it's been quite a ride.

Nice. 3Leaf sounds interesting; are there any papers on it
available?

- Dan C.

Scott Lurndal

unread,
May 23, 2023, 5:00:41 PM5/23/23
to
BGB <cr8...@gmail.com> writes:
>On 5/23/2023 1:26 PM, Dan Cross wrote:
>> In article <u4ivui$2lenq$1...@dont-email.me>, BGB <cr8...@gmail.com> wrote:
>>> On 5/23/2023 11:22 AM, Dan Cross wrote:
>>>> In article <u4h6lu$2e7an$1...@dont-email.me>, BGB <cr8...@gmail.com> wrote:
>>>>> On 5/22/2023 3:10 PM, Dan Cross wrote:
>>>>> [snip]
>>>>>> L2PT's like the EPT and NPT are wins here; even in the nested
>>>>>> VM case, where we have to resort to shadow paging techniques, we
>>>>>> can handle L2 page faults in the top-level hypervisor.
>>>>>>
>>>>>
>>>>> But, if one uses SW TLB, then NPT (as a concept) has no reason to need
>>>>> to exist...
>>>>
>>>> Yes, at great expense.
>>>>
>>>
>>> Doesn't seem all that expensive.
>>>
>>>
>>> In terms of LUTs, a soft TLB uses far less than a page walker.
>>
>> You're thinking in terms of hardware, not software or
>> performance.
>>
>
>Software cost is usually considered "virtually free" in comparison to
>hardware.

That's not my experience. Software has a cost. Subtantial even.

And for something so integrally associated with performance,
TLB refills are never free, and table walks aren't zero cost;
hardware or software.

Consider, for example, that a table walk for a guest access
requires up to 22 discrete translation table accesses to fill
a TLB. Hardware walkers often cache intermediate results to
reduce the cost for subsequent walks. In fast internal caches.

Software can't compete with that.



>Would care more about performance if my benchmarks showed it "actually
>mattered" in terms of macro-scale performance.

How representative of real-world workloads are your benchmarks?

>
>
>Spending a few orders of magnitude more clock-cycles on a TLB miss
>doesn't matter if the TLB miss rate is low enough that it disappears in
>the noise.

Good luck with that.

Scott Lurndal

unread,
May 23, 2023, 5:17:19 PM5/23/23
to
cr...@spitfire.i.gajendra.net (Dan Cross) writes:
>In article <7O7bM.3161054$iS99.2...@fx16.iad>,
>Scott Lurndal <sl...@pacbell.net> wrote:

>>>Ah, the AMD PSP approach, aka, how to start the engine on a
>>>ship. I guess the days of DRAM training from the BSP are over,
>>>which isn't necessarily a bad thing.
>>
>>That doesn't necessarily follow. It is possible to run the
>>training code from the L1/L2 caches with careful coding.
>
>Are we assuming the external processor can write to those
>caches? If so, perhaps, but it begs the question: why? I've
>evidently got a perfectly capable processor running before the
>x86 cores can even come out of reset, and it can mess with the
>microarchitectural state of the x86 CPU anyway: why not just
>let it train DRAM as well?

In my experience, training the controllers when you have a dozen
or more DRAM channels requires more oomph than the little microcontrollers
are capable off. And the microcontroller may not have access to
the internal bus (mesh/ring) structure required to program
the controllers, read the SPDs etc.

>
>If it can't write to those caches, then I don't see how it could
>initialize page tables that the CPU would start executing from,
>unless it dumped them into a mandatory SRAM buffer or something.

A small page table in part of the LLC should suffice. On the
arm chips, we come with paging disabled, push some code into the
LLC and take the bsp out of reset to initalize the DRAM controllers.

With respect to x86-S, it's all speculation anyway :-)
An PCI Endpoint BAR can be designated an I/O bar, in which case
it slots into the intel 64k I/O (pio) space (in and out instructions).
VERY few devices advertise I/O bars; it's basically deprecated.

An endpoint BAR designated as a memory bar, slots into the physical
address space.

The third PCI address space is the config space and the access
mechanism is up to the host. (peek/poke, ecam).

>
>>>I think port IO can be useful at very early boot for writing
>>>debugging data to the UART, or for supporting legacy "PC"
>>>devices; beyond that, I find it annoying.
>>
>>Yes, there's no doubt about that. On the AArch64 chips I
>>work with, the UART is memory mapped (compatible with the
>>ARM PL011 UART), and has a PCI configuration space; so
>>the early boot code needs to scan the PCI bus, read the
>>BAR and use MMIO through that bar (the PCI Enhanced
>>Allocation capability is present, so the BARs are
>>fixed rather than programable). Interrupts are handled
>>via MSI-X vectors (and being level sensitive, a pair
>>of vectors are used, one to assert and one to deassert
>>the interrupt).
>
>Level sensitive? Huh.

Heritage of the pl011.


>Having to go through config space to map a BAR to drive a UART
>seems excessive to me.

Ah, but it leverages the common OS discovery code. Granted it
makes boot software slightly more complicated, but the flexibility
is worth it.



>>
>>I've been fortunate to have worked near the bleeding edge:
>>from new mainframe architecture (updating a 20 Y.O. Arch) in the early 80s to
>>uKernel based distributed Unix-like operating systems in the late 80s/early 90's
>>to a distributed version of IRIX in the late 90's transitioning to
>>linux (contributed KDB while at SGI) and hypervisors in late 90s
>>(somewhat in parallel with disco) and a distributed hypervisor
>>in the 2000's (3Leaf Systems) it's been quite a ride.
>
>Nice. 3Leaf sounds interesting; are there any papers on it
>available?

There are a couple of patents (one granted, one abandoned)
at the USPTO. Not much else out in the public other than
the various trade mag stuff from that time period.

Basically we built an ASIC that extends the coherency domain
across infiniband (or 10G Ethernet) to allow creation of
large shared-memory cache-coherent systems from 1u or 2u
building blocks. The ASIC talked HyperTransport for AMD
Opteron CPUs and QuikPath for Intel CPUs.

Had a 32-node, 64-processor system at LLNL for evaluation
before the bottom dropped out of the markets, an acquision
fell through and we shut down.

Much the same as CXL-Cache today. We even considered PCIe
as the transport, but the switching latencies for IB were
less than 100ns.

muta...@gmail.com

unread,
May 23, 2023, 7:02:20 PM5/23/23
to
On Wednesday, May 24, 2023 at 5:17:19 AM UTC+8, Scott Lurndal wrote:

> >Nice. 3Leaf sounds interesting; are there any papers on it
> >available?
> There are a couple of patents (one granted, one abandoned)
> at the USPTO. Not much else out in the public other than
> the various trade mag stuff from that time period.
>
> Basically we built an ASIC that extends the coherency domain
> across infiniband (or 10G Ethernet) to allow creation of
> large shared-memory cache-coherent systems from 1u or 2u
> building blocks. The ASIC talked HyperTransport for AMD
> Opteron CPUs and QuikPath for Intel CPUs.
>
> Had a 32-node, 64-processor system at LLNL for evaluation
> before the bottom dropped out of the markets, an acquision
> fell through and we shut down.

What market dropped, when and why?

Thanks. Paul.

BGB

unread,
May 23, 2023, 9:08:43 PM5/23/23
to
I think a lot of this is making a big fuss over nothing, FWIW.

But, in any case, SuperH (along with PA-RISC, MIPS, SPARC, etc) got
along reasonably well with software-managed TLB.

Their downfall wasn't related to them spending an extra fraction of a
percent of CPU time on handling TLB Miss ISRs.

Similarly, this also wasn't what caused Itanium to fail (nor was it due
to it being VLIW based, etc).

And, likewise, the IBM POWER ISA is still around, ...

...



> Besides....what do you do if a guest decides it wants to insert
> a mapping covering part the hypervisor itself into the TLB?
>

There is no reason for the guest to be able to be able to put something
into the TLB which would somehow circumvent the host; since anything the
guest tries to load into the TLB will need to first get translated
through the host.

This is like asking why a program running in virtual memory can't just
create a pointer into memory inside the kernel:
The application doesn't have access to the kernel's address space to
begin with.

Or, stated another way, the entire "physical address" space for the
guest would itself be a virtual memory space running in user-mode.



>> [snip]
>>>> One could also have the guest OS use page-tables FWIW.
>>>
>>> How does the hypervisor know the format of the guest's page
>>> tables, in general?
>>>
>>
>> They have designated registers and the tree formats are documented as
>> part of the ISA/ABI specs...
>
> The point of a hypervisor is to provide a faithful emulation
> of the _hardware_: it's up to the guest to decide what ABI it
> uses. The hypervisor can't really force that onto the guest,
> and sothere's no "ABI" as such in a non-paravirtualized
> hypervisor. The whole point is that unmodified guests can run
> without change and think that they're running directly on the
> bare metal.
>
> It's unclear what the point of an ISA-mandated page table format
> would be in a system that doesn't use them. What prevents a
> guest from just ignoring them and doing its own thing?
>

You can have either accurate hardware level emulation, or slightly
better performance, and make a tradeoff there.

If the OS wants its own page-table format, it can specify that it is
using its own encoding easily enough via the tag bits in the TTB
register or similar.

And, if it claims to be using a standard table format, but is doing
something different, and crashes as a result. Well, that is its problem,
and/or one adds a flag or similar to the emulator to disable any
"faster" page translation.


Not like it is likely to matter all that much.
Hence, why I was using B-Trees for the 96-bit mode...


One can note that fetching something from a B-Tree is not exactly a fast
operation, but still roughly 3 orders of magnitude faster than swapping
a page in the pagefile.

...


> - Dan C.
>

BGB

unread,
May 24, 2023, 1:12:42 AM5/24/23