Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

x86-S

114 views
Skip to first unread message

Dan Cross

unread,
May 20, 2023, 1:07:18 PM5/20/23
to
Likely to be filed under the, "Too Little Too Late" category,
Intel has recent put forth a proposal for simplifying the x86_64
architecture, and in particularly, discarding some of the legacy
baggage of a 45 year old architecture. Details, including a
link to the specifically proposed architectural changes, are
here:
https://www.intel.com/content/www/us/en/developer/articles/technical/envisioning-future-simplified-architecture.html

Seemingly the biggest additions from an OS writer's perspective,
assuming a 64-bit system, are:

1. A pair of MSRs for jumping between using 4 and 5 level page
tables without the need to trampoline through 32-bit
protected mode and disabling paging as an intermediate
step, and
2. A change to the SIPI sequence for MP startup where APs come
up directly in 64-bit mode, with paging enabled. This is
accomplished by introduction of a new MSR where one can put
a pointer to a small data structure that includes initial
values for %rip, %cr0, %cr3, and %cr4.

In tandem, a whole bunch of stuff is eliminated: legacy 16 bit
modes, call gates, segment limits and base for 32-bit mode, etc.

Frankly, this seems like putting lipstick on a pig: all of this
seems "nice", I suppose, but I don't see the point of much of
it. Consider the 64-bit SIPI sequence, for example: sure, this
eliminates some boilerplate trampoline code at AP bringup, but
a) that code isn't very hard to write, and b) once it's written
it is not as though one goes around often changing it. I
suppose that a benefit is that an AP may be able to start
running against an address space that is defined by a page
table with the PML4 or PML5 somewhere about 4GiB in the physical
address space, but that seems like a minor win in the grand
scheme of things.

The L4<->L5 paging thing appears useful at first blush, but how
often is one doing that on a single system? It seems unlikely
to be particularly useful in practice; in particular since
paging is under control of the operating system; to what end
would it oscilate back and forth between two page table depths?

On the other hand, many annoying vestiges of the past are left:
the TSS continues to have a stack table for mapping kernel
stacks (why not just make those MSRs?); an opportunity to
simplify (or eliminate) the IDT was lost; segmentation very much
remains part of the architecture (and part of the 64-bit syscall
mechanism!); removal of "unrestricted guest mode" from VMX makes
writing a VMM to support legacy operating systems that much
harder.

So it begs the question: what is the point of this proposal? It
doesn't seem to add much that is particularly useful, while
removing some things that are (unrestricted guest mode) and
ignoring most of the historical barnacles of the architecture.

Sorry, Intel: I know its your cash cow, but it's time to put x86
to bed. This won't change that.

- Dan C.

JJ

unread,
May 21, 2023, 6:20:05 AM5/21/23
to
On Sat, 20 May 2023 17:07:16 -0000 (UTC), Dan Cross wrote:
> Likely to be filed under the, "Too Little Too Late" category,
> Intel has recent put forth a proposal for simplifying the x86_64
> architecture, and in particularly, discarding some of the legacy
> baggage of a 45 year old architecture. Details, including a
> link to the specifically proposed architectural changes, are
> here:
> https://www.intel.com/content/www/us/en/developer/articles/technical/envisioning-future-simplified-architecture.html
>
> Seemingly the biggest additions from an OS writer's perspective,
> assuming a 64-bit system, are:
>
> 1. A pair of MSRs for jumping between using 4 and 5 level page
> tables without the need to trampoline through 32-bit
> protected mode and disabling paging as an intermediate
> step, and
> 2. A change to the SIPI sequence for MP startup where APs come
> up directly in 64-bit mode, with paging enabled. This is
> accomplished by introduction of a new MSR where one can put
> a pointer to a small data structure that includes initial
> values for %rip, %cr0, %cr3, and %cr4.
>
> In tandem, a whole bunch of stuff is eliminated: legacy 16 bit
> modes, call gates, segment limits and base for 32-bit mode, etc.
[snip]

Intel's i64 version 2.

And I bet that, if it's adopted, the CPU would not be cheaper than x86-64.
Despite that it's mostly just a trimmed down version of x86-64.

Luke A. Guest

unread,
May 21, 2023, 6:24:35 AM5/21/23
to
On 20/05/2023 18:07, Dan Cross wrote:
> Likely to be filed under the, "Too Little Too Late" category,
> Intel has recent put forth a proposal for simplifying the x86_64
> architecture, and in particularly, discarding some of the legacy
> baggage of a 45 year old architecture. Details, including a
> link to the specifically proposed architectural changes, are
> here:
> https://www.intel.com/content/www/us/en/developer/articles/technical/envisioning-future-simplified-architecture.html

Or just dump the archaic x86(-64) arch and go full RISC, with current
manufacturing technology, they'd be speedy as fuck. Apparently all x86
ave been RISC underneath for ages now anyway.

Dan Cross

unread,
May 21, 2023, 10:20:05 AM5/21/23
to
In article <u4crh1$1hngg$1...@dont-email.me>,
The issue there is backwards compatibility with an enormous
installed base. Intel tried back back in the early 00s with the
Itanium; it did not go well for them. They had previously tried
back in the 80s with the i432 and that was even worse. In an
amazing display of lack of self-awareness, they made many of the
same mistakes with Itanium that they had previously made with
i432 (waiting for the perfect compiler to make the thing go fast
was a mistake with both projects). They made similar errors
with the i860, too, and made other RISC mistakes with the i960
(which was marketed as a microcontroller)---noteably, the 960
used register windows, like SPARC and the original Berkeley RISC
machine. History has shown this seemingly nifty idea not to be
that great in practice.

Modern x86 is a weird beast; I understand it's a pretty standard
dataflow processor underneath all of the x86 goo. The x86 stuff
is weird, but is actually a relatively small percentage of the
overall die surface (like, 5% or something). Some compiler
writers think of it as a relatively compact bytecode sitting
over a RISC core, with the side-benefit of being relatively
cache efficient. I'm not sure I buy that arugment, particularly
as things like RISC-V, with it's compressed opcode profile, is
both a very clean RISC and competitively compact in terms of
e.g. icache space.

- Dan C.

Scott Lurndal

unread,
May 21, 2023, 10:49:05 AM5/21/23
to
cr...@spitfire.i.gajendra.net (Dan Cross) writes:
>Likely to be filed under the, "Too Little Too Late" category,
>Intel has recent put forth a proposal for simplifying the x86_64
>architecture, and in particularly, discarding some of the legacy
>baggage of a 45 year old architecture. Details, including a
>link to the specifically proposed architectural changes, are
>here:
>https://www.intel.com/content/www/us/en/developer/articles/technical/envisioning-future-simplified-architecture.html
>
>Seemingly the biggest additions from an OS writer's perspective,
>assuming a 64-bit system, are:
>
>1. A pair of MSRs for jumping between using 4 and 5 level page
> tables without the need to trampoline through 32-bit
> protected mode and disabling paging as an intermediate
> step, and
>2. A change to the SIPI sequence for MP startup where APs come
> up directly in 64-bit mode, with paging enabled. This is
> accomplished by introduction of a new MSR where one can put
> a pointer to a small data structure that includes initial
> values for %rip, %cr0, %cr3, and %cr4.
>
>In tandem, a whole bunch of stuff is eliminated: legacy 16 bit
>modes, call gates, segment limits and base for 32-bit mode, etc.
>
>Frankly, this seems like putting lipstick on a pig: all of this
>seems "nice", I suppose, but I don't see the point of much of
>it.

It certainly will simplify the tasks for the architecture team,
the RTL implementation team(s), the Verification team(s), the Post silicon
Team(s) and the software folks. Not something to be sneezed at, it's
a lot of work verifying the legacy support with all the odd corner
cases and fifty year old test cases. There will be some, likely
minor, improvements in area utilization at the current node, so perhaps
they can squeeze in another core or two in the same area. Or
additional memory controllers. etc.

I've heard rumblings about this for a decade now from colleagues
(mostly former intel processor designers).

Dan Cross

unread,
May 21, 2023, 12:17:10 PM5/21/23
to
In article <2aqaM.646077$5S78....@fx48.iad>,
This is the thing though: how much of a difference will _these_
changes really make here? I'll certainly concede that they will
make some difference, but is it enough to be worthwhile given
the churn this will force on software writers?

Of all the weirdness in x86, eliminating this stuff seems like
pretty small potatoes. I mean, they could get rid of
segmentation entirely in 64-bit mode, and that seems like it
would have a bigger effect (no more `POP SS` style problems), or
they could fix NMI handling, or make the ISA generally more
orthogonal and predictable, so that you could reason about it
without resorting to deep study of Agner's site....

>There will be some, likely
>minor, improvements in area utilization at the current node, so perhaps
>they can squeeze in another core or two in the same area. Or
>additional memory controllers. etc.

I dunno.... Looking at the massive surface area of SIMD units
and their associated caches and registers on Xeon cores, I doubt
this is the limiting factor. Still, more memory controllers are
always welcome.

>I've heard rumblings about this for a decade now from colleagues
>(mostly former intel processor designers).

That's intriguing; I wonder what this says about x86 at Intel
generally. Are they running into a wall with their ability to
add functionality to the architecture?

- Dan C.

Scott Lurndal

unread,
May 21, 2023, 1:15:00 PM5/21/23
to
I believe it to be a significant factor. Their primary concern is
Windows and Linux; it will require very little software churn there, and Intel
will likely provide patches directly to the vendor(s) (e.g. microsoft)
and directly to the linux kernel mailing list.

Yes, it will basically put Paul and PDOS out of business, but he can
always run in emulation.

>
>Of all the weirdness in x86, eliminating this stuff seems like
>pretty small potatoes. I mean, they could get rid of
>segmentation entirely in 64-bit mode,

Now, that would be churn. Granted AMD's SVM and Intel's VT-X
(once nested page tables/extended page tables were added) have
basically eliminated the last need for segment limit register
(which AMD had removed in amd64 but added back later because
XEN needed them for paravirtualization (before NPT/EPT)).

I see this proposal eliminates limit checking, which was only
added to x86_64 to support Xen.

Most of these changes will affect the boot loaders and secondary
bootstrap for the most part, which is where the progression from
power-on through real-mode, protected-mode, enable paging, enable
longmode occurs. There should be very few changes to the modern
kernels (windows, linux), if any.


? and that seems like it
>would have a bigger effect (no more `POP SS` style problems), or
>they could fix NMI handling, or make the ISA generally more
>orthogonal and predictable,

Again, that's churn that affects applications or OS. While getting
rid of the iomap and IOPL==3 -might- affect applications,
it is quite unlikely and there are alternatives for most modern
operating systems (e.g. virt-io in linux) to grant access to
the hardware to user-mode applications.


>>There will be some, likely
>>minor, improvements in area utilization at the current node, so perhaps
>>they can squeeze in another core or two in the same area. Or
>>additional memory controllers. etc.
>
>I dunno.... Looking at the massive surface area of SIMD units
>and their associated caches and registers on Xeon cores, I doubt
>this is the limiting factor. Still, more memory controllers are
>always welcome.
>
>>I've heard rumblings about this for a decade now from colleagues
>>(mostly former intel processor designers).
>
>That's intriguing; I wonder what this says about x86 at Intel
>generally. Are they running into a wall with their ability to
>add functionality to the architecture?

The main grumbling was always about the arcane boot process
still requiring full 8086 semantics and the long process to
get to long mode. The complexity of verification and post
silicon testing (legacy 8059 pic et al) was also discussed.

Each generation adds more features that need formal verification
during the design process (and to ensure correct functionality
with new internal interconnect structures (bus, ring, mesh, et alia))
and that costs more man-hours to integrate into the design without
breaking any of the legacy stuff. If this all lets them tape
a new chip out a month early, that's a win.

Dan Cross

unread,
May 21, 2023, 5:07:07 PM5/21/23
to
In article <lksaM.450847$ZhSc....@fx38.iad>,
If, as you say, the concern is doing away with 16-bit entirely,
from a complexity/testing perspective, then I suppose I can see
it.

>Yes, it will basically put Paul and PDOS out of business,

What a shame.

>but he can always run in emulation.

What a shame.

>>Of all the weirdness in x86, eliminating this stuff seems like
>>pretty small potatoes. I mean, they could get rid of
>>segmentation entirely in 64-bit mode,
>
>Now, that would be churn. Granted AMD's SVM and Intel's VT-X
>(once nested page tables/extended page tables were added) have
>basically eliminated the last need for segment limit register
>(which AMD had removed in amd64 but added back later because
>XEN needed them for paravirtualization (before NPT/EPT)).

Would it really? Limits and base are already ignored in long
mode; about the only thing it's still used for is GSBASE/FSBASE
and for that we have MSRs. But, having to program non-null
segment selectors into STAR, and having to have a valid GDT,
adds seemingly unnecessary complexity. If they're going to
swap around how they do AP startup with a brand-new SIPI type,
it doesn't seem like a big lift to just do away with
segmentation entirely.

>I see this proposal eliminates limit checking, which was only
>added to x86_64 to support Xen.

I believe that's for 32-bit mode? Both AMD and Intel already
ignore segment limits in 64-bit mode, and both effectively
ignore segment base as well. At least for CS, DS, ES and SS; FS
and GS are treated specially for thread- and core-local storage,
but I don't think any that is changing in this proposal.

>Most of these changes will affect the boot loaders and secondary
>bootstrap for the most part, which is where the progression from
>power-on through real-mode, protected-mode, enable paging, enable
>longmode occurs. There should be very few changes to the modern
>kernels (windows, linux), if any.

Yup. That code (as you well know) tends to be write-once and
mostly forget. I get that the hardware folks want to make their
lives easier, but in the short term, this adds complexity to the
code (which must now be specialized to detect whether it's
running on an x86-S CPU or current x86_64 and behave
accordingly). I suppose we could treat x86-S as an entirely
separate architecture.

>? and that seems like it
>>would have a bigger effect (no more `POP SS` style problems), or
>>they could fix NMI handling, or make the ISA generally more
>>orthogonal and predictable,
>
>Again, that's churn that affects applications or OS. While getting
>rid of the iomap and IOPL==3 -might- affect applications,
>it is quite unlikely and there are alternatives for most modern
>operating systems (e.g. virt-io in linux) to grant access to
>the hardware to user-mode applications.

I don't see how virtio can give a user-application pass-through
access to programmed IO, but I appreciate an argument that says
that there can be a uioring sort of thing to communicate IO
requests from userspace to the kernel without a trap.

For that matter, I don't see how doing away with segmentation in
the 64-bit mode really adds that much, either: typically, once
one gets into long mode, one sets the segmentation register
exactly once and never touches them again, except where you're
forced to due to some superflous weirdness with syscall/sysret
and exceptions. On the other hand, a we just start treating cs,
ds, es, and ss WI/RAZ, registers, and eliminated oddities like
"interrupts are blocked until after the instruction after a pop
into ss" it would simplify the hardware _and_ close a stupid
potential security bug. What would really be the problem here?
We could leave the trap format untouched for compatibility with
32-bit mode.

>>>There will be some, likely
>>>minor, improvements in area utilization at the current node, so perhaps
>>>they can squeeze in another core or two in the same area. Or
>>>additional memory controllers. etc.
>>
>>I dunno.... Looking at the massive surface area of SIMD units
>>and their associated caches and registers on Xeon cores, I doubt
>>this is the limiting factor. Still, more memory controllers are
>>always welcome.
>>
>>>I've heard rumblings about this for a decade now from colleagues
>>>(mostly former intel processor designers).
>>
>>That's intriguing; I wonder what this says about x86 at Intel
>>generally. Are they running into a wall with their ability to
>>add functionality to the architecture?
>
>The main grumbling was always about the arcane boot process
>still requiring full 8086 semantics and the long process to
>get to long mode. The complexity of verification and post
>silicon testing (legacy 8059 pic et al) was also discussed.

The process to get to long mode is annoying, sure, but it's not
that long: what, about fifty instructions or so to go from the
first instruction on the SIPI page to running in a high-level
language? Slightly more on the BSP, sure, but not much more.

I can appreciate the desire to get rid of 8086 semantics, but I
am much less sympathetic to the argument about code distance
from boot to long mode.

And systems like Linux are still quite adament about using e.g.
the PIT and 8159 to calibrate the TSC when they can (which is
super annoying when you're trying to bring up a new hypervisor)!

>Each generation adds more features that need formal verification
>during the design process (and to ensure correct functionality
>with new internal interconnect structures (bus, ring, mesh, et alia))
>and that costs more man-hours to integrate into the design without
>breaking any of the legacy stuff. If this all lets them tape
>a new chip out a month early, that's a win.

I suppose....

I mean, don't get me wrong, I believe you, but I find the entire
thing a bit dubious.

- Dan C.

BGB

unread,
May 21, 2023, 11:07:19 PM5/21/23
to
It does help.

Though, this doesn't fix a lot of of the ugly issues with x86-64.
But, by the time one cleans up the mess, what one would have isn't
really x86-64 anyone...


>> Yes, it will basically put Paul and PDOS out of business,
>
> What a shame.
>
>> but he can always run in emulation.
>
> What a shame.
>

At this point, it almost seems like maybe x86 should be "put out to
pasture" (as a native hardware-level instruction set) and pretty much
the whole mess moved over to emulation (using "a good JIT" can be
reasonably effective).

Ideally, the native ISA should be open, and able to be used directly by
software. Or, having a fallback "x86 emulation" mode, which can launch a
conventional OS in a sort of emulator/hypervisor (so, say, one can run
an x86-64 OS on it, probably with reduced performance).


But, OTOH, not like there is an ideal alternative:
* ARM: Not ideal, as it is not an open ISA;
* RISC-V: Open ISA, but still a little lacking;
* BJX2: Could work OK, but I am a bit biased here...


I would almost thing RISC-V could be a good solution, but:
Some of its design choices, I don't feel are ideal;
The maturity of the tooling is unexpectedly lacking.
GCC is missing things like PIE and shared objects;
Many of the extensions are half-backed and don't mesh together well.
...

Getting good performance from RISC-V would still require a "clever" CPU
core, as naive designs would fall short in terms of performance (and
there are some issues here that are "not fixable" within the existing
encoding scheme).

For a small device, something akin to a RISC-V core running an x86
emulator as part of its ROM could almost make sense (but, performance
wouldn't be so great as the use of condition-codes by x86 would be a
serious issue for RISC-V's ISA design).



One could almost argue for reviving IA-64, except some of its design
choices don't make much sense either.

One likely needs, say:
Likely a VLIW or similar;
Supporting an in-order or out-of-order pipeline;
Probably 64 or 128 general-purpose registers;
Directly accessible, no register windows;
Has built-in support for SIMD operations;
Likely FPU existing as a subset of the SIMD operators;
Supports large-immediate forms in the instruction stream;
Needs to be able to support 32 and 64-bit inline constants.
Instruction predication;
...


Instruction size is an issue.

In my ISA, I was using 32-bit instructions, but it isn't really possible
to fit "everything I would want" into a 32-bit instruction format (in my
case, this leads to a certain inescapable level of non-orthogonality,
and also some larger variable-length instructions).


Could fit a bit more into a 48-bit instruction format, but code-density
would be worse. Well, and/or one uses an instruction encoding scheme
similar to IA-64... Both have drawbacks.


Then again, if Intel was like "We are bringing back something kinda
similar to IA-64 but with AVX glued on", possibly it wouldn't go over
very well...

Well, and also when one considers that some of the compiler
weaknesses/difficulties are still not an entirely solved issue.

Though, it is likely that rather than focusing on traditional native
compilation, the emphasis would be on JIT compiling x86 and x86-64 code
to the thing.


>>> Of all the weirdness in x86, eliminating this stuff seems like
>>> pretty small potatoes. I mean, they could get rid of
>>> segmentation entirely in 64-bit mode,
>>
>> Now, that would be churn. Granted AMD's SVM and Intel's VT-X
>> (once nested page tables/extended page tables were added) have
>> basically eliminated the last need for segment limit register
>> (which AMD had removed in amd64 but added back later because
>> XEN needed them for paravirtualization (before NPT/EPT)).
>
> Would it really? Limits and base are already ignored in long
> mode; about the only thing it's still used for is GSBASE/FSBASE
> and for that we have MSRs. But, having to program non-null
> segment selectors into STAR, and having to have a valid GDT,
> adds seemingly unnecessary complexity. If they're going to
> swap around how they do AP startup with a brand-new SIPI type,
> it doesn't seem like a big lift to just do away with
> segmentation entirely.
>

Ironically, if one goes over to software managed TLB, then the whole
"nested page table" thing can disappear into the noise (as it is all
software).


Or, you have people like me going and using B-Trees in place of
page-tables, since B-Trees don't waste as much memory when one has a
sparse address space managed with aggressive ASLR (the pages in the
upper levels of the page-tables being almost entirely empty with sparse
ASLR).

Granted, I don't expect many other people are likely to consider using
B-Trees in place of page-tables to be a sensible idea.



>> I see this proposal eliminates limit checking, which was only
>> added to x86_64 to support Xen.
>
> I believe that's for 32-bit mode? Both AMD and Intel already
> ignore segment limits in 64-bit mode, and both effectively
> ignore segment base as well. At least for CS, DS, ES and SS; FS
> and GS are treated specially for thread- and core-local storage,
> but I don't think any that is changing in this proposal.
>

Yeah.


>> Most of these changes will affect the boot loaders and secondary
>> bootstrap for the most part, which is where the progression from
>> power-on through real-mode, protected-mode, enable paging, enable
>> longmode occurs. There should be very few changes to the modern
>> kernels (windows, linux), if any.
>
> Yup. That code (as you well know) tends to be write-once and
> mostly forget. I get that the hardware folks want to make their
> lives easier, but in the short term, this adds complexity to the
> code (which must now be specialized to detect whether it's
> running on an x86-S CPU or current x86_64 and behave
> accordingly). I suppose we could treat x86-S as an entirely
> separate architecture.
>

Presumably you could only boot these from EFI anyways.
In some ways it makes some sense...


But, if one is going to be breaking backward compatibility anyways (as
far as the OS in concerned), still almost makes sense to consider
abandoning x86 entirely and deal with legacy software via emulation
(with a "good enough" emulator being provided that the OS's are less
likely to try to force all of the binaries over to the new instruction set).

But, does mean that, besides things being convenient for the OS devs,
one likely needs a design capable of getting up to ~ 90%+ of native x86
performance via JIT (this part would likely be a serious problem for
trying to emulate x86 via RISC-V).


> - Dan C.
>

Luke A. Guest

unread,
May 22, 2023, 6:00:49 AM5/22/23
to
On 21/05/2023 15:18, Dan Cross wrote:
> In article <u4crh1$1hngg$1...@dont-email.me>,
> Luke A. Guest <lag...@archeia.com> wrote:
>> On 20/05/2023 18:07, Dan Cross wrote:
>>> Likely to be filed under the, "Too Little Too Late" category,
>>> Intel has recent put forth a proposal for simplifying the x86_64
>>> architecture, and in particularly, discarding some of the legacy
>>> baggage of a 45 year old architecture. Details, including a
>>> link to the specifically proposed architectural changes, are
>>> here:
>>> https://www.intel.com/content/www/us/en/developer/articles/technical/envisioning-future-simplified-architecture.html
>>
>> Or just dump the archaic x86(-64) arch and go full RISC, with current
>> manufacturing technology, they'd be speedy as fuck. Apparently all x86
>> ave been RISC underneath for ages now anyway.
>
> The issue there is backwards compatibility with an enormous
> installed base. Intel tried back back in the early 00s with the
> Itanium; it did not go well for them. They had previously tried

I know, I remember it well, all the decent RISC chip manufacturers
basically dumping their stuff for something inferior.

As I said already, chips already do x86/64 emulation on top of a RISC
architecture, so what's the problem?

> back in the 80s with the i432 and that was even worse. In an
> amazing display of lack of self-awareness, they made many of the
> same mistakes with Itanium that they had previously made with
> i432 (waiting for the perfect compiler to make the thing go fast
> was a mistake with both projects). They made similar errors
> with the i860, too, and made other RISC mistakes with the i960
> (which was marketed as a microcontroller)---noteably, the 960
> used register windows, like SPARC and the original Berkeley RISC
> machine. History has shown this seemingly nifty idea not to be
> that great in practice.

It's not that it's "not too great" it's more that they fucked it up by
doing weird shit, like making a CPU have so many instructions just to
run Ada, I use Ada, it's great, but that was just stupid. But that was a
time when Ada features required compiler features that weren't inventd
or had only just been and were slow.


Dan Cross

unread,
May 22, 2023, 7:09:07 AM5/22/23
to
In article <u4em65$213rr$1...@dont-email.me>, BGB <cr8...@gmail.com> wrote:
>On 5/21/2023 4:07 PM, Dan Cross wrote:
>> [snip]
>> Would it really? Limits and base are already ignored in long
>> mode; about the only thing it's still used for is GSBASE/FSBASE
>> and for that we have MSRs. But, having to program non-null
>> segment selectors into STAR, and having to have a valid GDT,
>> adds seemingly unnecessary complexity. If they're going to
>> swap around how they do AP startup with a brand-new SIPI type,
>> it doesn't seem like a big lift to just do away with
>> segmentation entirely.
>>
>
>Ironically, if one goes over to software managed TLB, then the whole
>"nested page table" thing can disappear into the noise (as it is all
>software).
>
>
>Or, you have people like me going and using B-Trees in place of
>page-tables, since B-Trees don't waste as much memory when one has a
>sparse address space managed with aggressive ASLR (the pages in the
>upper levels of the page-tables being almost entirely empty with sparse
>ASLR).
>
>Granted, I don't expect many other people are likely to consider using
>B-Trees in place of page-tables to be a sensible idea.

Software-managed page tables actually dramatically complicate
address space management in a hypervisor, in part because the
page table used by a guest is not generally knowable in advance
(a guest can just make up their own).

Why a B-tree instead of a radix tree, anyway?

- Dan C.

Scott Lurndal

unread,
May 22, 2023, 10:48:06 AM5/22/23
to
cr...@spitfire.i.gajendra.net (Dan Cross) writes:
>In article <lksaM.450847$ZhSc....@fx38.iad>,
>Scott Lurndal <sl...@pacbell.net> wrote:

>
>>Yes, it will basically put Paul and PDOS out of business,
>
>What a shame.
>

:-)


>>Now, that would be churn. Granted AMD's SVM and Intel's VT-X
>>(once nested page tables/extended page tables were added) have
>>basically eliminated the last need for segment limit register
>>(which AMD had removed in amd64 but added back later because
>>XEN needed them for paravirtualization (before NPT/EPT)).
>
>Would it really? Limits and base are already ignored in long
>mode;

They were in the original AMD64 implementation, but that changed
quickly - the data segment limit is still enforced in long mode
(to support XEN (and VMware) paravirt).

https://www.pagetable.com/?p=25

A subsequent opteron added it back just for the data segment.
I can't speak for Intel in this instance, but AMD definitely
added support for the DS limit checking in long mode in the
second or third generation opteron.

> about the only thing it's still used for is GSBASE/FSBASE
>and for that we have MSRs. But, having to program non-null
>segment selectors into STAR, and having to have a valid GDT,
>adds seemingly unnecessary complexity. If they're going to
>swap around how they do AP startup with a brand-new SIPI type,
>it doesn't seem like a big lift to just do away with
>segmentation entirely.
>
>>I see this proposal eliminates limit checking, which was only
>>added to x86_64 to support Xen.
>
>I believe that's for 32-bit mode? Both AMD and Intel already
>ignore segment limits in 64-bit mode, and both effectively

See above.

>>Most of these changes will affect the boot loaders and secondary
>>bootstrap for the most part, which is where the progression from
>>power-on through real-mode, protected-mode, enable paging, enable
>>longmode occurs. There should be very few changes to the modern
>>kernels (windows, linux), if any.
>
>Yup. That code (as you well know) tends to be write-once and
>mostly forget. I get that the hardware folks want to make their
>lives easier, but in the short term, this adds complexity to the
>code (which must now be specialized to detect whether it's
>running on an x86-S CPU or current x86_64 and behave
>accordingly). I suppose we could treat x86-S as an entirely
>separate architecture.

A couple of checks of the CPUID output and a bit of new code
seems reasonable to position for the future.

>
>I don't see how virtio can give a user-application pass-through
>access to programmed IO, but I appreciate an argument that says
>that there can be a uioring sort of thing to communicate IO
>requests from userspace to the kernel without a trap.

We do that all the time on our processors. Applications like DPDK
and Open Data Plane (ODP) rely on user-mode access to the
device MMIO (often using SR-IOV virtual functions) space and direct
DMA (facilitated by an IOMMU/SMMU) initiated by usermode code.

Interrupts are still mediated by the OS (virt-io provides these
capabilities), although DPDK/ODP generally poll completion rings
rather than use interrupts.

>
>For that matter, I don't see how doing away with segmentation in
>the 64-bit mode really adds that much, either

Again, is the hardware implementation and verification cost
that's being saved.

Dan Cross

unread,
May 22, 2023, 12:48:56 PM5/22/23
to
In article <HfLaM.617957$Ldj8....@fx47.iad>,
Scott Lurndal <sl...@pacbell.net> wrote:
>cr...@spitfire.i.gajendra.net (Dan Cross) writes:
>>In article <lksaM.450847$ZhSc....@fx38.iad>,
>>Scott Lurndal <sl...@pacbell.net> wrote:
>[snip]
>>>Now, that would be churn. Granted AMD's SVM and Intel's VT-X
>>>(once nested page tables/extended page tables were added) have
>>>basically eliminated the last need for segment limit register
>>>(which AMD had removed in amd64 but added back later because
>>>XEN needed them for paravirtualization (before NPT/EPT)).
>>
>>Would it really? Limits and base are already ignored in long
>>mode;
>
>They were in the original AMD64 implementation, but that changed
>quickly - the data segment limit is still enforced in long mode
>(to support XEN (and VMware) paravirt).
>
>https://www.pagetable.com/?p=25
>
>A subsequent opteron added it back just for the data segment.
>I can't speak for Intel in this instance, but AMD definitely
>added support for the DS limit checking in long mode in the
>second or third generation opteron.

Funny, I checked both the SDM and the AMD APM before posting,
and both say segment checking is not enforced in 64-bit mode.
(SDM sec vol 3A sec 5.3.1 and APM vol 2 sec 4.8.2). Ah, wait;
I see now: APM vol 2 sec 4.12.2 says that they _are_ enforced
if EFER.LMSLE is set to 1 (see, this is why we can't have nice
things). Apparently, Intel never did this since they had VT-x,
so I don't imagine they care that much for x86S. With SVM, I
wonder how much AMD cares, either.
Yeah. That adds some modicum of complexity to the boot path,
but is obviously doable.

It's still not entirely clear to me now the BSP/BSC is supposed
to boot, however. If the world starts in 64-bit mode, and that
still requires paging to be enabled, then who sets up the page
tables that the BSP starts up on?

>>I don't see how virtio can give a user-application pass-through
>>access to programmed IO, but I appreciate an argument that says
>>that there can be a uioring sort of thing to communicate IO
>>requests from userspace to the kernel without a trap.
>
>We do that all the time on our processors. Applications like DPDK
>and Open Data Plane (ODP) rely on user-mode access to the
>device MMIO (often using SR-IOV virtual functions) space and direct
>DMA (facilitated by an IOMMU/SMMU) initiated by usermode code.

Ok, sure, but that's not PIO. Unprivileged access to the PIO
space seems like it's just going away. I think that's probably
fine as almost all high-speed devices are memory-mapped anyway,
so we're just left with legacy things like the UART or PS/2
keyboard controller or whatever.

>Interrupts are still mediated by the OS (virt-io provides these
>capabilities), although DPDK/ODP generally poll completion rings
>rather than use interrupts.

Really? Even with SR-IOV and the interrupt remapping tables in
the IOMMU? Are you running in VMX non-root mode? Why not use
posted interrupts?

>>For that matter, I don't see how doing away with segmentation in
>>the 64-bit mode really adds that much, either
>
>Again, is the hardware implementation and verification cost
>that's being saved.

Sorry, that was poorly worded: I meant, I don't see how it costs
that much to do away with it. Even on AMD it appears that one
has to go out of one's way to use it.

- Dan C.

Scott Lurndal

unread,
May 22, 2023, 1:19:39 PM5/22/23
to
Yep, thats the one. I have been deep into AArch64 for the last
12 years, so I haven't dug through the AMD docs for a while.

>
>Yeah. That adds some modicum of complexity to the boot path,
>but is obviously doable.
>
>It's still not entirely clear to me now the BSP/BSC is supposed
>to boot, however. If the world starts in 64-bit mode, and that
>still requires paging to be enabled, then who sets up the page
>tables that the BSP starts up on?

I haven't dug into it, but perhaps they come up in some funky
identity mode when the PT root pointer (CR3?) hasn't been programmed.

>
>>>I don't see how virtio can give a user-application pass-through
>>>access to programmed IO, but I appreciate an argument that says
>>>that there can be a uioring sort of thing to communicate IO
>>>requests from userspace to the kernel without a trap.
>>
>>We do that all the time on our processors. Applications like DPDK
>>and Open Data Plane (ODP) rely on user-mode access to the
>>device MMIO (often using SR-IOV virtual functions) space and direct
>>DMA (facilitated by an IOMMU/SMMU) initiated by usermode code.
>
>Ok, sure, but that's not PIO.

By PIO are you referring to 'in' and 'out' instructions that have
been obsolete for three decades except for a few legacy devices
like the UART (and access to pci config space, although PCI
express defines the memory mapped ECAM as an alternative which
is used on non-intel/amd systems)?


> Unprivileged access to the PIO
>space seems like it's just going away. I think that's probably
>fine as almost all high-speed devices are memory-mapped anyway,
>so we're just left with legacy things like the UART or PS/2
>keyboard controller or whatever.

Plus, with PCI, a "io space" bar can be programmed to sit anywhere
in the physical address space. With most modern devices either
being PCI or providing PCI configuration space semantics, one can
still use PIO even on ARM processors via IO BAR. Not that there really are
any modern PCI/PCIe devices that use anything other than "memory space"
bars.

>
>>Interrupts are still mediated by the OS (virt-io provides these
>>capabilities), although DPDK/ODP generally poll completion rings
>>rather than use interrupts.
>
>Really? Even with SR-IOV and the interrupt remapping tables in
>the IOMMU? Are you running in VMX non-root mode? Why not use
>posted interrupts?

Hmm. I do seem to recall some mechanisms for interrupt virtualization
in the IOMMU, but I've been, as noted above, in the ARMv8 world for a while now.

Speaking for ARM systems, the guts of the interrupt controller
(including the interrupt acknowledge registers) are privileged. There
is no way to segregate user-mode-visible interrupts from all others
which is needed to ensure that a user-mode program can't royally screw
up the system, the kernel must accept and end the interrupt. The
ARM GICv3 is actually much more sophisticated than the local and I/O
APICs' on Intel and the GICv4 adds some level of interrupt virtualization
to support delivery directly to the guest without intervention from
the hypervisor. IIRC, the Intel IOMMU interrupt remapping tables
were to support that type of usage, not direct user mode access
(which would require user-mode access to the local APIC to end the
interrupt).


>
>>>For that matter, I don't see how doing away with segmentation in
>>>the 64-bit mode really adds that much, either
>>
>>Again, is the hardware implementation and verification cost
>>that's being saved.
>
>Sorry, that was poorly worded: I meant, I don't see how it costs
>that much to do away with it. Even on AMD it appears that one
>has to go out of one's way to use it.

Gotcha. Concur.

Dan Cross

unread,
May 22, 2023, 2:18:05 PM5/22/23
to
In article <ltNaM.3348574$9sn9.2...@fx17.iad>,
Scott Lurndal <sl...@pacbell.net> wrote:
>cr...@spitfire.i.gajendra.net (Dan Cross) writes:
>[snip]
>>It's still not entirely clear to me now the BSP/BSC is supposed
>>to boot, however. If the world starts in 64-bit mode, and that
>>still requires paging to be enabled, then who sets up the page
>>tables that the BSP starts up on?
>
>I haven't dug into it, but perhaps they come up in some funky
>identity mode when the PT root pointer (CR3?) hasn't been programmed.

Now that would genuinely be a useful change.

>>>>I don't see how virtio can give a user-application pass-through
>>>>access to programmed IO, but I appreciate an argument that says
>>>>that there can be a uioring sort of thing to communicate IO
>>>>requests from userspace to the kernel without a trap.
>>>
>>>We do that all the time on our processors. Applications like DPDK
>>>and Open Data Plane (ODP) rely on user-mode access to the
>>>device MMIO (often using SR-IOV virtual functions) space and direct
>>>DMA (facilitated by an IOMMU/SMMU) initiated by usermode code.
>>
>>Ok, sure, but that's not PIO.
>
>By PIO are you referring to 'in' and 'out' instructions that have
>been obsolete for three decades except for a few legacy devices
>like the UART

Well, yes. (The context was the removal of both ring 3 port
access instructions, as well as the IOPL from TSS.)

>(and access to pci config space, although PCI
>express defines the memory mapped ECAM as an alternative which
>is used on non-intel/amd systems)?

I try to blot that out of my mind.

I believe that PCI express deprecates the port-based access
method to config space; MMIO _must_ be supported and in
particular, is the only way to get access to the extended
capability space. So from that perspective we're not losing
anything. Certainly, I've used memory-mapped IO for dealing
with PCI config space on x86_64. The port-based access method
is really only for compatibility with legacy systems at this
point.

>> Unprivileged access to the PIO
>>space seems like it's just going away. I think that's probably
>>fine as almost all high-speed devices are memory-mapped anyway,
>>so we're just left with legacy things like the UART or PS/2
>>keyboard controller or whatever.
>
>Plus, with PCI, a "io space" bar can be programmed to sit anywhere
>in the physical address space. With most modern devices either
>being PCI or providing PCI configuration space semantics, one can
>still use PIO even on ARM processors via IO BAR. Not that there really are
>any modern PCI/PCIe devices that use anything other than "memory space"
>bars.

Yup. It really seems like the only devices that demand access
via port IO are the legacy "PC" devices; if the 8159A is going
away, what's left? The RTC, UART and keyboard controller? Is
the PIT dual-wired to an IOAPIC for interrupt generation?

>>>Interrupts are still mediated by the OS (virt-io provides these
>>>capabilities), although DPDK/ODP generally poll completion rings
>>>rather than use interrupts.
>>
>>Really? Even with SR-IOV and the interrupt remapping tables in
>>the IOMMU? Are you running in VMX non-root mode? Why not use
>>posted interrupts?
>
>Hmm. I do seem to recall some mechanisms for interrupt virtualization
>in the IOMMU, but I've been, as noted above, in the ARMv8 world for a while now.

Is this the point where I express my jealousy? :-D

But yes: the IOMMU can be used to deliver interrupts directly to
a VCPU (provided you're using APIC virtualization) by writing to
a posted-interrupt vector. The resulting interrupt will be
generated and delivered in the guest without intervention from
the hypervisor.

>Speaking for ARM systems, the guts of the interrupt controller
>(including the interrupt acknowledge registers) are privileged. There
>is no way to segregate user-mode-visible interrupts from all others
>which is needed to ensure that a user-mode program can't royally screw
>up the system, the kernel must accept and end the interrupt.

I'm not sure I understand; I thought the GIC was memory mapped,
including for the banked per-CPU registers? Is the issue that
you don't want to expose the entire mapping (I presume this has
to be on some page granularity) to userspace?

>The
>ARM GICv3 is actually much more sophisticated than the local and I/O
>APICs' on Intel and the GICv4 adds some level of interrupt virtualization
>to support delivery directly to the guest without intervention from
>the hypervisor. IIRC, the Intel IOMMU interrupt remapping tables
>were to support that type of usage, not direct user mode access
>(which would require user-mode access to the local APIC to end the
>interrupt).

That is correct; perhaps I'm misintpreting what you meant
earlier: I think I gather now that you're talking about
overloading functionality meant for virtualization to provide
unprivileged access to devices in a host. That is, allocate a
virtual function and pass that through to a userspace process,
but don't enter a virtualzied CPU context?

- Dan C.

BGB

unread,
May 22, 2023, 3:04:16 PM5/22/23
to
If the guest is using the same type of software managed TLB, one doesn't
emulate the guest's page-tables, one emulates the guest's TLB
(effectively running the TLB through another level of virtual address
translation).


> Why a B-tree instead of a radix tree, anyway?
>

If you mean a radix tree, like conventional page-tables, the issue is
mostly a combination of a large address space and ASLR.

The page table works fine for a 48-bit address space, but starts to have
problems for a larger space.


In my case, the "full" virtual address space is 96 bits.

The upper levels of the page table end up being mostly empty, and if one
tries to ASLR addresses within this space, then the memory overhead from
the page tables ends up being *absurd* (huge numbers of page tables
often with only a single entry being non-zero).

I had ended up often using a hybrid strategy, where the upper-bits of
the address are managed using a B-Tree, and the low-order bits with a
more conventional page table.

Say, for 16K pages:
Addr(95:36): B-Tree
Addr(35:14): 2-level page-table

Then one can use ASLR freely without burning through excessive amounts
of memory (say, with an 8-level page table for 16K pages).

Note that 4K pages would require a 10-level page-table, and 64K pages a
7-level page-table.


A pure B-Tree would use less memory than the hybrid strategy, but the
drawback of a B-Tree is that it is slower.

It is possible to speed-up the B-Tree by using a hash-table to cache
lookups, but this is most effective with the hybrid strategy.


Using hash-tables as the primary lookup (rather than B-Trees) had been
looked into as well, but hash tables have drawbacks when used in this
way (they don't scale very well).

Had also experimented with using AVL Trees, however these ended up
slightly worse in terms of both memory overhead and performance when
compared with B-Trees (though, AVL Trees are a little simpler to implement).

...


Note that the memory layout here would have programs within their own
local 48 bits (programs not generally needing to care about anything
beyond their own 48-bit space), but the programs are placed randomly
within the 96-bit space (so that one program can't "guess" an address
into another program's memory; but memory can still be "shared" via
128-bit "huge" pointers).

Say:
void *ptr; //points within the local 48-bit space (64-bit pointer)
void * __huge ptr; //points within the 96-bit space (128-bit)
...

Say:
void ** __huge ptr; //128-bit pointer to 64-bit pointers
__huge void **ptr; //64-bit pointer to 128-bit pointers
__huge void ** __huge ptr; //128-bit pointer to 128-bit pointers

...

Though, this part is specific to BGBCC.



I had put off some of this for a little while, but it came up again
mostly because my CPU core can also run 64-bit RISC-V, but it turns out
GCC doesn't support either PIE or shared-objects for this target
("WTF?"), so to be able to load them up as programs, I need to give them
their own address space, and throwing them off into random parts of
96-bit land seemed the "lesser of two evils" (as compared with needing
to actually deal with multiple address spaces in the kernel).

This is a little bit of an annoyance as it does mean needing to widen
the virtual memory system and anything that deals directly with system
calls to deal with the larger address space (and, secondarily, use a
wrapper interface because, if any of this code is built with GCC in
RISC-V mode, then GCC doesn't support either 128-bit pointers or 128-bit
integers).

Though, in some cases, this will mean needing to copy things into local
buffers and copy them back into the program's address range (so, a
similar annoyance to if one was dealing with multiple address spaces).

...


Scott Lurndal

unread,
May 22, 2023, 3:20:46 PM5/22/23
to
Intel, AMD and ARM chips all have hardware translation walkers. All
have a facility to support guest OS management of page tables
(nested page tables on AMD, extended page tables on Intel and stage 2
page tables on ARM). They also have I/O memory management units that
also support multiple translation stages to allow guests to program
guest physical addresses into hardware DMA engines, some can even
support translations using both stages to allow user-mode code to
directly program hardware DMA engines when running under
a guest OS (e.g. for a NIC or storage adapter virtual function assigned
directly to user-mode code).

Even those using MIPS chips (the last with software walkers) such as
Cavium investigated adding a hardware walker before they switched to
ARMv8.

There is no possibility that the software (operating system and
hypervisor) folks will support software managed TLBs in a new architecture.
Zero chance.

Dan Cross

unread,
May 22, 2023, 4:12:51 PM5/22/23
to
Indeed, but that introduces complications: on a TLB miss
interrupt, the hypervisor must invoke the guest's TLB miss
handler to supply the translation (since it can't just walk the
guest's page tables, since it doesn't know about them), then
trap the TLB update. That's all straight-forward, but it's also
slow. Often, to mitigate this, the hypervisor will cache recent
translations itself (e.g., Disco's L2TLB), but now we have to
emulate removals as well (again, straight-forward, but something
to consider nontheless). And when we consider nested
virtualization (that is, a hypervisor running under a
hypervisor) this overhead gets even worse.

L2PT's like the EPT and NPT are wins here; even in the nested
VM case, where we have to resort to shadow paging techniques, we
can handle L2 page faults in the top-level hypervisor.

There's a reason soft-TLBs have basically disappeared. :-)

- Dan C.

Scott Lurndal

unread,
May 22, 2023, 4:45:33 PM5/22/23
to
cr...@spitfire.i.gajendra.net (Dan Cross) writes:
>In article <ltNaM.3348574$9sn9.2...@fx17.iad>,
>Scott Lurndal <sl...@pacbell.net> wrote:
>>cr...@spitfire.i.gajendra.net (Dan Cross) writes:
>>[snip]
>>>It's still not entirely clear to me now the BSP/BSC is supposed
>>>to boot, however. If the world starts in 64-bit mode, and that
>>>still requires paging to be enabled, then who sets up the page
>>>tables that the BSP starts up on?
>>
>>I haven't dug into it, but perhaps they come up in some funky
>>identity mode when the PT root pointer (CR3?) hasn't been programmed.
>
>Now that would genuinely be a useful change.

The document describes IA32_SIPI_ENTRY_STRUCT as containing:

- A bit that selects startup or shutdown (+63 filler bits)
- The RIP for the AP to start executing at
- The a CR3 value for the AP to use when starting
- The CR0 value
- The CR4 value

This is used to start all the secondary processors.

The boostrap processor:

"The CPU starts executing in 64-bit paged mode after reset. The
Firmware Interface Table (FIT) contains a reset state structure
containing RIP and CR3 that defines the initial execution state of
the CPU."

I'm presuming that a management processor provided by the
mainboard vendor will initialize the FIT out-of-band before
releasing the BSP from reset sufficent to execute the UEFI
firmware and boot loader.


>
>>>>>I don't see how virtio can give a user-application pass-through
>>>>>access to programmed IO, but I appreciate an argument that says
>>>>>that there can be a uioring sort of thing to communicate IO
>>>>>requests from userspace to the kernel without a trap.
>>>>
>>>>We do that all the time on our processors. Applications like DPDK
>>>>and Open Data Plane (ODP) rely on user-mode access to the
>>>>device MMIO (often using SR-IOV virtual functions) space and direct
>>>>DMA (facilitated by an IOMMU/SMMU) initiated by usermode code.
>>>
>>>Ok, sure, but that's not PIO.
>>
>>By PIO are you referring to 'in' and 'out' instructions that have
>>been obsolete for three decades except for a few legacy devices
>>like the UART
>
>Well, yes. (The context was the removal of both ring 3 port
>access instructions, as well as the IOPL from TSS.)

Ok. Coming from the unix/linux world, ring 3 access to those
has generally not been allowed and I don't see removal of that
capability as a loss, but rather a gain.

>
>>(and access to pci config space, although PCI
>>express defines the memory mapped ECAM as an alternative which
>>is used on non-intel/amd systems)?
>
>I try to blot that out of my mind.
>
>I believe that PCI express deprecates the port-based access
>method to config space; MMIO _must_ be supported and in
>particular, is the only way to get access to the extended
>capability space.

Intel cheated and used four unused bits in the high bits
of cf8 for the extended config space on some of the southbridge
chipsets in the early PCIe days. I've not checked recently
to see if newer chipsets/SoCs still support that.

> So from that perspective we're not losing
>anything. Certainly, I've used memory-mapped IO for dealing
>with PCI config space on x86_64. The port-based access method
>is really only for compatibility with legacy systems at this
>point.

For ARM, the standard is ECAM (See Server Base System Architecture
document).

https://developer.arm.com/documentation/den0029/latest/
>
>>> Unprivileged access to the PIO
>>>space seems like it's just going away. I think that's probably
>>>fine as almost all high-speed devices are memory-mapped anyway,
>>>so we're just left with legacy things like the UART or PS/2
>>>keyboard controller or whatever.
>>
>>Plus, with PCI, a "io space" bar can be programmed to sit anywhere
>>in the physical address space. With most modern devices either
>>being PCI or providing PCI configuration space semantics, one can
>>still use PIO even on ARM processors via IO BAR. Not that there really are
>>any modern PCI/PCIe devices that use anything other than "memory space"
>>bars.
>
>Yup. It really seems like the only devices that demand access
>via port IO are the legacy "PC" devices; if the 8159A is going
>away, what's left? The RTC, UART and keyboard controller? Is
>the PIT dual-wired to an IOAPIC for interrupt generation?

Don't the have an architected high precision timer (HPET) that
is used instead of the PIT in these modern times?

>
>>>>Interrupts are still mediated by the OS (virt-io provides these
>>>>capabilities), although DPDK/ODP generally poll completion rings
>>>>rather than use interrupts.
>>>
>>>Really? Even with SR-IOV and the interrupt remapping tables in
>>>the IOMMU? Are you running in VMX non-root mode? Why not use
>>>posted interrupts?
>>
>>Hmm. I do seem to recall some mechanisms for interrupt virtualization
>>in the IOMMU, but I've been, as noted above, in the ARMv8 world for a while now.
>
>Is this the point where I express my jealousy? :-D

I've quite enjoyed the last decade working on a
significant architectural upgrade from ARMv7.

Watching the architecture grow from initial early
release documents and modeling the Processor, SMMU, and
Interrupt Controller has been educational and fun.


>>Speaking for ARM systems, the guts of the interrupt controller
>>(including the interrupt acknowledge registers) are privileged. There
>>is no way to segregate user-mode-visible interrupts from all others
>>which is needed to ensure that a user-mode program can't royally screw
>>up the system, the kernel must accept and end the interrupt.
>
>I'm not sure I understand; I thought the GIC was memory mapped,
>including for the banked per-CPU registers?

That was GICv2. That interface only supported 8 cores/threads,
so they designed GICv3 for ARMv8. That uses CPU system registers
to interface rather than the former memory mapped CPU interface.
Much cleaner and easily accomodates many thousands of CPUs; also
expanded to handle large numbers (2^22) of interrupts, including
software generated interrupts for IPI, per-processor local interrupts
for things like timers, profiling, debugging, wired interrupts
(level or edge) and message signaled interrupts (edge only).

The GICv3 CPU interface also includes special hypervisor support. In
GICv3.0, the hypervisor could inject interrupts in the guest but still
was required to handle all physical interrupts itself (and pass them
to the guest as required). With GICv4.0, a mechanism was added such
that interrupts (message signaled) could be directly injected into
the guest by the hardware without hypervisor intervention. GICv4.1
added support for hardware injected virtual software generated
interrupts (vSGI), so SMP guests could send IPI's to other cores
assigned to it using virtual core numbers mapped by the GIC into
a interrupt to be injected (if the guest was actively scheduled on
the core, otherwise it would be recorded and delivered when the
guest virtual CPU is next scheduled on that core).




>
>That is correct; perhaps I'm misintpreting what you meant
>earlier: I think I gather now that you're talking about
>overloading functionality meant for virtualization to provide
>unprivileged access to devices in a host. That is, allocate a
>virtual function and pass that through to a userspace process,
>but don't enter a virtualzied CPU context?

Yes, that's the basic use pattern for the DPDK. Usermode drivers
directly access the networking hardware without operating system
intervention. Virtualized[*] _or_ bare-metal. Interrupts are the
complicating factor since most processors do not have the capability
to deliver interrupts to user-mode handlers directly. Thus DPDK
and ODP poll completion queues on the network hardware rather than
waiting for interrupts.

https://www.dpdk.org/
https://opendataplane.org/

[*] Requires the SMMU/IOMMU to do both stages of translation, i.e.
va->guestpa and guestpa->machinepa.

BGB

unread,
May 22, 2023, 9:38:07 PM5/22/23
to
I had gone with software managed TLB, but in my case this was partly
because my ISA had evolved out of SuperH and similar, and the approach
seemed to make sense in terms of a "make the hardware simple and cheap"
sense.


Similarly, some features, like B-Tree based page-tables, applying ACL
checks to virtual-memory pages (as opposed to a more conventional
protection-ring scheme), wouldn't really be quite as practical with a
hardware page-table walker.


But, with it being handled in software, one can basically do whatever
they want...


If the guest OS wants to use a page-table and pretend there is a
page-table walker, this is easy enough to pull off as well.

Granted, emulating a TLB on top of page-tables is more difficult, but
this is mostly because page-tables are less flexible in this case.


...



BGB

unread,
May 22, 2023, 10:00:47 PM5/22/23
to
IME, it isn't really all that difficult in practice.

Granted, for things like removing a page from the TLB, these were
generally handled by loading an empty TLBE (for a given virtual address)
roughly 8 times in a row. With a set-associative TLB, this was enough to
make sure the page was evicted (and should also be easy enough to detect
and handle).


> L2PT's like the EPT and NPT are wins here; even in the nested
> VM case, where we have to resort to shadow paging techniques, we
> can handle L2 page faults in the top-level hypervisor.
>

But, if one uses SW TLB, then NPT (as a concept) has no reason to need
to exist...


> There's a reason soft-TLBs have basically disappeared. :-)
>

Probably depends some on how the software-managed TLB is implemented.

In my case, TLB miss triggers an interrupt, and there is an "LDTLB"
instruction which basically means "Take the TLBE from these two
registers and shove it into the TLB at the appropriate place".

In this case, there is no way for the program to directly view or modify
the contents of the TLB.


As can be noted in my testing, TLB miss rate is typically low enough
(with 16K pages and a 256x 4-way TLB) that the performance impact of
handling TLB misses in software doesn't really have all that much effect
on the overall performance of the program.

As can be noted:
TLB miss rate increases significantly with 4K pages vs 16K pages;
Dropping to 64x 4-way also increases miss rate;
Miss rate is "pretty bad" at 16x 4-way.

Generally, stable operation of the TLB seems to require at least 4-way
associativity for the L2 TLB (an L1 TLB can get by more effectively with
1-way assuming a modulo indexing scheme; So, 16x 1-way for the L1 TLB).

Note that fully associative TLBs aren't really viable on an FPGA.


> - Dan C.
>

Dan Cross

unread,
May 23, 2023, 8:20:01 AM5/23/23
to
In article <u4h5b9$2afd7$1...@dont-email.me>, BGB <cr8...@gmail.com> wrote:
>[snip]
>Granted, emulating a TLB on top of page-tables is more difficult, but
>this is mostly because page-tables are less flexible in this case.

No, emulating the _TLB_ itself is easier, but overall memory
management is more complex, particularly in recursive VM
situations.

- Dan C.

muta...@gmail.com

unread,
May 23, 2023, 9:27:29 AM5/23/23
to
On Monday, May 22, 2023 at 1:15:00 AM UTC+8, Scott Lurndal wrote:

> Yes, it will basically put Paul and PDOS out of business, but he can
> always run in emulation.

I already have a nominally 64-bit version of PDOS that
runs under 64-bit UEFI.

I'm still waiting for compiler support to effectively force
it back to 32-bit.

For PDOS/386, as of yesterday, the entire toolchain except the
compiler is public domain, and all the assembler is masm
syntax.

ie we now have a public domain assembler that is sufficiently
masm-compatible for my purposes.

I haven't yet proven that the entire PDOS can be built with
Visual Studio, now that the language has been switched.

I've proven it with Watcom though.

If I dumb down the source base to SubC then I should be able
to have a completely public domain solution, but I am holding
out for Octogram C.

There is currently work being done to convert the public domain
toolchain to 64-bit, but direction, and even definition, is still being
negotiated.

BFN. Paul.

muta...@gmail.com

unread,
May 23, 2023, 11:11:51 AM5/23/23
to
On Tuesday, May 23, 2023 at 9:27:29 PM UTC+8, muta...@gmail.com wrote:

> I haven't yet proven that the entire PDOS can be built with
> Visual Studio, now that the language has been switched.

I just received the tool I needed to complete this (convert
a PE executable into a binary by inserting a jmp and
populating BSS), and ... it works!

So now PDOS can be built with professional Microsoft tools
instead of having to take your chances with jackasses on
the internet.

BFN. Paul.

Dan Cross

unread,
May 23, 2023, 12:24:31 PM5/23/23
to
In article <u4h6lu$2e7an$1...@dont-email.me>, BGB <cr8...@gmail.com> wrote:
>On 5/22/2023 3:10 PM, Dan Cross wrote:
>[snip]
>> L2PT's like the EPT and NPT are wins here; even in the nested
>> VM case, where we have to resort to shadow paging techniques, we
>> can handle L2 page faults in the top-level hypervisor.
>>
>
>But, if one uses SW TLB, then NPT (as a concept) has no reason to need
>to exist...

Yes, at great expense.

>> There's a reason soft-TLBs have basically disappeared. :-)
>
>Probably depends some on how the software-managed TLB is implemented.

Not really; the design issues and the impact are both
well-known. Think through how a nested guest (note, not a
nested page table, but a recursive instance of a hypervisor)
would be handled.

>In my case, TLB miss triggers an interrupt, and there is an "LDTLB"
>instruction which basically means "Take the TLBE from these two
>registers and shove it into the TLB at the appropriate place".

That's pretty much the way they all work, yes.

- Dan C.

Dan Cross

unread,
May 23, 2023, 12:31:05 PM5/23/23
to
In article <u4ip84$hpp$1...@reader2.panix.com>,
Dan Cross <cr...@spitfire.i.gajendra.net> wrote:
>In article <u4h6lu$2e7an$1...@dont-email.me>, BGB <cr8...@gmail.com> wrote:
>>On 5/22/2023 3:10 PM, Dan Cross wrote:
>>[snip]
>>> L2PT's like the EPT and NPT are wins here; even in the nested
>>> VM case, where we have to resort to shadow paging techniques, we
>>> can handle L2 page faults in the top-level hypervisor.
>>>
>>
>>But, if one uses SW TLB, then NPT (as a concept) has no reason to need
>>to exist...
>
>Yes, at great expense.
>
>>> There's a reason soft-TLBs have basically disappeared. :-)
>>
>>Probably depends some on how the software-managed TLB is implemented.
>
>Not really; the design issues and the impact are both
>well-known. Think through how a nested guest (note, not a
>nested page table, but a recursive instance of a hypervisor)
>would be handled.

Another thing to consider in a virtualized context with a
soft-TLB: suppose the host and guest want to occupy the same
region of virtual memory. How does the host wrest control
back from the guest, if the guest has usurped the host's
mappings? On MIPS, you have KSEGs, which is one approach
here, but note that under (say) Disco you had to modify the
guest kernel as a result.

- Dan C.

Dan Cross

unread,
May 23, 2023, 1:11:22 PM5/23/23
to
In article <KvQaM.2005263$t5W7.1...@fx13.iad>,
Scott Lurndal <sl...@pacbell.net> wrote:
>cr...@spitfire.i.gajendra.net (Dan Cross) writes:
>>In article <ltNaM.3348574$9sn9.2...@fx17.iad>,
>>Scott Lurndal <sl...@pacbell.net> wrote:
>>>cr...@spitfire.i.gajendra.net (Dan Cross) writes:
>>>[snip]
>>>>It's still not entirely clear to me now the BSP/BSC is supposed
>>>>to boot, however. If the world starts in 64-bit mode, and that
>>>>still requires paging to be enabled, then who sets up the page
>>>>tables that the BSP starts up on?
>>>
>>>I haven't dug into it, but perhaps they come up in some funky
>>>identity mode when the PT root pointer (CR3?) hasn't been programmed.
>>
>>Now that would genuinely be a useful change.
>
>The document describes IA32_SIPI_ENTRY_STRUCT as containing:
>
> - A bit that selects startup or shutdown (+63 filler bits)
> - The RIP for the AP to start executing at
> - The a CR3 value for the AP to use when starting
> - The CR0 value
> - The CR4 value
>
>This is used to start all the secondary processors.

Yeah, the AP (secondary processor) case was pretty clear.

>The boostrap processor:
>
> "The CPU starts executing in 64-bit paged mode after reset. The
> Firmware Interface Table (FIT) contains a reset state structure
> containing RIP and CR3 that defines the initial execution state of
> the CPU."
>
>I'm presuming that a management processor provided by the
>mainboard vendor will initialize the FIT out-of-band before
>releasing the BSP from reset sufficent to execute the UEFI
>firmware and boot loader.

Ah, the AMD PSP approach, aka, how to start the engine on a
ship. I guess the days of DRAM training from the BSP are over,
which isn't necessarily a bad thing.

I imagine that they'll just embed this into the SoC complex
directly.

>>>>>We do that all the time on our processors. Applications like DPDK
>>>>>and Open Data Plane (ODP) rely on user-mode access to the
>>>>>device MMIO (often using SR-IOV virtual functions) space and direct
>>>>>DMA (facilitated by an IOMMU/SMMU) initiated by usermode code.
>>>>
>>>>Ok, sure, but that's not PIO.
>>>
>>>By PIO are you referring to 'in' and 'out' instructions that have
>>>been obsolete for three decades except for a few legacy devices
>>>like the UART
>>
>>Well, yes. (The context was the removal of both ring 3 port
>>access instructions, as well as the IOPL from TSS.)
>
>Ok. Coming from the unix/linux world, ring 3 access to those
>has generally not been allowed and I don't see removal of that
>capability as a loss, but rather a gain.

ioperm(2) and iopl(2)? :-)

>>>(and access to pci config space, although PCI
>>>express defines the memory mapped ECAM as an alternative which
>>>is used on non-intel/amd systems)?
>>
>>I try to blot that out of my mind.
>>
>>I believe that PCI express deprecates the port-based access
>>method to config space; MMIO _must_ be supported and in
>>particular, is the only way to get access to the extended
>>capability space.
>
>Intel cheated and used four unused bits in the high bits
>of cf8 for the extended config space on some of the southbridge
>chipsets in the early PCIe days. I've not checked recently
>to see if newer chipsets/SoCs still support that.

Sigh. Actually, I checked the PCIe spec last night and I think
I was just wrong. It looks like you can still do it, but
address/data register pairs are problematic, so I always just
use ECAM.

I think port IO can be useful at very early boot for writing
debugging data to the UART, or for supporting legacy "PC"
devices; beyond that, I find it annoying. We did use it once in
an experimental hypervisor to execute the equivalent of a VMCALL
without first having to trap into (guest) kernel mode.

>> So from that perspective we're not losing
>>anything. Certainly, I've used memory-mapped IO for dealing
>>with PCI config space on x86_64. The port-based access method
>>is really only for compatibility with legacy systems at this
>>point.
>
>For ARM, the standard is ECAM (See Server Base System Architecture
>document).
>
>https://developer.arm.com/documentation/den0029/latest/

ARM has always been saner than x86. :-) It makes sense that
they'd start with and stick to ECAM, since (AFAIK) they never
had programmed IO instructions. How else _would_ you do it?
(That's a rhetorical question, btw.)

>>Yup. It really seems like the only devices that demand access
>>via port IO are the legacy "PC" devices; if the 8159A is going
>>away, what's left? The RTC, UART and keyboard controller? Is
>>the PIT dual-wired to an IOAPIC for interrupt generation?
>
>Don't the have an architected high precision timer (HPET) that
>is used instead of the PIT in these modern times?

It does, but the HPET has weird problems of its own and is more
rarely used than one might otherwise expect. It's a really
annoying device in a lot of ways.

>>>>>Interrupts are still mediated by the OS (virt-io provides these
>>>>>capabilities), although DPDK/ODP generally poll completion rings
>>>>>rather than use interrupts.
>>>>
>>>>Really? Even with SR-IOV and the interrupt remapping tables in
>>>>the IOMMU? Are you running in VMX non-root mode? Why not use
>>>>posted interrupts?
>>>
>>>Hmm. I do seem to recall some mechanisms for interrupt virtualization
>>>in the IOMMU, but I've been, as noted above, in the ARMv8 world for a while now.
>>
>>Is this the point where I express my jealousy? :-D
>
>I've quite enjoyed the last decade working on a
>significant architectural upgrade from ARMv7.
>
>Watching the architecture grow from initial early
>release documents and modeling the Processor, SMMU, and
>Interrupt Controller has been educational and fun.

Definitely some really cool things have happened in that space
while those of us slumming it with x86 have looked on with envy.
This actually sounds remarkably like what one gets with the
x2APIC and APIC virtualization these days, though perhaps more
coherent than on x86: one interacts with the LAPIC in x2 mode
via MSRs (though the IOMMU still uses memory-mapped address/data
register pairs). MSI delivery coupled with posted interrupts
and the remapping tables in the IOMMU have cooperated to make it
easy to inject interrupts into a guest without hypervisor
intervention. AMD's AVIC supports functionality that makes lets
guests VCPUs send IPIs amongst themselves without hypervisor
intervention, but there are all kinds of problems with it. It
sounds like GICv4 centralizes all of this, whereas on the x86
platforms it gets scattered between a variety of components; AND
ARM gives you the bare-metal passthru functionality.

>>That is correct; perhaps I'm misintpreting what you meant
>>earlier: I think I gather now that you're talking about
>>overloading functionality meant for virtualization to provide
>>unprivileged access to devices in a host. That is, allocate a
>>virtual function and pass that through to a userspace process,
>>but don't enter a virtualzied CPU context?
>
>Yes, that's the basic use pattern for the DPDK. Usermode drivers
>directly access the networking hardware without operating system
>intervention. Virtualized[*] _or_ bare-metal. Interrupts are the
>complicating factor since most processors do not have the capability
>to deliver interrupts to user-mode handlers directly. Thus DPDK
>and ODP poll completion queues on the network hardware rather than
>waiting for interrupts.

Tracking. That's good stuff. It strikes me that DoE did
similar things with NIX as part of the FastOS work.

>https://www.dpdk.org/
>https://opendataplane.org/
>
>[*] Requires the SMMU/IOMMU to do both stages of translation, i.e.
> va->guestpa and guestpa->machinepa.

_nod_

- Dan C.

BGB

unread,
May 23, 2023, 2:18:57 PM5/23/23
to
On 5/23/2023 11:22 AM, Dan Cross wrote:
> In article <u4h6lu$2e7an$1...@dont-email.me>, BGB <cr8...@gmail.com> wrote:
>> On 5/22/2023 3:10 PM, Dan Cross wrote:
>> [snip]
>>> L2PT's like the EPT and NPT are wins here; even in the nested
>>> VM case, where we have to resort to shadow paging techniques, we
>>> can handle L2 page faults in the top-level hypervisor.
>>>
>>
>> But, if one uses SW TLB, then NPT (as a concept) has no reason to need
>> to exist...
>
> Yes, at great expense.
>

Doesn't seem all that expensive.


In terms of LUTs, a soft TLB uses far less than a page walker.

And, the TLB doesn't need to have a mechanism to send memory requests
and handle memory responses, ...

It uses some Block-RAM's for the TLB, but those aren't too expensive.


In terms of performance, it is generally around 1.5 kilocycle per TLB
miss (*1), but as-is these typically happen roughly 50 or 100 times per
second or so.

On a 50 MHz core, only about 0.2% of the CPU time is going into handling
TLB misses.


Note that a page-fault (saving a memory page to an SD card and loading a
different page) is around 1 megacycle.


*1: Much of this goes into the cost of saving and restoring all the
GPRs, where my ISA has 64x 64-bit GPRs. The per-interrupt cost could be
reduced significantly via register banking, but then one pays a lot more
for registers which are only ever used during interrupt handling.


>>> There's a reason soft-TLBs have basically disappeared. :-)
>>
>> Probably depends some on how the software-managed TLB is implemented.
>
> Not really; the design issues and the impact are both
> well-known. Think through how a nested guest (note, not a
> nested page table, but a recursive instance of a hypervisor)
> would be handled.
>

The emulators for my ISA use SW TLB, and I don't imagine a hypervisor
would be that much different, except that they would likely use TLB ->
TLB remapping, rather than abstracting the whole memory subsystem.

One could also have the guest OS use page-tables FWIW.


I had originally intended to use firmware managed TLB with the OS using
page-tables, but this switched to plain software TLB mostly because I
ran out of space in the 32K Boot ROM (mostly due to things like
boot-time CPU sanity testing, *).

*: Idea being that during boot, the CPU tests many of the core ISA
features to verify they are working as intended (say, to detect things
like if a change to the Verilog broke the ALU or similar, ...).


Besides the sanity testing, the Boot ROM also contains a FAT filesystem
interface and PE/COFF / PEL4 loader (well, and also technically an ELF
loaded, but I am mostly using PEL4).


Where PEL4 is:
PC/COFF but without the MZ stub;
Compresses most of the image using LZ4.
Decompressing LZ4 being faster than reading in more data.

The LZ4 compression seems to work well with binary code vs my own RP2
compression (which works better for general data, but not as well for
machine-code). Both formats being byte-oriented LZ variants (but they
differ in terms of how LZ matches are encoded and similar).

Have observed that LZ4 decompression tends to be slightly faster on
conventional machines (like x86-64), but on my ISA, RP2 is a little faster.

Note that Deflate can give slightly better compression, but is around an
order of magnitude slower.


Generally, in PEL4, the file headers are left in an uncompressed state,
but all of the section data and similar is LZ compressed.
Where, header magic:
PE\0\0: Uncompressed
PEL0: Also uncompressed (similar to PE\0\0)
PEL3: RP2 Compression (Not generally used)
PEL4: LZ4 Compression
PEL6: LZ4LLB (Modified LZ4, Length-Limited Encoding)

If the header is 'MZ', it checks for an offset to the start of the PE
header, but then assumes normal (uncompressed) PE/COFF.


Also PEL4 uses a different checksum algorithm from normal PE/COFF, as
the original checksum algorithm sucked and could not detect some of the
main types of corruption that result from LZ screw-ups.

The "linear sum with carry-folding" was instead replaced with a "linear
sum and sum-of-linear-sums with carry-folding XORed together". It is
significantly faster than something like Adler32 (or CRC32), while still
providing many of the same benefits (namely, better error detection than
the original checksums).

Checksum is verified after the whole image is loaded/decompressed into RAM.


For my ABI, the "Global Pointer" entry in the Data directory was
repurposed into handling a floating "data section" which may be loaded
at a different address from ".text" and friends (so multiple program or
DLL instances can share the same copy of the ".text" and similar), with
the base-relocation table being internally split in this area (there is
a GBR register which points to the start of ".data", which in turn
points to a table which can be used for the program or DLLs to reload
their own corresponding data section into GBR; for "simple case" images,
this is simply a self-pointer).

Some sections, like the resource section, were effectively replaced (the
resource section now uses a format resembling the "Quake WAD2" format,
just with a different header and the offsets in terms of RVA's). Things
like "resource lumps" could then be identified with a 16-chacracter name
(typically uncompressed, apart from any compression due to the PEL4
compression, with bitmap images typically stored in the DIB/BMP format,
audio using RIFF/WAVE, ...).


Otherwise, the format is mostly similar to normal PE/COFF.


>> In my case, TLB miss triggers an interrupt, and there is an "LDTLB"
>> instruction which basically means "Take the TLBE from these two
>> registers and shove it into the TLB at the appropriate place".
>
> That's pretty much the way they all work, yes.
>

I think there were some that exposed the TLB as MMIO or similar, and the
ISR handler would then be expected to write the new TLBE into a MMIO array.

The SH-4 ISA also had something like this (in addition to the LDTLB
instruction), but I didn't keep this feature, and from what I could
tell, the existing OS's (such as the Linux kernel) didn't appear to use
it...

They also used a fully-associative TLB, which is absurdly expensive, so
I dropped to a 4-way set-associative TLB (while also making the TLB a
bit larger).


They had used a 64-entry fully-associative array, I ended up switching
to 256x 4-way, which is a total of around 1024 TLBEs.

So, in this case, the main TLB ends up as roughly half the size of an L1
cache (in terms of Block RAM), but uses less LUTs than an L1 cache.


As-is, a 16K L1 needs roughly 32K of Block-RAM (roughly half the space
eaten by tagging metadata with a 16-byte line size; while a larger
cache-line size would more efficiently use the BRAM's, it would also
result in a significant increase in LUT cost).


Dan Cross

unread,
May 23, 2023, 2:26:43 PM5/23/23
to
In article <u4ivui$2lenq$1...@dont-email.me>, BGB <cr8...@gmail.com> wrote:
>On 5/23/2023 11:22 AM, Dan Cross wrote:
>> In article <u4h6lu$2e7an$1...@dont-email.me>, BGB <cr8...@gmail.com> wrote:
>>> On 5/22/2023 3:10 PM, Dan Cross wrote:
>>> [snip]
>>>> L2PT's like the EPT and NPT are wins here; even in the nested
>>>> VM case, where we have to resort to shadow paging techniques, we
>>>> can handle L2 page faults in the top-level hypervisor.
>>>>
>>>
>>> But, if one uses SW TLB, then NPT (as a concept) has no reason to need
>>> to exist...
>>
>> Yes, at great expense.
>>
>
>Doesn't seem all that expensive.
>
>
>In terms of LUTs, a soft TLB uses far less than a page walker.

You're thinking in terms of hardware, not software or
performance.

>And, the TLB doesn't need to have a mechanism to send memory requests
>and handle memory responses, ...
>
>It uses some Block-RAM's for the TLB, but those aren't too expensive.
>
>
>In terms of performance, it is generally around 1.5 kilocycle per TLB
>miss (*1), but as-is these typically happen roughly 50 or 100 times per
>second or so.
>
>On a 50 MHz core, only about 0.2% of the CPU time is going into handling
>TLB misses.

That's not the issue.

The hypervisor has to invoke the guest's
TLB miss handler, which will have to fault _again_ once it tries
to write to the TLB to insert an entry; this can lead to several
round-trips, bouncing between the host and guest several times.
With nested VMs, this gets significantly worse.

> [snip]
>One could also have the guest OS use page-tables FWIW.

How does the hypervisor know the format of the guest's page
tables, in general?

- Dan C.

Scott Lurndal

unread,
May 23, 2023, 2:42:14 PM5/23/23
to
cr...@spitfire.i.gajendra.net (Dan Cross) writes:
>In article <KvQaM.2005263$t5W7.1...@fx13.iad>,
>Scott Lurndal <sl...@pacbell.net> wrote:
>>cr...@spitfire.i.gajendra.net (Dan Cross) writes:
>>>In article <ltNaM.3348574$9sn9.2...@fx17.iad>,
>>>Scott Lurndal <sl...@pacbell.net> wrote:
>>>>cr...@spitfire.i.gajendra.net (Dan Cross) writes:
>>>>[snip]
>>>>>It's still not entirely clear to me now the BSP/BSC is supposed
>>>>>to boot, however. If the world starts in 64-bit mode, and that
>>>>>still requires paging to be enabled, then who sets up the page
>>>>>tables that the BSP starts up on?

>>The boostrap processor:
>>
>> "The CPU starts executing in 64-bit paged mode after reset. The
>> Firmware Interface Table (FIT) contains a reset state structure
>> containing RIP and CR3 that defines the initial execution state of
>> the CPU."
>>
>>I'm presuming that a management processor provided by the
>>mainboard vendor will initialize the FIT out-of-band before
>>releasing the BSP from reset sufficent to execute the UEFI
>>firmware and boot loader.
>
>Ah, the AMD PSP approach, aka, how to start the engine on a
>ship. I guess the days of DRAM training from the BSP are over,
>which isn't necessarily a bad thing.

That doesn't necessarily follow. It is possible to run the
training code from the L1/L2 caches with careful coding.


>>
>>Ok. Coming from the unix/linux world, ring 3 access to those
>>has generally not been allowed and I don't see removal of that
>>capability as a loss, but rather a gain.
>
>ioperm(2) and iopl(2)? :-)

Hence "generally". Can't recall ever seeing it used.

>
>>>>(and access to pci config space, although PCI
>>>>express defines the memory mapped ECAM as an alternative which
>>>>is used on non-intel/amd systems)?
>>>
>>>I try to blot that out of my mind.
>>>
>>>I believe that PCI express deprecates the port-based access
>>>method to config space; MMIO _must_ be supported and in
>>>particular, is the only way to get access to the extended
>>>capability space.
>>
>>Intel cheated and used four unused bits in the high bits
>>of cf8 for the extended config space on some of the southbridge
>>chipsets in the early PCIe days. I've not checked recently
>>to see if newer chipsets/SoCs still support that.
>
>Sigh. Actually, I checked the PCIe spec last night and I think
>I was just wrong. It looks like you can still do it, but
>address/data register pairs are problematic, so I always just
>use ECAM.

Yeah, cf8/cfc was always an intel specific mechanism anyway;
I don't recall it being defined in the PCI Spec (check's 1993
copy, which states "System dependent issues ... such as mapping
various PCI address spaces (config, memory, i/o) into host
CPU address spaces, ordering rules, etc are described in the
PCI System Design guide" - which I don't have a copy of handy)

>
>I think port IO can be useful at very early boot for writing
>debugging data to the UART, or for supporting legacy "PC"
>devices; beyond that, I find it annoying.

Yes, there's no doubt about that. On the AArch64 chips I
work with, the UART is memory mapped (compatible with the
ARM PL011 UART), and has a PCI configuration space; so
the early boot code needs to scan the PCI bus, read the
BAR and use MMIO through that bar (the PCI Enhanced
Allocation capability is present, so the BARs are
fixed rather than programable). Interrupts are handled
via MSI-X vectors (and being level sensitive, a pair
of vectors are used, one to assert and one to deassert
the interrupt).


> We did use it once in
>an experimental hypervisor to execute the equivalent of a VMCALL
>without first having to trap into (guest) kernel mode.

It was useful to use the back door to "mask" NMI as well as
supporting legacy devices in the 3Leaf distributed hypervisor,
and for port 0x80 debugging (we had a PCI card with a pair of
seven segment displays which captured/displayed port 0x80).


>>For ARM, the standard is ECAM (See Server Base System Architecture
>>document).
>>
>>https://developer.arm.com/documentation/den0029/latest/
>
>ARM has always been saner than x86. :-)

We had settled on using ECAM in our ARMv8 chip and suggested
it during the early SBSA discussions with ARM and partners.

Granted ECAM didn't exist (and there wasn't really sufficient
address space in a 32-bit system for a full multi segment ECAM anyway) when
Intel 'invented' cf8/cfc.

> It makes sense that
>they'd start with and stick to ECAM, since (AFAIK) they never
>had programmed IO instructions. How else _would_ you do it?
>(That's a rhetorical question, btw.)

Well, existing art included various peek/poke backdoors similar
to cf8/cfc using MMIO registers at the time. I was on the architecture
team when we started the ARMv8 processors and pushed using ECAM in our
implemention; our MIPS chips had used peek/poke registers in the
PCI controller (onboard devices didn't look like PCI). In the
ARM chip, for discovery purposes, we chose make all the on-chip
devices and accelerators look like PCI devices and thus something
like the ECAM became necessary.



>>
>>Watching the architecture grow from initial early
>>release documents and modeling the Processor, SMMU, and
>>Interrupt Controller has been educational and fun.
>
>Definitely some really cool things have happened in that space
>while those of us slumming it with x86 have looked on with envy.

I've been fortunate to have worked near the bleeding edge:
from new mainframe architecture (updating a 20 Y.O. Arch) in the early 80s to
uKernel based distributed Unix-like operating systems in the late 80s/early 90's
to a distributed version of IRIX in the late 90's transitioning to
linux (contributed KDB while at SGI) and hypervisors in late 90s
(somewhat in parallel with disco) and a distributed hypervisor
in the 2000's (3Leaf Systems) it's been quite a ride.

BGB

unread,
May 23, 2023, 3:42:15 PM5/23/23
to
On 5/23/2023 1:26 PM, Dan Cross wrote:
> In article <u4ivui$2lenq$1...@dont-email.me>, BGB <cr8...@gmail.com> wrote:
>> On 5/23/2023 11:22 AM, Dan Cross wrote:
>>> In article <u4h6lu$2e7an$1...@dont-email.me>, BGB <cr8...@gmail.com> wrote:
>>>> On 5/22/2023 3:10 PM, Dan Cross wrote:
>>>> [snip]
>>>>> L2PT's like the EPT and NPT are wins here; even in the nested
>>>>> VM case, where we have to resort to shadow paging techniques, we
>>>>> can handle L2 page faults in the top-level hypervisor.
>>>>>
>>>>
>>>> But, if one uses SW TLB, then NPT (as a concept) has no reason to need
>>>> to exist...
>>>
>>> Yes, at great expense.
>>>
>>
>> Doesn't seem all that expensive.
>>
>>
>> In terms of LUTs, a soft TLB uses far less than a page walker.
>
> You're thinking in terms of hardware, not software or
> performance.
>

Software cost is usually considered "virtually free" in comparison to
hardware. So what if the virtual memory subsystem needs a few more kB
due to a more complex hardware interface?, ...

Would care more about performance if my benchmarks showed it "actually
mattered" in terms of macro-scale performance.


Spending a few orders of magnitude more clock-cycles on a TLB miss
doesn't matter if the TLB miss rate is low enough that it disappears in
the noise.

It is like making a big fuss over the clock-cycle budget of having a
1kHz clock-timer IRQ...

Yes, those clock IRQs eat clock-cycles, but mostly, there isn't too much
reason to care.


Well, except if one wants to do a 32kHz clock-timer like on the MSP430,
currently this is an area where the MSP430 wins (trying to do a 32kHz
timer IRQ basically eats the CPU...).



>> And, the TLB doesn't need to have a mechanism to send memory requests
>> and handle memory responses, ...
>>
>> It uses some Block-RAM's for the TLB, but those aren't too expensive.
>>
>>
>> In terms of performance, it is generally around 1.5 kilocycle per TLB
>> miss (*1), but as-is these typically happen roughly 50 or 100 times per
>> second or so.
>>
>> On a 50 MHz core, only about 0.2% of the CPU time is going into handling
>> TLB misses.
>
> That's not the issue.
>
> The hypervisor has to invoke the guest's
> TLB miss handler, which will have to fault _again_ once it tries
> to write to the TLB to insert an entry; this can lead to several
> round-trips, bouncing between the host and guest several times.
> With nested VMs, this gets significantly worse.
>

So?...

If it is maybe only happening 50 to 100 times a second or so, it doesn't
matter. Thousands or more per second, it does, but in the general case
it does not, provided the CPU has a reasonable sized TLB.

If it did start to be an issue (with programs with a fairly large
working set), one can make the TLB bigger (and/or go to 64K pages, but
this has its own drawbacks).


It maybe matters more if the OS also swaps page tables for multitasking
and if each page-table swap involves a TLB flush, but I am not doing it
that way (one could use ASIDs; in my case I am just using a huge
monolithic virtual address space).


Granted, the use of a monolithic address space does make a serious
annoyance for trying to run RISC-V ELF objects on top of this, as GCC
apparently doesn't support either PIE or Shared Objects, ...

At least, my PEL4 binaries were designed to be able to deal with a
monolithic virtual address space (and also use in a NO-MMU environment).

But, in this case, there is also the fallback that I have a 96-bit
address-mode extension with a 65C816-like addressing scheme, which can
mimic having a number of 48-bit spaces within an otherwise monolithic
address space.

Though, does currently have the limitation of effectively dropping the
TLB to 2-way associative when active.


>> [snip]
>> One could also have the guest OS use page-tables FWIW.
>
> How does the hypervisor know the format of the guest's page
> tables, in general?
>

They have designated registers and the tree formats are documented as
part of the ISA/ABI specs...


One could define it such that if page tables are used, one of the
defined formats, and the page is present, the hypervisor could be
allowed to translate the page itself and skip the TLB Miss ISR (falling
back to the ISR if the page-table is flagged as an unknown format).

Though, generally, things like ACL Miss ISR's would still need to be
forwarded to the guest, but these are much less common (it is generally
sufficient to use a 4 or 8 entry cache for ACL checks).


As-is, the defined formats are:
xxx: N-level Page-Table
3 levels for 48b address and 16K pages.
4 levels for 48b address and 4K pages.
Bit pattern encodes tree depth and format.
013: AVL Tree (Full Address)
113: B-Tree (Full Address)
213: Hybrid B-Tree (last-level page table)
313: Hybrid B-Tree (last 2-levels page table)


The B-Tree cases being mostly intended for 96-bit modes, since:
48-bit mode works fine with a conventional page-table;
As noted, 8 level page tables suck...

At present, most of the page-table formats assume 64-bit entries with a
48-bit physical address.
Low order bits are control flags;
Upper 16 bits are typically the ACLID.
The ACLID indirectly encoding "who can do what with this page".

There was an older VUGID system, but this system requires more bits to
encode (user/group/other, rwxrwxrwx). So, it has been partially
deprecated in favor of using ACL checks for everything.


> - Dan C.
>

Dan Cross

unread,
May 23, 2023, 4:18:55 PM5/23/23
to
In article <u4j4qu$2luoh$1...@dont-email.me>, BGB <cr8...@gmail.com> wrote:
>On 5/23/2023 1:26 PM, Dan Cross wrote:
>[snip]
>>> On a 50 MHz core, only about 0.2% of the CPU time is going into handling
>>> TLB misses.
>>
>> That's not the issue.
>>
>> The hypervisor has to invoke the guest's
>> TLB miss handler, which will have to fault _again_ once it tries
>> to write to the TLB to insert an entry; this can lead to several
>> round-trips, bouncing between the host and guest several times.
>> With nested VMs, this gets significantly worse.
>>
>
>So?...

I wonder: have you looked into why essentially every modern
architecture in common use today uses hardware page tables?
The hardware engineers working on are not stupid, and they are
perfectly well aware of everything you said about e.g. larger
TLBs. Yet there is a reason they chose to implement things
the way essentially every extant modern architecture has.
Perhaps they are aware of something you would find illuminating.

The issues I'm talking about very much exist and very much
affect world-world designs. I'll take the slightly larger cost
in transistors over the disadvantages, including forcing
pipeline flushes, thrashing the icache to handle TLB fault
misses, and significantly more complex virtualization.

Besides....what do you do if a guest decides it wants to insert
a mapping covering part the hypervisor itself into the TLB?

> [snip]
>>> One could also have the guest OS use page-tables FWIW.
>>
>> How does the hypervisor know the format of the guest's page
>> tables, in general?
>>
>
>They have designated registers and the tree formats are documented as
>part of the ISA/ABI specs...

The point of a hypervisor is to provide a faithful emulation
of the _hardware_: it's up to the guest to decide what ABI it
uses. The hypervisor can't really force that onto the guest,
and sothere's no "ABI" as such in a non-paravirtualized
hypervisor. The whole point is that unmodified guests can run
without change and think that they're running directly on the
bare metal.

It's unclear what the point of an ISA-mandated page table format
would be in a system that doesn't use them. What prevents a
guest from just ignoring them and doing its own thing?

- Dan C.

Dan Cross

unread,
May 23, 2023, 4:36:35 PM5/23/23
to
In article <7O7bM.3161054$iS99.2...@fx16.iad>,
Are we assuming the external processor can write to those
caches? If so, perhaps, but it begs the question: why? I've
evidently got a perfectly capable processor running before the
x86 cores can even come out of reset, and it can mess with the
microarchitectural state of the x86 CPU anyway: why not just
let it train DRAM as well?

If it can't write to those caches, then I don't see how it could
initialize page tables that the CPU would start executing from,
unless it dumped them into a mandatory SRAM buffer or something.

>>>Ok. Coming from the unix/linux world, ring 3 access to those
>>>has generally not been allowed and I don't see removal of that
>>>capability as a loss, but rather a gain.
>>
>>ioperm(2) and iopl(2)? :-)
>
> Hence "generally". Can't recall ever seeing it used.

I suppose I misinterpreted "generally." A general-purpose
mechanism to expose the functionality exists, though I agree
that relatively few applications make use of it. A friend of
mine did a PhD back in the 90s and did use it pretty extensively
to take data from an ISA-bus device (no interrupts though; he
just burned CPU polling).

>[snip]
>>Sigh. Actually, I checked the PCIe spec last night and I think
>>I was just wrong. It looks like you can still do it, but
>>address/data register pairs are problematic, so I always just
>>use ECAM.
>
>Yeah, cf8/cfc was always an intel specific mechanism anyway;
>I don't recall it being defined in the PCI Spec (check's 1993
>copy, which states "System dependent issues ... such as mapping
>various PCI address spaces (config, memory, i/o) into host
>CPU address spaces, ordering rules, etc are described in the
>PCI System Design guide" - which I don't have a copy of handy)

I checked my copy of the 6.0 spec yesterday and it mentions
"io" in addition to config and MMIO, which I interpret to mean
port-space IO.

>>I think port IO can be useful at very early boot for writing
>>debugging data to the UART, or for supporting legacy "PC"
>>devices; beyond that, I find it annoying.
>
>Yes, there's no doubt about that. On the AArch64 chips I
>work with, the UART is memory mapped (compatible with the
>ARM PL011 UART), and has a PCI configuration space; so
>the early boot code needs to scan the PCI bus, read the
>BAR and use MMIO through that bar (the PCI Enhanced
>Allocation capability is present, so the BARs are
>fixed rather than programable). Interrupts are handled
>via MSI-X vectors (and being level sensitive, a pair
>of vectors are used, one to assert and one to deassert
>the interrupt).

Level sensitive? Huh.

I wrote a driver for a memory-mapped UART in an AMD SoC complex
a few months ago; we use it for early boot on our machines. It
is soon enough after coming out of reset that I don't bother
with interrupts; we just poll.

Having to go through config space to map a BAR to drive a UART
seems excessive to me.

>[snip]
>Well, existing art included various peek/poke backdoors similar
>to cf8/cfc using MMIO registers at the time. I was on the architecture
>team when we started the ARMv8 processors and pushed using ECAM in our
>implemention; our MIPS chips had used peek/poke registers in the
>PCI controller (onboard devices didn't look like PCI). In the
>ARM chip, for discovery purposes, we chose make all the on-chip
>devices and accelerators look like PCI devices and thus something
>like the ECAM became necessary.

Oh of course; memory-mapped address/data register pairs would
give the same effect.

Making everything look like PCI certainly makes things regular.

>>Definitely some really cool things have happened in that space
>>while those of us slumming it with x86 have looked on with envy.
>
>I've been fortunate to have worked near the bleeding edge:
>from new mainframe architecture (updating a 20 Y.O. Arch) in the early 80s to
>uKernel based distributed Unix-like operating systems in the late 80s/early 90's
>to a distributed version of IRIX in the late 90's transitioning to
>linux (contributed KDB while at SGI) and hypervisors in late 90s
>(somewhat in parallel with disco) and a distributed hypervisor
>in the 2000's (3Leaf Systems) it's been quite a ride.

Nice. 3Leaf sounds interesting; are there any papers on it
available?

- Dan C.

Scott Lurndal

unread,
May 23, 2023, 5:00:41 PM5/23/23
to
BGB <cr8...@gmail.com> writes:
>On 5/23/2023 1:26 PM, Dan Cross wrote:
>> In article <u4ivui$2lenq$1...@dont-email.me>, BGB <cr8...@gmail.com> wrote:
>>> On 5/23/2023 11:22 AM, Dan Cross wrote:
>>>> In article <u4h6lu$2e7an$1...@dont-email.me>, BGB <cr8...@gmail.com> wrote:
>>>>> On 5/22/2023 3:10 PM, Dan Cross wrote:
>>>>> [snip]
>>>>>> L2PT's like the EPT and NPT are wins here; even in the nested
>>>>>> VM case, where we have to resort to shadow paging techniques, we
>>>>>> can handle L2 page faults in the top-level hypervisor.
>>>>>>
>>>>>
>>>>> But, if one uses SW TLB, then NPT (as a concept) has no reason to need
>>>>> to exist...
>>>>
>>>> Yes, at great expense.
>>>>
>>>
>>> Doesn't seem all that expensive.
>>>
>>>
>>> In terms of LUTs, a soft TLB uses far less than a page walker.
>>
>> You're thinking in terms of hardware, not software or
>> performance.
>>
>
>Software cost is usually considered "virtually free" in comparison to
>hardware.

That's not my experience. Software has a cost. Subtantial even.

And for something so integrally associated with performance,
TLB refills are never free, and table walks aren't zero cost;
hardware or software.

Consider, for example, that a table walk for a guest access
requires up to 22 discrete translation table accesses to fill
a TLB. Hardware walkers often cache intermediate results to
reduce the cost for subsequent walks. In fast internal caches.

Software can't compete with that.



>Would care more about performance if my benchmarks showed it "actually
>mattered" in terms of macro-scale performance.

How representative of real-world workloads are your benchmarks?

>
>
>Spending a few orders of magnitude more clock-cycles on a TLB miss
>doesn't matter if the TLB miss rate is low enough that it disappears in
>the noise.

Good luck with that.

Scott Lurndal

unread,
May 23, 2023, 5:17:19 PM5/23/23
to
cr...@spitfire.i.gajendra.net (Dan Cross) writes:
>In article <7O7bM.3161054$iS99.2...@fx16.iad>,
>Scott Lurndal <sl...@pacbell.net> wrote:

>>>Ah, the AMD PSP approach, aka, how to start the engine on a
>>>ship. I guess the days of DRAM training from the BSP are over,
>>>which isn't necessarily a bad thing.
>>
>>That doesn't necessarily follow. It is possible to run the
>>training code from the L1/L2 caches with careful coding.
>
>Are we assuming the external processor can write to those
>caches? If so, perhaps, but it begs the question: why? I've
>evidently got a perfectly capable processor running before the
>x86 cores can even come out of reset, and it can mess with the
>microarchitectural state of the x86 CPU anyway: why not just
>let it train DRAM as well?

In my experience, training the controllers when you have a dozen
or more DRAM channels requires more oomph than the little microcontrollers
are capable off. And the microcontroller may not have access to
the internal bus (mesh/ring) structure required to program
the controllers, read the SPDs etc.

>
>If it can't write to those caches, then I don't see how it could
>initialize page tables that the CPU would start executing from,
>unless it dumped them into a mandatory SRAM buffer or something.

A small page table in part of the LLC should suffice. On the
arm chips, we come with paging disabled, push some code into the
LLC and take the bsp out of reset to initalize the DRAM controllers.

With respect to x86-S, it's all speculation anyway :-)
An PCI Endpoint BAR can be designated an I/O bar, in which case
it slots into the intel 64k I/O (pio) space (in and out instructions).
VERY few devices advertise I/O bars; it's basically deprecated.

An endpoint BAR designated as a memory bar, slots into the physical
address space.

The third PCI address space is the config space and the access
mechanism is up to the host. (peek/poke, ecam).

>
>>>I think port IO can be useful at very early boot for writing
>>>debugging data to the UART, or for supporting legacy "PC"
>>>devices; beyond that, I find it annoying.
>>
>>Yes, there's no doubt about that. On the AArch64 chips I
>>work with, the UART is memory mapped (compatible with the
>>ARM PL011 UART), and has a PCI configuration space; so
>>the early boot code needs to scan the PCI bus, read the
>>BAR and use MMIO through that bar (the PCI Enhanced
>>Allocation capability is present, so the BARs are
>>fixed rather than programable). Interrupts are handled
>>via MSI-X vectors (and being level sensitive, a pair
>>of vectors are used, one to assert and one to deassert
>>the interrupt).
>
>Level sensitive? Huh.

Heritage of the pl011.


>Having to go through config space to map a BAR to drive a UART
>seems excessive to me.

Ah, but it leverages the common OS discovery code. Granted it
makes boot software slightly more complicated, but the flexibility
is worth it.



>>
>>I've been fortunate to have worked near the bleeding edge:
>>from new mainframe architecture (updating a 20 Y.O. Arch) in the early 80s to
>>uKernel based distributed Unix-like operating systems in the late 80s/early 90's
>>to a distributed version of IRIX in the late 90's transitioning to
>>linux (contributed KDB while at SGI) and hypervisors in late 90s
>>(somewhat in parallel with disco) and a distributed hypervisor
>>in the 2000's (3Leaf Systems) it's been quite a ride.
>
>Nice. 3Leaf sounds interesting; are there any papers on it
>available?

There are a couple of patents (one granted, one abandoned)
at the USPTO. Not much else out in the public other than
the various trade mag stuff from that time period.

Basically we built an ASIC that extends the coherency domain
across infiniband (or 10G Ethernet) to allow creation of
large shared-memory cache-coherent systems from 1u or 2u
building blocks. The ASIC talked HyperTransport for AMD
Opteron CPUs and QuikPath for Intel CPUs.

Had a 32-node, 64-processor system at LLNL for evaluation
before the bottom dropped out of the markets, an acquision
fell through and we shut down.

Much the same as CXL-Cache today. We even considered PCIe
as the transport, but the switching latencies for IB were
less than 100ns.

muta...@gmail.com

unread,
May 23, 2023, 7:02:20 PM5/23/23
to
On Wednesday, May 24, 2023 at 5:17:19 AM UTC+8, Scott Lurndal wrote:

> >Nice. 3Leaf sounds interesting; are there any papers on it
> >available?
> There are a couple of patents (one granted, one abandoned)
> at the USPTO. Not much else out in the public other than
> the various trade mag stuff from that time period.
>
> Basically we built an ASIC that extends the coherency domain
> across infiniband (or 10G Ethernet) to allow creation of
> large shared-memory cache-coherent systems from 1u or 2u
> building blocks. The ASIC talked HyperTransport for AMD
> Opteron CPUs and QuikPath for Intel CPUs.
>
> Had a 32-node, 64-processor system at LLNL for evaluation
> before the bottom dropped out of the markets, an acquision
> fell through and we shut down.

What market dropped, when and why?

Thanks. Paul.

BGB

unread,
May 23, 2023, 9:08:43 PM5/23/23
to
I think a lot of this is making a big fuss over nothing, FWIW.

But, in any case, SuperH (along with PA-RISC, MIPS, SPARC, etc) got
along reasonably well with software-managed TLB.

Their downfall wasn't related to them spending an extra fraction of a
percent of CPU time on handling TLB Miss ISRs.

Similarly, this also wasn't what caused Itanium to fail (nor was it due
to it being VLIW based, etc).

And, likewise, the IBM POWER ISA is still around, ...

...



> Besides....what do you do if a guest decides it wants to insert
> a mapping covering part the hypervisor itself into the TLB?
>

There is no reason for the guest to be able to be able to put something
into the TLB which would somehow circumvent the host; since anything the
guest tries to load into the TLB will need to first get translated
through the host.

This is like asking why a program running in virtual memory can't just
create a pointer into memory inside the kernel:
The application doesn't have access to the kernel's address space to
begin with.

Or, stated another way, the entire "physical address" space for the
guest would itself be a virtual memory space running in user-mode.



>> [snip]
>>>> One could also have the guest OS use page-tables FWIW.
>>>
>>> How does the hypervisor know the format of the guest's page
>>> tables, in general?
>>>
>>
>> They have designated registers and the tree formats are documented as
>> part of the ISA/ABI specs...
>
> The point of a hypervisor is to provide a faithful emulation
> of the _hardware_: it's up to the guest to decide what ABI it
> uses. The hypervisor can't really force that onto the guest,
> and sothere's no "ABI" as such in a non-paravirtualized
> hypervisor. The whole point is that unmodified guests can run
> without change and think that they're running directly on the
> bare metal.
>
> It's unclear what the point of an ISA-mandated page table format
> would be in a system that doesn't use them. What prevents a
> guest from just ignoring them and doing its own thing?
>

You can have either accurate hardware level emulation, or slightly
better performance, and make a tradeoff there.

If the OS wants its own page-table format, it can specify that it is
using its own encoding easily enough via the tag bits in the TTB
register or similar.

And, if it claims to be using a standard table format, but is doing
something different, and crashes as a result. Well, that is its problem,
and/or one adds a flag or similar to the emulator to disable any
"faster" page translation.


Not like it is likely to matter all that much.
Hence, why I was using B-Trees for the 96-bit mode...


One can note that fetching something from a B-Tree is not exactly a fast
operation, but still roughly 3 orders of magnitude faster than swapping
a page in the pagefile.

...


> - Dan C.
>

BGB

unread,
May 24, 2023, 1:12:42 AM5/24/23
to
From my experience thus far, the relative slowness of the mechanism
doesn't really seem to matter too much in practice.


The main case it would matter is if one flushes the TLB every time one
switches from one task to another, but there is a simpler solution:
Don't do this...

If one needs multiple address spaces, they can use ASIDs rather than
flushing every time (so, by the time a task regains focus, most of its
pages are still in the TLB). In my case, the ASID is 16 bits, so there
isn't too much of a shortage here.


Eg:

TLB Entry (TLBE):
* (127:112) = VUGID / ACLID
* (111: 76) = Virtual Address Page
* ( 75: 64) = VUGID KRR Access
* ( 63: 48) = ASID
* ( 47: 12) = Physical Address Page
* ( 11: 0) = Page Control bits

For 48-bit VA's.

X2TLB Entry (High):
* (127:112) = Reserved
* (111: 64) = Virtual Address Page (High 48 Bits)
* ( 63: 16) = Physical Address Page (High 48 Bits)
* ( 11: 0) = Page Control bits

Also used for 96-bit VAs, which are essentially handled by
double-pumping the LDTLB...


There are cases where it could get bad, such as TLB flushing on every
task switch, or tasks with a fairly large working set (so nearly the
entire TLB gets cycled out for each task).

But, thus far, this doesn't really seem to be too much of an issue.


>
>
>> Would care more about performance if my benchmarks showed it "actually
>> mattered" in terms of macro-scale performance.
>
> How representative of real-world workloads are your benchmarks?
>

Mostly running stuff like Doom and Quake and similar.


The TLB miss rate for Doom is negligible though, as its working set fits
almost entirely in the TLB.

Things like Quake, or a small Minecraft like 3D engine of mine, have a
slightly bigger working set.


By the time the working set gets large enough to put any real strain on
the TLB, one is going to be doing an excessive amount of swapping, which
would be a much bigger bottleneck than the TLB misses.

As can be noted, the FPGA boards I am using are limited to 128 and 256
MB of RAM (and typically, the "core" of a program is working on a
somewhat smaller region of memory than its total memory footprint).



Like, ~ 1 kilocycle (for a TLB miss) vs ~ 1 megacycle (for a page
fault). Page faults are a much more serious issue...


Or, if it really mattered, I could write the TLB miss handler in ASM and
*not* use a B-Tree...

As-is, it is written in C and uses a B-Tree; I suspect I would notice if
this was being a wrecking the performance...

Namely, it would show up as a higher ranking in the profiler.



>>
>>
>> Spending a few orders of magnitude more clock-cycles on a TLB miss
>> doesn't matter if the TLB miss rate is low enough that it disappears in
>> the noise.
>
> Good luck with that.
>

This is my experience thus far.

I can note that my emulator has a built-in profiler, so I can see where
the clock-cycle budget is going.



But, as noted, much below around TLB 500 TLB misses per second or so,
the cost tends to disappear in the noise for a 50MHz CPU.


Dan Cross

unread,
May 24, 2023, 7:20:15 AM5/24/23
to
In article <L1abM.364532$0XR7....@fx07.iad>,
Scott Lurndal <sl...@pacbell.net> wrote:
>cr...@spitfire.i.gajendra.net (Dan Cross) writes:
>>In article <7O7bM.3161054$iS99.2...@fx16.iad>,
>>Scott Lurndal <sl...@pacbell.net> wrote:
>
>>>>Ah, the AMD PSP approach, aka, how to start the engine on a
>>>>ship. I guess the days of DRAM training from the BSP are over,
>>>>which isn't necessarily a bad thing.
>>>
>>>That doesn't necessarily follow. It is possible to run the
>>>training code from the L1/L2 caches with careful coding.
>>
>>Are we assuming the external processor can write to those
>>caches? If so, perhaps, but it begs the question: why? I've
>>evidently got a perfectly capable processor running before the
>>x86 cores can even come out of reset, and it can mess with the
>>microarchitectural state of the x86 CPU anyway: why not just
>>let it train DRAM as well?
>
>In my experience, training the controllers when you have a dozen
>or more DRAM channels requires more oomph than the little microcontrollers
>are capable off. And the microcontroller may not have access to
>the internal bus (mesh/ring) structure required to program
>the controllers, read the SPDs etc.

The AMD PSP is a full-on application profile core complex
embedded in the SoC. We don't have a lot of insight into
what exactly runs on it, but it does DRAM training and loads
the "BIOS" into DRAM before the x86 cores come out of reset.

>>If it can't write to those caches, then I don't see how it could
>>initialize page tables that the CPU would start executing from,
>>unless it dumped them into a mandatory SRAM buffer or something.
>
>A small page table in part of the LLC should suffice. On the
>arm chips, we come with paging disabled, push some code into the
>LLC and take the bsp out of reset to initalize the DRAM controllers.
>
>With respect to x86-S, it's all speculation anyway :-)

Fair. If we're changing the architecture, though, I do wonder
why we can't just have an un-paged 64-bit execution mode. Start
me up in 64-bit mode with %cr3 cleared and page paging bit in
%cr0 cleared and let me run against physical memory until I am
ready to turn on paging.

>>>Yeah, cf8/cfc was always an intel specific mechanism anyway;
>>>I don't recall it being defined in the PCI Spec (check's 1993
>>>copy, which states "System dependent issues ... such as mapping
>>>various PCI address spaces (config, memory, i/o) into host
>>>CPU address spaces, ordering rules, etc are described in the
>>>PCI System Design guide" - which I don't have a copy of handy)
>>
>>I checked my copy of the 6.0 spec yesterday and it mentions
>>"io" in addition to config and MMIO, which I interpret to mean
>>port-space IO.
>
>An PCI Endpoint BAR can be designated an I/O bar, in which case
>it slots into the intel 64k I/O (pio) space (in and out instructions).
>VERY few devices advertise I/O bars; it's basically deprecated.

Ah, I checked again and what I remembered is that this only
appears to be true for a legacy endpoint.

>[snip]
>Ah, but it leverages the common OS discovery code. Granted it
>makes boot software slightly more complicated, but the flexibility
>is worth it.

Yeah, I can definitely see the attraction for that.

>[snip]
>>Nice. 3Leaf sounds interesting; are there any papers on it
>>available?
>
>There are a couple of patents (one granted, one abandoned)
>at the USPTO. Not much else out in the public other than
>the various trade mag stuff from that time period.
>
>Basically we built an ASIC that extends the coherency domain
>across infiniband (or 10G Ethernet) to allow creation of
>large shared-memory cache-coherent systems from 1u or 2u
>building blocks. The ASIC talked HyperTransport for AMD
>Opteron CPUs and QuikPath for Intel CPUs.
>
>Had a 32-node, 64-processor system at LLNL for evaluation
>before the bottom dropped out of the markets, an acquision
>fell through and we shut down.
>
>Much the same as CXL-Cache today. We even considered PCIe
>as the transport, but the switching latencies for IB were
>less than 100ns.

Oh wow, very cool.

- Dan C.

Dan Cross

unread,
May 24, 2023, 7:33:23 AM5/24/23
to
You can think that all you want, but sadly, that doesn't mean
that these aren't actual problems for real-world systems.

>But, in any case, SuperH (along with PA-RISC, MIPS, SPARC, etc) got
>along reasonably well with software-managed TLB.

In a very different time, with very different demands on the
architecture.

>[snip]
>> Besides....what do you do if a guest decides it wants to insert
>> a mapping covering part the hypervisor itself into the TLB?
>
>There is no reason for the guest to be able to be able to put something
>into the TLB which would somehow circumvent the host; since anything the
>guest tries to load into the TLB will need to first get translated
>through the host.

Right. But the guest expects to run within a virtual address
space of its own construction. You have a single TLB at the top
level that is shared by both guest and host and must be
multiplexed between them; what do you do when they conflict?

>This is like asking why a program running in virtual memory can't just
>create a pointer into memory inside the kernel:
>The application doesn't have access to the kernel's address space to
>begin with.
>
>Or, stated another way, the entire "physical address" space for the
>guest would itself be a virtual memory space running in user-mode.

Sure, but this isn't about the physical address space; it's
about management of the virtual address space.
Or one can just use architecturally-defined page tables and a
hardware walker and have the best of both worlds. Waterman lays
out the trade-offs for RISC-V in his dissertation; it is only a
small amount of extra work.

>Not like it is likely to matter all that much.
>Hence, why I was using B-Trees for the 96-bit mode...
>
>
>One can note that fetching something from a B-Tree is not exactly a fast
>operation, but still roughly 3 orders of magnitude faster than swapping
>a page in the pagefile.

Swapping from secondary storage seems entirely irrelevant.

- Dan C.

Scott Lurndal

unread,
May 24, 2023, 9:45:55 AM5/24/23
to
The cool part was the software. A bare-metal hypervisor that
allowed virtual servers to be composed dynamically on the system;
We didn't multiplex guests on any given core, but rather assigned
cores and memory to each guest when the guest is created. If
the guest supported CPU and Memory hot plug/unplug (e.g. linux),
we could dynamically move processors and memory between guests
as demand required. I/O was virtualized and handled by one or
more virtual I/O servers over infiniband/10Ge, and as SR-IOV
and MR-IOV were starting to be deployed, we could grant direct
access to a VF to a guest. We also supported XEN-style virtual
I/O drivers in the guest. We worked with the XEN folks at
Cambridge in the early days, before SVM was available; The
CTO of AMD was
on our advisory board which helped with the hypertransport
implementation on the ASIC.

A copy of the hypervisor (DVMM - Distributed Virtual Machine
Manager) ran on each node and they all communicated via shared
memory.

Dan Cross

unread,
May 24, 2023, 10:33:49 AM5/24/23
to
In article <WwobM.736280$5CY7....@fx46.iad>,
That IS cool.

- Dan C.

BGB

unread,
May 24, 2023, 1:32:17 PM5/24/23
to
Depends on what one wants.


I am mostly imagining an architecture for embedded-systems style
use-cases (but, more DSP-like than microcontroller-like).

Say, something that does real-time audio/video processing and can run
neural nets.

And, trying to optimize some things for a world where Moore's Law has
come to a halt.

So, for example, the design is an in-order VLIW, since it seems like
optimizing for OoO will become less attractive once Moore's Law ends
(say, if one wants more performance in less die area and less watts,
rather than maximum performance but throwing lots of die area and watts
at it).


But, say, one wants still better performance per clock than a
conventional in-order RISC design.


>> [snip]
>>> Besides....what do you do if a guest decides it wants to insert
>>> a mapping covering part the hypervisor itself into the TLB?
>>
>> There is no reason for the guest to be able to be able to put something
>> into the TLB which would somehow circumvent the host; since anything the
>> guest tries to load into the TLB will need to first get translated
>> through the host.
>
> Right. But the guest expects to run within a virtual address
> space of its own construction. You have a single TLB at the top
> level that is shared by both guest and host and must be
> multiplexed between them; what do you do when they conflict?
>

The guest doesn't push into the main TLB.
The whole thing is a "pull" model, not a "push" model.


Rather, it would be more like:
Host experiences a TLB miss, checks its structures;
If it is host memory, the host can translate it as appropriate;
Checks the guests TLB;
If it is found in the guest's TLB, pull it down into the host.
This performs any address translation.
If it is not found, trap into guest.

The rate of guest TLB misses could be reduced by giving it a bigger TLB
than on the actual hardware, say, 1024x 4-way, as this part is
"basically invisible" to the guest OS (apart from reducing the number of
TLB misses).


As for virtual-address spaces having the same addresses:
You can use ASIDs.

Say, for example:
0000..3FFF: Represent host OS ASIDs
4000: Guest OS Physical Space
4xxx..7xxx: Mapped subset of guest ASIDs
Likely via a mapping table or similar.


Where, it can be noted that a TLBE will only "hit" if the TLBE ASID
matches the current TTB ASID. Otherwise, they are seen as different
address spaces, and the TLBE will be ignored.

Granted, pulling this off this way does mean that the hypervisor would
need to be hooked fairly directly into the virtual memory subsystem.

There are other ways to do it, granted, but they would likely be
lower-performance.


One other strategy being to remap things within the 96-bit space.

Say, guest thinks it is looking at:
0000_00000000_0000_0xxxxxxx
But, really, it is looking at:
1xxx_xxxxxxxx_0000_0xxxxxxx

Noting that the 96-bit space is far larger than the ASID space, and it
is unlikely that the guest will use all of it.


However, this would means needing to trap and emulate any XMOV.x
instructions (which operate on 128-bit pointers).

Note that "Trap on XMOV.x" is currently the default behavior for
usermode code (can be configured per-task), mostly as the idea is that
normal usermode programs will be limited to a single 48-bit "quadrant".


Well, unless a sort of high-address-translation table were added
(possibly also keyed with an ASID, similar to the main TLB), effectively
treating the high 48-bits as a double-translated address.


>> This is like asking why a program running in virtual memory can't just
>> create a pointer into memory inside the kernel:
>> The application doesn't have access to the kernel's address space to
>> begin with.
>>
>> Or, stated another way, the entire "physical address" space for the
>> guest would itself be a virtual memory space running in user-mode.
>
> Sure, but this isn't about the physical address space; it's
> about management of the virtual address space.
>

The guest's physical space would be the virtual address space for the host.
Hardware page walking and plain page tables are overly limiting and
inflexible though.

With software-managed TLB, one can do all sorts of stuff that is
basically not possible (or at least practical) with a hardware
page-walker (and without a bunch of special case hardware support needed
to support differences in addressing or memory access behavior).


Even simple things, like being able to enforce access-rights checking on
pages:
Task A has RW access to a given page;
Task B has R- access to the same page;
Task C has no access;
...

Are more complicated (and limited) with page tables.

In "ye olde page tables", one needs a separate set of page tables for
each process (or each thread) which may have any differences in access
rights to a given region of memory.

Conventional hardware is like:
Well, you have Ring 0/1/2/3 or User/Supervisor;
But, this is lame, only addresses the OS but not task-to-task memory access.

Or, "well, these threads have free access to most of the process; but
this other thread runs in a sandbox over here, and has to jump through
an execute-only page to access this other thing over there", ...


>> Not like it is likely to matter all that much.
>> Hence, why I was using B-Trees for the 96-bit mode...
>>
>>
>> One can note that fetching something from a B-Tree is not exactly a fast
>> operation, but still roughly 3 orders of magnitude faster than swapping
>> a page in the pagefile.
>
> Swapping from secondary storage seems entirely irrelevant.
>

It is for comparison...

This is the cost we actually have reason to care about, and it becomes
much more of an issue, much quicker (typically when one is running low
on RAM).


Dan Cross

unread,
May 24, 2023, 2:12:26 PM5/24/23
to
In article <u4lhjo$31dht$1...@dont-email.me>, BGB <cr8...@gmail.com> wrote:
>On 5/24/2023 6:33 AM, Dan Cross wrote:
>>> I think a lot of this is making a big fuss over nothing, FWIW.
>>
>> You can think that all you want, but sadly, that doesn't mean
>> that these aren't actual problems for real-world systems.
>>
>>> But, in any case, SuperH (along with PA-RISC, MIPS, SPARC, etc) got
>>> along reasonably well with software-managed TLB.
>>
>> In a very different time, with very different demands on the
>> architecture.
>
>Depends on what one wants.
>
>I am mostly imagining an architecture for embedded-systems style
>use-cases (but, more DSP-like than microcontroller-like).

This perhaps explains why you seem to be discounting the use
cases others are telling your are important in other application
domains.

>>> [snip]
>>>> Besides....what do you do if a guest decides it wants to insert
>>>> a mapping covering part the hypervisor itself into the TLB?
>>>
>>> There is no reason for the guest to be able to be able to put something
>>> into the TLB which would somehow circumvent the host; since anything the
>>> guest tries to load into the TLB will need to first get translated
>>> through the host.
>>
>> Right. But the guest expects to run within a virtual address
>> space of its own construction. You have a single TLB at the top
>> level that is shared by both guest and host and must be
>> multiplexed between them; what do you do when they conflict?
>
>The guest doesn't push into the main TLB.
>The whole thing is a "pull" model, not a "push" model.
>
>Rather, it would be more like:
> Host experiences a TLB miss, checks its structures;
> If it is host memory, the host can translate it as appropriate;
> Checks the guests TLB;
> If it is found in the guest's TLB, pull it down into the host.
> This performs any address translation.
> If it is not found, trap into guest.

This is not the scenario that I'm talking about here.

You have the host. The host has chosen to run itself at virtual
address 0xWhatever. The host is executing a guest, that has
also chosen to run itself at 0xWhatever. The type of fault you
receive here is actually a page protection fault (presumably,
you are forcing the guest to run in userspace while the
hypervisor runs in privileged mode), not a TLB miss; there is
already a TLB entry for the address in question, since you were
(presumably) just using it in the host. So what do you do?

There are several answers here, btw; the obvious one is trap and
emulate the entirety of the guest's access to this region of the
virtual address space, but that's a) complex, and b) expensive.

Another is to make the hypervisor relocable, and only trap into
a small, position-independent trampoline stub that can, say, set
a base register and jump somewhere else. This will break down
if the guest uses too much of the virtual address space.

Yet another, since you control the hardware, is to have a
separate "guest" hardware TLB and an execution mode that uses
it, but that adds complexity to the hardware.

Another option is to locate the hypervisor somewhere random in
the virtual address space that is unlikely to conflict with a
guest and simply declare it off-limits by convention, but guests
don't necessarily need to obey convention.

None of these are particularly great options.

>The rate of guest TLB misses could be reduced by giving it a bigger TLB
>than on the actual hardware, say, 1024x 4-way, as this part is
>"basically invisible" to the guest OS (apart from reducing the number of
>TLB misses).

>>> Or, stated another way, the entire "physical address" space for the
>>> guest would itself be a virtual memory space running in user-mode.
>>
>> Sure, but this isn't about the physical address space; it's
>> about management of the virtual address space.
>
>The guest's physical space would be the virtual address space for the host.

Cool. So what provide's the guests virtual address space?

>Hardware page walking and plain page tables are overly limiting and
>inflexible though.
>
>[snip]

I see no evidence for that, and plenty of disconfirming
evidence.

- Dan C.

Scott Lurndal

unread,
May 24, 2023, 2:33:43 PM5/24/23
to
BGB <cr8...@gmail.com> writes:
>On 5/24/2023 6:33 AM, Dan Cross wrote:
>> In article <u4jnvk$2nsop$1...@dont-email.me>, BGB <cr8...@gmail.com> wrote:

>>
>>> But, in any case, SuperH (along with PA-RISC, MIPS, SPARC, etc) got
>>> along reasonably well with software-managed TLB.

Having much experience with MIPS (both at SGI and Cavium) I would
dispute that characterization.

>>
>> In a very different time, with very different demands on the
>> architecture.
>>
>
>Depends on what one wants.
>
>
>I am mostly imagining an architecture for embedded-systems style
>use-cases (but, more DSP-like than microcontroller-like).

So, computer games like Doom are not very good benchmark choices.

>
>Say, something that does real-time audio/video processing and can run
>neural nets.

>So, for example, the design is an in-order VLIW, since it seems like
>optimizing for OoO will become less attractive once Moore's Law ends
>(say, if one wants more performance in less die area and less watts,
>rather than maximum performance but throwing lots of die area and watts
>at it).

I suspect that parallelism is the answer to the purported end of
Moore's law. Note that the largest AMD supercomuputer now has over
8 million cores.


>
>As for virtual-address spaces having the same addresses:
> You can use ASIDs.

Actually, you need both VMIDs (virutal machine ID) and ASIDs (address space ID).

All user-mode applications running under all virtual machines may be using identical
virtual addresses. Which means you need to tag the TLB entries with both VMID
and ASID (now you're up to 32 bits of tag if both are 16-bits).

And even 16-bits of ASID are insufficent on multiprocessor machines and
the OS needs a mechanism to invalidate all ASIDs and assign new ones
when unassigned processes are subsequently scheduled.

>
>Noting that the 96-bit space is far larger than the ASID space, and it
>is unlikely that the guest will use all of it.

You know what they say about assumptions.

BGB

unread,
May 25, 2023, 2:25:00 AM5/25/23
to
On 5/24/2023 1:12 PM, Dan Cross wrote:
> In article <u4lhjo$31dht$1...@dont-email.me>, BGB <cr8...@gmail.com> wrote:
>> On 5/24/2023 6:33 AM, Dan Cross wrote:
>>>> I think a lot of this is making a big fuss over nothing, FWIW.
>>>
>>> You can think that all you want, but sadly, that doesn't mean
>>> that these aren't actual problems for real-world systems.
>>>
>>>> But, in any case, SuperH (along with PA-RISC, MIPS, SPARC, etc) got
>>>> along reasonably well with software-managed TLB.
>>>
>>> In a very different time, with very different demands on the
>>> architecture.
>>
>> Depends on what one wants.
>>
>> I am mostly imagining an architecture for embedded-systems style
>> use-cases (but, more DSP-like than microcontroller-like).
>
> This perhaps explains why you seem to be discounting the use
> cases others are telling your are important in other application
> domains.
>

Possibly, I am not trying to design a CPU for desktop PCs or servers...


Granted, I had considered trying to use it for a CNC controller, but
this use-case is served reasonably well with something like an ARM-based
microcontroller (and having an full OS, or virtual memory, on a CNC
controller actually makes it worse).
This is where ASIDs come in:
They allow multiple address spaces to coexist in the TLB at the same
time, while being mutually invisible to each other.

So, say:
0123_456789AB with ASID=1234
And:
0123_456789AB with ASID=5678

Can both exist in the TLB at the same time.


This is provided one can give each thing its own ASID, which is
potentially limiting in that only 65536 ASIDs can be in use at a time.

AS soon as one changes the value in the TTB(63:48), then whatever was in
the TLB before (that belongs to the other address space) is now ignored.

So, when one switches to the guest OS, they can switch TTB over to the
"guest page table" (possibly actually just a virtual TLB), with its own
ASID. When control moves back to the host, the host loads TTB with its
host page-table.


I can note that I am already using a mechanism like this to implement
system calls:
User process triggers a System Call;
SYSCALL ISR performs a task switch to the SYSCALL handler task;
Handler does its thing;
It invokes the SYSCALL ISR again, which transfers control back to the
caller task.

In this case, the user program can run in user-mode, whereas the syscall
handler task runs in supervisor mode.

It is not handled directly by the ISR, mostly because the ISR's run in a
special mode which has the MMU disables and which can't handle interrupt
(and a fault here will cause the CPU to lock-up until a RESET signal is
received, such as from pressing an external reset button). It is
possible a flag could be added to auto-reboot the CPU though.



Translating things in 96-bit space is another option, but fully
generalizing this would likely require adding an additional translation
layer. But, this could potentially be used to sidestep the "only 64K
unique ASIDs" limitation.

Then one could have effectively a space of up 2^64 possible 48-bit
address spaces...



For now though, the number of PIDs+threads in my use-cases small enough
that the 64K ASID limit isn't too much of an issue.

As-is, I would run out of both RAM and pagefile space well before I run
out of ASIDs, if I were using this way.

Granted, for most normal tasks, I am currently running them in a shared
address space, and instead the idea is that ACL checks would be used to
keep one process from stomping on another process's memory (which
ironically, effectively creates sub-rings within User Mode).


> There are several answers here, btw; the obvious one is trap and
> emulate the entirety of the guest's access to this region of the
> virtual address space, but that's a) complex, and b) expensive.
>

And, unnecessary...


> Another is to make the hypervisor relocable, and only trap into
> a small, position-independent trampoline stub that can, say, set
> a base register and jump somewhere else. This will break down
> if the guest uses too much of the virtual address space.
>
> Yet another, since you control the hardware, is to have a
> separate "guest" hardware TLB and an execution mode that uses
> it, but that adds complexity to the hardware.
>

And, is also unnecessary, given one can have multiple address spaces
present in the same TLB at the same time without them conflicting with
each other (provided each has a different ASID).


> Another option is to locate the hypervisor somewhere random in
> the virtual address space that is unlikely to conflict with a
> guest and simply declare it off-limits by convention, but guests
> don't necessarily need to obey convention.
>
> None of these are particularly great options.
>

The latter is also possible.

Within the high 48 bits, there is plenty of space...

Theoretically, one could generate a good 48-bit random number and have
space that is shared between the host and guest, if needed...


>> The rate of guest TLB misses could be reduced by giving it a bigger TLB
>> than on the actual hardware, say, 1024x 4-way, as this part is
>> "basically invisible" to the guest OS (apart from reducing the number of
>> TLB misses).
>
>>>> Or, stated another way, the entire "physical address" space for the
>>>> guest would itself be a virtual memory space running in user-mode.
>>>
>>> Sure, but this isn't about the physical address space; it's
>>> about management of the virtual address space.
>>
>> The guest's physical space would be the virtual address space for the host.
>
> Cool. So what provide's the guests virtual address space?
>

Giving it its own ASID's...

Probably using a Guest -> Host ASID remapping table.



>> Hardware page walking and plain page tables are overly limiting and
>> inflexible though.
>>
>> [snip]
>
> I see no evidence for that, and plenty of disconfirming
> evidence.
>

How many hardware page-walker implementations:
Support layouts other than an N-level page table?
Support adding per-thread access permissions to pages within a single
address space? (Say, similar to access permissions in a Unix-style
filesystems, or file-access permissions in NTFS?)
...

Or, if one were going to do so (without TLB Miss and ACL Miss
interrupts), how would they do so?...


Or, say, if you wanted to fake 32-bit x86 segmentation on top of the
MMU, how would one do so?
...


If you do it in hardware, every possibility also needs to be supported
in hardware, so you either have fewer possibilities, or the hardware
becomes needlessly complex.

It is like, the whole x86 TSS thing...

They handled this mechanism in hardware, but it doesn't need hardware...

One could instead have the ISR go through an arcane ritual to try to get
all of the CPU registers saved to and restored from memory without
(accidentally) changing the value of any of the registers in the process.

Granted, it is a PITA to figure out how to save and restore all of the
registers with no spare registers to use as scratch-pad, but it is
possible...

Originally, SuperH banked out some of the registers, but I got rid of
this as it was cheaper for the FPGA to not have any banked registers...
(If, albeit, slower and more awkward for the ISR handlers).

So, usually the first priority is to get a few of the scratch registers
saved off and freed up so that it can use these to set up the ISR
stack-frame and get all the other registers saved off.

Though, there is a designated area of scratchpad RAM located around
0000_0000C000 .. 0000_0000DFFF (in the physical map) to help with some
of this (also used by the Boot ROM before switching over to external DRAM).


The actual mechanism itself being essentially: Update a few special
registers, change some processor mode flags, and perform a computed
branch relative to a special register.

Unlike either the 8086 or SH-2 mechanism, the ISR mechanism does not
need to access memory.

Ironically, RISC-V went to the other extreme, having effectively 3
copies of the register space (User, Supervisor, Machine). Rather than
just a single register space...


But, this means that a BJX2 core, despite having 64 GPRs (rather than
32), ends up having a smaller register space in practice than RISC-V's
"Privileged ISA" spec (for RV64IMAFD or similar).

...


BGB

unread,
May 25, 2023, 5:22:35 AM5/25/23
to
On 5/24/2023 1:31 PM, Scott Lurndal wrote:
> BGB <cr8...@gmail.com> writes:
>> On 5/24/2023 6:33 AM, Dan Cross wrote:
>>> In article <u4jnvk$2nsop$1...@dont-email.me>, BGB <cr8...@gmail.com> wrote:
>
>>>
>>>> But, in any case, SuperH (along with PA-RISC, MIPS, SPARC, etc) got
>>>> along reasonably well with software-managed TLB.
>
> Having much experience with MIPS (both at SGI and Cavium) I would
> dispute that characterization.
>

Possibly.

If it were that bad though, why would IA-64 have gone that route?...
Or, say, the people still using the Power ISA?...


>>>
>>> In a very different time, with very different demands on the
>>> architecture.
>>>
>>
>> Depends on what one wants.
>>
>>
>> I am mostly imagining an architecture for embedded-systems style
>> use-cases (but, more DSP-like than microcontroller-like).
>
> So, computer games like Doom are not very good benchmark choices.
>

Doom is sort of notable in that it is a poor fit to the ISA design, but
reflects a lot of "generic" coding practices (along with Quake and ROTT).


But, despite being a poor fit, in 320x200 mode, Doom can pull off around
20fps at 50MHz. Performance is a bit worse if trying to run it in a
640x400 or 800x600 GUI mode though (my attempt at a GUI not doing quite
so well in terms of performance).



If I only test the stuff that works well, this is only part of the picture:
I can run real-time software-rasterized OpenGL on the thing;
It is also "pretty formidable" at running neural nets expressed as SIMD
ops and similar;
...

But, like Software Quake pulls off an epic 2-4 FPS...
As-is, it is faster to run a modified GLQuake port on the thing.


I am generally getting a Dhrystone score of around 75000, which is
seemingly on-par with a lot of "retro" stats (comparable to 90s era
PowerPC machines relative to clock speed).

But, scores get a bit suspect if comparing against scores generated by
GCC or Clang, which seem to give "unreasonably fast" Dhrystone numbers
by default.

Stuff seems "less bad" if I compare against Dhrystone built with MSVC
though.


Not ported any standardized floating-point benchmarks though.

There is a risk they would get caught up on the "atrociously bad" FDIV
performance though (there is, sort-of, an instruction for this, but it
is generally faster to do Newton-Raphson iteration in software).


Similar sort of issue for integer divide (hardware integer divide was
required for the RISC-V mode to support the 'M' extension's
instructions; but not particularly fast).

Originally I had assumed skipping out on integer divide (and only
providing MUL and similar in hardware).

Though, my C compiler will (by default) use a C runtime call in the case
of integer divide or similar (with the integer divide being handled in
software).

But, neither is quite bad enough to fall into "boat anchor" territory.


>>
>> Say, something that does real-time audio/video processing and can run
>> neural nets.
>
>> So, for example, the design is an in-order VLIW, since it seems like
>> optimizing for OoO will become less attractive once Moore's Law ends
>> (say, if one wants more performance in less die area and less watts,
>> rather than maximum performance but throwing lots of die area and watts
>> at it).
>
> I suspect that parallelism is the answer to the purported end of
> Moore's law. Note that the largest AMD supercomuputer now has over
> 8 million cores.
>

There will be a limit to how many cores one can fit into a die with a
given power budget and a given micro-architecture.

x86 and OoO in general would likely become unfavorable:
x86 needs OoO to not perform like crap;
OoO needs a lot of die-space and power.


A simple RISC could fare a little better, but in-order superscalar is
fairly limited.

A VLIW can push a little closer to OoO performance, while still having
the "cheapness" of an in-order design. Cost being that one needs a more
complicated compiler, and that the compiler needs to be aware of the
pipeline width and behavior (and which combinations of features are
allowed on a given CPU core).


>
>>
>> As for virtual-address spaces having the same addresses:
>> You can use ASIDs.
>
> Actually, you need both VMIDs (virutal machine ID) and ASIDs (address space ID).
>
> All user-mode applications running under all virtual machines may be using identical
> virtual addresses. Which means you need to tag the TLB entries with both VMID
> and ASID (now you're up to 32 bits of tag if both are 16-bits).
>
> And even 16-bits of ASID are insufficent on multiprocessor machines and
> the OS needs a mechanism to invalidate all ASIDs and assign new ones
> when unassigned processes are subsequently scheduled.
>

ASIDs could be locally assigned in many cases.

Adding a VMID probably shouldn't be needed if the VMs use an ASID
remapping table or similar.

As least on moderately sized systems, it is likely one would run out of
memory before they run out of ASIDs.


>>
>> Noting that the 96-bit space is far larger than the ASID space, and it
>> is unlikely that the guest will use all of it.
>
> You know what they say about assumptions.
>

At least in the near-term, there is unlikely to be either enough RAM or
HDD space to make full use of such an address space.


Scott Lurndal

unread,
May 25, 2023, 9:26:00 AM5/25/23
to
BGB <cr8...@gmail.com> writes:
>On 5/24/2023 1:31 PM, Scott Lurndal wrote:
>> BGB <cr8...@gmail.com> writes:
>>> On 5/24/2023 6:33 AM, Dan Cross wrote:
>>>> In article <u4jnvk$2nsop$1...@dont-email.me>, BGB <cr8...@gmail.com> wrote:
>>
>>>>
>>>>> But, in any case, SuperH (along with PA-RISC, MIPS, SPARC, etc) got
>>>>> along reasonably well with software-managed TLB.
>>
>> Having much experience with MIPS (both at SGI and Cavium) I would
>> dispute that characterization.
>>
>
>Possibly.
>
>If it were that bad though, why would IA-64 have gone that route?...
>Or, say, the people still using the Power ISA?...

IA-64 is obsolete. And it was designed almost three decades ago.

Times change.


>
>Doom is sort of notable in that it is a poor fit to the ISA design, but
>reflects a lot of "generic" coding practices (along with Quake and ROTT).

Does it?

>> I suspect that parallelism is the answer to the purported end of
>> Moore's law. Note that the largest AMD supercomuputer now has over
>> 8 million cores.
>>
>
>There will be a limit to how many cores one can fit into a die with a
>given power budget and a given micro-architecture.
>
>x86 and OoO in general would likely become unfavorable:
> x86 needs OoO to not perform like crap;
> OoO needs a lot of die-space and power.

Does it?


>>>
>>> Noting that the 96-bit space is far larger than the ASID space, and it
>>> is unlikely that the guest will use all of it.
>>
>> You know what they say about assumptions.
>>
>
>At least in the near-term, there is unlikely to be either enough RAM or
>HDD space to make full use of such an address space.

If you're making a microcontroller, then you don't need all the
fancy features. If you designing a general purpose processor,
current processors are designed to support 52-bits VA and PA, and with CXL-Memory,
the need for full 64 bits is only a few years away.

Dan Cross

unread,
May 25, 2023, 3:09:58 PM5/25/23
to
This doesn't really solve the problem, but just moves the
goal-posts (because now, of course, you're sharing the ASID
space with the guest in the exact same way that you are sharing
the virtual address space). Now, you have to trap every guest
reference to an ASID and adjust it, pushing significant
complexity into the VMM, which was the original argument for why
a soft-TLB was "better."

- Dan C.

BGB

unread,
May 25, 2023, 8:12:22 PM5/25/23
to
You can adjust them when them in the handlers at basically the same time
as when doing an additional level of address translation, no additional
trapping needed.

The point is not "well, it requires some additional code", but rather
that it does not require additional hardware support for each feature
that is added.

Most would not likely consider needing to use a few additional lookup
tables or similar to be a significant issue.


So, as the guest sees it, it has one set of addresses, one set of ASIDs,
etc, and for the host, different addresses and ASIDs.

No big issue so long as one doesn't run out of ASIDs.


But, as noted, I am using 16-bit ASIDs here.
SuperH had used 8-bit ASIDs;
Some traditional RISC's had used 5-bit ASIDs.
A 16-bit space is going to last a lot longer than 5 or 8 bits.



If the mechanism were a little more flexible, one could also emulate
8086 real-mode addressing (or 286/386 style segmented addressing) via
the TLB, but currently this is not supported (could be added if a strong
use-case came up though). Would likely be added if I were trying to do
something along similar lines to DOSBox.

This would mostly require adding a mode where a low-order bias could be
added to the page during translation.

Otherwise, it would be more expensive to emulate segmented addressing in
cases where the segment base is not a multiple of the page size.


The main difficult/annoying part would be needing to come up with some
way to efficiently emulate x86 style flags / condition codes (my ISA
doesn't currently have any support for ALU flags or condition codes; and
they are expensive to emulate via bit twiddling when a significant part
of the ISA may potentially update the status flags).

Though, this would still be assuming some sort of dynamic
trace-translation mechanism.

Well, and then the GDT and LDT handling can basically be glued onto the
normal page-table handling code. Though, supporting "big real mode"
would make things a little more fiddly (since the state of the segment
is tied to the segment register itself, rather than the value held by
the segment register; and one wouldn't want to do a TLB flush due to
reloading a segment register or similar).

...


Dan Cross

unread,
May 25, 2023, 8:45:50 PM5/25/23
to
In article <u4oth2$3nc4d$1...@dont-email.me>, BGB <cr8...@gmail.com> wrote:
>On 5/25/2023 2:08 PM, Dan Cross wrote:
>[snip]
>You can adjust them when them in the handlers at basically the same time
>as when doing an additional level of address translation, no additional
>trapping needed.
>
>The point is not "well, it requires some additional code", but rather
>that it does not require additional hardware support for each feature
>that is added.
>
>Most would not likely consider needing to use a few additional lookup
>tables or similar to be a significant issue.

What if I told you that no modern architectures use soft-TLBs
_because_ "most" serious users found out through real world
experience that "needing to use a few additional lookup tables"
actually _is_ a "significant issue"?

>So, as the guest sees it, it has one set of addresses, one set of ASIDs,
>etc, and for the host, different addresses and ASIDs.
>
>No big issue so long as one doesn't run out of ASIDs.

Yeah, that "as long as..." is doing a lot of work for you there.

- Dan C.

BGB

unread,
May 26, 2023, 5:44:17 AM5/26/23
to
On 5/25/2023 8:24 AM, Scott Lurndal wrote:
> BGB <cr8...@gmail.com> writes:
>> On 5/24/2023 1:31 PM, Scott Lurndal wrote:
>>> BGB <cr8...@gmail.com> writes:
>>>> On 5/24/2023 6:33 AM, Dan Cross wrote:
>>>>> In article <u4jnvk$2nsop$1...@dont-email.me>, BGB <cr8...@gmail.com> wrote:
>>>
>>>>>
>>>>>> But, in any case, SuperH (along with PA-RISC, MIPS, SPARC, etc) got
>>>>>> along reasonably well with software-managed TLB.
>>>
>>> Having much experience with MIPS (both at SGI and Cavium) I would
>>> dispute that characterization.
>>>
>>
>> Possibly.
>>
>> If it were that bad though, why would IA-64 have gone that route?...
>> Or, say, the people still using the Power ISA?...
>
> IA-64 is obsolete. And it was designed almost three decades ago.
>
> Times change.
>

It is still newer than x86 and ARM...


But, yeah, IA-64's high-point (in terms of popularity) was back when I
was in high-school, which was sort of a while ago now.

The world ended up getting x86-64 instead.


However, the fundamentals of computing aren't really much different now
than they were 20 or 30 years ago.


The OS's we use now aren't *that* much different than I was using in
high-school, only now the CPUs are a little faster and we have a lot
more RAM...

At the time, IIRC, I was mostly dual-booting Win2K and Mandrake Linux;
WinXP had come out, but I didn't switch over until later (then got an
Athlon64 PC and switched to WinXP X64... and had "fun times with system
stability" for the next few years...).


Most programs are at least "sane" regarding memory use, apart from
Firefox always wanting to grow to consume all available RAM if given
enough time.



Ironically, if a HW page-walker were used, and the page-tables weren't
already in the L2 cache, one would possibly still be looking at around
400-600 clock cycles just for 3 memory accesses...

Meanwhile, the special scratch RAM for the ISRs (at 0000C000..0000DFFF)
has the property that it is in Block-RAM so it never has an L2 miss (can
slightly help speed up the "spill and restore all the GPRs" thing for
the ISRs).


Likely, a "better" option might be a giant RAM-backed TLB; since then
all the access is in adjacent cache lines (so could play better with the
interface between the L2 cache and DRAM).


>
>>
>> Doom is sort of notable in that it is a poor fit to the ISA design, but
>> reflects a lot of "generic" coding practices (along with Quake and ROTT).
>
> Does it?
>

Doom, Quake, and ROTT and similar seem to be good examples of "generic
programming practices".

Also a lot of small/tight loops that my CPU does crappy at...

Also a lot of L2 cache misses (a fair chunk of performance being lost
mostly to needing to wait for the L2 cache; Doom will otherwise run at
the 35 fps limiter if it were in a scenario where the L2 cache always
hits...).


On the current FPGA, I can use a 512K L2 cache, which at least
"partially" compensates for the relatively slow access to the external
RAM chip (has a 16-bit interface, running at 50MHz, with a 5-cycle
minimum CAS latency; and pulling off an "epic" ~ 18MB/sec for memcpy).

The XC7A200T isn't quite big enough for me to have a 1MB L2 cache though.




Bulky highly-unrolled loops tend to work better, but are not really
standard coding practice.



An interesting edge case is with some of my neural net tests, which can
turn into large blocks of hundreds of kB of straight-line SIMD code
(with no real looping whatsoever).

Can generally be speed competitive with an early 2000s laptop at this
task, though does hit the L1 I-Cache pretty hard...

This is likely more of a novelty than anything else though.

Otherwise, laptop has 30x the clock-speed and 10x the DRAM memcpy
bandwidth (but only around 3x for small L1 local copies).


>>> I suspect that parallelism is the answer to the purported end of
>>> Moore's law. Note that the largest AMD supercomuputer now has over
>>> 8 million cores.
>>>
>>
>> There will be a limit to how many cores one can fit into a die with a
>> given power budget and a given micro-architecture.
>>
>> x86 and OoO in general would likely become unfavorable:
>> x86 needs OoO to not perform like crap;
>> OoO needs a lot of die-space and power.
>
> Does it?
>

Are you claiming that x86 on an in-order CPU wouldn't suck?...
They made the jump to OoO pretty much before everyone else (eg, "Pentium
Pro");
with the 486 and early Pentium as the last of the in-order chips (apart
from a short run with early versions of Atom).

Meanwhile, some other architectures (such as ARM) hold up much better
with in-order CPUs (such as the seemingly ubiquitous Cortex-A53 used in
most of the cell-phones I have had in recent years).


>
>>>>
>>>> Noting that the 96-bit space is far larger than the ASID space, and it
>>>> is unlikely that the guest will use all of it.
>>>
>>> You know what they say about assumptions.
>>>
>>
>> At least in the near-term, there is unlikely to be either enough RAM or
>> HDD space to make full use of such an address space.
>
> If you're making a microcontroller, then you don't need all the
> fancy features. If you designing a general purpose processor,
> current processors are designed to support 52-bits VA and PA, and with CXL-Memory,
> the need for full 64 bits is only a few years away.
>

52 or 64 bit VAs are still pretty massive overkill at present.

Like, most PCs have maybe 32GB or 64GB of RAM at present.

A cellphone has maybe 2GB or 4GB.


48-bit is plenty, we are nowhere close to the 256TB mark yet...

48-bit also leaves some bits free for dynamic type tags and similar,
which is a much more useful feature at present (also for encoding the
processor's instruction-set mode in function pointers and link-register
pointers, to allow for function pointers between instruction-set modes).

There was a considered feature to allow extending 64-bit pointers to 60
address bits (with only 4 bits for type-tag bits), but haven't done this
(not useful at present).


As noted, the full "extended address space" in BJX2 is 96-bit, but this
would be stupid levels of overkill for any normal application (and I had
started work on an ABI that used 128-bit pointers, but shelved it for
now for sake of it being massive overkill, and at present, not really
worth the effort needed to debug it).


Scott Lurndal

unread,
May 26, 2023, 10:18:58 AM5/26/23
to
BGB <cr8...@gmail.com> writes:
>On 5/25/2023 2:08 PM, Dan Cross wrote:

>> This doesn't really solve the problem, but just moves the
>> goal-posts (because now, of course, you're sharing the ASID
>> space with the guest in the exact same way that you are sharing
>> the virtual address space). Now, you have to trap every guest
>> reference to an ASID and adjust it, pushing significant
>> complexity into the VMM, which was the original argument for why
>> a soft-TLB was "better."
>>
>
>You can adjust them when them in the handlers at basically the same time
>as when doing an additional level of address translation, no additional
>trapping needed.

How many hypervisors have you written?

Scott Lurndal

unread,
May 26, 2023, 10:36:19 AM5/26/23
to
BGB <cr8...@gmail.com> writes:
>On 5/25/2023 8:24 AM, Scott Lurndal wrote:

>> IA-64 is obsolete. And it was designed almost three decades ago.
>>
>> Times change.
>>
>
>It is still newer than x86 and ARM...

ARMv8 is about 10 years old - a completely _new_ architecture.

AMD64 is about 20 years old, and post-dates IA64. And both AMD
and the derived x86_64 have made constant improvements to the
architecture over the last 20 years.



>
>
>But, yeah, IA-64's high-point (in terms of popularity) was back when I

Which was _never_ very high. A few very large-scale HP systems and
a couple of SGI machines.

I used to have P7 yellow books (which was the prior design for IA64
before Intel discarded P7 and joined the Merced project with HP that
resulted in itanic). P7 was to be the follow-on to P6 (Pentium Pro).

>>
>>>
>>> Doom is sort of notable in that it is a poor fit to the ISA design, but
>>> reflects a lot of "generic" coding practices (along with Quake and ROTT).
>>
>> Does it?
>>

> Doom, Quake, and ROTT and similar seem to be good examples of "generic
> programming practices".

Really? How many programs have you looked at? Databases? Web Servers?
Financial Software? ERP software? Productivity software? Compilers?
Interpreters? Operating Systems? Hypervisors?

Games aren't representative of
anything other than games. The vast majority of software doesn't use
graphics or have real-time constraints, for example.

> >> x86 and OoO in general would likely become unfavorable:
> >> x86 needs OoO to not perform like crap;
> >> OoO needs a lot of die-space and power.
> >
> > Does it?
> >
>
> Are you claiming that x86 on an in-order CPU wouldn't suck?...

Depends on the application. Smaller ARMv7 cores are in-order.

And "OoO" needs a lot of die space is disputable. The OoO
Neoverse N2 cores are quite small, for example.

> > If you're making a microcontroller, then you don't need all the
> > fancy features. If you designing a general purpose processor,
> > current processors are designed to support 52-bits VA and PA, and with CXL-Memory,
> > the need for full 64 bits is only a few years away.
> >
>
> 52 or 64 bit VAs are still pretty massive overkill at present.
>

No, they're actively being _used_ at present. Today. In production.


> Like, most PCs have maybe 32GB or 64GB of RAM at present.

Like, most real computers are _not_ PC's. They're servers
in the cloud, or enterprise systems running Oracle
or the various ERP packages.

BGB

unread,
May 26, 2023, 2:41:06 PM5/26/23
to
I have mostly written emulators.

I had figured that a hypervisor would be mostly like an emulator, just
with the instructions running "mostly" natively, and leveraging the
underlying memory map (rather than implementing all of the Load/Store
address translation in software).

Granted, the architecture for a hypervisor on BJX2 or similar would
likely be very different from one on x86, likely in some ways with more
in common with a more traditional emulator...

For running something like x86, there would likely be a "translated
trace cache" and similar, more like in my JIT-based emulator designs.

...

BGB

unread,
May 26, 2023, 2:51:20 PM5/26/23
to
On 5/26/2023 9:36 AM, Scott Lurndal wrote:
> BGB <cr8...@gmail.com> writes:
>> On 5/25/2023 8:24 AM, Scott Lurndal wrote:
>
>>> IA-64 is obsolete. And it was designed almost three decades ago.
>>>
>>> Times change.
>>>
>>
>> It is still newer than x86 and ARM...
>
> ARMv8 is about 10 years old - a completely _new_ architecture.
>
> AMD64 is about 20 years old, and post-dates IA64. And both AMD
> and the derived x86_64 have made constant improvements to the
> architecture over the last 20 years.
>
>
>
>>
>>
>> But, yeah, IA-64's high-point (in terms of popularity) was back when I
>
> Which was _never_ very high. A few very large-scale HP systems and
> a couple of SGI machines.
>
> I used to have P7 yellow books (which was the prior design for IA64
> before Intel discarded P7 and joined the Merced project with HP that
> resulted in itanic). P7 was to be the follow-on to P6 (Pentium Pro).
>
>>>
>>>>
>>>> Doom is sort of notable in that it is a poor fit to the ISA design, but
>>>> reflects a lot of "generic" coding practices (along with Quake and ROTT).
>>>
>>> Does it?
>>>
>
>> Doom, Quake, and ROTT and similar seem to be good examples of "generic
>> programming practices".
>
> Really? How many programs have you looked at? Databases? Web Servers?
> Financial Software? ERP software? Productivity software? Compilers?
> Interpreters? Operating Systems? Hypervisors?
>

Much of this stuff falls outside of the target domain of BJX2.
It is not intended for server or business computing...


I haven't really ported BGBCC to BJX2, but this is partly because BGBCC
is still a bit heavyweight in terms of RAM usage.

Ideally, would want a C compiler that could compile stuff in under
around 10MB of RAM.


> Games aren't representative of
> anything other than games. The vast majority of software doesn't use
> graphics or have real-time constraints, for example.
>

Graphics aside, both use loops and mostly scalar code.

As opposed to more SIMD heavy workloads.


>>>> x86 and OoO in general would likely become unfavorable:
>>>> x86 needs OoO to not perform like crap;
>>>> OoO needs a lot of die-space and power.
>>>
>>> Does it?
>>>
>>
>> Are you claiming that x86 on an in-order CPU wouldn't suck?...
>
> Depends on the application. Smaller ARMv7 cores are in-order.
>
> And "OoO" needs a lot of die space is disputable. The OoO
> Neoverse N2 cores are quite small, for example.
>
>>> If you're making a microcontroller, then you don't need all the
>>> fancy features. If you designing a general purpose processor,
>>> current processors are designed to support 52-bits VA and PA, and with CXL-Memory,
>>> the need for full 64 bits is only a few years away.
>>>
>>
>> 52 or 64 bit VAs are still pretty massive overkill at present.
>>
>
> No, they're actively being _used_ at present. Today. In production.
>
>
>> Like, most PCs have maybe 32GB or 64GB of RAM at present.
>
> Like, most real computers are _not_ PC's. They're servers
> in the cloud, or enterprise systems running Oracle
> or the various ERP packages.
>

I don't care about these, my focus is mostly on things on the smaller
end, but still bigger than microcontrollers.

Say, more like a processor one would stick in a small autonomous robot...


In this space, the main "competition" would be more things like the RasPi...


Scott Lurndal

unread,
May 26, 2023, 3:35:19 PM5/26/23
to
BGB <cr8...@gmail.com> writes:
>On 5/26/2023 9:17 AM, Scott Lurndal wrote:
>> BGB <cr8...@gmail.com> writes:
>>> On 5/25/2023 2:08 PM, Dan Cross wrote:
>>
>>>> This doesn't really solve the problem, but just moves the
>>>> goal-posts (because now, of course, you're sharing the ASID
>>>> space with the guest in the exact same way that you are sharing
>>>> the virtual address space). Now, you have to trap every guest
>>>> reference to an ASID and adjust it, pushing significant
>>>> complexity into the VMM, which was the original argument for why
>>>> a soft-TLB was "better."
>>>>
>>>
>>> You can adjust them when them in the handlers at basically the same time
>>> as when doing an additional level of address translation, no additional
>>> trapping needed.
>>
>> How many hypervisors have you written?
>>
>
>I have mostly written emulators.

While VMware required the ability to emulate parts of the instruction set,
as did XEN, prior to the advent of nested page tables, modern
hypervisors have very little need for emulation thanks to the
capabilities that AMD/ARM/Intel have added to the ISA to trap
various guest activities.

A hypervisor is more about resource management. Depending
on type (bare metal or integrated with a kernel like KVM),
the set of resource vary but generally include I/O,
memory and guest scheduling (complicated if processing
resources are overcommited and you have multiprocessor guests)
and supporting hot plug resource APIs (e.g. via ACPI in the guest).

BGB-Alt

unread,
May 26, 2023, 6:27:15 PM5/26/23
to
On 5/26/2023 2:33 PM, Scott Lurndal wrote:
> BGB <cr8...@gmail.com> writes:
>> On 5/26/2023 9:17 AM, Scott Lurndal wrote:
>>> BGB <cr8...@gmail.com> writes:
>>>> On 5/25/2023 2:08 PM, Dan Cross wrote:
>>>
>>>>> This doesn't really solve the problem, but just moves the
>>>>> goal-posts (because now, of course, you're sharing the ASID
>>>>> space with the guest in the exact same way that you are sharing
>>>>> the virtual address space). Now, you have to trap every guest
>>>>> reference to an ASID and adjust it, pushing significant
>>>>> complexity into the VMM, which was the original argument for why
>>>>> a soft-TLB was "better."
>>>>>
>>>>
>>>> You can adjust them when them in the handlers at basically the same time
>>>> as when doing an additional level of address translation, no additional
>>>> trapping needed.
>>>
>>> How many hypervisors have you written?
>>>
>>
>> I have mostly written emulators.
>
> While VMware required the ability to emulate parts of the instruction set,
> as did XEN, prior to the advent of nested page tables, modern
> hypervisors have very little need for emulation thanks to the
> capabilities that AMD/ARM/Intel have added to the ISA to trap
> various guest activities.
>

So, unlike, say, the sort of x86->x86 dynamic-translation / JIT thing
done, say, in DOSBox?...

This is admittedly a little closer to what I would have assumed, apart
from maybe being able to run a lot of the usermode stuff directly, and
then only switching to JIT in cases where supervisor mode was being used
(say, JIT translating the ISR handlers).

Well, except when the ISA's don't match, in which case, it would need to
be full JIT. Call threading would be slow though, so preferably avoided.


> A hypervisor is more about resource management. Depending
> on type (bare metal or integrated with a kernel like KVM),
> the set of resource vary but generally include I/O,
> memory and guest scheduling (complicated if processing
> resources are overcommited and you have multiprocessor guests)
> and supporting hot plug resource APIs (e.g. via ACPI in the guest).

OK.

Sorta figured all of the hardware interfaces would have been emulated in
software as well.


Have noted before, for example, that if one tries to run Win98 in
VirtualBox, it tends to blue-screen after a few minutes. Something like
QEMU holds up a little better, but "sucks a low worse".

Had used VMware in the past, but it doesn't seem to work anymore (both
VMware and VirtualPC giving an error message and refusing to work).

...

Scott Lurndal

unread,
May 27, 2023, 10:43:52 AM5/27/23
to
Completely unlike. Apples and magma.

>This is admittedly a little closer to what I would have assumed, apart
>from maybe being able to run a lot of the usermode stuff directly, and
>then only switching to JIT in cases where supervisor mode was being used
>(say, JIT translating the ISR handlers).

The processor still runs in "ring 3" from the perspective of the
guest kernel. There's no emulation or jit, the processor just runs
the guest as if it were actually running in ring 3, but the
hypervisor has arranged for various activities to be trapped
to the hypervisor by the hardware when the guest executes them.

Today that arrangement is in the hardware via controls that the
processor allows the hypervisor to use. In intel and amd, there
is a vmenter instruction used to tell the processor that it is
running guest code. In the olden days, before hardware virtualization
support (leaving aside the IBM 370 which had it in the 1970s),
various mechanisms using shadow page tables, read-only or absent
pages, and run-time modification of the guest kernel OS code by
the hypervisor were used. They were troublesome, prone to error
and costly to performance and needed intimiate knowledge of the
guest OS internals.

The list of traps is large, and there are some differences
between the mechanisms used by AMD's Secure Virtual Machine and
Intel's VT-X, or ARM's integrated virtualization (where the
processor supports four rings, machine (firmware mainly, like SMM in x86),
hypervisor, OS and application. Ring 2 is unused when no virtualization
is present, although linux mostly runs in ring2 whether or not virtualization
is being used to support KVM.

>
>Well, except when the ISA's don't match, in which case, it would need to
>be full JIT.

That's not part of the normal definition of this type of virtualization.


>Had used VMware in the past, but it doesn't seem to work anymore (both
>VMware and VirtualPC giving an error message and refusing to work).

I primarily use kvm these days, as it is built into linux. I've never
used Windows for anything (other than turbotax once a year on a cheep
laptop).

muta...@gmail.com

unread,
Jun 23, 2023, 6:35:57 AM6/23/23
to
On Tuesday, May 23, 2023 at 11:11:51 PM UTC+8, muta...@gmail.com wrote:

> So now PDOS can be built with professional Microsoft tools
> instead of having to take your chances with jackasses on
> the internet.

That was in reference to the 32-bit portion of PDOS/386,
which is almost everything.

And just now the 16-bit bootloader can be built with
Visual C++ 1.52, which I think can be obtained via
the MSDN (I instead got mine from ebay a few weeks ago),
and runs under modern Windows (although you get an out
of memory error if you don't specify "-f"). But I needed to develop
my own crude exe2bin since that was only ever supplied as
a DOS executable (Watcom has one too, but the version I
tried crashed, and I don't want to be dependent on copyrighted
freeware anyway).

The bootsector was already able to be built with masm.

BFN. Paul.
0 new messages