Proposal for virtualization without H mode

776 views
Skip to first unread message

Paolo Bonzini

unread,
Aug 30, 2016, 7:01:40 AM8/30/16
to RISC-V ISA Dev, sag...@eecs.berkeley.edu
The separation of S and H levels in RISC-V is repeating how PPC first,
and ARM later introduced virtualization extensions. As mentioned in the
privileged interface specification, this model suits Type-1 hypervisors
very well, but it can be a bit messy for Type-2 hypervisors such as KVM.

For Type-2 hypervisors, the same kernel (e.g., Linux) would have to be
able to run both in S and H mode. In one case it would use s* registers,
in the other it would use h* registers. It is possible to avoid this
through a small stub that sets up the h* CSRs so as to be able to use s*
registers for regular kernel operation. For example, all interrupts
are usually delegated. This works, but it makes the world switch code
very slow, as we've seen with KVM on ARMv7, because it needs to read
and write all the s* CSRs twice (once on entry, once on exit).

Even for Type-1 hypervisors, the separate H-mode makes it very slow to
run a hypervisor "nested" into another outer hypervisor. Each privileged
operation of the inner hypervisor is an H-mode instruction that requires
trap-and-emulate, and this is expensive. While nested virtualization
sounds weird it is actually very useful, at least as a debugging and
development tool.

None of these issues exist in the s390 SIE (Start Interpreted Execution)
mechanisms or in x86 virtualization extensions by either Intel or AMD.
So I am proposing a different mechanism for hypervisors that is based on
the s390 and x86 models. This model is strictly more powerful than the
ARM/PPC-like idea of an H-mode; Type-1 hypervisors can be easily written
in this model as shown by e.g. Xen on x86 hardware or z/VM on s390.

In this model, "hypervisor active" mode (HA) is orthogonal to whether the
machine is in S- or U-mode. There are effectively five processor modes:
M, S, U, HA/S (supervisor under hypervisor), HA/U (user under hypervisor).
HA is part of hstatus and mstatus, and enables additional interception
and memory translation mechanisms. The processor's h* CSRs are used by
S-mode with no active hypervisor, or can be virtualized by the hypervisor
when used in HA/S-mode. The same is true for the various H* fields in
mstatus, mip and mie.

This model can be implemented in a framework that reasonably fits
the draft privileged interface of RISC-V. The only changes required in
the spec refer to HPP, HRET, HPIE and HIE. While very similar
mechanisms exist in this proposal, they differ slightly so that backwards
compatibility would be awkward to preserve.

I will now present my proposal; if you would like to consider it, I would
suggest to remove references to these bits and perhaps, for clarity and
future-proofing, the H-mode bits in mie (HEIE/HTIE/HSIE) and mip
(HEIP/HTIP/HSIP).

---
1) Changes to mstatus

Because there is no separate H-mode, HPP only requires one bit of
storage. The higher bit of HPP is replaced by the HA bit.

The HA bit can be written in mstatus. It is always 0 in
hstatus, sstatus and ustatus.


2) Changes to the semantics of the h* CSRs

The hypervisor runs in S-mode with HA=0. Therefore:
- the h* CSRs are accessible to S-mode if HA=0, and of course to M-mode;
- accessing the h* CSRs causes a trap in U-mode, and in S-mode if HA=1.

h* registers do not affect operation of the processor if HA=0. In
particular:
- the HIE bit of mstatus behaves as if 1
- the HEIE, HTIE, HSIE bits of mie behave as if 1
- hideleg and hedeleg behave as if all one
- bits of mstatus covering second-level address translation behave as
if second-level address translation is disabled
- any (TBD) mechanism that the hypervisor can use to modify the delta
between mcycle and scycle is ignored

In other words, if HA=0:
- all interrupts are enabled (unless blocked by lower privilege levels
of course) and delegated to S-mode;
- all exceptions are delegated to S-mode;
- page translation is only driven by sptbr.


3) New hp* CSRs (hpptbr - hypervisor previous page table base register)

A few selected pieces of state must be switched atomically when
activating/deactivating hypervisor mode. This resembles the
HPIE bit in mstatus, hence the guest values for this state are
stored in CSRs whose name begins with "hp".

Currently this is only the case for sptbr, hence only a new CSR
hpptbr is defined. It stores the guest's sptbr while HA=0, and the
host's sptbr while HA=1.

In the future, this could be extended to include e.g. the enabled/disabled
state for hardware breakpoints and watchpoints (sdbgctl/hpdbgctl).


4) New translation mechanisms

VM[4:3] defines several different translation schemes converting
supervisor physical addresses to hypervisor physical addresses, for
example "no translation" (sptbr only), Hv32, Hv39, Hv48. (The
current limit for supervisor physical addresses is 50 bits).

Second-level address translation is never used if HA=0. M-mode can also use
second-level address translation for loads and stores whenever HA=1, MPRV=1,
MPP=S or U.


5) New HRET logic

Similar to accessing h* CSRs, HRET can only be used in S-mode if HA=0,
as well as in M-mode. Executing HRET causes a trap in U-mode, and in
S-mode if HA=1.

HRET performs the following tasks:
- SIE is set to the value of HPIE
- the privilege mode is changed to HPP
- HPIE is set to 1
- HPP is set to 0
- the program counter is set to hepc
- HA is set to 1 **
- sptbr and hpptbr are swapped **

Apart from the two final steps, this is the same logic that is already
in the draft privileged interface specification. Future extensions may
add more hp* CSRs and swap them here at the same time as sptbr/hpptbr.

In order to enter the guest, the hypervisor restores the guest context
into general purpose registers and s* CSRs, except that the guest
sptbr is stored in hpptbr and the guest SIE is stored in HPIE. It then
executes HRET.


6) New hypervisor trap logic

Whenever HA=1, interrupts and traps delegated by medeleg/mideleg but
not by hedeleg/hideleg perform the following tasks:

- sptbr and hpptbr are swapped **
- HA is set to 0 **
- trap information is recorded in hcause and hbadaddr
- the program counter is recorded in hepc
- the program counter is set to htvec
- HPIE is set to the value of SIE
- SIE is set to 0
- HPP is set to the previous privilege mode
- the privilege mode is changed to S **

Apart from the three steps marked with **, this is the same logic in
the draft privileged interface specification. Future extensions may
add more hp* CSRs and swap them here at the same time as sptbr/hpptbr.
The change in the last step is simply because H-mode does not exist.

If necessary, the entry point at htvec saves the guest context from
general purpose registers and s* CSRs, except that the guest sptbr is
saved from hpptbr and the guest SIE is saved from HPIE.


7) Other extensions

This proposal does not constitute a full hypervisor specification.
Most notably, a hypervisor needs to inject virtual interrupts that
_appear_ like external interrupts but are actually software interrupts.
While acknowledging interrupts can be done through the SBI, interrupt
injection must be part of the processor specification.

There are two parts in this. First, the processor must examine SIE,
sie and sip on guest entry (in addition to any other time when SIE/sie/sip
change) and inject the interrupt if appropriate. This is already covered
by the specification.

Second, the processor must support writing SEIP and UEIP into hstatus,
causing an injection on the next HRET. Likewise, the processor should
support writing HEIP/SEIP/UEIP into mstatus, which would cause an interrupt
injection on the next MRET.
---


Thanks,

Paolo

Samuel Falvo II

unread,
Aug 30, 2016, 10:53:28 AM8/30/16
to Paolo Bonzini, RISC-V ISA Dev, sag...@eecs.berkeley.edu
On Tue, Aug 30, 2016 at 4:01 AM, Paolo Bonzini <bon...@gnu.org> wrote:
> None of these issues exist in the s390 SIE (Start Interpreted Execution)

I just read up on some basics of the SIE instruction. I like how it's
effectively a subroutine call which happens to atomically set other
control registers to transparently effect constraints on what the
called code can do.

I suppose one could consider HRET a more constrained equivalent of
SIE, provided the following invariant holds:

; ... hypervisor event handling here ...
HRET
; htvec points here, giving the illusion that HRET "looks like" a
subroutine call.
; ... more hypervisor event handling here; eventually loops back ...

> So I am proposing a different mechanism for hypervisors that is based on
> the s390 and x86 models. This model is strictly more powerful than the
> ARM/PPC-like idea of an H-mode; Type-1 hypervisors can be easily written
> in this model as shown by e.g. Xen on x86 hardware or z/VM on s390.

Can you go into more detail on this? I don't see how your proposal is
strictly more powerful. It just seems different from my point of
view. Maybe I still don't understand the precise nature of the
problem that's trying to be solved.

--
Samuel A. Falvo II

Paolo Bonzini

unread,
Aug 30, 2016, 11:28:56 AM8/30/16
to Samuel Falvo II, RISC-V ISA Dev, sag...@eecs.berkeley.edu


On 30/08/2016 16:53, Samuel Falvo II wrote:
> On Tue, Aug 30, 2016 at 4:01 AM, Paolo Bonzini <bon...@gnu.org> wrote:
>> None of these issues exist in the s390 SIE (Start Interpreted Execution)
>
> I just read up on some basics of the SIE instruction. I like how it's
> effectively a subroutine call which happens to atomically set other
> control registers to transparently effect constraints on what the
> called code can do.
>
> I suppose one could consider HRET a more constrained equivalent of
> SIE, provided the following invariant holds:
>
> ; ... hypervisor event handling here ...
> HRET
> ; htvec points here, giving the illusion that HRET "looks like" a
> ; subroutine call.
> ; ... more hypervisor event handling here; eventually loops back ...

Yes, this also how KVM uses VMRESUME on x86. The "HOST_RIP" field is
set to the instruction right after VMRESUME.

It doesn't strictly have to be used that way, but it's pleasant indeed.

>> So I am proposing a different mechanism for hypervisors that is based on
>> the s390 and x86 models. This model is strictly more powerful than the
>> ARM/PPC-like idea of an H-mode; Type-1 hypervisors can be easily written
>> in this model as shown by e.g. Xen on x86 hardware or z/VM on s390.
>
> Can you go into more detail on this? I don't see how your proposal is
> strictly more powerful.

It doesn't have any practical disadvantage for Type-1 hypervisors, and
it is much faster for Type-2. For example on ARMv8 a KVM hypercall
costs ~6000 cycles versus ~400 for Xen; on x86 the cost is the same
(1300 cycles, but the x86 microcode and hypervisor code does a loooot
more than on ARM).

For both kinds of hypervisor, in addition, nesting is easier and
higher-performance, because most hypervisor code runs in S-mode and thus
is naturally virtualized by the processor.

Last but not least, it's much easier to screw up the design of a
separate H-mode. On PPC, for nested KVM you end up with substantially
different code that runs the hypervisor in S-mode (and runs the guest
kernel in U-mode).

> It just seems different from my point of
> view. Maybe I still don't understand the precise nature of the
> problem that's trying to be solved.

Making KVM for RISC-V less of a headache than KVM for ARM, :) and
avoiding some of the issue that caused ARM to backtrack with the ARMv8.1
virtualization host extensions (VHE). See the ISCA 2016 paper
http://www.cs.columbia.edu/~cdall/pubs/isca2016-dall.pdf for information
on VHE.

I guess it's possible to design RISC-V hypervisor extensions right. But
in my opinion KVM/ARM has demonstrated that it's a bad idea to ignore
half of the design space for hypervisors when adding virtualization
extensions to a processor.

Paolo

Bharat Bhushan

unread,
Aug 30, 2016, 12:50:35 PM8/30/16
to Paolo Bonzini, Samuel Falvo II, RISC-V ISA Dev, sag...@eecs.berkeley.edu
Hi Paolo,
ARMv8 execution modes are
- Hypervisor Stub in EL2
- Hypervisor and Host OS in EL1
- Guest OS in EL1

With ARMV8.1 it changes to:
- Hypervisor and Host OS in EL2
- Guest OS in EL1

Similarly i understand how it works on PowerPC ISA2.6. Can you
describe how it works on X86 to betther understand the nested
virtualization problems?

Thanks
-Bharat

>
> Paolo
>
> --
> You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
> To post to this group, send email to isa...@groups.riscv.org.
> Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/fc308b18-db6c-a7aa-f7c1-a12b1d37ccd9%40gnu.org.




--
-Bharat

Paolo Bonzini

unread,
Aug 30, 2016, 2:26:11 PM8/30/16
to Bharat Bhushan, Samuel Falvo II, RISC-V ISA Dev, sag...@eecs.berkeley.edu


On 30/08/2016 18:50, Bharat Bhushan wrote:

> ARMv8 execution modes are
> - Hypervisor Stub in EL2
> - Hypervisor and Host OS in EL1
> - Guest OS in EL1
>
> With ARMV8.1 it changes to:
> - Hypervisor and Host OS in EL2
> - Guest OS in EL1

Right. However even with v8.1 it's hard to virtualize EL2 because it
uses a very different set of privileged registers.

> Similarly i understand how it works on PowerPC ISA2.6.

Same as ARMv8 IIUC? Still has real mode code for the hypervisor.

> Can you
> describe how it works on X86 to betther understand the nested
> virtualization problems?

Exactly as in my proposal. The four modes are root S, root U, non-root
S, non-root U; root S uses the same privileged registers as S plus other
special instruction to control non-root operation (these correspond to
RISC-V h* CSRs in my proposal). HRET is the same as Intel's
VMLAUNCH/VMRESUME and AMD's VMRUN. To these four, RISC-V would add M.

Intel's SMM sits "on the side" compared to RISC-V's M and ARM's EL3,
but that's not a big deal. There is another thing I'd like in RISC-V's
M mode, but one thing at a time. ;)

Paolo

Andrew Waterman

unread,
Aug 30, 2016, 7:17:12 PM8/30/16
to Paolo Bonzini, RISC-V ISA Dev, Sagar Karandikar
On Tue, Aug 30, 2016 at 4:01 AM, Paolo Bonzini <bon...@gnu.org> wrote:
> The separation of S and H levels in RISC-V is repeating how PPC first,
> and ARM later introduced virtualization extensions. As mentioned in the
> privileged interface specification, this model suits Type-1 hypervisors
> very well, but it can be a bit messy for Type-2 hypervisors such as KVM.
>
> For Type-2 hypervisors, the same kernel (e.g., Linux) would have to be
> able to run both in S and H mode. In one case it would use s* registers,
> in the other it would use h* registers. It is possible to avoid this
> through a small stub that sets up the h* CSRs so as to be able to use s*
> registers for regular kernel operation. For example, all interrupts
> are usually delegated. This works, but it makes the world switch code
> very slow, as we've seen with KVM on ARMv7, because it needs to read
> and write all the s* CSRs twice (once on entry, once on exit).

One thing that confuses me is why swapping the the supervisor state is
so expensive. It takes roughly 30 instructions to swap out all the
supervisor CSRs and the PLIC state, none of which should require a
pipeline flush or egregious stalling. Swapping the integer/FP
registers should dominate.
> --
> You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
> To post to this group, send email to isa...@groups.riscv.org.
> Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/df6a7730-7fe9-5fe2-f1ae-80488b34b3c7%40gnu.org.

Paolo Bonzini

unread,
Aug 31, 2016, 5:02:04 AM8/31/16
to Andrew Waterman, RISC-V ISA Dev, Sagar Karandikar


On 31/08/2016 01:16, Andrew Waterman wrote:
> On Tue, Aug 30, 2016 at 4:01 AM, Paolo Bonzini <bon...@gnu.org> wrote:
>> The separation of S and H levels in RISC-V is repeating how PPC first,
>> and ARM later introduced virtualization extensions. As mentioned in the
>> privileged interface specification, this model suits Type-1 hypervisors
>> very well, but it can be a bit messy for Type-2 hypervisors such as KVM.
>>
>> For Type-2 hypervisors, the same kernel (e.g., Linux) would have to be
>> able to run both in S and H mode. In one case it would use s* registers,
>> in the other it would use h* registers. It is possible to avoid this
>> through a small stub that sets up the h* CSRs so as to be able to use s*
>> registers for regular kernel operation. For example, all interrupts
>> are usually delegated. This works, but it makes the world switch code
>> very slow, as we've seen with KVM on ARMv7, because it needs to read
>> and write all the s* CSRs twice (once on entry, once on exit).
>
> One thing that confuses me is why swapping the the supervisor state is
> so expensive. It takes roughly 30 instructions to swap out all the
> supervisor CSRs and the PLIC state, none of which should require a
> pipeline flush or egregious stalling. Swapping the integer/FP
> registers should dominate.

On ARM the culprit is mostly the GIC (see the ISCA 2016 article on VHE
-- http://www.cs.columbia.edu/~cdall/pubs/isca2016-dall.pdf); saving its
state takes 75% of the time on the microbenchmark.

I pointed out supervisor state because you do need to swap it twice.
GPRs only have to be swapped once by the stub, because the host kernel's
world switch code for can be written to withstand clobbering of most
GPRs. Swapping FPRs can be avoided completely.

It depends on the processor architecture whether swapping the supervisor
CSRs requires pipeline flushes or not. However, HRET and trapping back
to H mode probably do require pipeline flushes, and why do those twice
rather than once?...

H mode and the H mode stub also complicate paging. There are basically
two choices:

- use physical addresses in H mode (as in PPC), which limits or at least
complicates the implementation of the hypervisor;

- keep H mode page tables synchronized with host kernel page tables;
possibly this may require defining two separate page table formats in
the hypervisor specification, one for H mode itself (possibly based on
the S mode format) and one for second-level address translation.

Without H mode, hypervisor page tables _are_ the host kernel page
tables, because the host kernel runs in S mode and uses sptbr as usual.

I just would like to avoid ending up with Linux in H mode five years
down the road (as in ARM VHE), because that is good for performance but
more complex and also even worse for nested virtualization support.
RISC-V has the advantage of starting from a clean slate, and the
s390/x86 approach is IMHO clearly superior.

Thanks,

Paolo

Andrew Waterman

unread,
Feb 23, 2017, 5:12:15 PM2/23/17
to Paolo Bonzini, RISC-V ISA Dev, Sagar Karandikar
Hi Paolo,

I'm reviving this thread six months later to say that we've aligned
with your vision of hardware support for Type-2 hypervisors. While
the next UCB draft proposal (1.10) will not contain an H-mode, it is
likely that 1.11 will contain a virtualization proposal very similar
to yours. There will, of course, be ample opportunity for debate on
virtualization support before any attempt to ratify the privileged
architecture standard.

Andrew

On Tue, Aug 30, 2016 at 4:01 AM, Paolo Bonzini <bon...@gnu.org> wrote:

Paolo Bonzini

unread,
Feb 24, 2017, 3:57:23 AM2/24/17
to Andrew Waterman, RISC-V ISA Dev, Sagar Karandikar


On 23/02/2017 23:11, Andrew Waterman wrote:
> Hi Paolo,
>
> I'm reviving this thread six months later to say that we've aligned
> with your vision of hardware support for Type-2 hypervisors. While
> the next UCB draft proposal (1.10) will not contain an H-mode, it is
> likely that 1.11 will contain a virtualization proposal very similar
> to yours. There will, of course, be ample opportunity for debate on
> virtualization support before any attempt to ratify the privileged
> architecture standard.

That's awesome! Thanks,

Paolo

Andrew Waterman

unread,
Feb 24, 2017, 4:20:39 AM2/24/17
to Paolo Bonzini, RISC-V ISA Dev, Sagar Karandikar
Thank *you*.

Andrew

Ray Van De Walker

unread,
Feb 24, 2017, 3:09:52 PM2/24/17
to RISC-V ISA Dev
I reviewed P. Bonzini's proposal again, and I think I might have understood it this time.
Does "user under hypervisor" mode have usefully significant differences from "user" mode?
In a like way, does "machine mode" have usefully significant differences from "supervisor" mode?
If these modes of P. Bonzini's proposal can be combined, the number of states falls from 5 to 3.
(I propose no change to "supervisor under hypervisor" mode.)
Therefore the number of state transitions (combinations taken two at a time) falls from 10 to 3.
Might it be easier to design and verify?
Or maybe this is a way to select verification cases?
(P. Bonzini's proposed implementation already looks pretty good.)

-----Original Message-----
From: Andrew Waterman [mailto:and...@sifive.com]
Sent: Thursday, February 23, 2017 2:12 PM
To: Paolo Bonzini <bon...@gnu.org>
Cc: RISC-V ISA Dev <isa...@groups.riscv.org>;
Subject: Re: [isa-dev] Proposal for virtualization without H mode

Hi Paolo,

I'm reviving this thread six months later to say that we've aligned with your vision of hardware support for Type-2 hypervisors...
.

Paolo Bonzini

unread,
Feb 27, 2017, 6:16:09 AM2/27/17
to Ray Van De Walker, RISC-V ISA Dev


On 24/02/2017 21:09, Ray Van De Walker wrote:
> I reviewed P. Bonzini's proposal again, and I think I might have understood it this time.
> Does "user under hypervisor" mode have usefully significant differences from "user" mode?

There are some differences but they are only caused by any additional
infrastructure introduced by hypervisor mode: two level paging,
additional deltas added to the cycle counter CSRs, etc. This is because
RISC-V is Popek-Goldberg virtualizable.

On x86, on the other hand, there are some unprivileged instructions that
behave differently in "user under hypervisor" mode, and some
unprivileged instructions that cause a hypervisor exit but would not
cause a supervisor exit.

Paolo

Po-wei Huang

unread,
Jun 13, 2017, 2:04:27 AM6/13/17
to RISC-V ISA Dev, ray.van...@silergy.com, bon...@gnu.org
Hi Paolo and all,
Could I ask you why RISC-V is Popek-Goldberg virtualizable?
The CSR read/write in user spec could edit the timer or other system resource, so it looks like a sensitive instruction without trap.
If it's correct, why is RISC-V virtualizable?

Any reply would be appreciated.
Thanks,
Po-wei

Paolo Bonzini於 2017年2月27日星期一 UTC+8下午7時16分09秒寫道:

Andrew Waterman

unread,
Jun 13, 2017, 3:10:58 AM6/13/17
to Po-wei Huang, RISC-V ISA Dev, bon...@gnu.org, ray.van...@silergy.com
Access to the timer/counters can be disabled by the higher privilege level (via scounteren/mcounteren), so that accessing them will cause a trap.

(Also, the U-mode counters are read-only; writes always trap.)

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

abo junghichi

unread,
May 16, 2018, 1:50:52 PM5/16/18
to RISC-V ISA Dev, sag...@eecs.berkeley.edu, bon...@gnu.org
I might be posting a silly question now, but, can we remove U(without "HA/") mode by using HA/S mode instead?

The role of hypervisor is to allow all users sharing a same machine run own OS different from each other, say "virtualization outside OS".
But, we are doing virtualization for decades inside OS.
We virtualized the number of CPU with "Time sharing", memory with "Memory Management Unit", file-system with such "chroot" and "jail", etc.
At this step, OS virtualizes a MMU-less machine with "Unix process" where DOS-like OS can run on.
So I'm worried that OSes in the next decade will go farther where a MMU-enabled machine is virtualized.
At that step, U-mode should be obsolete feature for compatibility.
Or, we have seen already OSes leap the MMU-enabled machine step - which is called hypervisor.

If we can displace U with HA/S, we can remove HA/U too.
With the proposal from ray.vandewalker merging M-mode and S-mode, the mode status is represented as 1 bit.

Michael Clark

unread,
May 16, 2018, 5:52:35 PM5/16/18
to abo junghichi, RISC-V ISA Dev, sag...@eecs.berkeley.edu, bon...@gnu.org
I’m not sure I agree with everything you have suggested i.e. collapsing M-mode and S-mode. M-mode is an appropriate place to “patch hardware” that the Supervisor should be largely unaware of. Of course I see the problem. Now one wants to virtualise M-mode.

One observation is that making the mode implicit helps. i.e. exposing the mode field anywhere leads to problems. The mode one is in must be largely implicit. Of course there are practical cases where the “mode” abstraction helps.

If we model this problem using capabilities, essentially we have a context with a set of hardware capability bits and the capability “for each capability” to grant that capability to another “context”. For convenience, multiple capabilities end up being grouped together which is where the concept of a “mode” arises, perhaps out of convenience.

We use that 1-bit (well several bits) to create multiple contexts with varying capabilities, including the ability to create new contexts, and to control address translation for these contexts. We’ve now even added this context to the MMU in the form of an ASID, such that an OS can map its process context identifier (pid) in the MMU. This changes the MMU from having “1-bit” to many, avoiding the cost of “context switching”.

Perhaps a more salient question is how to make two levels of address translation perform well enough such that one would choose to run a TLB intensive workload such as a database, in a virtual machine using two levels of address translation.

I think hardware second level address translation is likely here to stay, however realising that it can in a large proportion of cases, be linearised to a single level of address translation; one questions what is the best way to ameliorate its overheads.

It seems that second level address translation lowers context switch cost at the expense of higher steady state costs. i.e. TLB misses and page faults. I don’t think this is yet a completely solved problem in hardware. Given the TLB CAMS can use up in the order of ~20% energy for memory intensive workloads, it’s certainly a problem that needs to be solved.

I’m personally interested in the use of smaller (ternary content addressable memory) TCAMs and more linearisation in software to reduce the depth of the CAM structure in hardware which is what leads to high energy usage. The problem however is that the current mode of page-based first and second level address translation is very well entrenched. We need to support 4K, 2M, 1G and 512G page sizes but there is no reason why we could not explore a model that lets us use less hardware comparators for address spans that are inefficiently expressed with these sizes as building blocks.

Food for thought.

Michael
Reply all
Reply to author
Forward
0 new messages