software-maintained A and D bits in the page table

842 views
Skip to first unread message

Christoph Hellwig

unread,
Nov 29, 2017, 7:47:16 PM11/29/17
to isa...@groups.riscv.org
Section 4.3.1 of the privileged ISA spec contains this gem:

"Two schemes to manage the A and D bits are permitted:
* ... <hardware page table maintanance>
* When a virtual page is accessed and the A bit is clear, or is written
and the D bit is clear, a page-fault exception is raised."

The way this is worded it forced every OS to support the software
maintainance of the A and D bits, but given that hardware is free
to implement either scheme it requires specific hardware for each
scheme, and even worse the scheme to be used can't be discovered before
caussing an potential access or dirty fault.

This has two issues:

- first OSes don't know at boot time which scheme is used and can't
easily reject CPUs that don't handle the A and D bits. Instead
silent failure will happen if they first encounter such a CPU.
- Without access to such a CPU the code in the OS can't be tested
as there is no way to select the way the page table updates are
handled.
- the OS can't easily use special page fault handlers that only
contain the code for the A and D bit handling if needed, thus
bloating the page fault path even for CPUs that do all the work.

Are there any real plans for CPUs that do want to support page based
virtual memory and do not support the A and D bits in hardware? If not
it would be best to just drop the second option. But if it is supported
there needs to be a way to discovery it without causing page faults.

Andrew Waterman

unread,
Nov 29, 2017, 7:58:29 PM11/29/17
to Christoph Hellwig, RISC-V ISA Dev
No HW implementations I'm aware of manage the A/D bits in HW.

At least in Linux, managing A/D in SW uses a code path that exists for
other reasons, so it doesn't bloat the page-fault handler at all.
> --
> You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
> To post to this group, send email to isa...@groups.riscv.org.
> Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/20171130004714.GA8610%40lst.de.

Christoph Hellwig

unread,
Nov 30, 2017, 9:31:54 AM11/30/17
to Andrew Waterman, Christoph Hellwig, RISC-V ISA Dev
On Wed, Nov 29, 2017 at 04:58:08PM -0800, Andrew Waterman wrote:
> No HW implementations I'm aware of manage the A/D bits in HW.
>
> At least in Linux, managing A/D in SW uses a code path that exists for
> other reasons, so it doesn't bloat the page-fault handler at all.

Pretty much all points still stand with that. Once you get hard and
software updating the bits separately you need to coordinate the two,
or you get into fun situations like the one handled here:

http://lists.infradead.org/pipermail/linux-arm-kernel/2015-July/355769.html

In short: discoverability is important.

Albert Cahalan

unread,
Dec 1, 2017, 8:47:45 AM12/1/17
to Andrew Waterman, RISC-V ISA Dev
On 11/29/17, Andrew Waterman <wate...@eecs.berkeley.edu> wrote:

> No HW implementations I'm aware of manage the A/D bits in HW.

On the one hand, this is horrifying from a performance standpoint.
On the other hand, it offers an opportunity to change the specification.

Right now, the theoretical hardware would only set bits in leaf pages.
When scanning page tables to find these bits, an OS would be unable
to skip portions of address space that have not been accessed. This is
quite a bit of overhead; it was not a mistake for x86 to update these bits.

The privileged-isa specification doesn't even nicely allow for hardware
that would update the bits. It asks software to set them to zero. That is
a state in which they ought to cause faults (CPUs w/o hardware updates)
or cause updates.

I also see that the specification has very little about when the bits might
be updated. For example, if an OS clears the bits while there is a TLB
entry for the page, would the hardware set the bits on the next access?
Generally, if the hardware does not snoop page tables and does not
have an exact architecturally defined specification for when TLB entries
get cast out, some types of virtualization become impossible. This is
particularly important for exact execution replay. Consider a hypervisor
that records nondeterministic things like interrupts during execution,
taking note of an instructions-retired counter. It then restores memory
and begins execution again, supplying interrupts in all the right places.
Nondeterminism for A and D bits becomes a problem; memory state
fails to match what is expected. The replay fails.

David Chisnall

unread,
Dec 1, 2017, 9:43:09 AM12/1/17
to Albert Cahalan, Andrew Waterman, RISC-V ISA Dev
On 1 Dec 2017, at 13:47, Albert Cahalan <acah...@gmail.com> wrote:
>
> Right now, the theoretical hardware would only set bits in leaf pages.
> When scanning page tables to find these bits, an OS would be unable
> to skip portions of address space that have not been accessed. This is
> quite a bit of overhead; it was not a mistake for x86 to update these bits.

This doesn’t sound feasible to do in hardware, without assuming that each leaf page table entry is owned by precisely one parent. This is not the case, for example, on XNU where shared libraries share leaf page table entries, and the savings in cache pressure to satisfy TLB misses are quite noticeable.

David

Albert Cahalan

unread,
Dec 1, 2017, 9:54:57 AM12/1/17
to David Chisnall, Andrew Waterman, RISC-V ISA Dev
It works fine in hardware, updating the parent that was actually used.
If XNU has no use for this, it can ignore the bits. It could set them
from the start.

Jonas Oberhauser

unread,
Dec 1, 2017, 10:00:06 AM12/1/17
to RISC-V ISA Dev, wate...@eecs.berkeley.edu


Am Freitag, 1. Dezember 2017 14:47:45 UTC+1 schrieb acahalan:
On 11/29/17, Andrew Waterman <wate...@eecs.berkeley.edu> wrote:

> No HW implementations I'm aware of manage the A/D bits in HW.

I also see that the specification has very little about when the bits might
be updated. For example, if an OS clears the bits while there is a TLB
entry for the page, would the hardware set the bits on the next access?

The bits probably have to be set speculatively, even if no access ever occurs. This could happen if between the memory access of the MMU that sets the bits and the memory access for which the MMU does the translation, there is an interrupt and the latter is never executed. It doesn't seem feasible in general to provide atomicity here. I don't think it makes sense for the hardware to reset the bits in case there is a TLB entry. I haven't yet read the TLB format, but I would assume that bits are buffered as well.
 
Generally, if the hardware does not snoop page tables and does not
have an exact architecturally defined specification for when TLB entries
get cast out, some types of virtualization become impossible. This is
particularly important for exact execution replay. Consider a hypervisor
that records nondeterministic things like interrupts during execution,
taking note of an instructions-retired counter. It then restores memory
and begins execution again, supplying interrupts in all the right places.
Nondeterminism for A and D bits becomes a problem; memory state
fails to match what is expected. The replay fails.

 Yes... If the OS that you are trying to replay has subtle race conditions like this one, you can not replay it unless you have control&logging capabilities for the MMU. But isn't the same already true in case the OS or its users have ill-synchronized self-modifying code and the hardware does not snoop for code modifications, or if the OS races in other ways with the MMU (make a page table entry invalid without flush), or if you have a parallel OS?

Albert Cahalan

unread,
Dec 1, 2017, 12:27:54 PM12/1/17
to Jonas Oberhauser, RISC-V ISA Dev, wate...@eecs.berkeley.edu
On 12/1/17, Jonas Oberhauser <s9jo...@gmail.com> wrote:
> Am Freitag, 1. Dezember 2017 14:47:45 UTC+1 schrieb acahalan:
>> On 11/29/17, Andrew Waterman <wate...@eecs.berkeley.edu <javascript:>> wrote:

>>> No HW implementations I'm aware of manage the A/D bits in HW.
>>
>> I also see that the specification has very little about when
>> the bits might be updated. For example, if an OS clears
>> the bits while there is a TLB entry for the page, would the
>> hardware set the bits on the next access?
>
> The bits probably have to be set speculatively, even if no
> access ever occurs. This could happen if between the memory
> access of the MMU that sets the bits and the memory access
> for which the MMU does the translation, there is an interrupt
> and the latter is never executed. It doesn't seem feasible in
> general to provide atomicity here. I don't think it makes sense
> for the hardware to reset the bits in case there is a TLB entry.
> I haven't yet read the TLB format, but I would assume that bits
> are buffered as well.

As long as the hypervisor can record the location of that interrupt
unambiguously and then later insert it exactly, there is no problem.

I don't see a need to take interrupts in the middle of all this.
Once a TLB victim is cast out and/or an A/D bit is committed
to being written out, the memory operation ought to complete.

>> Generally, if the hardware does not snoop page tables and
>> does not have an exact architecturally defined specification
>> for when TLB entries get cast out, some types of virtualization
>> become impossible. This is particularly important for exact
>> execution replay. Consider a hypervisor that records
>> nondeterministic things like interrupts during execution,
>> taking note of an instructions-retired counter. It then restores
>> memory and begins execution again, supplying interrupts in
>> all the right places.
>>
>> Nondeterminism for A and D bits becomes a problem;
>> memory state fails to match what is expected. The replay fails.
>
> Yes... If the OS that you are trying to replay has subtle race
> conditions like this one, you can not replay it unless you have
> control&logging capabilities for the MMU. But isn't the same
> already true in case the OS or its users have ill-synchronized
> self-modifying code and the hardware does not snoop for code
> modifications, or if the OS races in other ways with the MMU
> (make a page table entry invalid without flush), or if you have a
> parallel OS?

Much of this can normally be handled by a hypervisor that uses
its own page tables to block troublesome accesses. Self-modifying
code is hopefully uncommon, so the performance impact can
be tolerated. If the OS is parallel, the same can apply as long as
the OS is decently optimized for NUMA hardware.

Nondeterministic MMU troubles are more serious. If the hypervisor
must block things like A/D updates, then performance is killed.
The whole point of hypervisor-related features in the MMU is to
make a hypervisor go acceptably fast, but those features wouldn't
be able to be used. Depending on the degree of badness, one might
have to use a JIT or even a pure emulator.

Andrew Waterman

unread,
Dec 1, 2017, 2:40:53 PM12/1/17
to Jonas Oberhauser, RISC-V ISA Dev
On Fri, Dec 1, 2017 at 7:00 AM, Jonas Oberhauser <s9jo...@gmail.com> wrote:
>
>
> Am Freitag, 1. Dezember 2017 14:47:45 UTC+1 schrieb acahalan:
>>
>> On 11/29/17, Andrew Waterman <wate...@eecs.berkeley.edu> wrote:
>>
>> > No HW implementations I'm aware of manage the A/D bits in HW.
>>
>> I also see that the specification has very little about when the bits
>> might
>> be updated. For example, if an OS clears the bits while there is a TLB
>> entry for the page, would the hardware set the bits on the next access?
>
>
> The bits probably have to be set speculatively, even if no access ever
> occurs. This could happen if between the memory access of the MMU that sets
> the bits and the memory access for which the MMU does the translation, there
> is an interrupt and the latter is never executed. It doesn't seem feasible
> in general to provide atomicity here. I don't think it makes sense for the
> hardware to reset the bits in case there is a TLB entry. I haven't yet read
> the TLB format, but I would assume that bits are buffered as well.

"The PTE update must be exact (i.e., not speculative), and observed in
program order by the local hart."

Andrew Waterman

unread,
Dec 1, 2017, 2:41:43 PM12/1/17
to Albert Cahalan, RISC-V ISA Dev
On Fri, Dec 1, 2017 at 5:47 AM, Albert Cahalan <acah...@gmail.com> wrote:
> On 11/29/17, Andrew Waterman <wate...@eecs.berkeley.edu> wrote:
>
>> No HW implementations I'm aware of manage the A/D bits in HW.
>
> On the one hand, this is horrifying from a performance standpoint.
> On the other hand, it offers an opportunity to change the specification.
>
> Right now, the theoretical hardware would only set bits in leaf pages.
> When scanning page tables to find these bits, an OS would be unable
> to skip portions of address space that have not been accessed. This is
> quite a bit of overhead; it was not a mistake for x86 to update these bits.
>
> The privileged-isa specification doesn't even nicely allow for hardware
> that would update the bits. It asks software to set them to zero. That is
> a state in which they ought to cause faults (CPUs w/o hardware updates)
> or cause updates.

Other way around... the spec recommends that software set them to 1 if
they aren't needed, for the reason you suggested.

>
> I also see that the specification has very little about when the bits might
> be updated. For example, if an OS clears the bits while there is a TLB
> entry for the page, would the hardware set the bits on the next access?

"Executing an SFENCE.VMA instruction guarantees that any stores in the
instruction stream prior to the SFENCE.VMA are ordered before all
implicit references subsequent to the SFENCE.VMA."

Andrew Waterman

unread,
Dec 1, 2017, 2:42:57 PM12/1/17
to Christoph Hellwig, RISC-V ISA Dev
I agree on the discovery point; the devicetree for a hart should
indicate if this feature is present.

Christoph Hellwig

unread,
Dec 1, 2017, 2:54:41 PM12/1/17
to Andrew Waterman, Christoph Hellwig, RISC-V ISA Dev
On Fri, Dec 01, 2017 at 11:42:34AM -0800, Andrew Waterman wrote:
> I agree on the discovery point; the devicetree for a hart should
> indicate if this feature is present.

In general I would prefer if the ISA does this in a self-discoverable
way. E.g. the equivalent of a cpuid leave in x86. But devicetree or
devicetree equivalent (UEFI *shudder*) is better than nothing.

Albert Cahalan

unread,
Dec 1, 2017, 3:36:13 PM12/1/17
to Andrew Waterman, RISC-V ISA Dev
On 12/1/17, Andrew Waterman <wate...@eecs.berkeley.edu> wrote:
> On Fri, Dec 1, 2017 at 5:47 AM, Albert Cahalan <acah...@gmail.com> wrote:
>> On 11/29/17, Andrew Waterman <wate...@eecs.berkeley.edu> wrote:

>>> No HW implementations I'm aware of manage the A/D bits in HW.
>>
>> On the one hand, this is horrifying from a performance standpoint.
>> On the other hand, it offers an opportunity to change the specification.
>>
>> Right now, the theoretical hardware would only set bits in leaf pages.
>> When scanning page tables to find these bits, an OS would be unable
>> to skip portions of address space that have not been accessed. This is
>> quite a bit of overhead; it was not a mistake for x86 to update these
>> bits.
>>
>> The privileged-isa specification doesn't even nicely allow for hardware
>> that would update the bits. It asks software to set them to zero. That is
>> a state in which they ought to cause faults (CPUs w/o hardware updates)
>> or cause updates.
>
> Other way around... the spec recommends that software set them to 1 if
> they aren't needed, for the reason you suggested.

It does only for leaf page tables. For non-leaf, it recommends zero.

Supporting A/D bits in hardware on non-leaf page table entries is
important for performance. Lacking this, an OS will need to scan
through many uninteresting page table entries to find these bits.

>> I also see that the specification has very little about when the bits
>> might
>> be updated. For example, if an OS clears the bits while there is a TLB
>> entry for the page, would the hardware set the bits on the next access?
>
> "Executing an SFENCE.VMA instruction guarantees that any stores in the
> instruction stream prior to the SFENCE.VMA are ordered before all
> implicit references subsequent to the SFENCE.VMA."

Suppose an OS does not do this, and I want to run the OS in a
hypervisor with exact replay. It looks like proper replay might not
be possible.

For example:

The hardware sets these bits on a TLB load. The OS sweeps through
the entries from time to time, clearing the A bits. (virtual memory aging)
TLB entries can be evicted via an undocumented nondeterministic
mechanism at any time, and then recreated as needed. So the bits are
getting written by the hardware at randomish times. When the hypervisor
replays the execution, TLB entries get evicted and recreated differently.
This causes different state in memory, which leads to the OS having
different state as it finds the bits, which leads to different swapping and
different disk interrupts and different process scheduling... and the whole
replay falls apart.

Jonas Oberhauser

unread,
Dec 3, 2017, 11:52:02 AM12/3/17
to Albert Cahalan, Andrew Waterman, RISC-V ISA Dev

On 12/1/17, Andrew Waterman <wate...@eecs.berkeley.edu> wrote:
"The PTE update must be exact (i.e., not speculative), and observed in
program order by the local hart."

2017-12-01 21:36 GMT+01:00 Albert Cahalan <acah...@gmail.com>:
It does only for leaf page tables. For non-leaf, it recommends zero.
Supporting A/D bits in hardware on non-leaf page table entries is
important for performance. Lacking this, an OS will need to scan
through many uninteresting page table entries to find these bits.

If I understand correctly, taking both of these together turns TLB misses into strong memory barriers and turns SFENCE.VMAs into NOPs.

This can be implemented in two ways:
To ensure program order between preceding MOPs and the MMU step:
1) we need forwarding/stalling hardware between the MMU and the reorder buffers in hardware that scan for races between MMU and the buffered operations and ensures the local order (by forwarding) or even global order (by stalling)
2) a TLB miss is a very strong memory barrier, *all* memory operations before it in program order have to be completed before the MMU operation
To ensure program order between MMU step and successive MOPs, we need to stall all of the latter because in the multi-level translation we can usually not predict the remaining addresses

Also if you do begin to translate that oldest memory operation, you have to delay incoming interrupts until the memory operation and all preceding instructions have completed. This means that you probably want to complete all preceding memory operations before you begin the translation. 

So the MMU barrier (SFENCE.VMA) can typically be implemented as a NOP if the MMU sets A/D bits: MMUs already must ensure ordering with preceding stores to the portion of memory that contains A/D bits, so they can either take as a software condition that the OS must only change a page table entry if they also clear A/D bits or they can simply extend the race detection hardware to check the whole page table entry.

Without the restriction Andrew just informed me of, you would only need that all loads are translated before a store is translated, but you don't need to stall translations/forward load&stores to the MMU and you don't need to translate memory operations before a speculative load (which you can quash if you find out you just violated program order).  

As long as only leaf page tables have A/D bits, one can do the parent translations speculatively and there is a little bit less of stalling/forwarding that needs to be done.

I don't know if all of this makes a big performance difference, but it doesn't feel right. Personally I'd much rather have A/D bits on all levels and take speculative A/D bits than to have A/D bits only on the leaf but exact.

Jacob Bachmeyer

unread,
Dec 3, 2017, 11:27:38 PM12/3/17
to Jonas Oberhauser, Albert Cahalan, Andrew Waterman, RISC-V ISA Dev
Jonas Oberhauser wrote:
> So the MMU barrier (SFENCE.VMA) can typically be implemented as a NOP
> if the MMU sets A/D bits: MMUs already must ensure ordering with
> preceding stores to the portion of memory that contains A/D bits, so
> they can either take as a software condition that the OS must only
> change a page table entry if they also clear A/D bits or they can
> simply extend the race detection hardware to check the whole page
> table entry.

Could this be simplified to "MMU hardware snoops all memory writes and
invalidates TLB entries loaded from pages to which writes are
observed"? This adds TLB CAM columns, since each leaf TLB entry needs
to record LEVELS physical page numbers to invalidate itself when needed,
but would make SFENCE.VMA a NOP.

> As long as only leaf page tables have A/D bits, one can do the parent
> translations speculatively and there is a little bit less of
> stalling/forwarding that needs to be done.
>
> I don't know if all of this makes a big performance difference, but it
> doesn't feel right. Personally I'd much rather have A/D bits on all
> levels and take speculative A/D bits than to have A/D bits only on the
> leaf but exact.

I may be showing my ignorance here, but could we have speculative A/D
bits on non-leaf PTEs and exact A/D bits on leaf PTEs? Is there a way
we could reduce the hardware A/D bit management to AMOOR in physical
address space? Physical address space is also M-mode address space;
should we require the monitor to handle A/D bit updates on
implementations that do not perform them in hardware like we do for
unaligned accesses?

I understand that the key benefit of having A/D bits on non-leaf PTEs is
allowing supervisor software to quickly observe summary information --
if the A bit on a non-leaf PTE is clear, /nothing/ in that region of the
address space has been accessed and scanning further page tables is
unnecessary. Likewise for the D bit. Since the cost of a "false
positive" on a non-leaf PTE is merely a wasted scan of next-level page
tables (a cost which is unavoidable currently due to the lack of A/D
bits on non-leaf PTEs), speculatively setting A/D on non-leaf PTEs could
be useful.

On another note, if the A/D updates can be reduced to AMOOR, why not
simply attach speculative AMOOR operations to speculative memory
accesses? The AMOORs that set the A/D bits would commit if, only if,
and when the parent MOP commits. This requires attaching up to LEVELS
AMOOR ops to each memory access, however.



-- Jacob

Albert Cahalan

unread,
Dec 4, 2017, 2:08:59 AM12/4/17
to jcb6...@gmail.com, Jonas Oberhauser, Andrew Waterman, RISC-V ISA Dev
On 12/3/17, Jacob Bachmeyer <jcb6...@gmail.com> wrote:

> Could this be simplified to "MMU hardware snoops all memory writes and
> invalidates TLB entries loaded from pages to which writes are
> observed"? This adds TLB CAM columns, since each leaf TLB entry needs
> to record LEVELS physical page numbers to invalidate itself when needed,
> but would make SFENCE.VMA a NOP.

This would be tremendously helpful for hypervisors with exact replay.
It may make the difference between being able to directly let the OS
use page tables and not being able to do so. This is a huge performance
difference, possibly enough to determine if the OS can run or not.

> I may be showing my ignorance here, but could we have speculative A/D
> bits on non-leaf PTEs and exact A/D bits on leaf PTEs?

If unneeded speculative stuff doesn't always get undone before being
seen by the OS, then exact replay won't work.

Jonas Oberhauser

unread,
Dec 4, 2017, 4:29:22 AM12/4/17
to Albert Cahalan, Jacob Bachmeyer, Andrew Waterman, RISC-V ISA Dev


On Dec 4, 2017 08:08, "Albert Cahalan" <acah...@gmail.com> wrote:
On 12/3/17, Jacob Bachmeyer <jcb6...@gmail.com> wrote:

> Could this be simplified to "MMU hardware snoops all memory writes and
> invalidates TLB entries loaded from pages to which writes are
> observed"?  This adds TLB CAM columns, since each leaf TLB entry needs
> to record LEVELS physical page numbers to invalidate itself when needed,
> but would make SFENCE.VMA a NOP.

This would be tremendously helpful for hypervisors with exact replay.
It may make the difference between being able to directly let the OS
use page tables and not being able to do so. This is a huge performance
difference, possibly enough to determine if the OS can run or not.

We would also have to order stores to a PTE and memory operations that use translations from that PTE.
If I look at a simple pipelined processor, I want to fetch before the preceding stores have been decoded (since before decoding I don't even know that it is a store, I would have to wait between fetching until instructions are decoded).
But if a preceding store manipulates a (non-leaf) PTE that is used to translate the current instruction address, I am in hot water:
the MMU may have or may not have just translated and set A bits for PTEs it thought will be used for fetch, but will not be used because I pointed the PTE somewhere else.
How can I for now be sure on replay that I will get the same A bits?

Note also that this can not be the only place where TLB entries are evicted, because the TLB is finite.
So you probably also need "The algorithm that passively evicts TLB entries is deterministic"? In the sense that it only runs as a sideeffect of something deterministic, like a TLB miss, and if the spec-TLB has the same set of records, the same entries are evicted.


> I may be showing my ignorance here, but could we have speculative A/D
> bits on non-leaf PTEs and exact A/D bits on leaf PTEs?

If unneeded speculative stuff doesn't always get undone before being
seen by the OS, then exact replay won't work.

Only if the speculation is non-deterministic in ISA.
Hardware in the end is usually deterministic (if you don't rely on sensors, which a Hypervisor for exact replay may fudge), but the question is which factors can play a role (e.g., cache state could play a role). As a result this model of deterministic speculative execution would probably have to go all the way down (if it is possible to implement it at all -- I'm not sure yet).

It is possible to specify in ISA that there is a deterministic scheduler in each hart that looks at processor state (including TLB content) + current ROB, but that the exact scheduler is opaque (i.e., software has to work for all allowed* schedulers). This scheduler tells us for example "fetch the next instruction; compute the address of that memory operation; translate this, translate that; evaluate this branch; speculate this branch as `taken'; ..."
(*)A scheduler is allowed if it obeys some restrictions. For example, it should only issue a translation for something that is currently in the ROB (or to be fetched) and for which there is no TLB hit; and it should only execute a store operation if all preceding stores will not modify any PTE used by that store.

In this model you can totally translate out of order, you do get speculative A/D bits, but you also know on replay you will get the same speculative A/D bits. 


Again I'm not sure this can be implemented, but if it can, I think it might make all of us happy?

Jonas Oberhauser

unread,
Dec 4, 2017, 9:34:49 AM12/4/17
to RISC-V ISA Dev, s9jo...@gmail.com, acah...@gmail.com, wate...@eecs.berkeley.edu, jcb6...@gmail.com


Am Montag, 4. Dezember 2017 05:27:38 UTC+1 schrieb Jacob Bachmeyer:
On another note, if the A/D updates can be reduced to AMOOR, why not
simply attach speculative AMOOR operations to speculative memory
accesses?  The AMOORs that set the A/D bits would commit if, only if,
and when the parent MOP commits.  This requires attaching up to LEVELS
AMOOR ops to each memory access, however.


Not completely. You'd probably also want to make sure that the PTE has not changed in the meantime (e.g., by another thread, which saw that the A/D bits are not set and swapped the page out). Otherwise you will set A/D bits for the new PTE rather than the old. In case they have changed, you'd have to retranslate.

Jacob Bachmeyer

unread,
Dec 5, 2017, 12:40:44 AM12/5/17
to Jonas Oberhauser, RISC-V ISA Dev, acah...@gmail.com, wate...@eecs.berkeley.edu
Until SFENCE.VMA is executed (or an implementation-specific
hardware-assisted remote TLB shootdown performed), there is no guarantee
that this hart will see a PTE update. If SFENCE.VMA is executed, then
affected translations must be flushed from the TLB, which means any
speculative memory accesses (that are somehow still pending after an IPI
arrives... we must be using hardware remote TLB shootdown) using those
translations must be discarded.


-- Jacob

Jacob Bachmeyer

unread,
Dec 5, 2017, 12:51:01 AM12/5/17
to Albert Cahalan, Jonas Oberhauser, Andrew Waterman, RISC-V ISA Dev
By "speculative A/D bits on non-leaf PTEs", I meant that A/D bits on
non-leaf PTEs are set whenever that path through the page tables is
traversed, even if that traversal ultimately ends at a page fault
instead of a successful translation. I have probably misused the term
"speculative" here, showing ignorance.

Unfortunately, this leaves the question of what happens if the page
tables are traversed in a context where a page fault does not cause a
trap, such as a speculated instruction that has not yet committed or a
prefetch instruction that ignores page faults. For the latter case, I
would say that the bits should be updated -- user code has walked this
path, after all. For the former case, I would argue that instructions
should in general be atomic -- either the instruction commits or no sign
that it was ever decoded leaves the processor.


-- Jacob

Jacob Bachmeyer

unread,
Dec 5, 2017, 1:07:23 AM12/5/17
to Jonas Oberhauser, Albert Cahalan, Andrew Waterman, RISC-V ISA Dev
Jonas Oberhauser wrote:
> On Dec 4, 2017 08:08, "Albert Cahalan" <acah...@gmail.com
> <mailto:acah...@gmail.com>> wrote:
>
> On 12/3/17, Jacob Bachmeyer <jcb6...@gmail.com
> <mailto:jcb6...@gmail.com>> wrote:
>
> > Could this be simplified to "MMU hardware snoops all memory
> writes and
> > invalidates TLB entries loaded from pages to which writes are
> > observed"? This adds TLB CAM columns, since each leaf TLB entry
> needs
> > to record LEVELS physical page numbers to invalidate itself when
> needed,
> > but would make SFENCE.VMA a NOP.
>
> This would be tremendously helpful for hypervisors with exact replay.
> It may make the difference between being able to directly let the OS
> use page tables and not being able to do so. This is a huge
> performance
> difference, possibly enough to determine if the OS can run or not.
>
>
> We would also have to order stores to a PTE and memory operations that
> use translations from that PTE.

RISC-V has the SFENCE.VMA instruction for this. PTE updates by software
are not guaranteed to be visible to the MMU until SFENCE.VMA is
executed. SFENCE.VMA would still be a NOP with respect to the TLBs, but
it may need to flush the pipeline.

> If I look at a simple pipelined processor, I want to fetch before the
> preceding stores have been decoded (since before decoding I don't even
> know that it is a store, I would have to wait between fetching until
> instructions are decoded).
> But if a preceding store manipulates a (non-leaf) PTE that is used to
> translate the current instruction address, I am in hot water:
> the MMU may have or may not have just translated and set A bits for
> PTEs it thought will be used for fetch, but will not be used because I
> pointed the PTE somewhere else.

Each instruction can carry the physical page numbers from the ITLB entry
that provided its translation; a match flushes the pipeline. For an
in-order pipeline this could be implemented by adding a counter to each
ITLB row that is set when that entry is used to fetch an instruction and
decremented towards zero as each instruction is retired. If an ITLB
entry with a non-zero counter is invalidated, the pipeline is flushed.
Similarly, the needed A bits for the instruction fetch need to be
carried in the pipeline and committed when the instruction retires.

> Note also that this can not be the only place where TLB entries are
> evicted, because the TLB is finite.

Of course not, I assume that other events (like adding a translation to
a full TLB) can evict TLB entries, but suggested that MMU hardware
observing a write to a physical page that had been traversed to produce
a TLB entry could always invalidate that TLB entry for a moderate cost
in hardware.


-- Jacob

Jonas Oberhauser

unread,
Dec 5, 2017, 2:16:46 AM12/5/17
to Jacob Bachmeyer, Albert Cahalan, Andrew Waterman, RISC-V ISA Dev


On Dec 5, 2017 7:07 AM, "Jacob Bachmeyer" <jcb6...@gmail.com> wrote:
Jonas Oberhauser wrote:
On Dec 4, 2017 08:08, "Albert Cahalan" <acah...@gmail.com <mailto:acah...@gmail.com>> wrote:

    On 12/3/17, Jacob Bachmeyer <jcb6...@gmail.com
    <mailto:jcb6...@gmail.com>> wrote:

    > Could this be simplified to "MMU hardware snoops all memory
    writes and
    > invalidates TLB entries loaded from pages to which writes are
    > observed"?  This adds TLB CAM columns, since each leaf TLB entry
    needs
    > to record LEVELS physical page numbers to invalidate itself when
    needed,
    > but would make SFENCE.VMA a NOP.

    This would be tremendously helpful for hypervisors with exact replay.
    It may make the difference between being able to directly let the OS
    use page tables and not being able to do so. This is a huge
    performance
    difference, possibly enough to determine if the OS can run or not.


We would also have to order stores to a PTE and memory operations that use translations from that PTE.

RISC-V has the SFENCE.VMA instruction for this.  PTE updates by software are not guaranteed to be visible to the MMU until SFENCE.VMA is executed.  SFENCE.VMA would still be a NOP with respect to the TLBs, but it may need to flush the pipeline.

I'm afraid there was a misunderstanding. The problem is not that the last write is not visible (it is). The problem is that A bits might be set accidentally before the CPU realizes that the PTE is out of date. This could happen (in a 7 stage pipeline) before the store and the sfence are decoded, so an sfence does not help. Carrying (as you suggest) A bits would indeed solve this problem, but I'm not sure it's a good idea (see below).



If I look at a simple pipelined processor, I want to fetch before the preceding stores have been decoded (since before decoding I don't even know that it is a store, I would have to wait between fetching until instructions are decoded).
But if a preceding store manipulates a (non-leaf) PTE that is used to translate the current instruction address, I am in hot water:
the MMU may have or may not have just translated and set A bits for PTEs it thought will be used for fetch, but will not be used because I pointed the PTE somewhere else.

Each instruction can carry the physical page numbers from the ITLB entry that provided its translation; a match flushes the pipeline.  For an in-order pipeline this could be implemented by adding a counter to each ITLB row that is set when that entry is used to fetch an instruction and decremented towards zero as each instruction is retired.  If an ITLB entry with a non-zero counter is invalidated, the pipeline is flushed. 

By flushed you mean rolled-back?

Similarly, the needed A bits for the instruction fetch need to be carried in the pipeline and committed when the instruction retires.

As mentioned above, I'm not sure carrying A bits in the pipeline is a good idea, since you need to double check before committing these bits (which you can do once preceding instructions have been decoded) that the PTE is not changed in the meantime by others and possibly roll-back.

As a side note, couldn't this theoretically lead to a starved processor which never gets to execute an instruction because the fetch translation always happens to be bad? Imagine two cores A,B that send each other into an infinite miss A -> pagefault handler A -> miss B -> pagefault handler B loop. I'm not sure it's a real risk since the pagefault handler is presumably quite slow, but it's still worrying me a bit.



Note also that this can not be the only place where TLB entries are evicted, because the TLB is finite.

Of course not, I assume that other events (like adding a translation to a full TLB) can evict TLB entries, but suggested that MMU hardware observing a write to a physical page that had been traversed to produce a TLB entry could always invalidate that TLB entry for a moderate cost in hardware.

I know, and I think it's a neat idea :) I was only responding to Albert and observed that this alone does not guarrantee a deterministic TLB for exact replay.



-- Jacob

Jacob Bachmeyer

unread,
Dec 5, 2017, 11:47:50 PM12/5/17
to Jonas Oberhauser, Albert Cahalan, Andrew Waterman, RISC-V ISA Dev
Jonas Oberhauser wrote:
> On Dec 5, 2017 7:07 AM, "Jacob Bachmeyer" <jcb6...@gmail.com
> <mailto:jcb6...@gmail.com>> wrote:
>
> Jonas Oberhauser wrote:
>
> On Dec 4, 2017 08:08, "Albert Cahalan" <acah...@gmail.com
> <mailto:acah...@gmail.com> <mailto:acah...@gmail.com
> <mailto:acah...@gmail.com>>> wrote:
>
> On 12/3/17, Jacob Bachmeyer <jcb6...@gmail.com
> <mailto:jcb6...@gmail.com>
> <mailto:jcb6...@gmail.com <mailto:jcb6...@gmail.com>>>
Ah, so the PTE has been replaced, but that replacement is not visible to
the local MMU yet and the MMU ends up setting A on the new PTE without
walking the page tables again. This cannot happen when the PTE
replacement invalidates the TLB entry. (Although now DTLB entries also
need "pipeline clearance" counters.)

> If I look at a simple pipelined processor, I want to fetch before
> the preceding stores have been decoded (since before decoding I
> don't even know that it is a store, I would have to wait between
> fetching until instructions are decoded).
> But if a preceding store manipulates a (non-leaf) PTE that is used
> to translate the current instruction address, I am in hot water:
> the MMU may have or may not have just translated and set A bits
> for PTEs it thought will be used for fetch, but will not be used
> because I pointed the PTE somewhere else.
>
>
> Each instruction can carry the physical page numbers from the ITLB
> entry that provided its translation; a match flushes the
> pipeline. For an in-order pipeline this could be implemented by
> adding a counter to each ITLB row that is set when that entry is
> used to fetch an instruction and decremented towards zero as each
> instruction is retired. If an ITLB entry with a non-zero counter
> is invalidated, the pipeline is flushed.
>
>
> By flushed you mean rolled-back?

I was assuming that instructions commit at the end of the pipeline, such
that flushing the pipeline cancels all instructions currently
"in-flight", yes. (Fetching instructions to feed the pipeline is
speculative in that the program counter is not actually updated to point
after an instruction until that instruction is retired.)

> Similarly, the needed A bits for the instruction fetch need to be
> carried in the pipeline and committed when the instruction retires.
>
>
> As mentioned above, I'm not sure carrying A bits in the pipeline is a
> good idea, since you need to double check before committing these bits
> (which you can do once preceding instructions have been decoded) that
> the PTE is not changed in the meantime by others and possibly roll-back.

Changing the PTE invalidates the TLB entries that were loaded from that
PTE (on the same (memory) cycle as the PTE write in the approach I
suggest) and flushes the pipeline if any "in-flight" instructions depend
on those TLB entries. The A bits are only set if the instruction
actually reaches the end of the pipeline. What to do with the primary
result if committing an instruction and its A bits spans cycles and the
invalidation occurs in the middle of the commit is an interesting
question. Cancel it entirely? Let it "sneak under the wire"?

> As a side note, couldn't this theoretically lead to a starved
> processor which never gets to execute an instruction because the fetch
> translation always happens to be bad? Imagine two cores A,B that send
> each other into an infinite miss A -> pagefault handler A -> miss B ->
> pagefault handler B loop. I'm not sure it's a real risk since the
> pagefault handler is presumably quite slow, but it's still worrying me
> a bit.

If the mapping for fetching the next instruction keeps changing, how
does the processor ever correctly execute an instruction? (Your example
of livelock is correct behavior as I see it. The solution is "do not do
that".)



-- Jacob

Paolo Bonzini

unread,
Dec 6, 2017, 3:14:45 AM12/6/17
to Albert Cahalan, Andrew Waterman, RISC-V ISA Dev
On 01/12/2017 21:36, Albert Cahalan wrote:
>> Other way around... the spec recommends that software set them to 1 if
>> they aren't needed, for the reason you suggested.
> It does only for leaf page tables. For non-leaf, it recommends zero.
>
> Supporting A/D bits in hardware on non-leaf page table entries is
> important for performance. Lacking this, an OS will need to scan
> through many uninteresting page table entries to find these bits.

Is it? Suppose you use absent or write-protected pages to implement A/D
bits on non-leaf page tables; that is, a mix of software-maintained A/D
bits on non-leaves, and hardware-maintained A/D bits on leaves.

For non-leaf page tables, a page covers 2 MB of memory of more. If you
take 4000 clock cycles to process such a page fault, the cost is about 1
clock cycle per 4k page. (I actually think 4000 clock cycle is quite
generous, but it makes a round number :)).

You could either place the A/D bits in separate memory (which possibly
gives you more cache-friendly scans when you have to look at the bits),
or ignore the recommendation and place the A/D bits in the page tables.
In either case, having hardware non-leaf A/D bits is less important than
having hardware leaf A/D bits.

That said, the spec _should_ recommend setting the bits to 1 even for
non-leaf page tables.

Paolo

Jonas Oberhauser

unread,
Dec 6, 2017, 5:20:10 AM12/6/17
to Jacob Bachmeyer, Albert Cahalan, Andrew Waterman, RISC-V ISA Dev
2017-12-06 5:47 GMT+01:00 Jacob Bachmeyer <jcb6...@gmail.com>:
Jonas Oberhauser wrote:
On Dec 5, 2017 7:07 AM, "Jacob Bachmeyer" <jcb6...@gmail.com <mailto:jcb6...@gmail.com>> wrote:
I'm afraid there was a misunderstanding. The problem is not that the last write is not visible (it is). The problem is that A bits might be set accidentally before the CPU realizes that the PTE is out of date. This could happen (in a 7 stage pipeline) before the store and the sfence are decoded, so an sfence does not help. Carrying (as you suggest) A bits would indeed solve this problem, but I'm not sure it's a good idea (see below).

Ah, so the PTE has been replaced, but that replacement is not visible to the local MMU yet and the MMU ends up setting A on the new PTE without walking the page tables again.  This cannot happen when the PTE replacement invalidates the TLB entry.  (Although now DTLB entries also need "pipeline clearance" counters.)

I feel we are still not in sync. Here is an execution: (pteX(va,ptea) generates the X-th level of translation for virtual address va using page table entry address ptea))

CPU wants to fetch for va
MMU sets A bit for pte0(va, root ptea), generates ptea_1
MMU sets A bit for pte1(va, ptea_1), generates ptea_2
MMU sets A bit for (leaf) pte2(va, ptea_2), generates pa
CPU changes pte0(va,root pte), translation above is aborted and invalidated in TLB
MMU sets A bit for pte0(va, root pte), generates ptea_1' 
MMU sets A bit for pte1(va, ptea_1'), generates ptea_2'
MMU sets A bit for (leaf) pte2(va, ptea_2'), generates pa'
CPU fetches from pa' (translated through root ptea -> ptea_1' -> ptea_2')

In this execution we never accessed something that uses the PTE with ptea_1&ptea_2, but the A bit is set. Therefore the A bit is now speculative (violates spec, makes perfect replay impossible unless we add determinism, etc.).

Note that the MMU 1) walks page tables again, 2) sets A bits for the new PTE, 3) uses the new PTE. All of this is because PTE replacement invalidates the TLB entry, and is perfectly fine.
The problem is that A bits on the *old* PTE are set even though the PTE is never used.

Now I agree that delaying the A bits until retirement solves this problem, but it has its own problems.


By flushed you mean rolled-back?

I was assuming that instructions commit at the end of the pipeline, such that flushing the pipeline cancels all instructions currently "in-flight", yes.  (Fetching instructions to feed the pipeline is speculative in that the program counter is not actually updated to point after an instruction until that instruction is retired.)

Is that how you implement it? In our implementations, the program counter is updated, but the old program counters are carried in the pipeline so that in case of a roll-back they can be restored to their original values. Otherwise the circuit for computing the instruction address becomes long (you'll have to forward the effect of several jumps & normal instructions), which requires long cycle times.
The program counter is updated either speculatively directly after fetch (with branch prediction) or once the branch target is known, in which case the instruction address is forwarded using delayed branch semantics.

 
    Similarly, the needed A bits for the instruction fetch need to be
    carried in the pipeline and committed when the instruction retires.


As mentioned above, I'm not sure carrying A bits in the pipeline is a good idea, since you need to double check before committing these bits (which you can do once preceding instructions have been decoded) that the PTE is not changed in the meantime by others and possibly roll-back.

The A bits are only set if the instruction actually reaches the end of the pipeline. 

Yes, but like I said this means that you need to look for changes by *others* made to the PTE, not just within your pipeline. In particular, you have to atomically: 
1) read all the PTEs that you used and compare with the original value (which you need to store). In case they changed, abort and rollback the pipeline.
2) set the A bits to all of them
3) fetch
Then you may have to do an analogous thing if you are a memory operation.
You can do this long before you retire but after all previous stores and their addresses are known.


Now of course you could instead snoop for changes made by others. This is a bit complicated.
My suggestion would be that the local cache used by the TLB needs to
1) listen to global writes to PTEs used by current TLB entries (obviously)
2) but also listen to global reads (!) to PTEs used by the current TLB, respond with a "you are not the only one who has it, but I don't have it cached" (i.e., no data intervention). This prevents the other cache from going into a local mode where it can write without using the memory bus, which would cause us to miss these messages.

The TLB's cache can then stay invalid.
On a write to a buffered PTE, the cache has to send invalidation messages to the TLB. 

In other words, the TLB needs to sit on the memory bus (and act like a psecial cache) or needs its own special cache, it can not just share a normal cache with the CPU.


None of this feels particularly easy or standard to me. Which doesn't have to be a bad thing. But I know my boss would be angry if I tried suggesting such a spec ;) (we once had something similar where a page fault would have to be rechecked after draining the pipeline and before raising an interrupt -- he decided we simply put the faulty translation speculatively in the TLB and raise the page fault even if in the meantime  the fault has disappeared).
 
What to do with the primary result if committing an instruction and its A bits spans cycles and the invalidation occurs in the middle of the commit is an interesting question.  Cancel it entirely?  Let it "sneak under the wire"?

In-pipeline you can probably prevent this using some stalling -- only begin one access if no conflicting access is ongoing.

By others you somehow need to trick the cache system into preventing it, e.g., by reserving all of the PTEs and setting all A/D bits together.

As a side note, couldn't this theoretically lead to a starved processor which never gets to execute an instruction because the fetch translation always happens to be bad? Imagine two cores A,B that send each other into an infinite miss A -> pagefault handler A -> miss B -> pagefault handler B loop. I'm not sure it's a real risk since the pagefault handler is presumably quite slow, but it's still worrying me a bit.

If the mapping for fetching the next instruction keeps changing, how does the processor ever correctly execute an instruction?  (Your example of livelock is correct behavior as I see it.  The solution is "do not do that".)

The mapping would not change if A bits would be set immediately on fetch, rather than when the instruction is ready to retire. Wouldn't it be nice if you could simply take the first unaccessed page table and swap it out? 
I think now you need to be a bit smart on how you build your thread scheduler and your page eviction scheduler to really  prevent the above situation, namely there must be no symmetry (eventually the next page to be swapped out must be from another thread than the next one to be scheduled).
PS: The situation with the miss/pagefaults reminds me a bit of when I try to pass by someone in the aisles of a store; I make a step left, he makes a step right. I make a step right, he makes a step left. This goes on for a couple of iterations... ;)

Jacob Bachmeyer

unread,
Dec 6, 2017, 8:04:01 PM12/6/17
to Jonas Oberhauser, Albert Cahalan, Andrew Waterman, RISC-V ISA Dev
Jonas Oberhauser wrote:
> 2017-12-06 5:47 GMT+01:00 Jacob Bachmeyer <jcb6...@gmail.com
> <mailto:jcb6...@gmail.com>>:
>
> Jonas Oberhauser wrote:
>
> On Dec 5, 2017 7:07 AM, "Jacob Bachmeyer" <jcb6...@gmail.com
> <mailto:jcb6...@gmail.com> <mailto:jcb6...@gmail.com
> <mailto:jcb6...@gmail.com>>> wrote:
> I'm afraid there was a misunderstanding. The problem is not
> that the last write is not visible (it is). The problem is
> that A bits might be set accidentally before the CPU realizes
> that the PTE is out of date. This could happen (in a 7 stage
> pipeline) before the store and the sfence are decoded, so an
> sfence does not help. Carrying (as you suggest) A bits would
> indeed solve this problem, but I'm not sure it's a good idea
> (see below).
>
>
> Ah, so the PTE has been replaced, but that replacement is not
> visible to the local MMU yet and the MMU ends up setting A on the
> new PTE without walking the page tables again. This cannot happen
> when the PTE replacement invalidates the TLB entry. (Although now
> DTLB entries also need "pipeline clearance" counters.)
>
>
> I feel we are still not in sync. Here is an execution: (pteX(va,ptea)
> generates the X-th level of translation for virtual address va using
> page table entry address ptea))
[numbers added]
> CPU[1] wants to fetch for va
> MMU[1] sets A bit for pte0(va, root ptea), generates ptea_1
> MMU[1] sets A bit for pte1(va, ptea_1), generates ptea_2
> MMU[1] sets A bit for (leaf) pte2(va, ptea_2), generates pa
> CPU[2] changes pte0(va,root pte), translation above is aborted and
> invalidated in [MMU1] TLB
> MMU[1] sets A bit for pte0(va, root pte), generates ptea_1'
> MMU[1] sets A bit for pte1(va, ptea_1'), generates ptea_2'
> MMU[1] sets A bit for (leaf) pte2(va, ptea_2'), generates pa'
> CPU[1] fetches from pa' (translated through root ptea -> ptea_1' ->
> ptea_2')
>
> In this execution we never accessed something that uses the PTE with
> ptea_1&ptea_2, but the A bit is set. Therefore the A bit is now
> speculative (violates spec, makes perfect replay impossible unless we
> add determinism, etc.).
>
> Note that the MMU 1) walks page tables again, 2) sets A bits for the
> new PTE, 3) uses the new PTE. All of this is because PTE replacement
> invalidates the TLB entry, and is perfectly fine.
> The problem is that A bits on the *old* PTE are set even though the
> PTE is never used.
>
> Now I agree that delaying the A bits until retirement solves this
> problem, but it has its own problems.

I had assumed that delaying setting A bits until the instruction
completes was the only way, to prevent exactly this problem. (The only
quibble I have is that CPU1 and MMU1 are doing the translation, while
CPU2 changes the page tables -- CPU1 cannot change the translation
because it is stalled while resolving the translation. This issue only
happens in multi-processor systems.)

> By flushed you mean rolled-back?
>
>
> I was assuming that instructions commit at the end of the
> pipeline, such that flushing the pipeline cancels all instructions
> currently "in-flight", yes. (Fetching instructions to feed the
> pipeline is speculative in that the program counter is not
> actually updated to point after an instruction until that
> instruction is retired.)
>
>
> Is that how you implement it? In our implementations, the program
> counter is updated, but the old program counters are carried in the
> pipeline so that in case of a roll-back they can be restored to their
> original values. Otherwise the circuit for computing the instruction
> address becomes long (you'll have to forward the effect of several
> jumps & normal instructions), which requires long cycle times.
> The program counter is updated either speculatively directly after
> fetch (with branch prediction) or once the branch target is known, in
> which case the instruction address is forwarded using delayed branch
> semantics.

You seem to be using an undo log, while I was thinking of a redo log:
the "real" program counter value is updated as instructions retire, but
instruction fetch has its own speculative program counter. A jump or
predicted branch simply changes the speculative program counter
(possibly changing a few speculatively-fetched instructions in the
pipeline into NOPs); a branch misprediction flushes the pipeline. There
are probably many ways to do this; I just needed an existence proof that
it is possible. (In practice, both ways probably end up sending the
fetch-program-counter value down the pipeline -- AUIPC needs it in the ALU.)

> Similarly, the needed A bits for the instruction fetch
> need to be
> carried in the pipeline and committed when the instruction
> retires.
>
>
> As mentioned above, I'm not sure carrying A bits in the
> pipeline is a good idea, since you need to double check before
> committing these bits (which you can do once preceding
> instructions have been decoded) that the PTE is not changed in
> the meantime by others and possibly roll-back.
>
>
> The A bits are only set if the instruction actually reaches the
> end of the pipeline.
>
>
> Yes, but like I said this means that you need to look for changes by
> *others* made to the PTE, not just within your pipeline. In
> particular, you have to atomically:
> 1) read all the PTEs that you used and compare with the original value
> (which you need to store). In case they changed, abort and rollback
> the pipeline.
> 2) set the A bits to all of them
> 3) fetch
> Then you may have to do an analogous thing if you are a memory operation.
> You can do this long before you retire but after all previous stores
> and their addresses are known.

That would be multiple cycles to retire each instruction -- avoiding
this is why I suggested having the MMU snoop the memory bus. The MMU is
parallel hardware and can monitor for invalidating writes while the CPU
is executing instructions.

> Now of course you could instead snoop for changes made by others. This
> is a bit complicated.
> My suggestion would be that the local cache used by the TLB needs to
> 1) listen to global writes to PTEs used by current TLB entries (obviously)
> 2) but also listen to global reads (!) to PTEs used by the current
> TLB, respond with a "you are not the only one who has it, but I don't
> have it cached" (i.e., no data intervention). This prevents the other
> cache from going into a local mode where it can write without using
> the memory bus, which would cause us to miss these messages.

A single wired-OR data line is sufficient for this and would indicate
"you are reading from a live page table"; I suggested that
TLB-invalidation-on-PTE-write could have page granularity -- PTE updates
are generally fairly rare, and more so for active translations. Any
write to a page table invalidates all TLB entries derived from that page
table.

> The TLB's cache can then stay invalid.
> On a write to a buffered PTE, the cache has to send invalidation
> messages to the TLB.
>
> In other words, the TLB needs to sit on the memory bus (and act like a
> psecial cache) or needs its own special cache, it can not just share a
> normal cache with the CPU.

Sharing "close-in" data caches between CPU core and TLB is a waste of
limited L1D cachelines in my view. Outer, shared caches can hold both
(and can thus forward CPU PTE writes to MMU TLB reads without going
through main memory in all cases), but yes, this what I meant by "the
MMU snoops the memory bus".

> None of this feels particularly easy or standard to me. Which doesn't
> have to be a bad thing. But I know my boss would be angry if I tried
> suggesting such a spec ;) (we once had something similar where a page
> fault would have to be rechecked after draining the pipeline and
> before raising an interrupt -- he decided we simply put the faulty
> translation speculatively in the TLB and raise the page fault even if
> in the meantime the fault has disappeared).

Synchronization in high performance hardware is never easy; simpler
implementations can use page faults and software to manage A and D.

>
> What to do with the primary result if committing an instruction
> and its A bits spans cycles and the invalidation occurs in the
> middle of the commit is an interesting question. Cancel it
> entirely? Let it "sneak under the wire"?
>
>
> In-pipeline you can probably prevent this using some stalling -- only
> begin one access if no conflicting access is ongoing.

Fortunately, AMOOR is idempotent, so no conflicting access for
instructions in the pipeline is possible -- an instructions cannot
conflict with itself, and translation conflicts with later instructions
cause "busy" TLB entries to be invalidated and the pipeline to be flushed.

> By others you somehow need to trick the cache system into preventing
> it, e.g., by reserving all of the PTEs and setting all A/D bits together.

This is essentially the idea -- the MMU holds quasi-LR reservations on
all PTEs that affect currently "in-flight" instructions. Writes (other
than A/D updates) break those reservations, which invalidates TLB
entries, which forces a pipeline flush.

> As a side note, couldn't this theoretically lead to a starved
> processor which never gets to execute an instruction because
> the fetch translation always happens to be bad? Imagine two
> cores A,B that send each other into an infinite miss A ->
> pagefault handler A -> miss B -> pagefault handler B loop. I'm
> not sure it's a real risk since the pagefault handler is
> presumably quite slow, but it's still worrying me a bit.
>
>
> If the mapping for fetching the next instruction keeps changing,
> how does the processor ever correctly execute an instruction?
> (Your example of livelock is correct behavior as I see it. The
> solution is "do not do that".)
>
>
> The mapping would not change if A bits would be set immediately on
> fetch, rather than when the instruction is ready to retire. Wouldn't
> it be nice if you could simply take the first unaccessed page table
> and swap it out?

In that case, what we really need is a global "TLB probe" instruction
that answers "is any hart relying on this page table?" which can be
answered by attempting to read from the table and seeing if the "you are
reading from a live page table" signal is asserted. Still, the window
for this is very narrow -- the A bits get set after one instruction latency.

> I think now you need to be a bit smart on how you build your thread
> scheduler and your page eviction scheduler to really prevent the
> above situation, namely there must be no symmetry (eventually the next
> page to be swapped out must be from another thread than the next one
> to be scheduled).

A scheduler that swaps out pages belonging to runnable tasks has a
pretty serious bug anyway -- that is an amazingly good recipe for VM
swap thrashing. :-)

> PS: The situation with the miss/pagefaults reminds me a bit of when I
> try to pass by someone in the aisles of a store; I make a step left,
> he makes a step right. I make a step right, he makes a step left. This
> goes on for a couple of iterations... ;)

I think that relative speeds between processor core and memory bus is
part of the trick here -- the processor can execute several instructions
before the next memory bus cycle can start, so as long as the processor
can complete fetching *one* cacheline worth of instructions, it can make
forward progress by executing from that cacheline before another
processor can initiate yet another write. Multiple harts "ganging up"
to bounce PTEs and starve one hart of memory access cycles might be
possible, but I see that as a pathological case.


-- Jacob

Jonas Oberhauser

unread,
Dec 7, 2017, 5:17:58 AM12/7/17
to Jacob Bachmeyer, Albert Cahalan, Andrew Waterman, RISC-V ISA Dev
On Dec 7, 2017 02:04, "Jacob Bachmeyer" <jcb6...@gmail.com> wrote:
Jonas Oberhauser wrote:
2017-12-06 5:47 GMT+01:00 Jacob Bachmeyer <jcb6...@gmail.com <mailto:jcb6...@gmail.com>>:


    Jonas Oberhauser wrote:

        On Dec 5, 2017 7:07 AM, "Jacob Bachmeyer" <jcb6...@gmail.com
        <mailto:jcb6...@gmail.com> <mailto:jcb6...@gmail.com

        <mailto:jcb6...@gmail.com>>> wrote:
        I'm afraid there was a misunderstanding. The problem is not
        that the last write is not visible (it is). The problem is
        that A bits might be set accidentally before the CPU realizes
        that the PTE is out of date. This could happen (in a 7 stage
        pipeline) before the store and the sfence are decoded, so an
        sfence does not help. Carrying (as you suggest) A bits would
        indeed solve this problem, but I'm not sure it's a good idea
        (see below).


    Ah, so the PTE has been replaced, but that replacement is not
    visible to the local MMU yet and the MMU ends up setting A on the
    new PTE without walking the page tables again.  This cannot happen
    when the PTE replacement invalidates the TLB entry.  (Although now
    DTLB entries also need "pipeline clearance" counters.)


I feel we are still not in sync. Here is an execution: (pteX(va,ptea) generates the X-th level of translation for virtual address va using page table entry address ptea))
[numbers added]
In this execution we never accessed something that uses the PTE with ptea_1&ptea_2, but the A bit is set. Therefore the A bit is now speculative (violates spec, makes perfect replay impossible unless we add determinism, etc.).

Note that the MMU 1) walks page tables again, 2) sets A bits for the new PTE, 3) uses the new PTE. All of this is because PTE replacement invalidates the TLB entry, and is perfectly fine.
The problem is that A bits on the *old* PTE are set even though the PTE is never used.

Now I agree that delaying the A bits until retirement solves this problem, but it has its own problems.

I had assumed that delaying setting A bits until the instruction completes was the only way, to prevent exactly this problem.  (The only quibble I have is that CPU1 and MMU1 are doing the translation, while CPU2 changes the page tables -- CPU1 cannot change the translation because it is stalled while resolving the translation.  This issue only happens in multi-processor systems.)

This issue happens in single core systems. I'll add a few extra steps for a 7 stage pipeline.

CPU in stage 2 begins fetching instruction word I
CPU in stage 1 wants to fetch for va, whole CPU is stalled
MMU sets A bit for pte0(va, root ptea), generates ptea_1
MMU sets A bit for pte1(va, ptea_1), generates ptea_2
MMU sets A bit for (leaf) pte2(va, ptea_2), generates pa, CPU resumes
CPU finishes fetching I, clocks it into instruction decoder stage 3
CPU finishes decoding I, it is a store. It is clocked into the ALU stage 4 to compute the effective address of the store
CPU Computes: physical ea of I is pte0(va,root ptea), translation above is aborted and invalidated in TLB
MMU sets A bit for (updated) pte0(va, root pte), generates ptea_1' MMU sets A bit for pte1(va, ptea_1'), generates ptea_2'
MMU sets A bit for (leaf) pte2(va, ptea_2'), generates pa'
CPU fetches from pa' (translated through root ptea -> ptea_1' -> ptea_2')

(note that stalling the whole CPU sounds slow and did not help).
Or did you mean the pipeline is drained before the translation begins? If the whole pipeline is drained on a TLB miss, that would help. I don't know what the performance cost is.



You seem to be using an undo log, while I was thinking of a redo log:  the "real" program counter value is updated as instructions retire, but instruction fetch has its own speculative program counter.  A jump or predicted branch simply changes the speculative program counter (possibly changing a few speculatively-fetched instructions in the pipeline into NOPs); a branch misprediction flushes the pipeline.  There are probably many ways to do this; I just needed an existence proof that it is possible. 

I think they are the same in HW. Once you have a speculative pc register up there you no longer need an extra real pc register downstairs, which is never used.


(In practice, both ways probably end up sending the fetch-program-counter value down the pipeline -- AUIPC needs it in the ALU.)

You don't need an explixit PC pipeline foe that, you can use the operand pipeline.

The explicit pc pipeline is needed for interrupts of type repeat (like page faults).




Now of course you could instead snoop for changes made by others. This is a bit complicated.
My suggestion would be that the local cache used by the TLB needs to
1) listen to global writes to PTEs used by current TLB entries (obviously)
2) but also listen to global reads (!) to PTEs used by the current TLB, respond with a "you are not the only one who has it, but I don't have it cached" (i.e., no data intervention). This prevents the other cache from going into a local mode where it can write without using the memory bus, which would cause us to miss these messages.

A single wired-OR data line is sufficient for this and would indicate "you are reading from a live page table"
At first I was going to say:
You typically don't need an extra wire since such wires are already part of most cache protocols. For example in the caches implemented in "a pipelined multicore MIPS machine" , you would put a 1 on the Ca bus and a 0 on the di bus.
This would allow one to use "normal" caches for the most part.

However once you want to do more with the line than a normal cache would -- e.g., reserve it except for MMUs -- you do need the extra wire.


I suggested that TLB-invalidation-on-PTE-write could have page granularity -- PTE updates are generally fairly rare, and more so for active translations.  Any write to a page table invalidates all TLB entries derived from that page table.

Sure. I don't think it makes a big difference in HW.

Outer, shared caches can hold both (and can thus forward CPU PTE writes to MMU TLB reads without going through main memory in all cases), but yes, this what I meant by "the MMU snoops the memory bus".

You are now talking about CPUi and MMUj, right?

Synchronization in high performance hardware is never easy; simpler implementations can use page faults and software to manage A and D.

I agree but what I don't understand is why it is necessary to promise such strong synchronization in the first place. In my view it is perfectly acceptable to have speculative out of order translation.
I believe: If you also have that xRET synchronizes mops of CPUi and MMUi, and PTE changes either in HW or in SW invalidates affected entries, then any software that does not modify the PTEs for its own accesses will never see stale translations, and if a page has no A/D bits it was never accessed not even by "live" translations (though the converse is not true).





    What to do with the primary result if committing an instruction
    and its A bits spans cycles and the invalidation occurs in the
    middle of the commit is an interesting question.  Cancel it
    entirely?  Let it "sneak under the wire"?


In-pipeline you can probably prevent this using some stalling -- only begin one access if no conflicting access is ongoing.

Fortunately, AMOOR is idempotent, so no conflicting access for instructions in the pipeline is possible -- an instructions cannot conflict with itself, and translation conflicts with later instructions cause "busy" TLB entries to be invalidated and the pipeline to be flushed.

As mentioned above, this also needs that you drain pipeline on TLB miss, since you can otherwise have conflicts and discover them late.



By others you somehow need  to trick the cache system into preventing it, e.g., by reserving all of the PTEs and setting all A/D bits together.

This is essentially the idea -- the MMU holds quasi-LR reservations on all PTEs that affect currently "in-flight" instructions.  Writes (other than A/D updates) break those reservations, which invalidates TLB entries, which forces a pipeline flush.

Are we agreed that writes by non-MMUs should not be allowed if an MMU hold the reservations, to prevent exactly the situation you mentioned (memory access during PTE invalidation)?

        As a side note, couldn't this theoretically lead to a starved
        processor which never gets to execute an instruction because
        the fetch translation always happens to be bad? Imagine two
        cores A,B that send each other into an infinite miss A ->
        pagefault handler A -> miss B -> pagefault handler B loop. I'm
        not sure it's a real risk since the pagefault handler is
        presumably quite slow, but it's still worrying me a bit.


    If the mapping for fetching the next instruction keeps changing,
    how does the processor ever correctly execute an instruction?     (Your example of livelock is correct behavior as I see it.  The
    solution is "do not do that".)


The mapping would not change if A bits would be set immediately on fetch, rather than when the instruction is ready to retire. Wouldn't it be nice if you could simply take the first unaccessed page table and swap it out?

In that case, what we really need is a global "TLB probe" instruction that answers "is any hart relying on this page table?" which can be answered by attempting to read from the table and seeing if the "you are reading from a live page table" signal is asserted.

The purpose of this is to figure out whether the PTE is used but A/D bits have not yet been stored to it.

Such a TLB probe would need to reserve the PTE and the subsequent PTE write has to be conditional. But if the MMU already reserves PTEs between the first access and setting A/D bits, such an instruction is not necessary.


Still, the window for this is very narrow -- the A bits get set after one instruction latency.

But this latency can be big. Many cycles can occur between translation and retirement, including several memory bus cycles. I don't know how fast memory will be in typical RISCV implementations relative to the CPU  but I can imagine that LEVELS PTE accesses+a fetch take a lot of cycles. 

I think now you need to be a bit smart on how you build your thread scheduler and your page eviction scheduler to really  prevent the above situation, namely there must be no symmetry (eventually the next page to be swapped out must be from another thread than the next one to be scheduled).

A scheduler that swaps out pages belonging to runnable tasks has a pretty serious bug anyway -- that is an amazingly good recipe for VM swap thrashing.  :-)

You may have no other choice if you only have a few tasks and they are all runnable. I take it from you that this never ever happens because there are a huge amount of tasks always?


PS: The situation with the miss/pagefaults reminds me a bit of when I try to pass by someone in the aisles of a store; I make a step left, he makes a step right. I make a step right, he makes a step left. This goes on for a couple of iterations... ;)

I think that relative speeds between processor core and memory bus is part of the trick here -- the processor can execute several instructions before the next memory bus cycle can start, so as long as the processor can complete fetching *one* cacheline worth of instructions, it can make forward progress by executing from that cacheline before another processor can initiate yet another write.

No, because to begin executing instructions you need to translate, which takes several cycles and could be aborted when the first write starts, at which point you get a pf. 

Of course if like you suggested above we guarantee for the whole time that the PTEs are reserved for the MMUs, the fetch translation will progress and the cacheline can be fetched (and subsequently be executed, even if in the meantime the PTEs get invalidated -- yay for separate fetch and execute). My worries will as we Germans say "evaporate into a cloud of pleasure" (it sounds better in German, I promise).

Jacob Bachmeyer

unread,
Dec 7, 2017, 11:15:10 PM12/7/17
to Jonas Oberhauser, Albert Cahalan, Andrew Waterman, RISC-V ISA Dev
Jonas Oberhauser wrote:
> On Dec 7, 2017 02:04, "Jacob Bachmeyer" <jcb6...@gmail.com
> <mailto:jcb6...@gmail.com>> wrote:
>
> Jonas Oberhauser wrote:
>
> 2017-12-06 5:47 GMT+01:00 Jacob Bachmeyer <jcb6...@gmail.com
> <mailto:jcb6...@gmail.com> <mailto:jcb6...@gmail.com
> <mailto:jcb6...@gmail.com>>>:
>
>
> Jonas Oberhauser wrote:
>
> On Dec 5, 2017 7:07 AM, "Jacob Bachmeyer"
> <jcb6...@gmail.com <mailto:jcb6...@gmail.com>
> <mailto:jcb6...@gmail.com
> <mailto:jcb6...@gmail.com>> <mailto:jcb6...@gmail.com
I think I see it now: the pipeline must avoid translation hazards.

> Now of course you could instead snoop for changes made by
> others. This is a bit complicated.
> My suggestion would be that the local cache used by the TLB
> needs to
> 1) listen to global writes to PTEs used by current TLB entries
> (obviously)
> 2) but also listen to global reads (!) to PTEs used by the
> current TLB, respond with a "you are not the only one who has
> it, but I don't have it cached" (i.e., no data intervention).
> This prevents the other cache from going into a local mode
> where it can write without using the memory bus, which would
> cause us to miss these messages.
>
>
> A single wired-OR data line is sufficient for this and would
> indicate "you are reading from a live page table"
>
> At first I was going to say:
> You typically don't need an extra wire since such wires are already
> part of most cache protocols. For example in the caches implemented in
> "a pipelined multicore MIPS machine" , you would put a 1 on the Ca bus
> and a 0 on the di bus.
> This would allow one to use "normal" caches for the most part.
>
> However once you want to do more with the line than a normal cache
> would -- e.g., reserve it except for MMUs -- you do need the extra wire.

Reads never conflict, but a CPU reading a line held by an MMU must take
the line as "shared" rather than "owned". In the simple case, where we
do not have a special "global TLB probe" instruction to implement, any
TLB entry is effectively in "shared" cache state. When a CPU writes to
a page table, it must acquire "exclusive" cache state for that line,
which causes any TLB entries dependent on that line to be invalidated.

If we do have a "global TLB probe" (for finding unused regions to swap
out), then we need a special "shared-by-TLB" state that "global TLB
probe" can observe.

> I suggested that TLB-invalidation-on-PTE-write could have page
> granularity -- PTE updates are generally fairly rare, and more so
> for active translations. Any write to a page table invalidates
> all TLB entries derived from that page table.
>
>
> Sure. I don't think it makes a big difference in HW.

I have since realized that cacheline granularity or even individual PTE
granularity would be much preferable -- if page granularity is used, any
write to the root page table flushes the entire TLB!

> Outer, shared caches can hold both (and can thus forward CPU PTE
> writes to MMU TLB reads without going through main memory in all
> cases), but yes, this what I meant by "the MMU snoops the memory bus".
>
>
> You are now talking about CPUi and MMUj, right?

Also CPUi and MMUi -- in normal user execution, the MMU will access page
tables, but the CPU will access user data. Those categories are
disjoint, so the innermost caches should be distinct for CPU and MMU.
Going a step farther, the MMU can use multi-level TLBs instead of
MMU-only caches. A TLB is functionally equivalent to a cache. At some
point, MMU paths to memory and CPU paths to memory join, but whether
distinct harts share CPU-only caches before CPU and MMU paths join is an
implementation choice.

Caches beyond the point where CPU and MMU paths join can forward CPU PTE
writes to MMU TLB reads, but that point may be just above a single hart
and its MMU or farther out in the cache hierarchy. At that point, an
inclusive cache is needed -- that cache must be able to respond with
data when a TLB indicates "I share this line", since the TLB does not
have the entire cacheline.

If we add a "shared-by-TLB" state to the cache coherency protocol, then
we can also add "set A" and "set D" messages that TLBs can emit to
update other copies of PTEs that they map. If we allow non-leaf A to be
set even when a translation is ultimately unsuccessful, only leaf PTEs
can ever need such an update, as non-leaf PTEs are updated when the TLB
is loaded, and the "shared-by-TLB" state indicates that a TLB holds a
mapping, therefore all non-leaf A bits have been set on that path.

> Synchronization in high performance hardware is never easy;
> simpler implementations can use page faults and software to manage
> A and D.
>
>
> I agree but what I don't understand is why it is necessary to promise
> such strong synchronization in the first place. In my view it is
> perfectly acceptable to have speculative out of order translation.

I think that someone wanted the ability to do exact replays?

> I believe: If you also have that xRET synchronizes mops of CPUi and
> MMUi, and PTE changes either in HW or in SW invalidates affected
> entries, then any software that does not modify the PTEs for its own
> accesses will never see stale translations, and if a page has no A/D
> bits it was never accessed not even by "live" translations (though the
> converse is not true).

What really complicates this is the need for "busy" translations -- the
first instruction to read or write that page (and therefore set the A or
D bit) has performed its translation, but has not yet committed. I
believe the hazard we are trying to avoid is setting the A/D bit after
another hart has replaced the PTE while an instruction was "in-flight"
that uses that PTE. Then "in-flight" instruction then commits and the
A/D bit is set on the *new* PTE, incorrectly.

> What to do with the primary result if committing an
> instruction
> and its A bits spans cycles and the invalidation occurs in the
> middle of the commit is an interesting question. Cancel it
> entirely? Let it "sneak under the wire"?
>
>
> In-pipeline you can probably prevent this using some stalling
> -- only begin one access if no conflicting access is ongoing.
>
>
>
> Fortunately, AMOOR is idempotent, so no conflicting access for
> instructions in the pipeline is possible -- an instructions cannot
> conflict with itself, and translation conflicts with later
> instructions cause "busy" TLB entries to be invalidated and the
> pipeline to be flushed.
>
>
> As mentioned above, this also needs that you drain pipeline on TLB
> miss, since you can otherwise have conflicts and discover them late.

Will a page table walk ever complete quickly enough to *not* drain the
pipeline while the MMU is busy loading a TLB entry? I believe that
loading a TLB entry will have enough latency that all preceding
instructions will complete before the new translation is ready. If a
preceding instruction commits and replaces a PTE that the MMU just read
while walking the page tables, then the MMU cancels its partial
translation and starts over. If some other PTE is overwritten, the
relevant TLB entry is invalidated, but the MMU continues. If that TLB
invalidation affects a pending instruction, then the pipeline is flushed
(while the MMU is still translating) and the pipeline resumes by
retrying the instruction that caused the TLB miss or the instruction
that was canceled, whichever came earlier in the program.

An MMU write to the page tables *never* invalidates a TLB entry, to be
clear.

> By others you somehow need to trick the cache system into
> preventing it, e.g., by reserving all of the PTEs and setting
> all A/D bits together.
>
>
> This is essentially the idea -- the MMU holds quasi-LR
> reservations on all PTEs that affect currently "in-flight"
> instructions. Writes (other than A/D updates) break those
> reservations, which invalidates TLB entries, which forces a
> pipeline flush.
>
>
> Are we agreed that writes by non-MMUs should not be allowed if an MMU
> hold the reservations, to prevent exactly the situation you mentioned
> (memory access during PTE invalidation)?

A CPU write to a PTE is allowed at any time, but forces all "readers"
("in-flight" instructions dependent on a translation through that PTE)
to be canceled and re-tried when their TLB entries are invalidated.
Since this is expected to be a rare conflict, flushing the pipeline and
starting over from just after the last instruction that actually
committed seems reasonable to me. An instruction that has already been
committed simply completed before the PTE was written.
The "global TLB probe" instruction effectively asks "will the A/D bits
on this mapping be set 'soon'?" The MMU reservations are "soft" -- they
do not inhibit a CPU from writing to a PTE, only ensure that the MMU is
notified of such writes.

> Still, the window for this is very narrow -- the A bits get set
> after one instruction latency.
>
>
> But this latency can be big. Many cycles can occur between translation
> and retirement, including several memory bus cycles. I don't know how
> fast memory will be in typical RISCV implementations relative to the
> CPU but I can imagine that LEVELS PTE accesses+a fetch take a lot of
> cycles.

The ITLB entry will be in this state the longest -- the translation must
be resolved before the instruction can be fetched, and a memory access
instruction may run a DTLB entry through the same state. Then, after
the instruction is executed, up to LEVELS A bits for the ITLB and up to
LEVELS A/D bits for the DTLB must be set, closing the window. Perhaps
the hart should be able to hold off other writes to the page tables
while setting the A/D bits? These are a short burst of accesses, and
the hazard we are trying to avoid is setting an A/D bit on a replaced PTE.

> I think now you need to be a bit smart on how you build your
> thread scheduler and your page eviction scheduler to really
> prevent the above situation, namely there must be no symmetry
> (eventually the next page to be swapped out must be from
> another thread than the next one to be scheduled).
>
>
> A scheduler that swaps out pages belonging to runnable tasks has a
> pretty serious bug anyway -- that is an amazingly good recipe for
> VM swap thrashing. :-)
>
>
> You may have no other choice if you only have a few tasks and they are
> all runnable. I take it from you that this never ever happens because
> there are a huge amount of tasks always?

In my experience (mostly desktop and server GNU/Linux), yes, ~100 tasks
and usually <10 runnable. In an RTOS, there is no swap. :-)
Admittedly, if your runnable tasks are all memory hogs, and your working
set does not fit in RAM, you are toast no matter what.

> PS: The situation with the miss/pagefaults reminds me a bit of
> when I try to pass by someone in the aisles of a store; I make
> a step left, he makes a step right. I make a step right, he
> makes a step left. This goes on for a couple of iterations... ;)
>
>
> I think that relative speeds between processor core and memory bus
> is part of the trick here -- the processor can execute several
> instructions before the next memory bus cycle can start, so as
> long as the processor can complete fetching *one* cacheline worth
> of instructions, it can make forward progress by executing from
> that cacheline before another processor can initiate yet another
> write.
>
>
> No, because to begin executing instructions you need to translate,
> which takes several cycles and could be aborted when the first write
> starts, at which point you get a pf.

Fetching the cacheline requires a translation to know which physical
cacheline to pull from main memory. Once that is done, execution from
that cacheline can begin. Of course, if the instruction to be executed
is a memory access, another translation is needed, and there is an
opportunity to replace a PTE that was used in fetching that cacheline,
and we go round all over again. I believe that this requires the memory
bus be almost saturated with repeated PTE replacements, which is a
pathological scenario.

Forward progress is possible, assuming that the processor can complete a
translation at all. In other words, we are not talking about starving
instruction fetch, but starving the MMU.

> Of course if like you suggested above we guarantee for the whole time
> that the PTEs are reserved for the MMUs, the fetch translation will
> progress and the cacheline can be fetched (and subsequently be
> executed, even if in the meantime the PTEs get invalidated -- yay for
> separate fetch and execute). My worries will as we Germans say
> "evaporate into a cloud of pleasure" (it sounds better in German, I
> promise).

So we need a memory controller that (globally) prioritizes MMU PTE fetch
above CPU writes and will permit an MMU to complete a translation before
allowing other nodes to use the bus again. If we also assign the next
memory cycle to the processor that just performed a translation and give
the same processor priority for one more translation and one more memory
bus cycle before giving the next processor priority (round-robin), I
*think* that we can guarantee forward progress unconditionally. This is
starting to sound like the dining philosophers problem: an RV32
processor needs three or four memory cycles to begin execution (fetch
PTE from root page table, fetch PTE from intermediate page table, fetch
instruction, execute instruction) with no intervening writes to the page
tables.

I still suspect that this MMU starvation is an edge case that can only
happen with programs specially written to cause it.


-- Jacob

Albert Cahalan

unread,
Dec 11, 2017, 2:17:11 AM12/11/17
to jcb6...@gmail.com, Jonas Oberhauser, Andrew Waterman, RISC-V ISA Dev
On 12/7/17, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
> Jonas Oberhauser wrote:

>> I agree but what I don't understand is why it is necessary to promise
>> such strong synchronization in the first place. In my view it is
>> perfectly acceptable to have speculative out of order translation.
>
> I think that someone wanted the ability to do exact replays?

That would be me. I work on software that does exact replays
for a different architecture. It's amazing software.

I'd love to see it support RISC-V. Without reproducible memory
content, virtualization becomes unusable for this. Emulation is
often unable to run fast enough.

Allen J. Baum

unread,
Dec 11, 2017, 4:11:20 AM12/11/17
to Albert Cahalan, jcb6...@gmail.com, Jonas Oberhauser, Andrew Waterman, RISC-V ISA Dev
I'm not sure I have quite the context of your assertion - maybe you've talked about it before and I missed it..
I can see why speculative O-O-O translation could mess up exact replay.
But your rationale (beyond the fact that you supply SW for exact replay) is a bit hazy - sort of a circular argument. Is this for debugging highly parallel code? Or for Tandem-like fault tolerance? That is - why is it so amazing?


--
**************************************************
* Allen Baum tel. (908)BIT-BAUM *
* 248-2286 *
**************************************************

Jonas Oberhauser

unread,
Dec 11, 2017, 4:11:56 AM12/11/17
to Albert Cahalan, Jacob Bachmeyer, RISC-V ISA Dev
I understand that. What I was suggesting instead of non-speculative in-order translation was "deterministic speculative out-of-order translation".

Is the deterministic scheduler known to the programmer? Of course not. 
Is the deterministic scheduler the same between two implementations? Of course not. 
So software, just as is the case now, has to be programmed against  all possible deterministic schedulers.

But if you run the same program on the same hardware starting in the same state, the deterministic scheduler will make sure that the program will see the same memory states, use the same translations, etc., but if you run it on a newer CPU (as a trivial example) it might not (because the new CPU uses a different scheduler).
 

The questions are 
1) is  that good enough for you
2) can we guarantee that the scheduler only uses parts of the state that the hypervisor (HV) can control and "make the same" during the replay, so that the OS, user programs, MMU will be scheduled exactly the same way on replay?
For example, the HV during replay may take a few extra cycles to load some state.
If the TLB entries in HW "age" in each cycle, the fact that the HV takes a few extra cycles might cause TLB entries that were used in the original execution to be evicted a little sooner -- bye bye replay.
To counter that you would need to guarantee that the deterministic scheduler only bases its decision to schedule a TLB eviction on, for example, the ISA TLB (where the aging is not visible).

So again: the question is whether in HW the deterministic scheduler depends on "hidden state" which is only stored in HW  and might be different  between two executions of the HV.


As an aside I wonder if you are familiar with Dijkstra's opinion on exact replay (https://www.cs.utexas.edu/users/EWD/transcriptions/EWD13xx/EWD1303.html). I quote -- and this in no way reflects my own opinion on the matter --:
"In order to make errors reproducible they equipped the IBM/360 with a special hardware monitor; in the recording mode it would capture precisely where in the computation each interrupt took place, in the controlling mode it would force the computer to replay exactly the recorded computation, thus making intermediate results available for inspection. It struck me as a crude and silly way of proceeding, and I remember being grateful that lack of money had protected me from making that methodological mistake. I felt grateful and superior, and when a few years later it became clear that IBM couldn't get its OS/360 right, I was not amazed." 

Albert Cahalan

unread,
Dec 11, 2017, 3:38:07 PM12/11/17
to Allen J. Baum, jcb6...@gmail.com, Jonas Oberhauser, Andrew Waterman, RISC-V ISA Dev
On 12/11/17, Allen J. Baum <allen...@esperantotech.com> wrote:
> Albert Cahalan
> I'm not sure I have quite the context of your assertion - maybe you've
> talked about it before and I missed it..
> I can see why speculative O-O-O translation could mess up exact replay.
> But your rationale (beyond the fact that you supply SW for exact replay) is
> a bit hazy - sort of a circular argument. Is this for debugging highly
> parallel code? Or for Tandem-like fault tolerance? That is - why is it so
> amazing?

I didn't give a rationale.

In my case, it is debugging of problems that are difficult to reproduce.
Catching the bug just once means that it can be carefully reviewed.
The bug can be archived away for later study, perhaps on a different
machine. Multiple people can work on it at the same time. Debugging
can be handled at a different location, perhaps by a different company.
If there is a program that has difficult-to-debug crashes that only seem
to occur at one customer site (maybe: submarine, space station, or the
south pole), now you don't have to go there to debug it.

It is also nice to be able to uncrash something, and I could see that
being automated to provide Tandem-like fault tolerance.

There are probably other uses. Accurate profiling might be one.

My main point was this: exact replay is a real thing. It actually exists
and is useful. Making it impossible on RISC-V would be bad.

An excellent hardware situation would allow for exact execution replay
as a hypervisor (hardware acceleration) and also with emulation. While
running as a hypervisor, there would be the ability to count instructions
and to stop execution after a specified number have executed.

Albert Cahalan

unread,
Dec 11, 2017, 4:00:57 PM12/11/17
to Jonas Oberhauser, Jacob Bachmeyer, RISC-V ISA Dev
On 12/11/17, Jonas Oberhauser <s9jo...@gmail.com> wrote:

[about exact replay]

> I understand that. What I was suggesting instead of
> non-speculative in-order translation was "deterministic
> speculative out-of-order translation".
>
> Is the deterministic scheduler known to the programmer?
> Of course not. Is the deterministic scheduler the same
> between two implementations? Of course not. So software,
> just as is the case now, has to be programmed against all
> possible deterministic schedulers.
>
> But if you run the same program on the same hardware starting
> in the same state, the deterministic scheduler will make sure
> that the program will see the same memory states, use the same
> translations, etc., but if you run it on a newer CPU (as a trivial
> example) it might not (because the new CPU uses a different
> scheduler).
>
> The questions are
> 1) is that good enough for you

It's OKish. It covers many use cases. It may mean that people who
use this to debug things would need to have many CPU versions.

> 2) can we guarantee that the scheduler only uses parts of the
> state that the hypervisor (HV) can control and "make the same"
> during the replay, so that the OS, user programs, MMU will be
> scheduled exactly the same way on replay?

Well, that's the idea. With varying degrees of difficulty with performance
and ill-behaved operating systems, it works today on non-RISC-V.
Hardware can make this easy, difficult, or not worth even trying.

> As an aside I wonder if you are familiar with Dijkstra's opinion
> on exact replay

No, but generally I find his opinions to have a certain... arrogance.
He would have all software developers work in a mathematical way,
with proof of correctness to prevent bugs. That's a nice ideal. In his
world there would be no bugs created, so why support debugging?

Jacob Bachmeyer

unread,
Dec 11, 2017, 7:15:50 PM12/11/17
to Albert Cahalan, Jonas Oberhauser, Andrew Waterman, RISC-V ISA Dev
Albert Cahalan wrote:
> On 12/7/17, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>
>> Jonas Oberhauser wrote:
>>
>>> I agree but what I don't understand is why it is necessary to promise
>>> such strong synchronization in the first place. In my view it is
>>> perfectly acceptable to have speculative out of order translation.
>>>
>> I think that someone wanted the ability to do exact replays?
>>
>
> That would be me. I work on software that does exact replays
> for a different architecture. It's amazing software.
>

Is this something similar to Xen Remus?

> I'd love to see it support RISC-V. Without reproducible memory
> content, virtualization becomes unusable for this. Emulation is
> often unable to run fast enough.
>

Since updates to non-leaf PTEs would need to be optional in any case,
could adding a "deterministic memory" flag to hstatus solve your
concern? Systems that do not care about deterministic memory might be
able to gain performance by not setting that flag, while systems that
want to perform replays could set that flag. One possible effect could
be disabling hardware A/D bit management and falling back to software.


-- Jacob

Albert Cahalan

unread,
Dec 12, 2017, 12:19:08 AM12/12/17
to jcb6...@gmail.com, Jonas Oberhauser, Andrew Waterman, RISC-V ISA Dev
On 12/11/17, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
> Albert Cahalan wrote:

>> That would be me. I work on software that does exact replays
>> for a different architecture. It's amazing software.
>
> Is this something similar to Xen Remus?

My stuff isn't. It's more of a debugger. Xen Remus is
however another type of software that would benefit
from the same sort of better hardware support.

Xen Remus currently requires a modified OS:
https://wiki.xen.org/wiki/Remus_PV_domU_requirements

The hardware is capable of handling an arbitrary OS,
but performance would be greatly reduced. Hardware
limitations similar to your proposal below are the cause
of the reduced performance.

>> I'd love to see it support RISC-V. Without reproducible memory
>> content, virtualization becomes unusable for this. Emulation is
>> often unable to run fast enough.
>
> Since updates to non-leaf PTEs would need to be optional in
> any case, could adding a "deterministic memory" flag to hstatus
> solve your concern? Systems that do not care about deterministic
> memory might be able to gain performance by not setting that flag,
> while systems that want to perform replays could set that flag.
> One possible effect could be disabling hardware A/D bit
> management and falling back to software.

If the hardware A/D bit setting is disabled, then the hypervisor has
to emulate the hardware A/D bit setting. It would likely make the
page table pages inaccessible most of the time, faulting them in to
the OS (and invalidating associated TLB entries) as needed. This
is normally faster than emulating every instruction, but slower than
being able to run with the hardware A/D bit setting.

So sort of yes, it works, but it is far from ideal.

Jonas Oberhauser

unread,
Dec 12, 2017, 5:49:23 AM12/12/17
to Albert Cahalan, Jacob Bachmeyer, RISC-V ISA Dev
> As an aside I wonder if you are familiar with Dijkstra's opinion
> on exact replay

No, but generally I find his opinions to have a certain... arrogance.

Haha well I think he realized this himself :) I think at a certain point professors need to be a little bit full of themselves. Why try to improve the status quo if you don't think your way is the best?
 
He would have all software developers work in a mathematical way,
with proof of correctness to prevent bugs. That's a nice ideal. In his
world there would be no bugs created, so why support debugging? 

Yeah... I for one am happy though that we got over debugging buildings, not without the help of the maths of structural analysis I may add.
But even in one of our recent CPUs with correctness proof we had after the first (pencil and paper) proof a few minor bugs which we found by implementing the machine on an FPGA. Machinized proofs will probably prevent such things in the future, but until then, I'm very happy for debugging tools :)

Jonas Oberhauser

unread,
Dec 12, 2017, 5:56:35 AM12/12/17
to Albert Cahalan, Jacob Bachmeyer, RISC-V ISA Dev
2017-12-11 22:00 GMT+01:00 Albert Cahalan <acah...@gmail.com>:
On 12/11/17, Jonas Oberhauser <s9jo...@gmail.com> wrote:

[about exact replay]

> I understand that. What I was suggesting instead of
> non-speculative in-order translation was "deterministic
> speculative out-of-order translation".
>
> Is the deterministic scheduler known to the programmer?
> Of course not. Is the deterministic scheduler the same
> between two implementations? Of course not. So software,
> just as is the case now, has to be programmed against all
> possible deterministic schedulers.
>
> But if you run the same program on the same hardware starting
> in the same state, the deterministic scheduler will make sure
> that the program will see the same memory states, use the same
> translations, etc., but if you run it on a newer CPU (as a trivial
> example) it might not (because the new CPU uses a different
> scheduler).
>
> The questions are
> 1) is  that good enough for you

It's OKish. It covers many use cases. It may mean that people who
use this to debug things would need to have many CPU versions.

Yes, but is this not already the case, because with a different RAM your memory cycles can be quite different, resulting in different behavior down the way?
For example, a thread may be delayed by the slow memory, thus a store down the line is executed a little later, and therefore suddenly pops up after a load to the same address rather than before.

The behavior of the program changes on different configurations (RAM,CPU,disk...) and that can not be prevented (and that is already the case for other architectures).
But if you at least have the guarantee that on a replay on the same configuration everything is the same, it sounds good.
 

Jonas Oberhauser

unread,
Dec 12, 2017, 5:57:17 AM12/12/17
to Albert Cahalan, Jacob Bachmeyer, Andrew Waterman, RISC-V ISA Dev
2017-12-12 6:19 GMT+01:00 Albert Cahalan <acah...@gmail.com>:
On 12/11/17, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
> Albert Cahalan wrote:

>> That would be me. I work on software that does exact replays
>> for a different architecture. It's amazing software.
>
> Is this something similar to Xen Remus?

My stuff isn't. It's more of a debugger. Xen Remus is
however another type of software that would benefit
from the same sort of better hardware support.

Xen Remus currently requires a modified OS:
https://wiki.xen.org/wiki/Remus_PV_domU_requirements

The hardware is capable of handling an arbitrary OS,
but performance would be greatly reduced. Hardware
limitations similar to your proposal below are the cause
of the reduced performance.


Wait, do you have a hypervisor or an OS or what?

Jonas Oberhauser

unread,
Dec 12, 2017, 7:31:56 AM12/12/17
to Albert Cahalan, Allen J. Baum, Jacob Bachmeyer, Andrew Waterman, RISC-V ISA Dev


On Dec 11, 2017 21:38, "Albert Cahalan" <acah...@gmail.com> wrote:
My main point was this: exact replay is a real thing. It actually exists
and is useful. Making it impossible on RISC-V would be bad.

An excellent hardware situation would allow for exact execution replay
as a hypervisor (hardware acceleration) and also with emulation. While running as a hypervisor, there would be the ability to count instructions
and to stop execution after a specified number have executed.

(disregard my previous mail asking what it is. I thought it's a HV but got confused by some of the comments on remus. Of course it is a HV.)

RISCV can count instructions, but about stopping execution after reaching a given count.

This is not yet a RISCV feature, is it? It is probably not possible to add this ad-hoc via a peripheral (which could not count instructions). The only thing I see that you are guaranteed to be able to trigger on - and I don't know with which precision - is "time", which is useless.

You need it to replay PLIC interrupts from peripherals, right?

So in any case the CPU on which you do replay has a custom spec. You probably also need to be able to manipulate mtime, minstret and so on, since you otherwise have to emulate accesses even in normal execution (slow).

Do you think it would make sense to have high-performance recordability as a profile?

In this case the chip guarantees to be deterministic in state that can be restored by the HV, and on the matching replay chip you habe the same scheduler and time etc can be restored. Consequently, you can run the software like before but without peripherals, timing the interrupts, and for the OS and its users everything looks the same.

If your HV detects that the HW does not implement high-performance recordability, you would either give up  or trap everything that you can not possibly restore (like TLB misses...).



Jacob Bachmeyer

unread,
Dec 12, 2017, 7:54:55 PM12/12/17
to Albert Cahalan, Jonas Oberhauser, Andrew Waterman, RISC-V ISA Dev
No, the hypervisor would not emulate the hardware A/D bit management --
hardware A/D bit management is optional in RISC-V, and I believe that no
current implementations have it. Instead, access to pages without the A
bit set faults, and the supervisor is responsible for setting the A
bit. Since visible speculative page faults are definitely not allowed,
this provides a deterministic option.


-- Jacob

Albert Cahalan

unread,
Dec 12, 2017, 9:32:16 PM12/12/17
to Jonas Oberhauser, Jacob Bachmeyer, RISC-V ISA Dev
On 12/12/17, Jonas Oberhauser <s9jo...@gmail.com> wrote:
> 2017-12-11 22:00 GMT+01:00 Albert Cahalan <acah...@gmail.com>:
>> On 12/11/17, Jonas Oberhauser <s9jo...@gmail.com> wrote:

>>> But if you run the same program on the same hardware starting
>>> in the same state, the deterministic scheduler will make sure
>>> that the program will see the same memory states, use the same
>>> translations, etc., but if you run it on a newer CPU (as a trivial
>>> example) it might not (because the new CPU uses a different
>>> scheduler).
>>>
>>> The questions are
>>> 1) is that good enough for you
>>
>> It's OKish. It covers many use cases. It may mean that people who
>> use this to debug things would need to have many CPU versions.
>
> Yes, but is this not already the case, because with a different RAM
> your memory cycles can be quite different, resulting in different
> behavior down the way?

Oh no. It works fine.

> For example, a thread may be delayed by the slow memory, thus a
> store down the line is executed a little later, and therefore suddenly
> pops up after a load to the same address rather than before.

A thread sees its own explicit actions in order. If doing SMP,
the problem becomes harder; the hypervisor must prevent
any uncontrolled interaction.

> The behavior of the program changes on different configurations
> (RAM,CPU,disk...) and that can not be prevented (and that is
> already the case for other architectures).

Disk is not a problem at all. The hypervisor can supply interrupts
in the correct locations. For example, an "instructions retired"
counter could be used to determine where an interrupt must fire.
On replay, this counter is set to cause an exception when the
desired instruction is reached.

Michael Chapman

unread,
Dec 13, 2017, 4:03:51 AM12/13/17
to isa...@groups.riscv.org


On 13-Dec-17 03:32, Albert Cahalan wrote:
> On 12/12/17, Jonas Oberhauser <s9jo...@gmail.com> wrote:
>> 2017-12-11 22:00 GMT+01:00 Albert Cahalan <acah...@gmail.com>:
>>> On 12/11/17, Jonas Oberhauser <s9jo...@gmail.com> wrote:
>>>> But if you run the same program on the same hardware starting
>>>> in the same state, the deterministic scheduler will make sure
>>>> that the program will see the same memory states, use the same
>>>> translations, etc., but if you run it on a newer CPU (as a trivial
>>>> example) it might not (because the new CPU uses a different
>>>> scheduler).
>>>>
>>>> The questions are
>>>> 1) is that good enough for you
>>> It's OKish. It covers many use cases. It may mean that people who
>>> use this to debug things would need to have many CPU versions.
>> Yes, but is this not already the case, because with a different RAM
>> your memory cycles can be quite different, resulting in different
>> behavior down the way?
> Oh no. It works fine.
>
>> For example, a thread may be delayed by the slow memory, thus a
>> store down the line is executed a little later, and therefore suddenly
>> pops up after a load to the same address rather than before.
> A thread sees its own explicit actions in order. If doing SMP,
> the problem becomes harder; the hypervisor must prevent
> any uncontrolled interaction.

So this means that supervisor and user mode code cannot use LR and SC
instructions (and other A extension instructions as well) because the
result and number of times a locking loop are executed may change from
execution to execution. I.e. the instruction count will then no longer
match.

So how would this be handled in practice? Modification of the guest OS
and all user mode code to remove those instructions and replace with
some kind of hypervisor call?

Or should the hypervisor have the option of making those instructions
privileged to be able to trap them (ISA spec change)?

Jonas Oberhauser

unread,
Dec 13, 2017, 4:27:31 AM12/13/17
to Michael Chapman, RISC-V ISA Dev
If you have a deterministic scheduler this is not a problem, but this is unlikely to be the case in high-performance implementations.

If you only debug programs run on single-core CPUs this is also not a problem.

In a multicore system the problem is not only with LR and SC.

All programs that have races are problematic.
Thread1:
LOAD x
8 integer instructions
STORE y 

Thread2:
14 integer instructions
LOAD y


Due to subtle timing issues such as memory latency changing based on temp, you can have different outcomes, e.g.,

cycle 0: LOAD x begins
cycle 4: LOAD x ends
cycle 12 (4+8): STORE y begins
cycle 14: LOAD y begins, is stalled
cycle 16: STORE y ends, LOAD y ends and sees new value


on replay, temp is a little higher, memory access times increase
cycle 0: LOAD x begins
cycle 8: LOAD x ends
cycle 14: LOAD y begins
cycle 16 (8+8): STORE y begins, is stalled (!)
cycle 22: LOAD y ends and sees old value (!!!)

If you have to trap every load and store, you can forget performance.


--
You received this message because you are subscribed to a topic in the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this topic, visit https://groups.google.com/a/groups.riscv.org/d/topic/isa-dev/0aIYw-W0tl4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to isa-dev+unsubscribe@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/78785bed-a785-8ef0-eddd-1f1f8865493d%40gmail.com.

Michael Clark

unread,
Dec 13, 2017, 4:40:09 AM12/13/17
to Michael Chapman, isa...@groups.riscv.org
What is the source of entropy if the initial state is determined (memory, caches, etc)? Obviously thermal throttling would need to be disabled as that would introduce entropy, but assuming clocks are precise, i.e. cycle counters recorded for precise replay of interrupts and IO bus responses as IO is the essential source of non-determinism.

The multi-threaded deterministic replay problem has been solved by simulating multiple cores on a single core via multiplexing. I believe the mozilla rr reverse debugger [1] handles multiple threads in this way. i.e. is limited to running all threads on a single core. I’m not sure how undo.io [2] handles multiple threads.

I guess the problem with recording multiple threads on multiple cores is whether an arbiter in hardware will break a tie in the same direction each time or whether circuit meta-stability issues will introduce entropy that results in replay non-determinism.

Interesting problem!

Jonas Oberhauser

unread,
Dec 13, 2017, 5:38:16 AM12/13/17
to Michael Clark, Michael Chapman, RISC-V ISA Dev
2017-12-13 10:39 GMT+01:00 Michael Clark <michae...@mac.com>:
What is the source of entropy if the initial state is determined (memory, caches, etc)? Obviously thermal throttling would need to be disabled as that would introduce entropy, but assuming clocks are precise, i.e. cycle counters recorded for precise replay of interrupts and IO bus responses as IO is the essential source of non-determinism.

Other problems are e.g. decay of TLB lines (and other caches). If you count cycles in some way, you get similar problems -- but now the "entropy" comes from the HV:
on record, it takes A cycles to store the context,
on replay, it takes B cycles to restore the context.

A TLB entry now gets marked for eviction on replay but not in the original run.

A memory operation that previously took the buffered translation now has to be retranslated, resulting in the best case only in additional latency (if the PTE has not changed) and in the worst case 1) additional A/D bits and 2) the memory access going to a different address.
(I consider having a TLB entry for which a PTE has been changed to be a bug in the OS, and we have been discussing to prevent it in HW).
 
The multi-threaded deterministic replay problem has been solved by simulating multiple cores on a single core via multiplexing. I believe the mozilla rr reverse debugger [1] handles multiple threads in this way. i.e. is limited to running all threads on a single core. I’m not sure how undo.io [2] handles multiple threads.

Yes, but many RISC-V processors will not be single-core. That is why I suggested either having replay guarantees only for single-core, or as an ISA extension, which the HV can detect.
If we make guarantees for single-core, we already need to prohibit behaviors like the above, so we need a rule like (for example) "TLB eviction only happens on a TLB miss, and if the virtual address of the miss is the same and the visible ISA TLB is the same, then the same TLB entries are evicted."

Furthermore, you need to prevent speculative loads (including fetch) that depend on making translations, and so on.

The questions to me are: what is the performance hit of this? Are vendors willing and able to check compliance with such a spec? If it is an ISA extension, will there be any vendors who will actually implement it? Maybe custom-built SoCs on a submarine or a spaceship?

I guess the problem with recording multiple threads on multiple cores is whether an arbiter in hardware will break a tie in the same direction each time or whether circuit meta-stability issues will introduce entropy that results in replay non-determinism. 

It's not just circuit meta-stability, it's also internal invisible state (like TLB "age") or reliance on environment variables that the HV can not control (like temperature).

On the one hand, you have these high-performance CPUs that can not do without heat throttling (or rather, cold-boosting :) ).
On the other hand, our own multi-core MIPS86 processor is completely deterministic in visible state (modulo interference, which I am blending out for the time being), so we might be able to give such a guarantee (if we added a few extra registers and instructions).

Interesting problem!

Definitely :) 

Albert Cahalan

unread,
Dec 13, 2017, 11:11:35 AM12/13/17
to Michael Chapman, isa...@groups.riscv.org
On 12/13/17, Michael Chapman <michael.c...@gmail.com> wrote:
> On 13-Dec-17 03:32, Albert Cahalan wrote:

>> A thread sees its own explicit actions in order. If doing SMP,
>> the problem becomes harder; the hypervisor must prevent
>> any uncontrolled interaction.
>
> So this means that supervisor and user mode code cannot use
> LR and SC instructions (and other A extension instructions as well)
> because the result and number of times a locking loop are executed
> may change from execution to execution. I.e. the instruction count
> will then no longer match.
>
> So how would this be handled in practice? Modification of the guest OS
> and all user mode code to remove those instructions and replace with
> some kind of hypervisor call?

The hypervisor needs to deal with other memory ordering anyway,
along with interrupts. Is there anything else that could make LR/SC
be nondeterministic? Memory ordering can be simply "no SMP" or
it can be something more complicated, resembling a distributed
shared memory implementation based on faulting pages.

Modification of the guest is highly undesirable. RISC-V can have
variable width instructions, so jumping into the middle is possible.
Modifications to one instruction may affect another instruction.

> Or should the hypervisor have the option of making those instructions
> privileged to be able to trap them (ISA spec change)?

I don't think it is needed in this case, but generally that is a very good
ability to have for any instruction that is weird and for any group of
instructions that might behave differently on different hardware.

Weird instructions could include: processor identification, timekeeping,
random number generation, query of privilege level...

Instructions that behave differently on different hardware would
include any instruction set extensions, including bits that are hardwired
in some implementations but not in others. Lack of ability to downgrade
the processor means that newer hardware can not run replays that
were created on older hardware. Whether this matters or not would
depend on the use case.

Albert Cahalan

unread,
Dec 13, 2017, 11:26:22 AM12/13/17
to Jonas Oberhauser, Michael Chapman, RISC-V ISA Dev
On 12/13/17, Jonas Oberhauser <s9jo...@gmail.com> wrote:

> Due to subtle timing issues such as memory latency changing based on temp,
> you can have different outcomes, e.g.,
>
> cycle 0: LOAD x begins
> cycle 4: LOAD x ends

Dealing in terms of cycles is not ideal. One would have to assume that
some error occurs. Things can still be made to work if the timing is
"close", since the hypervisor has the option to retry any failure.
If memory state is incorrect, back up and try again. Maybe undershoot
on purpose, then single-step to the desired point in execution.

One would much rather deal in terms of an "instructions retired" counter.

> If you have to trap every load and store, you can forget performance.

Sure. In that case, you'd JIT or emulate instead.

Note that memory operations normally have locality. For a guest OS
that is reasonably optimized, a single trap can lead to numerous
memory operations before the hypervisor has to take the page away.
It's as if the pages are 4096-byte cache lines bouncing between cores.

Michael Chapman

unread,
Dec 13, 2017, 11:39:13 AM12/13/17
to Albert Cahalan, Jonas Oberhauser, RISC-V ISA Dev
So what is needed today in RISC-V which is not there at the moment?

As far as I can tell what is needed is

1) the possibility to write the instret counter when in machine and
hypervisor modes (to be able to replay) and the possibility to precisely
trap to the hypervisor when the instret counter equals a certain value
(to be able to replay traps/interrupts)

2) the possibility to prohibit any memory state changes due to
speculation (e.g. changing A/D bits in PTEs)

Have I missed anything?
> ---
> This email has been checked for viruses by AVG.
> http://www.avg.com
>

Jonas Oberhauser

unread,
Dec 13, 2017, 1:02:23 PM12/13/17
to Michael Chapman, Albert Cahalan, RISC-V ISA Dev
TLB evictions have to be deterministic.
Translations have to be deterministic. 

With the discussed meaning of the word "deterministic".

You can make stronger guarrantees, but they may have implications on performance.

Jonas Oberhauser

unread,
Dec 13, 2017, 2:08:59 PM12/13/17
to Albert Cahalan, Michael Chapman, RISC-V ISA Dev
I'm not suggesting for the HV to count cycles. As you say that would be foolish.

I'm saying that if you want to replay two threads in a parallel machine, the number of cycles that the memory takes on some operation -- which you can not influence as HV -- makes a difference, destroying replayability.


On Dec 13, 2017 5:26 PM, "Albert Cahalan" <acah...@gmail.com> wrote:


> If you have to trap every load and store, you can forget performance.

Sure. In that case, you'd JIT or emulate instead.

Note that memory operations normally have locality. For a guest OS
that is reasonably optimized, a single trap can lead to numerous
memory operations before the hypervisor has to take the page away.
It's as if the pages are 4096-byte cache lines bouncing between cores.

If I understand you correctly, you would like to cause page faults in the program I posted, so that you can record the order of memory accesses and replay them correctly?
How exactly can you implement that? My understanding is that the OS can share the page between its users and there is little you can do about it.

Jonas Oberhauser

unread,
Dec 13, 2017, 5:09:14 PM12/13/17
to Michael Chapman, Albert Cahalan, RISC-V ISA Dev
Actually I now think this is not enough if the MMU can see memory operations of the CPU out of order (which is what the memory model working group believes). You could obtain the same problem as with the loads/stores if the software can race with its own MMU, i.e., the order between a PTE update by SW and a PTE fetch by the MMU could be delayed.

I have reread the MMU spec with more attention to detail now, and I see now that the TLB is invisible in the spec in the sense that the virtual address is always translated with the up-to-date PTEs (at the point of the access). That does not fix the problem, although it implies that TLB evictions and translations are deterministic "enough".

I think the intention of the spec is to have the memory translation take place as-if directly before the memory access in the global memory order. The global memory order can be changed by timing issues, causing the problem above.

One way to have exact replay with this is to share the memory view between MMU and CPU. In other words, MMU accesses are considered to be in program order right before the CPU access that uses them.
This comes with the costs discussed elsewhere (TLB misses basically require a pipeline drain, SFENCE is a NOP). 

Another way might be to make TLB visible in spec (and deterministic), and (during recording) interrupt on a TLB miss by forcing page faults on TLB misses (which is slow: you have to endure LEVELS page faults, then execute one SV/user instruction, then make the pages unavailable again for the next TLB miss). 
In this case the TLB miss tells you the order and the interrupt+possibly fences enforce program order on MMU accesses.
This way you can add replayability.

I currently gravitate towards the former suggestion, since it will at least not break existing SW.

Albert Cahalan

unread,
Dec 13, 2017, 8:46:54 PM12/13/17
to Jonas Oberhauser, Michael Chapman, RISC-V ISA Dev
On 12/13/17, Jonas Oberhauser <s9jo...@gmail.com> wrote:

> I'm saying that if you want to replay two threads in a parallel
> machine, the number of cycles that the memory takes on
> some operation -- which you can not influence as HV -- makes
> a difference, destroying replayability.
...
> If I understand you correctly, you would like to cause page
> faults in the program I posted, so that you can record the
> order of memory accesses and replay them correctly?
> How exactly can you implement that? My understanding is
> that the OS can share the page between its users and there
> is little you can do about it.

In your example, I don't know if x and y are on the same page.

In any case, the hypervisor implements a cache coherency
protocol within itself. The guest OS may set a page to be
read-write for two threads on two different cores, but that
won't be what happens if the hypervisor steps in. If the page
is writable, the hypervisor will make it inaccessible on other
cores. When one of those other cores accesses the page,
it faults into the hypervisor. The hypervisor can then take the
page away from the first core. These actions taken by the
hypervisor are of course recorded for proper replay. If a core
runs unusually fast during replay, it will reach the instructions
early and then be made to wait for other cores to catch up.

It can be fast if the guest OS is normal. If the guest does the
sort of things that would kill performance on large NUMA
systems, then things won't go so well.

Good hardware will translate guest OS memory addresses
twice, once as desired by the guest OS and once as desired
by the hypervisor. Lacking this, the hypervisor can create
page tables that perform the translation in one step, but then
it will have to detect and act upon any access the guest OS
makes to the page tables within the guest.

Jonas Oberhauser

unread,
Dec 14, 2017, 1:59:05 AM12/14/17
to Albert Cahalan, Michael Chapman, RISC-V ISA Dev


On Dec 14, 2017 2:46 AM, "Albert Cahalan" <acah...@gmail.com> wrote:
On 12/13/17, Jonas Oberhauser <s9jo...@gmail.com> wrote:

In any case, the hypervisor implements a cache coherency
protocol within itself. The guest OS may set a page to be
read-write for two threads on two different cores, but that
won't be what happens if the hypervisor steps in. If the page
is writable, the hypervisor will make it inaccessible on other
cores.

How?



Alex Elsayed

unread,
Dec 14, 2017, 3:15:32 AM12/14/17
to RISC-V ISA Dev
The vNUMA approach[1] is one option: disjoint memory, explicitly messaged. Scales pretty nicely for virtualizing huge systems, too.

Albert Cahalan

unread,
Dec 14, 2017, 3:40:10 AM12/14/17
to Jonas Oberhauser, Michael Chapman, RISC-V ISA Dev
The details depend on the hypervisor support provided
by the hardware, but it's a combination of these two things:

https://en.wikipedia.org/wiki/Distributed_shared_memory
https://en.wikipedia.org/wiki/Second_Level_Address_Translation

An example with hardware supporting nested page tables:

Page translation first goes via page tables that the guest OS
is aware of. The resulting "physical" address isn't really physical.
It is a virtual address from the perspective of the hypervisor.
The hypervisor maintains page tables that the guest is unaware
of. The CPU does a second translation in these tables to get the
real physical address. The hypervisor maintains a distinct set
of page tables on each core. When the second translation fails,
it is resolved by the hypervisor. Permissions in the second set
of page tables are set to ensure that each page is in one of the
following states: writable and exclusive to a single core, shared
by all cores and not writable, unavailable to all cores.

An example without hardware support for nested page tables:

The guest OS maintains page tables, but they are never used
for page translation. A different set of page tables, maintained
by the hypervisor, is used instead. This set of page tables acts
as a combination of both sets in the prior example, combining
both translations into one step. When the guest OS attempts
to modify a page table, that access faults into the hypervisor.
The hypervisor then: disables the corresponding page table
maintained by the hypervisor, removes any TLB entries for it,
makes the guest OS's page table accessible to the guest, and
returns into the guest. When the guest tries to access memory
that would be mapped by such a page table, it traps back into
the hypervisor. The hypervisor then: generates a hypervisor
page table that corresponds to the one modified by the guest,
makes the one modified by the guest inaccessible to the guest,
and returns into the guest.

Clearly, it isn't trivial software. SMP makes things much harder.
Hardware bugs and misfeatures make things much harder too.
Hardware features affect speed: don't bother, barely OK, good.
For example, A/D bit troubles can make nested page table
support unusable or difficult to use.

Jonas Oberhauser

unread,
Dec 14, 2017, 3:41:37 AM12/14/17
to Alex Elsayed, RISC-V ISA Dev
I meant, how do you implement this on RISC-V?
I guess you could put a different atp value on each core. This way you would be able to get page faults if you access a new page, then shooting down all other cores that currently have that page present.

--
You received this message because you are subscribed to a topic in the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this topic, visit https://groups.google.com/a/groups.riscv.org/d/topic/isa-dev/0aIYw-W0tl4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to isa-dev+unsubscribe@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

Jonas Oberhauser

unread,
Dec 14, 2017, 3:46:54 AM12/14/17
to Albert Cahalan, Michael Chapman, RISC-V ISA Dev
2017-12-14 9:40 GMT+01:00 Albert Cahalan <acah...@gmail.com>:
On 12/14/17, Jonas Oberhauser <s9jo...@gmail.com> wrote:
> On Dec 14, 2017 2:46 AM, "Albert Cahalan" <acah...@gmail.com> wrote:

>> In any case, the hypervisor implements a cache coherency
>> protocol within itself. The guest OS may set a page to be
>> read-write for two threads on two different cores, but that
>> won't be what happens if the hypervisor steps in. If the page
>> is writable, the hypervisor will make it inaccessible on other
>> cores.
>
> How?
 
The hypervisor maintains a distinct set

of page tables on each core. When the second translation fails,
it is resolved by the hypervisor. Permissions in the second set
of page tables are set to ensure that each page is in one of the
following states: writable and exclusive to a single core, shared
by all cores and not writable, unavailable to all cores.

Yep, I see that now.

The problem is that if you have MMU and CPU on the same core racing each other due to, e.g., weak memory ordering/out-of-order execution in a pipeline, you can not use this approach.

Clearly, it isn't trivial software. SMP makes things much harder.
Hardware bugs and misfeatures make things much harder too.
Hardware features affect speed: don't bother, barely OK, good.
For example, A/D bit troubles can make nested page table
support unusable or difficult to use.

I get it. You already mentioned that with the "page exclusive" approach above, you can still record and replay reasonable multi-core programs -- did I get that right?
 

Michael Chapman

unread,
Dec 14, 2017, 4:03:38 AM12/14/17
to Albert Cahalan, Jonas Oberhauser, RISC-V ISA Dev
So what are the required and desirable (for performance reasons)
features to make all this possible?

So far I have seen just two:-

1) Possibility to write instret (machine mode, hypervisor) and to trap
into the hypervisor on it reaching a certain value

2) Any kind of speculative update of memory by the HW (as far as I am
aware, this only applies to PTE A/D bits)

Is the weak memory model as being defined by the working group OK for
replay? Or is something stronger needed?

Ensuring that replay is possible without throttling performance seems
very desirable. A "free" replay possibility in production hypervisors
seems a great feature.

Albert Cahalan

unread,
Dec 14, 2017, 4:37:47 AM12/14/17
to Michael Chapman, Jonas Oberhauser, RISC-V ISA Dev
On 12/14/17, Michael Chapman <michael.c...@gmail.com> wrote:

> So what are the required and desirable (for performance reasons)
> features to make all this possible?
>
> So far I have seen just two:-
>
> 1) Possibility to write instret (machine mode, hypervisor) and to trap
> into the hypervisor on it reaching a certain value

This also has to not bleed over from one privilege level to another.
It has to freeze when going from guest to hypervisor.

> 2) Any kind of speculative update of memory by the HW (as far as I am
> aware, this only applies to PTE A/D bits)
>
> Is the weak memory model as being defined by the working group OK for
> replay? Or is something stronger needed?

I looked over the memory model stuff. I nervously believe that
the weak model could work. I don't recall seeing anything about
privilege changes causing full synchronization, but that would
be needed.

> Ensuring that replay is possible without throttling performance seems
> very desirable. A "free" replay possibility in production hypervisors
> seems a great feature.

Nested page tables are mighty useful:
https://en.wikipedia.org/wiki/Second_Level_Address_Translation

For every weird register (cycle counter, randomness source,
performance monitoring registers, etc.) the hypervisor needs a
way to force the correct values.

To replay on newer hardware, the hypervisor must be able to disable
any newer instructions. (they must fault as they did on the older hardware)
To replay on older hardware, missing instructions must fault to the
hypervisor for emulation.

Generally, the hypervisor should be able to catch faults of various
types, including ones the guest expects to handle.

Jonas Oberhauser

unread,
Dec 14, 2017, 10:55:08 AM12/14/17
to Michael Chapman, Albert Cahalan, RISC-V ISA Dev
2017-12-14 10:03 GMT+01:00 Michael Chapman <michael.c...@gmail.com>:
So what are the required and desirable (for performance reasons)
features to make all this possible?

So far I have seen just two:-

1) Possibility to write instret (machine mode, hypervisor) and to trap
into the hypervisor on it reaching a certain value

2) Any kind of speculative update of memory by the HW (as far as I am
aware, this only applies to PTE A/D bits)

We currently do not have speculative A/D bits (and we currently put A/D bits into the local program order). So this is not something that has to be changed; it is something that needs to be kept the way it is.

 
Ensuring that replay is possible without throttling performance seems
very desirable. A "free" replay possibility in production hypervisors
seems a great feature.

I'd like to point out that recording is *not* possible currently without throttling performance, but according to Albert the performance penalty is small enough to still be able to run reasonable/most software.
The throttling currently consists of:
1) not allowing the guest/users to directly modify page tables
2) bringing the memory model down to sequential consistency, in particular by making each writable page exclusive to one core at a time and having memory barriers when transferring a page to another core.

Jacob Bachmeyer

unread,
Dec 14, 2017, 7:28:25 PM12/14/17
to Albert Cahalan, Jonas Oberhauser, Michael Chapman, RISC-V ISA Dev
Albert Cahalan wrote:
> On 12/14/17, Jonas Oberhauser <s9jo...@gmail.com> wrote:
>
>> On Dec 14, 2017 2:46 AM, "Albert Cahalan" <acah...@gmail.com> wrote:
>>
>>> In any case, the hypervisor implements a cache coherency
>>> protocol within itself. The guest OS may set a page to be
>>> read-write for two threads on two different cores, but that
>>> won't be what happens if the hypervisor steps in. If the page
>>> is writable, the hypervisor will make it inaccessible on other
>>> cores.
>>>
>> How?
>>
>
> The details depend on the hypervisor support provided
> by the hardware, but it's a combination of these two things:
>
> https://en.wikipedia.org/wiki/Distributed_shared_memory
> https://en.wikipedia.org/wiki/Second_Level_Address_Translation
>

The current proposed RISC-V Hypervisor support uses two-level address
translation. It is Chapter 5 in the privileged ISA spec draft. How
does RVH compare to what your application needs?


-- Jacob



Jacob Bachmeyer

unread,
Dec 14, 2017, 7:40:36 PM12/14/17
to Albert Cahalan, Michael Chapman, Jonas Oberhauser, RISC-V ISA Dev
Albert Cahalan wrote:
> On 12/14/17, Michael Chapman <michael.c...@gmail.com> wrote:
>
>
>> So what are the required and desirable (for performance reasons)
>> features to make all this possible?
>>
>> So far I have seen just two:-
>>
>> 1) Possibility to write instret (machine mode, hypervisor) and to trap
>> into the hypervisor on it reaching a certain value
>>
>
> This also has to not bleed over from one privilege level to another.
> It has to freeze when going from guest to hypervisor.
>

Earlier drafts of the privileged ISA spec had delta registers that each
privilege level could use to set offsets for {h,s,u}{instret,cycle,...}
that would be applied to lower privilege levels. Those were removed and
replaced with *counteren CSRs. Would allowing independent writable
counters help here? I am suggesting having sinstret, for example, only
count in S-mode and U-mode and be writable in H-mode and M-mode. The
*counteren CSRs would then change to instead control whether each
counter counts in more-privileged modes or is frozen while lower
implementation levels run. (M-mode == highest privilege == lowest
implementation)

>> Ensuring that replay is possible without throttling performance seems
>> very desirable. A "free" replay possibility in production hypervisors
>> seems a great feature.
>>
>
> Nested page tables are mighty useful:
> https://en.wikipedia.org/wiki/Second_Level_Address_Translation
>

The current RVH draft specifies these.

> For every weird register (cycle counter, randomness source,
> performance monitoring registers, etc.) the hypervisor needs a
> way to force the correct values.
>

Would what I suggested above work for the counters? (A randomness
source would be a non-idempotent MMIO read in RISC-V, so memory
protection handles that.)

> To replay on newer hardware, the hypervisor must be able to disable
> any newer instructions. (they must fault as they did on the older hardware)
>

This could be a problem. On the one hand, we have writable misa now,
but on the other hand, that only applies to entire extensions. This
level of fine granularity backwards compatibility will end up needing a
very large number of flags and I am unsure if the costs outweigh the
benefits. Perhaps an hisa CSR that can disable extensions in the guest
S-mode to permit coarse-grained trap-and-emulate combined with the
freezable counters could help?

> To replay on older hardware, missing instructions must fault to the
> hypervisor for emulation.
>

This always happens in RISC-V.

> Generally, the hypervisor should be able to catch faults of various
> types, including ones the guest expects to handle.
>

This is also part of the current RISC-V baseline -- trap delegation can
be performed in hardware or software.



-- Jacob

Albert Cahalan

unread,
Dec 14, 2017, 11:27:57 PM12/14/17
to jcb6...@gmail.com, Jonas Oberhauser, Michael Chapman, RISC-V ISA Dev
On 12/14/17, Jacob Bachmeyer <jcb6...@gmail.com> wrote:

> The current proposed RISC-V Hypervisor support uses two-level address
> translation. It is Chapter 5 in the privileged ISA spec draft. How
> does RVH compare to what your application needs?

It's a complete failure, given that the chapter currently reads:


Chapter 5
Hypervisor Extensions, Version 0.0

This chapter is a placeholder for RISC-V hypervisor support with an
extended S-mode.

The privileged architecture is designed to simplify the use
of classic virtualization techniques, where a guest OS is run
at user-level, as the few privileged instructions can be
easily detected and trapped.

Albert Cahalan

unread,
Dec 15, 2017, 12:12:32 AM12/15/17
to jcb6...@gmail.com, Michael Chapman, Jonas Oberhauser, RISC-V ISA Dev
On 12/14/17, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
> Albert Cahalan wrote:
>> On 12/14/17, Michael Chapman <michael.c...@gmail.com> wrote:

>>> 1) Possibility to write instret (machine mode, hypervisor) and
>>> to trap into the hypervisor on it reaching a certain value
>>
>> This also has to not bleed over from one privilege level to
>> another. It has to freeze when going from guest to hypervisor.
>
> Earlier drafts of the privileged ISA spec had delta registers that each
> privilege level could use to set offsets for {h,s,u}{instret,cycle,...}
> that would be applied to lower privilege levels. Those were removed and
> replaced with *counteren CSRs. Would allowing independent writable
> counters help here? I am suggesting having sinstret, for example, only
> count in S-mode and U-mode and be writable in H-mode and M-mode. The
> *counteren CSRs would then change to instead control whether each
> counter counts in more-privileged modes or is frozen while lower
> implementation levels run. (M-mode == highest privilege == lowest
> implementation)

I think the spec has an inconsistency:

3.1.17 Counter-Enable Registers ([m|h|s]counteren)
Figure 3.18: Counter-enable registers (mcounteren and scounteren).

The hcounteren register is both there and not there. Hey, quantum! :-)

It looks like the enable registers let a hypervisor emulate the use
of these registers for less-privileged levels via trapping. This is
good. The deltas would also be good, and having both would give
the hypervisor a choice.

That only covers providing such resources for the less-privileged level.
I don't see anything that would stop or capture the count upon trapping
into the hypervisor. (with deltas, they could begin incrementing)
I suppose very careful counting of the instruction path could make things
workable, but that would be fragile code to maintain. The register would
be read by the hypervisor as soon as possible, and then a known value
corresponding to the hypervisor instructions already executed could be
subtracted off. For a hypervisor that loads as a kernel module, this may
require some seriously invasive hot-patching of the host OS kernel.
The same can apply on the path out of the hypervisor, depending on
how that is done. Although baroque in operation, Intel's vmrun opcode
nicely lets the hypervisor code be all in one spot where it can be carefully
maintained assembly code instead of spread out across trap handlers.

I don't see a way to trap on an arbitrary value. If the less-privileged level
can see the current value, then the more-privileged level needs the
ability to trap on an arbitrary value. Otherwise, with the less-privileged
code unable to see what is going on, it is sufficient to trap on a single
fixed value (such as 0 or -1 or wrap-around) and let the hypervisor start
from a value that will trap after the desired count.

>> Nested page tables are mighty useful:
>> https://en.wikipedia.org/wiki/Second_Level_Address_Translation
>
> The current RVH draft specifies these.

Chapter 5 is an empty placeholder.

>> To replay on newer hardware, the hypervisor must be able
>> to disable any newer instructions. (they must fault as they
>> did on the older hardware)
>
> This could be a problem. On the one hand, we have writable misa
> now, but on the other hand, that only applies to entire extensions.
> This level of fine granularity backwards compatibility will end up
> needing a very large number of flags and I am unsure if the costs
> outweigh the benefits. Perhaps an hisa CSR that can disable
> extensions in the guest S-mode to permit coarse-grained
> trap-and-emulate combined with the freezable counters could help?

The finer control the better, but I understand that gets annoying.
Instructions could be grouped by extension and year of introduction.

>> To replay on older hardware, missing instructions must fault to the
>> hypervisor for emulation.
>
> This always happens in RISC-V.

Good. An example of trouble is the L bit in the PowerPC cmp
instruction. This chooses 32-bit or 64-bit comparison. It is an
invalid bit on 32-bit hardware, but some hardware doesn't check
and some broken programs have the bit improperly set. This is
trouble beyond fancy hypervisors of course; it can block an
ordinary program from running on upgraded hardware.

Jacob Bachmeyer

unread,
Dec 15, 2017, 9:06:10 PM12/15/17
to Albert Cahalan, Michael Chapman, Jonas Oberhauser, RISC-V ISA Dev
Albert Cahalan wrote:
> On 12/14/17, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>
>> Albert Cahalan wrote:
>>
>>> On 12/14/17, Michael Chapman <michael.c...@gmail.com> wrote:
>>>
>>>> 1) Possibility to write instret (machine mode, hypervisor) and
>>>> to trap into the hypervisor on it reaching a certain value
>>>>
>>> This also has to not bleed over from one privilege level to
>>> another. It has to freeze when going from guest to hypervisor.
>>>
>> Earlier drafts of the privileged ISA spec had delta registers that each
>> privilege level could use to set offsets for {h,s,u}{instret,cycle,...}
>> that would be applied to lower privilege levels. Those were removed and
>> replaced with *counteren CSRs. Would allowing independent writable
>> counters help here? I am suggesting having sinstret, for example, only
>> count in S-mode and U-mode and be writable in H-mode and M-mode. The
>> *counteren CSRs would then change to instead control whether each
>> counter counts in more-privileged modes or is frozen while lower
>> implementation levels run. (M-mode == highest privilege == lowest
>> implementation)
>>
>
> I think the spec has an inconsistency:
>
> 3.1.17 Counter-Enable Registers ([m|h|s]counteren)
> Figure 3.18: Counter-enable registers (mcounteren and scounteren).
>
> The hcounteren register is both there and not there. Hey, quantum! :-)
>

That was removed in commit ce6437f9 a few days ago.

> It looks like the enable registers let a hypervisor emulate the use
> of these registers for less-privileged levels via trapping. This is
> good. The deltas would also be good, and having both would give
> the hypervisor a choice.
>
> That only covers providing such resources for the less-privileged level.
> I don't see anything that would stop or capture the count upon trapping
> into the hypervisor. (with deltas, they could begin incrementing)
> I suppose very careful counting of the instruction path could make things
> workable, but that would be fragile code to maintain. The register would
> be read by the hypervisor as soon as possible, and then a known value
> corresponding to the hypervisor instructions already executed could be
> subtracted off. For a hypervisor that loads as a kernel module, this may
> require some seriously invasive hot-patching of the host OS kernel.
> The same can apply on the path out of the hypervisor, depending on
> how that is done. Although baroque in operation, Intel's vmrun opcode
> nicely lets the hypervisor code be all in one spot where it can be carefully
> maintained assembly code instead of spread out across trap handlers.
>

I am proposing changing the meaning of *counteren, so that a "disabled"
counter remains accessible to less-privileged modes and counts in
less-privileged modes, but is frozen and writable in more-privileged
modes. There would be no way to trap on counter access, but lower
implementation levels can load the counters with any desired values.

> I don't see a way to trap on an arbitrary value. If the less-privileged level
> can see the current value, then the more-privileged level needs the
> ability to trap on an arbitrary value. Otherwise, with the less-privileged
> code unable to see what is going on, it is sufficient to trap on a single
> fixed value (such as 0 or -1 or wrap-around) and let the hypervisor start
> from a value that will trap after the desired count.
>

This will require a new semi-timer interrupt using new *instretcmp CSRs.

>>> Nested page tables are mighty useful:
>>> https://en.wikipedia.org/wiki/Second_Level_Address_Translation
>>>
>> The current RVH draft specifies these.
>>
>
> Chapter 5 is an empty placeholder.
>

Try the Git repository at
<URL:https://github.com/riscv/riscv-isa-manual>. You will need LaTeX to
build the draft spec.

>>> To replay on newer hardware, the hypervisor must be able
>>> to disable any newer instructions. (they must fault as they
>>> did on the older hardware)
>>>
>> This could be a problem. On the one hand, we have writable misa
>> now, but on the other hand, that only applies to entire extensions.
>> This level of fine granularity backwards compatibility will end up
>> needing a very large number of flags and I am unsure if the costs
>> outweigh the benefits. Perhaps an hisa CSR that can disable
>> extensions in the guest S-mode to permit coarse-grained
>> trap-and-emulate combined with the freezable counters could help?
>>
>
> The finer control the better, but I understand that gets annoying.
> Instructions could be grouped by extension and year of introduction.
>

There are two problems with finer-grained control: the proliferation of
control bits and the complexity of instruction decoding that honors
them. Both of these could grow without bound if every little change can
be undone.


-- Jacob

Jacob Bachmeyer

unread,
Dec 15, 2017, 9:08:08 PM12/15/17
to Albert Cahalan, Jonas Oberhauser, Michael Chapman, RISC-V ISA Dev
You must be looking at an obsolete draft. Try the Git repository at
<URL:https://github.com/riscv/riscv-isa-manual>. :-)


-- Jacob

Reply all
Reply to author
Forward
0 new messages