> <mailto:
jcb6...@gmail.com> <mailto:
jcb6...@gmail.com
> <mailto:
jcb6...@gmail.com>> <mailto:
jcb6...@gmail.com
I think I see it now: the pipeline must avoid translation hazards.
> Now of course you could instead snoop for changes made by
> others. This is a bit complicated.
> My suggestion would be that the local cache used by the TLB
> needs to
> 1) listen to global writes to PTEs used by current TLB entries
> (obviously)
> 2) but also listen to global reads (!) to PTEs used by the
> current TLB, respond with a "you are not the only one who has
> it, but I don't have it cached" (i.e., no data intervention).
> This prevents the other cache from going into a local mode
> where it can write without using the memory bus, which would
> cause us to miss these messages.
>
>
> A single wired-OR data line is sufficient for this and would
> indicate "you are reading from a live page table"
>
> At first I was going to say:
> You typically don't need an extra wire since such wires are already
> part of most cache protocols. For example in the caches implemented in
> "a pipelined multicore MIPS machine" , you would put a 1 on the Ca bus
> and a 0 on the di bus.
> This would allow one to use "normal" caches for the most part.
>
> However once you want to do more with the line than a normal cache
> would -- e.g., reserve it except for MMUs -- you do need the extra wire.
Reads never conflict, but a CPU reading a line held by an MMU must take
the line as "shared" rather than "owned". In the simple case, where we
do not have a special "global TLB probe" instruction to implement, any
TLB entry is effectively in "shared" cache state. When a CPU writes to
a page table, it must acquire "exclusive" cache state for that line,
which causes any TLB entries dependent on that line to be invalidated.
If we do have a "global TLB probe" (for finding unused regions to swap
out), then we need a special "shared-by-TLB" state that "global TLB
probe" can observe.
> I suggested that TLB-invalidation-on-PTE-write could have page
> granularity -- PTE updates are generally fairly rare, and more so
> for active translations. Any write to a page table invalidates
> all TLB entries derived from that page table.
>
>
> Sure. I don't think it makes a big difference in HW.
I have since realized that cacheline granularity or even individual PTE
granularity would be much preferable -- if page granularity is used, any
write to the root page table flushes the entire TLB!
> Outer, shared caches can hold both (and can thus forward CPU PTE
> writes to MMU TLB reads without going through main memory in all
> cases), but yes, this what I meant by "the MMU snoops the memory bus".
>
>
> You are now talking about CPUi and MMUj, right?
Also CPUi and MMUi -- in normal user execution, the MMU will access page
tables, but the CPU will access user data. Those categories are
disjoint, so the innermost caches should be distinct for CPU and MMU.
Going a step farther, the MMU can use multi-level TLBs instead of
MMU-only caches. A TLB is functionally equivalent to a cache. At some
point, MMU paths to memory and CPU paths to memory join, but whether
distinct harts share CPU-only caches before CPU and MMU paths join is an
implementation choice.
Caches beyond the point where CPU and MMU paths join can forward CPU PTE
writes to MMU TLB reads, but that point may be just above a single hart
and its MMU or farther out in the cache hierarchy. At that point, an
inclusive cache is needed -- that cache must be able to respond with
data when a TLB indicates "I share this line", since the TLB does not
have the entire cacheline.
If we add a "shared-by-TLB" state to the cache coherency protocol, then
we can also add "set A" and "set D" messages that TLBs can emit to
update other copies of PTEs that they map. If we allow non-leaf A to be
set even when a translation is ultimately unsuccessful, only leaf PTEs
can ever need such an update, as non-leaf PTEs are updated when the TLB
is loaded, and the "shared-by-TLB" state indicates that a TLB holds a
mapping, therefore all non-leaf A bits have been set on that path.
> Synchronization in high performance hardware is never easy;
> simpler implementations can use page faults and software to manage
> A and D.
>
>
> I agree but what I don't understand is why it is necessary to promise
> such strong synchronization in the first place. In my view it is
> perfectly acceptable to have speculative out of order translation.
I think that someone wanted the ability to do exact replays?
> I believe: If you also have that xRET synchronizes mops of CPUi and
> MMUi, and PTE changes either in HW or in SW invalidates affected
> entries, then any software that does not modify the PTEs for its own
> accesses will never see stale translations, and if a page has no A/D
> bits it was never accessed not even by "live" translations (though the
> converse is not true).
What really complicates this is the need for "busy" translations -- the
first instruction to read or write that page (and therefore set the A or
D bit) has performed its translation, but has not yet committed. I
believe the hazard we are trying to avoid is setting the A/D bit after
another hart has replaced the PTE while an instruction was "in-flight"
that uses that PTE. Then "in-flight" instruction then commits and the
A/D bit is set on the *new* PTE, incorrectly.
> What to do with the primary result if committing an
> instruction
> and its A bits spans cycles and the invalidation occurs in the
> middle of the commit is an interesting question. Cancel it
> entirely? Let it "sneak under the wire"?
>
>
> In-pipeline you can probably prevent this using some stalling
> -- only begin one access if no conflicting access is ongoing.
>
>
>
> Fortunately, AMOOR is idempotent, so no conflicting access for
> instructions in the pipeline is possible -- an instructions cannot
> conflict with itself, and translation conflicts with later
> instructions cause "busy" TLB entries to be invalidated and the
> pipeline to be flushed.
>
>
> As mentioned above, this also needs that you drain pipeline on TLB
> miss, since you can otherwise have conflicts and discover them late.
Will a page table walk ever complete quickly enough to *not* drain the
pipeline while the MMU is busy loading a TLB entry? I believe that
loading a TLB entry will have enough latency that all preceding
instructions will complete before the new translation is ready. If a
preceding instruction commits and replaces a PTE that the MMU just read
while walking the page tables, then the MMU cancels its partial
translation and starts over. If some other PTE is overwritten, the
relevant TLB entry is invalidated, but the MMU continues. If that TLB
invalidation affects a pending instruction, then the pipeline is flushed
(while the MMU is still translating) and the pipeline resumes by
retrying the instruction that caused the TLB miss or the instruction
that was canceled, whichever came earlier in the program.
An MMU write to the page tables *never* invalidates a TLB entry, to be
clear.
> By others you somehow need to trick the cache system into
> preventing it, e.g., by reserving all of the PTEs and setting
> all A/D bits together.
>
>
> This is essentially the idea -- the MMU holds quasi-LR
> reservations on all PTEs that affect currently "in-flight"
> instructions. Writes (other than A/D updates) break those
> reservations, which invalidates TLB entries, which forces a
> pipeline flush.
>
>
> Are we agreed that writes by non-MMUs should not be allowed if an MMU
> hold the reservations, to prevent exactly the situation you mentioned
> (memory access during PTE invalidation)?
A CPU write to a PTE is allowed at any time, but forces all "readers"
("in-flight" instructions dependent on a translation through that PTE)
to be canceled and re-tried when their TLB entries are invalidated.
Since this is expected to be a rare conflict, flushing the pipeline and
starting over from just after the last instruction that actually
committed seems reasonable to me. An instruction that has already been
committed simply completed before the PTE was written.
The "global TLB probe" instruction effectively asks "will the A/D bits
on this mapping be set 'soon'?" The MMU reservations are "soft" -- they
do not inhibit a CPU from writing to a PTE, only ensure that the MMU is
notified of such writes.
> Still, the window for this is very narrow -- the A bits get set
> after one instruction latency.
>
>
> But this latency can be big. Many cycles can occur between translation
> and retirement, including several memory bus cycles. I don't know how
> fast memory will be in typical RISCV implementations relative to the
> CPU but I can imagine that LEVELS PTE accesses+a fetch take a lot of
> cycles.
The ITLB entry will be in this state the longest -- the translation must
be resolved before the instruction can be fetched, and a memory access
instruction may run a DTLB entry through the same state. Then, after
the instruction is executed, up to LEVELS A bits for the ITLB and up to
LEVELS A/D bits for the DTLB must be set, closing the window. Perhaps
the hart should be able to hold off other writes to the page tables
while setting the A/D bits? These are a short burst of accesses, and
the hazard we are trying to avoid is setting an A/D bit on a replaced PTE.
> I think now you need to be a bit smart on how you build your
> thread scheduler and your page eviction scheduler to really
> prevent the above situation, namely there must be no symmetry
> (eventually the next page to be swapped out must be from
> another thread than the next one to be scheduled).
>
>
> A scheduler that swaps out pages belonging to runnable tasks has a
> pretty serious bug anyway -- that is an amazingly good recipe for
> VM swap thrashing. :-)
>
>
> You may have no other choice if you only have a few tasks and they are
> all runnable. I take it from you that this never ever happens because
> there are a huge amount of tasks always?
In my experience (mostly desktop and server GNU/Linux), yes, ~100 tasks
and usually <10 runnable. In an RTOS, there is no swap. :-)
Admittedly, if your runnable tasks are all memory hogs, and your working
set does not fit in RAM, you are toast no matter what.
> PS: The situation with the miss/pagefaults reminds me a bit of
> when I try to pass by someone in the aisles of a store; I make
> a step left, he makes a step right. I make a step right, he
> makes a step left. This goes on for a couple of iterations... ;)
>
>
> I think that relative speeds between processor core and memory bus
> is part of the trick here -- the processor can execute several
> instructions before the next memory bus cycle can start, so as
> long as the processor can complete fetching *one* cacheline worth
> of instructions, it can make forward progress by executing from
> that cacheline before another processor can initiate yet another
> write.
>
>
> No, because to begin executing instructions you need to translate,
> which takes several cycles and could be aborted when the first write
> starts, at which point you get a pf.
Fetching the cacheline requires a translation to know which physical
cacheline to pull from main memory. Once that is done, execution from
that cacheline can begin. Of course, if the instruction to be executed
is a memory access, another translation is needed, and there is an
opportunity to replace a PTE that was used in fetching that cacheline,
and we go round all over again. I believe that this requires the memory
bus be almost saturated with repeated PTE replacements, which is a
pathological scenario.
Forward progress is possible, assuming that the processor can complete a
translation at all. In other words, we are not talking about starving
instruction fetch, but starving the MMU.
> Of course if like you suggested above we guarantee for the whole time
> that the PTEs are reserved for the MMUs, the fetch translation will
> progress and the cacheline can be fetched (and subsequently be
> executed, even if in the meantime the PTEs get invalidated -- yay for
> separate fetch and execute). My worries will as we Germans say
> "evaporate into a cloud of pleasure" (it sounds better in German, I
> promise).
So we need a memory controller that (globally) prioritizes MMU PTE fetch
above CPU writes and will permit an MMU to complete a translation before
allowing other nodes to use the bus again. If we also assign the next
memory cycle to the processor that just performed a translation and give
the same processor priority for one more translation and one more memory
bus cycle before giving the next processor priority (round-robin), I
*think* that we can guarantee forward progress unconditionally. This is
starting to sound like the dining philosophers problem: an RV32
processor needs three or four memory cycles to begin execution (fetch
PTE from root page table, fetch PTE from intermediate page table, fetch
instruction, execute instruction) with no intervening writes to the page
tables.
I still suspect that this MMU starvation is an edge case that can only
happen with programs specially written to cause it.
-- Jacob