Robert O'Callahan wrote:
> I'm the lead developer of rr, a low overhead record-and-replay-based
> debugger for arbitrary user-space processes on Linux/x86. People use
> it to debug applications like Firefox and QEMU with reverse execution.
> Details:
https://arxiv.org/abs/1705.05937
>
> Out of curiosity I had a quick look to see whether rr would be
> implementable on RISC-V. It generally looks delightful but there are a
> few obvious issues for rr:
>
> The cycle/time/instret CSRs are exposed to user-space unconditionally.
> I see no way to force them to trap, so it would be impossible to
> record and replay code that reads these. Compare to x86, where you can
> configure the processor to trap on RDTSC and friends.
The [ms]counteren CSRs are what you seek. They are defined in the
privileged ISA spec, in the "Counter-Enable Registers" sections.
Previously, there were *delta CSRs that permitted setting offsets that
would be applied to those counters. I have also suggested simply making
the counters writable to privileged code.
> It's not clear whether the instret counter is precise. Intel's x86 PMU
> has an instructions-retired counter, but it's not precise enough for
> our use; for example, instructions that are interrupted and restarted
> (e.g. due to a page fault) are counted twice. rr needs the property
> that if you retire N instructions *as seen by user space* then the
> counter advances by exactly N. On x86 rr instead uses the
> conditional-branches-retired counter and some crazy hacks, but it
> would be much better to have a usable instructions-retired counter.
My guess is that the precision of instret is implementation-defined,
like so much else in RISC-V. (I would support making it precise if it
is not currently -- an instruction restarted due to a trap never
actually retired in the first place!)
> RISC-V uses LL/SC rather than CAS. This is a problem for rr because
> LL/SC, on ARM at least, is prone to unpredictable fail-and-retry, e.g.
> because a hardware interrupt occurred inside the LL/SC region. Thus,
> the count of instructions or even conditional branches retired may
> incur unpredictable extra increments, making it unreliable for our
> purposes. The ability to trap on a failed SC would probably be enough
> to let rr work.
RISC-V will do exactly that, although multi-hart systems with a PLIC
have the ability to steer most interrupts away from a particular hart.
On the other hand, RISC-V LR/SC provides a forward-progress guarantee.
Could that forward-progress guarantee be strengthened to include holding
off interrupts for a limited time? In other words, if the LR/SC *can*
succeed on the first try (no page fault, no race with other harts,
etc.), it *must* succeed on the first try?
-- Jacob