RISC-V and deterministic record/replay tools (i.e. rr)

250 views
Skip to first unread message

Robert O'Callahan

unread,
Mar 15, 2018, 1:27:18 AM3/15/18
to isa...@groups.riscv.org
I'm the lead developer of rr, a low overhead record-and-replay-based debugger for arbitrary user-space processes on Linux/x86. People use it to debug applications like Firefox and QEMU with reverse execution. Details: https://arxiv.org/abs/1705.05937

Out of curiosity I had a quick look to see whether rr would be implementable on RISC-V. It generally looks delightful but there are a few obvious issues for rr:

The cycle/time/instret CSRs are exposed to user-space unconditionally. I see no way to force them to trap, so it would be impossible to record and replay code that reads these. Compare to x86, where you can configure the processor to trap on RDTSC and friends.

It's not clear whether the instret counter is precise. Intel's x86 PMU has an instructions-retired counter, but it's not precise enough for our use; for example, instructions that are interrupted and restarted (e.g. due to a page fault) are counted twice. rr needs the property that if you retire N instructions *as seen by user space* then the counter advances by exactly N. On x86 rr instead uses the conditional-branches-retired counter and some crazy hacks, but it would be much better to have a usable instructions-retired counter.

RISC-V uses LL/SC rather than CAS. This is a problem for rr because LL/SC, on ARM at least, is prone to unpredictable fail-and-retry, e.g. because a hardware interrupt occurred inside the LL/SC region. Thus, the count of instructions or even conditional branches retired may incur unpredictable extra increments, making it unreliable for our purposes. The ability to trap on a failed SC would probably be enough to let rr work.

Just FYI,
Rob
--
Su ot deraeppa sah dna Rehtaf eht htiw saw hcihw, efil lanrete eht uoy ot mialcorp ew dna, ti ot yfitset dna ti nees evah ew; deraeppa efil eht. Efil fo Drow eht gninrecnoc mialcorp ew siht - dehcuot evah sdnah ruo dna ta dekool evah ew hcihw, seye ruo htiw nees evah ew hcihw, draeh evah ew hcihw, gninnigeb eht morf saw hcihw taht.

Jacob Bachmeyer

unread,
Mar 15, 2018, 1:48:02 AM3/15/18
to rob...@ocallahan.org, isa...@groups.riscv.org
Robert O'Callahan wrote:
> I'm the lead developer of rr, a low overhead record-and-replay-based
> debugger for arbitrary user-space processes on Linux/x86. People use
> it to debug applications like Firefox and QEMU with reverse execution.
> Details: https://arxiv.org/abs/1705.05937
>
> Out of curiosity I had a quick look to see whether rr would be
> implementable on RISC-V. It generally looks delightful but there are a
> few obvious issues for rr:
>
> The cycle/time/instret CSRs are exposed to user-space unconditionally.
> I see no way to force them to trap, so it would be impossible to
> record and replay code that reads these. Compare to x86, where you can
> configure the processor to trap on RDTSC and friends.

The [ms]counteren CSRs are what you seek. They are defined in the
privileged ISA spec, in the "Counter-Enable Registers" sections.
Previously, there were *delta CSRs that permitted setting offsets that
would be applied to those counters. I have also suggested simply making
the counters writable to privileged code.

> It's not clear whether the instret counter is precise. Intel's x86 PMU
> has an instructions-retired counter, but it's not precise enough for
> our use; for example, instructions that are interrupted and restarted
> (e.g. due to a page fault) are counted twice. rr needs the property
> that if you retire N instructions *as seen by user space* then the
> counter advances by exactly N. On x86 rr instead uses the
> conditional-branches-retired counter and some crazy hacks, but it
> would be much better to have a usable instructions-retired counter.

My guess is that the precision of instret is implementation-defined,
like so much else in RISC-V. (I would support making it precise if it
is not currently -- an instruction restarted due to a trap never
actually retired in the first place!)

> RISC-V uses LL/SC rather than CAS. This is a problem for rr because
> LL/SC, on ARM at least, is prone to unpredictable fail-and-retry, e.g.
> because a hardware interrupt occurred inside the LL/SC region. Thus,
> the count of instructions or even conditional branches retired may
> incur unpredictable extra increments, making it unreliable for our
> purposes. The ability to trap on a failed SC would probably be enough
> to let rr work.

RISC-V will do exactly that, although multi-hart systems with a PLIC
have the ability to steer most interrupts away from a particular hart.
On the other hand, RISC-V LR/SC provides a forward-progress guarantee.
Could that forward-progress guarantee be strengthened to include holding
off interrupts for a limited time? In other words, if the LR/SC *can*
succeed on the first try (no page fault, no race with other harts,
etc.), it *must* succeed on the first try?


-- Jacob

Allen Baum

unread,
Mar 15, 2018, 2:23:48 AM3/15/18
to jcb6...@gmail.com, rob...@ocallahan.org, isa...@groups.riscv.org


-Allen

> On Mar 14, 2018, at 10:47 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>
> Robert O'Callahan wrote:
>> I'm the lead developer of rr, a low overhead record-and-replay-based debugger for arbitrary user-space processes on Linux/x86.
...
>> It's not clear whether the instret counter is precise.
>
> My guess is that the precision of instret is implementation-defined, like so much else in RISC-V. (I would support making it precise if it is not currently -- an instruction restarted due to a trap never actually retired in the first place!)

If I didn’t agree before I started with compliance WG, I certainly do now.
Having lots of implementation defined architecturally visible results will make it hard to evolve a SW ecosystem, and that will kill RISCV in the marketplace.

In any case, I don’t see why a precise instruction retire counter should be difficult for any implementation.
>
>> RISC-V uses LL/SC rather than CAS. This is a problem for rr because LL/SC, on ARM at least, is prone to unpredictable fail-and-retry, e.g. because a hardware interrupt occurred inside the LL/SC region.

That is a difficult problem to solve.
Note that the A-extension has other atomic ops which might be usable (though DCAS is often what you want)

Robert O'Callahan

unread,
Mar 15, 2018, 3:09:45 AM3/15/18
to jcb6...@gmail.com, isa...@groups.riscv.org
On Thu, Mar 15, 2018 at 6:47 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
The [ms]counteren CSRs are what you seek.  They are defined in the privileged ISA spec, in the "Counter-Enable Registers" sections.

Aha! Fantastic. Sorry I missed that.

Robert O'Callahan wrote:
RISC-V uses LL/SC rather than CAS. This is a problem for rr because LL/SC, on ARM at least, is prone to unpredictable fail-and-retry, e.g. because a hardware interrupt occurred inside the LL/SC region. Thus, the count of instructions or even conditional branches retired may incur unpredictable extra increments, making it unreliable for our purposes. The ability to trap on a failed SC would probably be enough to let rr work.

RISC-V will do exactly that, although multi-hart systems with a PLIC have the ability to steer most interrupts away from a particular hart.  On the other hand, RISC-V LR/SC provides a forward-progress guarantee.  Could that forward-progress guarantee be strengthened to include holding off interrupts for a limited time?  In other words, if the LR/SC *can* succeed on the first try (no page fault, no race with other harts, etc.), it *must* succeed on the first try?

That wouldn't help rr because we have no way to be sure there isn't a page fault.

One other thing I forgot: rr replay uses the x86 PMU feature to trigger an interrupt when a counter reaches a specified threshold. These interrupts are not precise, which is OK as long as the imprecision is bounded in practice. AFAICT the RISC-V PMU spec doesn't have interrupts. PMU interrupts are useful for sample-based profiling so my guess is you'll want them.

Thanks,

Robert O'Callahan

unread,
Mar 15, 2018, 3:16:45 AM3/15/18
to Allen Baum, jcb6...@gmail.com, isa...@groups.riscv.org
rr needs to handle whatever instructions applications use in practice, while avoiding pervasive binary modification/instrumentation ... so applications using LL/SC block rr even if the ISA has alternatives. FWIW LL/SC blocks us from supporting ARM.

Rob

Jacob Bachmeyer

unread,
Mar 16, 2018, 12:46:29 AM3/16/18
to rob...@ocallahan.org, isa...@groups.riscv.org
Robert O'Callahan wrote:
> On Thu, Mar 15, 2018 at 6:47 PM, Jacob Bachmeyer <jcb6...@gmail.com
> <mailto:jcb6...@gmail.com>> wrote:
>
> Robert O'Callahan wrote:
>
> RISC-V uses LL/SC rather than CAS. This is a problem for rr
> because LL/SC, on ARM at least, is prone to unpredictable
> fail-and-retry, e.g. because a hardware interrupt occurred
> inside the LL/SC region. Thus, the count of instructions or
> even conditional branches retired may incur unpredictable
> extra increments, making it unreliable for our purposes. The
> ability to trap on a failed SC would probably be enough to let
> rr work.
>
>
> RISC-V will do exactly that, although multi-hart systems with a
> PLIC have the ability to steer most interrupts away from a
> particular hart. On the other hand, RISC-V LR/SC provides a
> forward-progress guarantee. Could that forward-progress guarantee
> be strengthened to include holding off interrupts for a limited
> time? In other words, if the LR/SC *can* succeed on the first try
> (no page fault, no race with other harts, etc.), it *must* succeed
> on the first try?
>
>
> That wouldn't help rr because we have no way to be sure there isn't a
> page fault.

Can you include page faults in the replay? Such that if it faults on
the first try, it will fault again on the replay?


-- Jacob


Albert Cahalan

unread,
Mar 16, 2018, 3:00:21 AM3/16/18
to Allen Baum, jcb6...@gmail.com, rob...@ocallahan.org, isa...@groups.riscv.org
On 3/15/18, Allen Baum <allen...@esperantotech.com> wrote:
>> On Mar 14, 2018, at 10:47 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>> Robert O'Callahan wrote:

>>> I'm the lead developer of rr, a low overhead record-and-replay-based
>>> debugger for arbitrary user-space processes on Linux/x86.
> ...
>>> It's not clear whether the instret counter is precise.
>>
>> My guess is that the precision of instret is implementation-defined, like
>> so much else in RISC-V. (I would support making it precise if it is not
>> currently -- an instruction restarted due to a trap never actually retired
>> in the first place!)
>
> If I didn’t agree before I started with compliance WG, I certainly do now.
> Having lots of implementation defined architecturally visible results will
> make it hard to evolve a SW ecosystem, and that will kill RISCV in the
> marketplace.
>
> In any case, I don’t see why a precise instruction retire counter should be
> difficult for any implementation.

I'm doing whole-system record and replay, and I have almost exactly the
same needs as he does. I'm using nearly the same hacks on Intel chips.
It's a miserable mess, which I suspect to be why VMWare gave up.

I really hope that a precise instruction retire counter is something that
can be done for RISC-V. This isn't something Intel can manage, but then
you might say that x86-64 is a tad more complicated, eh? Adding string.h
and a task switcher to an architecture has consequences.

>>> RISC-V uses LL/SC rather than CAS. This is a problem for rr because
>>> LL/SC, on ARM at least, is prone to unpredictable fail-and-retry, e.g.
>>> because a hardware interrupt occurred inside the LL/SC region.
>
> That is a difficult problem to solve.
> Note that the A-extension has other atomic ops which might be
> usable (though DCAS is often what you want)

LL/SC is also worrisome to me. Page faults are no trouble in my case,
but M-mode stuff will cause a disaster if it breaks LL/SC status.

Transactional memory looks like trouble. Hardware updates of MMU PTE
bits, such as "accessed" and "dirty", look like trouble. (this being exposure
of non-reproducible TLB state) Another source of trouble is that processors
with different hardware revisions may have different behavior when they
encounter instructions that have been introduced between the revisions.
Being able to disable newer instructions, including redefined bits, would help.

Robert O'Callahan

unread,
Mar 16, 2018, 7:44:04 PM3/16/18
to Jacob Bachmeyer, isa...@groups.riscv.org
Hmm. How would that work? We'd have to get the OS to notify us of page faults asynchronously (synchronous would be a performance killer), collecting data like the PC and the instret counter for us. Then for each one we'd have to figure out if it caused an SC failure --- by parsing the instruction stream at the PC? Which hopefully hasn't changed since the fault occurred... And this still wouldn't work for sequences that don't get the forward-progress guarantee.

Even "holding off interrupts" is still a hardware change. Would that actually be simpler than just exposing an option to trap after failed SCs? I'm not a hardware person but the latter sounds relatively simple to me. Unfortunately I can't think of any use for it other than for rr so my expectations for seeing this implemented in any architecture aren't high :-).

Cesar Eduardo Barros

unread,
Mar 16, 2018, 8:22:22 PM3/16/18
to rob...@ocallahan.org, Jacob Bachmeyer, isa...@groups.riscv.org
Em 16-03-2018 20:44, Robert O'Callahan escreveu:
> Hmm. How would that work? We'd have to get the OS to notify us of page
> faults asynchronously (synchronous would be a performance killer),
> collecting data like the PC and the instret counter for us. Then for
> each one we'd have to figure out if it caused an SC failure --- by
> parsing the instruction stream at the PC? Which hopefully hasn't changed
> since the fault occurred... And this still wouldn't work for sequences
> that don't get the forward-progress guarantee.
>
> Even "holding off interrupts" is still a hardware change. Would that
> actually be simpler than just exposing an option to trap after failed
> SCs? I'm not a hardware person but the latter sounds relatively simple
> to me. Unfortunately I can't think of any use for it other than for rr
> so my expectations for seeing this implemented in any architecture
> aren't high :-).

Sorry if this is a naive idea, but how about (ab)using breakpoints? Most
SC instructions will be immediately followed by a conditional jump. You
could scan a few instructions ahead to find this conditional jump, and
put a breakpoint on its failure side; if the code is too complex
(playing games with the SC result), just put the breakpoint immediately
after the SC and record the result there.

Of course, this has the disadvantage that you have to scan the
instruction stream to find where to put the breakpoints.

As for a hardware solution, this is actually a great moment to raise
that idea. RISC-V currently does not have the sort of advanced
performance counters we're used to seeing the "perf" tool use; sooner or
later, a standard for RISC-V performance counters will probably be
proposed. You could lobby for a "trap on failed SC" to go together with
the "trap on performance counter overflow" which will probably be part
of that proposal when it comes up (think of it as a "failed SC counter"
which is one hit away from overflowing, and you see how the same
mechanism could be used for both).

--
Cesar Eduardo Barros
ces...@cesarb.eti.br

Jacob Bachmeyer

unread,
Mar 16, 2018, 9:38:25 PM3/16/18
to Albert Cahalan, Allen Baum, rob...@ocallahan.org, isa...@groups.riscv.org
Could a "replay mode" flag that causes transactional memory (including
LR/SC) to always trap to allow emulation help here?


-- Jacob

Jacob Bachmeyer

unread,
Mar 16, 2018, 10:00:10 PM3/16/18
to rob...@ocallahan.org, isa...@groups.riscv.org
Robert O'Callahan wrote:
> On Fri, Mar 16, 2018 at 5:46 PM, Jacob Bachmeyer <jcb6...@gmail.com
> <mailto:jcb6...@gmail.com>> wrote:
>
> Robert O'Callahan wrote:
>
> On Thu, Mar 15, 2018 at 6:47 PM, Jacob Bachmeyer
> <jcb6...@gmail.com <mailto:jcb6...@gmail.com>
I probably need to carefully read the paper mentioned at the start of
this thread, but presumably, during replay, the same memory state,
including page-present and permissions could be replicated? This would
require that the supervisor provide a trace of memory actions in between
checkpoints. It appears that the userfaultfd(2) mechanism might be able
to do this in Linux.


-- Jacob

Robert O'Callahan

unread,
Mar 16, 2018, 10:33:23 PM3/16/18
to Cesar Eduardo Barros, Jacob Bachmeyer, isa...@groups.riscv.org
On Sat, Mar 17, 2018 at 1:22 PM, Cesar Eduardo Barros <ces...@cesarb.eti.br> wrote:
Em 16-03-2018 20:44, Robert O'Callahan escreveu:
Hmm. How would that work? We'd have to get the OS to notify us of page faults asynchronously (synchronous would be a performance killer), collecting data like the PC and the instret counter for us. Then for each one we'd have to figure out if it caused an SC failure --- by parsing the instruction stream at the PC? Which hopefully hasn't changed since the fault occurred... And this still wouldn't work for sequences that don't get the forward-progress guarantee.

Even "holding off interrupts" is still a hardware change. Would that actually be simpler than just exposing an option to trap after failed SCs? I'm not a hardware person but the latter sounds relatively simple to me. Unfortunately I can't think of any use for it other than for rr so my expectations for seeing this implemented in any architecture aren't high :-).

Sorry if this is a naive idea, but how about (ab)using breakpoints? Most SC instructions will be immediately followed by a conditional jump. You could scan a few instructions ahead to find this conditional jump, and put a breakpoint on its failure side; if the code is too complex (playing games with the SC result), just put the breakpoint immediately after the SC and record the result there.

Of course, this has the disadvantage that you have to scan the instruction stream to find where to put the breakpoints.

This is definitely solvable if you're willing to scan the entire application instruction stream and add instrumentation around SCs. However, avoiding pervasive binary instrumentation is a design principle of rr. Pervasive binary instrumentation imposes significant complexity and performance overhead, especially on very large applications and applications that use dynamic code generation and self-modifying code (e.g. JITs for dynamic languages).

As for a hardware solution, this is actually a great moment to raise that idea. RISC-V currently does not have the sort of advanced performance counters we're used to seeing the "perf" tool use; sooner or later, a standard for RISC-V performance counters will probably be proposed. You could lobby for a "trap on failed SC" to go together with the "trap on performance counter overflow" which will probably be part of that proposal when it comes up (think of it as a "failed SC counter" which is one hit away from overflowing, and you see how the same mechanism could be used for both).

Good point. A "failed SC" counter would be useful for profiling, and the ability to precisely trap when it overflows would satisfy rr's needs.

Robert O'Callahan

unread,
Mar 16, 2018, 10:35:33 PM3/16/18
to Jacob Bachmeyer, Albert Cahalan, Allen Baum, isa...@groups.riscv.org
On Sat, Mar 17, 2018 at 2:38 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
Could a "replay mode" flag that causes transactional memory (including LR/SC) to always trap to allow emulation help here?

Trapping on every atomic operation would be unacceptable overhead for rr, and (I imagine) for any other record-replay tool that aspires to be used in practice.

Robert O'Callahan

unread,
Mar 16, 2018, 10:46:14 PM3/16/18
to Jacob Bachmeyer, isa...@groups.riscv.org
On Sat, Mar 17, 2018 at 3:00 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
I probably need to carefully read the paper mentioned at the start of this thread, but presumably, during replay, the same memory state, including page-present and permissions could be replicated?  This would require that the supervisor provide a trace of memory actions in between checkpoints.  It appears that the userfaultfd(2) mechanism might be able to do this in Linux.

You can't remove pages with userfaultfd. If you extended the kernel with all the necessary interfaces to support replicating page-presence during replay, it would be pretty complicated (e.g. requiring page pinning during replay) and it would add a lot of otherwise unnecessary overhead. I can't see this being workable.

Albert Cahalan

unread,
Mar 17, 2018, 12:18:28 AM3/17/18
to jcb6...@gmail.com, isa...@groups.riscv.org
On 3/15/18, Jacob Bachmeyer <jcb6...@gmail.com> wrote:

> On the other hand, RISC-V LR/SC provides a forward-progress guarantee.
> Could that forward-progress guarantee be strengthened to include holding
> off interrupts for a limited time? In other words, if the LR/SC *can*
> succeed on the first try (no page fault, no race with other harts,
> etc.), it *must* succeed on the first try?

I think that is useful for unrelated problems. It doesn't help record/replay,
but it could improve throughput for locking and interrupt handling.

Crude proposal:

Each interrupt source gets a pair of adjustable latency settings. One is for
when there is an active LL/SC pair, and the other is for when there is not.
When there is reason to handle an interrupt, this is delayed until one of
the following:

a. There is an active LL/SC and the latency for this condition is exceeded.
b. There is no LL/SC active, and the latency for this condition is exceeded.
c. Any other interrupt is handled. (so they get batched up)

Additionally, there are latency settings for proceeding into a LL/SC pair.
One of them stops the processor at the LL until all other processors are
known to not be in a LL/SC pair, and the other stops the processor at the
LL until it is not known that any other processor is certainly in a LL/SC pair.
(the difference being how uncertainty is treated) If the latency limit is hit,
then execution continues on past the LL or perhaps (configurable?) faults.

So for a system with 32 interrupt sources, there would be 66 latency settings.
That is (32+1)*2 or 32*2+2 depending on how you like to view it.

Jacob Bachmeyer

unread,
Mar 17, 2018, 6:54:16 PM3/17/18
to rob...@ocallahan.org, Cesar Eduardo Barros, isa...@groups.riscv.org
Robert O'Callahan wrote:
> On Sat, Mar 17, 2018 at 1:22 PM, Cesar Eduardo Barros
> <ces...@cesarb.eti.br <mailto:ces...@cesarb.eti.br>> wrote:
>
> As for a hardware solution, this is actually a great moment to
> raise that idea. RISC-V currently does not have the sort of
> advanced performance counters we're used to seeing the "perf" tool
> use; sooner or later, a standard for RISC-V performance counters
> will probably be proposed. You could lobby for a "trap on failed
> SC" to go together with the "trap on performance counter overflow"
> which will probably be part of that proposal when it comes up
> (think of it as a "failed SC counter" which is one hit away from
> overflowing, and you see how the same mechanism could be used for
> both).
>
>
> Good point. A "failed SC" counter would be useful for profiling, and
> the ability to precisely trap when it overflows would satisfy rr's needs.

I would like to suggest adding a privilege level field to the hpmevent*
CSRs. Each counter then would count only events occurring at that
privilege level or lower and would be writable at that privilege level
or higher. The privilege level field in hpmevent* can only be written
with the current privilege level or lower. Events codes that duplicate
instret and cycles should be assigned, to permit counting only
instructions retired in U-mode.

I would also like to suggest adding either a set of hpmthreshold* CSRs
(if the counters are not made writable) or an overflow trap field to the
hpmevent* CSRs. Reaching the threshold or overflowing a counter would
cause either a breakpoint trap (distinguishable because *epc is not
EBREAK) or an exception with a new cause code.


-- Jacob

Reply all
Reply to author
Forward
0 new messages