On 1/4/23, 'MitchAlsup' via RISC-V ISA Dev <
isa...@groups.riscv.org> wrote:
> On Wednesday, January 4, 2023 at 3:11:37 PM UTC-6 acahalan wrote:
>> On 1/4/23, Ruslan Nikolaev <
nrusl...@gmail.com> wrote:
>> No, because this "completeness" prevents replay debuggers
>> from working. Simply having LL/SC is an architectural flaw.
>> The instructions should not exist at all.
>
> We (architects) are caught between a rock and a hard place::
>
> Either a) we invent and implement new ATOMIC instructions of
> increasing complexity every new generation,
I don't really think so. Compare and swap is pretty solid.
You can vary the width, or support multiple pointers, or
decide on a status to provide... but none of that changes
the operation in any deep or fundamental way.
>>> 2. CAS is worthy of discussion. To guarantee strict lock-freedom,
>>> let alone wait-freedom, we still need to have a single instruction
>>> which cannot be interrupted somewhere in the middle.
>>
>> Replay requires that all instructions are this way.
>
> Alternatives exist: while it is true that one cannot single step through
> an ATOMIC event and have any semblance of ATOMICity; it is false
> that one cannot suppress interrupts for the duration of an ATOMIC
> event (subject to timeouts and having a plan to deal with exceptions).
>
> So, for example, say you had something like LL/SC and that the
> core would suppress interrupts between LL and SC, so your single
> stepping debugger would single step up to the LL and then you single
> again and end up at the instruction following SC. Here, at least the
> ATOMICity of the event is not harmed with the "debugging" of the
> event.
Oh no, not a single-stepping debugger. The idea is to let the
processor run at full speed, pipelines all going as they should,
until a certain count of executed instructions is reached or an
interrupt happens. If an interrupt happens, then it should be
possible to note the number of instructions that had run, so
that execution can be restarted. After the desired number of
instructions is completed, the processor should be in a fully
determined state, having executed the exact instructions that
were desired.
For example, consider a replay debugger built as a hypervisor.
A guest OS is booted to a GUI, and the user interacts with it.
The hypervisor records information about interrupts and
about input from things like the mouse. It is then possible to
replay the whole thing at relatively normal speed, stopping at
any desired point in time. The sequence of instructions does
not vary during the replay. Register content, memory content,
and everything else is fully reproducible. The user can jump
to any point in time. The user can fork the history at any point
in time, continuing execution with interaction, with the only
troubles being that network connections to other systems
will be rejected by those other systems.
> For this to work, however, there must be a place of control that receives
> control if an exception is thrown before the exception control transfer is
> performed. Also, there must be a way that SW can query why the event
> failed.
When the hypervisor gains control due to the exception,
it takes note of how many instructions have completed.
It can subtract that from the desired number, then restart
the guest with the new value as the number to execute.
Maybe you wouldn't call it CAS, but otherwise it works fine.
You just need more than one pointer for the instruction, and
of course you hit the usual complaints about complexity.
Example: 6 consecutive register pairs to hold comparison values,
6 consecutive register pairs to hold replacement values, and
another 6 consecutive registers to hold pointers.
> When there is no interference observed during the event, the event
> takes exactly the same number of cycles as if none of the semantic
> decorations (above) existed.
And if there is interference...?
Taking a different number of cycles is fine. Taking a different
number of instructions is trouble. Branching, including loops,
must not be unpredictable.
> My conditional branch instruction has the ability to ask the cache-miss
> buffers of a core if any interference has been detected.
The answer to that question must be deterministic,
or it will have to be recorded and then recreated.
It is very bad to need to frequently record things
like this; the worst-case situation becomes single-step.
The performance of single-step is useless.
>> LL/SC must
>> not exist. They are an architectural flaw. It should be possible
>> to specify a maximum number of instructions to execute and to
>> measure how many instructions did execute, and there shouldn't
>> be any hidden strangeness like a LL/SC loop running a different
>> number of times.
>
> LL/SC runs different numbers of times BECAUSE there are other
> threads attempting to do some kind of access on the same data
> this thread is attempting. Smashing LL/SC into a single instruction
> just bounds the amount of interference that can happen (but does
> nothing to suppress that interference.) Your test-and-set <typical>
> primitive can take a cache miss and be subject to all sorts of
> latency that allows for other 3rd party threads to touch the same
> data, and ultimately cause the T&S to fail--even when the instantaneous
> state of memory at the instant of T&S decode was in a state where it
> could succeed. The general rule of thumb, with a multi-core chip
> implementation, is "whoever gets to the memory controller first"
> will end up "getting" the lock and there is nothing a core can do about
> that.
Unpredictable latency is fine. Unpredictable instruction count
is a problem. Failure should not be happening... and I do not
consider instruction roll-back for an interrupt to be failure.