RVV: what happens when LOAD / STORE crosses a Virtual Memory page?

158 views
Skip to first unread message

Luke Kenneth Casson Leighton

unread,
Apr 17, 2018, 10:22:16 PM4/17/18
to RISC-V ISA Dev
I was going through the C.LOAD / C.STORE section 12.3 of V2.3-Draft
riscv-isa-manual in order to work out how to re-map RVV onto the
standard ISA, and came across an interesting comments at the bottom of
pages 75 and 76:

" A common mechanism used in other ISAs to further reduce
save/restore code size is load-
multiple and store-multiple instructions. "

Fascinatingly, due to Simple-V proposing to use the *standard*
register file, both C.LOAD / C.STORE *and* LOAD / STORE would in
effect be exactly that: load-multiple and store-multiple instructions.
Which brings us on to this comment:

"For virtual memory systems, some data accesses could be resident in
physical memory and
some could not, which requires a new restart mechanism for partially
executed instructions."

Which then of course brings us to the interesting question: how does
RVV cope with the scenario when, particularly with LD.X (Indexed /
indirect loads), part-way through the loading a page fault occurs?

Has this been noted or discussed before?

Is RVV designed in any way to be re-entrant?

What would the implications be for instructions that were in a FIFO at
the time, in out-of-order and VLIW implementations, where partial
decode had taken place?

Would it be reasonable at least to say *bypass* (and freeze) the
instruction FIFO (drop down to a single-issue execution model
temporarily) for the purposes of executing the instructions in the
interrupt (whilst setting up the VM page), then re-continue the
instruction with all state intact?

Or would it be better to switch to an entirely separate secondary
hyperthread context?

Does anyone have any ideas or know if there is any academic literature
on solutions to this problem?

Greatly appreciated,

l.

Andrew Waterman

unread,
Apr 18, 2018, 12:01:50 AM4/18/18
to Luke Kenneth Casson Leighton, RISC-V ISA Dev
On Tue, Apr 17, 2018 at 7:21 PM, Luke Kenneth Casson Leighton <lk...@lkcl.net> wrote:
I was going through the C.LOAD / C.STORE section 12.3 of V2.3-Draft
riscv-isa-manual in order to work out how to re-map RVV onto the
standard ISA, and came across an interesting comments at the bottom of
pages 75 and 76:

"   A common mechanism used in other ISAs to further reduce
save/restore code size is load-
multiple and store-multiple instructions. "

Fascinatingly, due to Simple-V proposing to use the *standard*
register file, both C.LOAD / C.STORE *and* LOAD / STORE would in
effect be exactly that: load-multiple and store-multiple instructions.
Which brings us on to this comment:

"For virtual memory systems, some data accesses could be resident in
physical memory and
  some could not, which requires a new restart mechanism for partially
executed instructions."

Which then of course brings us to the interesting question: how does
RVV cope with the scenario when, particularly with LD.X (Indexed /
indirect loads), part-way through the loading a page fault occurs?

Has this been noted or discussed before?

For applications-class platforms, the RVV exception model is element-precise (that is, if an exception occurs on element j of a vector instruction, elements 0..j-1 have completed execution and elements j+1..vl-1 have not executed).

Certain classes of embedded platforms where exceptions are always fatal might choose to offer resumable/swappable interrupts but not precise exceptions.


Is RVV designed in any way to be re-entrant?

Yes.


What would the implications be for instructions that were in a FIFO at
the time, in out-of-order and VLIW implementations, where partial
decode had taken place?

The usual bag of tricks for maintaining precise exceptions applies to vector machines as well.  Register renaming makes the job easier, and it's relatively cheaper for vectors, since the control cost is amortized over longer registers.
 

Would it be reasonable at least to say *bypass* (and freeze) the
instruction FIFO (drop down to a single-issue execution model
temporarily) for the purposes of executing the instructions in the
interrupt (whilst setting up the VM page), then re-continue the
instruction with all state intact?

This approach has been done successfully, but it's desirable to be able to swap out the vector unit state to support context switches on exceptions that result in long-latency I/O.


Or would it be better to switch to an entirely separate secondary
hyperthread context?

Does anyone have any ideas or know if there is any academic literature
on solutions to this problem?

The Vector VAX offered imprecise but restartable and swappable exceptions: http://mprc.pku.edu.cn/~liuxianhua/chn/corpus/Notes/articles/isca/1990/VAX%20vector%20architecture.pdf

Sec. 4.6 of Krste's dissertation assesses some of the tradeoffs and references a bunch of related work: http://people.eecs.berkeley.edu/~krste/thesis.pdf


Greatly appreciated,

l.

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+unsubscribe@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAPweEDyUE7znMRUMuM4hn9wEo%2B5ehNnmkz6dG3eKC%3DAidwCyjA%40mail.gmail.com.

Luke Kenneth Casson Leighton

unread,
Apr 18, 2018, 12:20:54 AM4/18/18
to Andrew Waterman, RISC-V ISA Dev
hi andrew, great reply! really informative and helpful, thank you.
need time to absorb and research (and work out implications) e.g. one
must be that it implies that RVV has a means of restarting vector
operations from a specific index.

On Wed, Apr 18, 2018 at 5:01 AM, Andrew Waterman
<wate...@eecs.berkeley.edu> wrote:

> On Tue, Apr 17, 2018 at 7:21 PM, Luke Kenneth Casson Leighton
> <lk...@lkcl.net> wrote:

>> Which then of course brings us to the interesting question: how does
>> RVV cope with the scenario when, particularly with LD.X (Indexed /
>> indirect loads), part-way through the loading a page fault occurs?
>>
>> Has this been noted or discussed before?
>
>
> For applications-class platforms, the RVV exception model is element-precise
> (that is, if an exception occurs on element j of a vector instruction,
> elements 0..j-1 have completed execution and elements j+1..vl-1 have not
> executed).
>
> [....]

lk...@lkcl.net

unread,
Apr 18, 2018, 1:31:06 AM4/18/18
to RISC-V ISA Dev, wate...@eecs.berkeley.edu


On Wednesday, April 18, 2018 at 5:20:54 AM UTC+1, lk...@lkcl.net wrote:
hi andrew, great reply!  really informative and helpful, thank you.
need time to absorb and research (and work out implications) e.g. one
must be that it implies that RVV has a means of restarting vector
operations from a specific index.

 oh!  I just had a thought.  Started reading section 4.6 of Krste's thesis, noted the "IEE85 F.P exceptions" and thought, "hmmm that could go into a CSR, must re-read the section on FP state CSRs in RVV 0.4-Draft again" then i suddenly thought, "ah ha!  what if the memory exceptions were, instead of having an immediate exception thrown, were simply stored in a type of predication bit-field with a flag "error this element failed"?

 Then, *after* the vector load (or store, or even operation) was performed, you could *then* raise an exception, at which point it would be possible (yes in software... I know....) to go "hmmm, these indexed operations didn't work, let's get them into memory by triggering page-loads", then *re-run the entire instruction* but this time with a "memory-predication CSR" that stops the already-performed operations (whether they be loads, stores or an arithmetic / FP operation) from being carried out a second time.

 This theoretically could end up being done multiple times in an SMP environment, and also for LD.X there would be the remote outside annoying possibility that the indexed memory address could end up being modified.

 The advantage would be that the order of execution need not be sequential, which potentially could have some big advantages.  Am still thinking through the implications as any dependent operations (particularly ones already decoded and moved into the execution FIFO) would still be there (and stalled).  hmmm.

l.

Andrew Waterman

unread,
Apr 18, 2018, 2:42:43 AM4/18/18
to lk...@lkcl.net, RISC-V ISA Dev
On Tue, Apr 17, 2018 at 10:31 PM, lk...@lkcl.net <lk...@lkcl.net> wrote:


On Wednesday, April 18, 2018 at 5:20:54 AM UTC+1, lk...@lkcl.net wrote:
hi andrew, great reply!  really informative and helpful, thank you.
need time to absorb and research (and work out implications) e.g. one
must be that it implies that RVV has a means of restarting vector
operations from a specific index.

 oh!  I just had a thought.  Started reading section 4.6 of Krste's thesis, noted the "IEE85 F.P exceptions" and thought, "hmmm that could go into a CSR, must re-read the section on FP state CSRs in RVV 0.4-Draft again" then i suddenly thought, "ah ha!  what if the memory exceptions were, instead of having an immediate exception thrown, were simply stored in a type of predication bit-field with a flag "error this element failed"?

 Then, *after* the vector load (or store, or even operation) was performed, you could *then* raise an exception, at which point it would be possible (yes in software... I know....) to go "hmmm, these indexed operations didn't work, let's get them into memory by triggering page-loads", then *re-run the entire instruction* but this time with a "memory-predication CSR" that stops the already-performed operations (whether they be loads, stores or an arithmetic / FP operation) from being carried out a second time.

We've contemplated a model along these lines to support software-speculative vectorization.  You'd still want the regular vector loads and stores that trigger exceptions, though.  For general-purpose autovectorized code, any load or store can potentially fault.  Manually checking for exceptions after every memory access or even basic block would result in considerable overhead.

Luke Kenneth Casson Leighton

unread,
Apr 18, 2018, 4:28:52 AM4/18/18
to Andrew Waterman, RISC-V ISA Dev
On Wed, Apr 18, 2018 at 7:42 AM, Andrew Waterman
<wate...@eecs.berkeley.edu> wrote:

> On Tue, Apr 17, 2018 at 10:31 PM, lk...@lkcl.net <lk...@lkcl.net> wrote:

>> again" then i suddenly thought, "ah ha! what if the memory exceptions were,
>> instead of having an immediate exception thrown, were simply stored in a
>> type of predication bit-field with a flag "error this element failed"?
>> [...]

> We've contemplated a model along these lines to support software-speculative
> vectorization. You'd still want the regular vector loads and stores that
> trigger exceptions, though. For general-purpose autovectorized code, any
> load or store can potentially fault. Manually checking for exceptions after
> every memory access or even basic block would result in considerable
> overhead.

Turns out that if I'd read even the next page (65) before posting I
would have spotted that the scheme is identical (except applying to
IEE85 rather than LOAD/STORE).

I don't see that it would be _that_ much extra work: the actual
number of bits in the "parallel exception flag(s)" flags plural
because one for IEE85 and one for LOAD/STORE would match MAXVECTORLEN
or possibly just (if the sequential nature of RVV was kept) simply
match the number of actual allowed parallel execution units (that
would be one I presume in the case of the implementation I am guessing
is being developed?)

If that were the case, if MAXVECTORLEN were say 4 in a particular
implementation, the maximum number of LOAD page-faults to deal with
would also be a maximum of 4.

oh - actually it would in turn mean that it potentially would be
possible to optimise (amalgamate) the page-fault handling!

Best case you might find (particularly with sequential loads) that one
page-fault covers them all. If it turns out to be several page-faults
that's great because it's then possible to issue multiple page-loads
at once, then *wait* for them all and only then return from the
Exception to repeat the instruction.

Right now, with only one element being executed at a time, it
basically means that several successive page-faults would occur and
that's clearly far more expensive, particularly on LD.X which is the
sparse case that, if
https://en.wikipedia.org/wiki/Translation_lookaside_buffer is to be
believed, would be a huge 20-40% miss rate on the TLB. Being able to
issue *multiple* page-loads before continuing the instruction would
seem to me to be infinitely preferable.


There's also potentially another reason which makes the case for
allowing more than one vector element operation to be executed at the
same time, and that's when Vector takes over from SIMD.

SIMD's "big advantage" is that the parallelism is in the ALU. XTensa
was bought out right around the time when they developed
massively-wide DSP-SIMD instructions for low-power LTE handsets
(trading width for clock rate, width doubling and power reducing on a
square-law, their implementation was much more power-efficient than
anything else on the market at the time).

Fitting the SIMD paradigm into the Vector one therefore requires two steps:

(1) splitting the (128-bit? 256-bit? 512-bit?) SIMD register down
into smaller blocks (possibly as far as 16-bit if that's the base size
of the SIMD element) and then
(2) permitting the core to issue *multiple* elements as actual
hardware-parallel operations to get back to the same level of
throughput *as* the SIMD operation being replaced, before it was
"split".

Now if that's *not possible* because RVV is *designed* to permit one
and only one element to be executed at once, the opportunity is
definitely lost for Vector to be recommended over SIMD [with the
subsequent O(N^5) opcode proliferation that will result). Yes really
O(N^5) I did the analysis a couple days ago! No wonder SIMD
multimedia architectures end up with thousands of unique opcodes.

Would you agree that that would slightly change the dynamics and trade-offs?

The only thing I'm not sure about is, what's the impact on performance
of "multi-flag VM exception handling" in all the different cases.
According to the privileged spec, TLB cache-misses are permitted to be
handled in either software *or* hardware, as an implementor's choice,
and page-faults obviously and clearly get handled in software. Whilst
there's a clear benefit to handling multiple page-faults (being able
to issue muitlple simultaneous page-loads and *wait* for *all* of them
to be loaded before returning out of the Exception), the other cases
I've not yet got a handle on.

l.

Andrew Waterman

unread,
Apr 18, 2018, 4:57:05 AM4/18/18
to Luke Kenneth Casson Leighton, RISC-V ISA Dev
In either model, the OS has the flexibility to examine the architectural state and decide to fault in multiple pages at once.  But the absolute perf. is so low if you're faulting all the time that this is probably doesn't matter too much.


Right now, with only one element being executed at a time, it
basically means that several successive page-faults would occur and
that's clearly far more expensive, particularly on LD.X which is the
sparse case that, if
https://en.wikipedia.org/wiki/Translation_lookaside_buffer is to be
believed, would be a huge 20-40% miss rate on the TLB.  Being able to
issue *multiple* page-loads before continuing the instruction would
seem to me to be infinitely preferable.

There's little inherent connection between TLB miss rates (even of that magnitude) and page-fault rates.  Vector units optimized for scatter-gather have large TLBs with hardware page-table walkers, so exceptions on these workloads will be exceedingly rare.



 There's also potentially another reason which makes the case for
allowing more than one vector element operation to be executed at the
same time, and that's when Vector takes over from SIMD.

 SIMD's "big advantage" is that the parallelism is in the ALU.  XTensa
was bought out right around the time when they developed
massively-wide DSP-SIMD instructions for low-power LTE handsets
(trading width for clock rate, width doubling and power reducing on a
square-law, their implementation was much more power-efficient than
anything else on the market at the time).

 Fitting the SIMD paradigm into the Vector one therefore requires two steps:

 (1) splitting the (128-bit? 256-bit? 512-bit?) SIMD register down
into smaller blocks (possibly as far as 16-bit if that's the base size
of the SIMD element) and then
 (2) permitting the core to issue *multiple* elements as actual
hardware-parallel operations to get back to the same level of
throughput *as* the SIMD operation being replaced, before it was
"split".

Now if that's *not possible* because RVV is *designed* to permit one
and only one element to be executed at once, the opportunity is
definitely lost for Vector to be recommended over SIMD [with the
subsequent O(N^5) opcode proliferation that will result).  Yes really
O(N^5) I did the analysis a couple days ago!  No wonder SIMD
multimedia architectures end up with thousands of unique opcodes.

This conflates the sequential ISA semantics with the microarchitecture.  It is straightforward to build vector machines that can process many elements in parallel, despite the precise-exception model.

That page faults are ordinarily handled sequentially shouldn't manifest as a first-order performance concern.

lk...@lkcl.net

unread,
Apr 18, 2018, 8:09:00 AM4/18/18
to RISC-V ISA Dev, lk...@lkcl.net


On Wednesday, April 18, 2018 at 9:57:05 AM UTC+1, waterman wrote:

On Wed, Apr 18, 2018 at 1:28 AM, Luke Kenneth Casson Leighton <lk...@lkcl.net> wrote:
Best case you might find (particularly with sequential loads) that one
page-fault covers them all.  If it turns out to be several page-faults
that's great because it's then possible to issue multiple page-loads
at once, then *wait* for them all and only then return from the
Exception to repeat the instruction.

In either model, the OS has the flexibility to examine the architectural state and decide to fault in multiple pages at once.  But the absolute perf. is so low if you're faulting all the time that this is probably doesn't matter too much.

 Yeh good point.  Which is where I wasn't sure if one would be better than the other in the TLB-miss case (noted below about vector arches having huge TLBs).
 
There's little inherent connection between TLB miss rates (even of that magnitude) and page-fault rates.  Vector units optimized for scatter-gather have large TLBs with hardware page-table walkers, so exceptions on these workloads will be exceedingly rare.

 ok.  that sounds perfectly reasonable, to increase TLB sizes to match the workload.  Wikipedia, fount-of-all-knowledge, clearly lacking in the "Vector Architectural Insights" department... :)

 

This conflates the sequential ISA semantics with the microarchitecture.  It is straightforward to build vector machines that can process many elements in parallel, despite the precise-exception model.

 so, analysing how i got that mistaken impression... i must have thought that "sequential RVV semantics implies *only one* element may be executed at a time".

 ok, so thinking this (revised understanding) through:

* an RVV implementation may issue multiple (sequentlal) vector LD (or arithmetic) operations to the underlying hardware.
* it will therefore need to keep state (parallel state) on each because (simultaneously) any one of those operations (LD or F.P) may trigger an exception
* if there are multiple parallel engines generating exceptions, then there's *going* to be multiple simultaneous exceptions received on any given clock cycle

so in effect it's kiiinda identical to the idea of having an explicit bit-array of exceptions, except that it's either an implementation detail (not explicitly part of the standard) or is part of the state and has to be *derived* from complex / comprehensive state (by software analysis of the parallel-vector state information, in the exception handler).

so as I understand it, the difference between what I proposed and how RVV is implemented is that:

* RVV implementations would expect to generate an exception presumably the moment it hits the first of any elements that happen, sequentially, to have an error condition and that
* if there *happens* to be another one (sequentially after it) that *also* has an error condition (likely in LD and LD.S but not so likely in LD.X) then that gets a notification too
* (presumably?) any results *AFTER* the exception(s) have to be stopped / stalled / thrown away?  otherwise it breaks the "sequential" semantics.  LD that doesn't matter but F.P. operations it does.  
* at the point the exception(s) occur(s) the trap handler can, if it's hardware, deal with that quite easily and if it's a software-TLB have a bit of awkwardness having to decode the vector state
* if there's no TLB misses return to the (sequential, parallel) execution and carry on, knowing that there's at least one unit that needs to be woken up.

whereas:

* (Model) goes through *all* of VSETL leaving in its wake a bit-field of "success / exception" data of length up to MAXVECTORLEN bits, which in effect will be treated as "extra predication bits", later.
* some results succeed, some don't, successful ones are stored in registers (not thrown away as in RVV?)
* unsuccessful ones generate the exception, exception (in hardware-TLB) is easy enough, in software-TLB is *also* reasonably easy as there's the bit-field to walk (saves a few cycles)
* on return there's now a new "predication mask" (based on which units *previously succeeded*) which kicks off only those elements that previously failed.

oh.  I can think of a potential benefit of the latter model.  As it's effectively hooking into the predication system, and given that the other units were successful in completing, the Vector Unit is now in effect a "clean slate" in which to issue a full set of *new* Vector Elements (assuming a micro-architecture internal parallelism of ALUs that's less than the MAXVECTORLEN)  

Whereas in the RVV case, I *believe* that the consequences are that the Vector Elements which were previously successful (the ones BEFORE the first exception) can't _necessarily_ be re-used because.. unless.. unless... there's not a direct correlation between the underlying internal ALU allocation and the actual instruction issue gaah this stuff's got so many rabbit-holes to explore! :)

So that would just leave the question of what's done with the RVV results that are sequentially numbered higher than those which had exceptions.  Are those indeed thrown away?  Or stored as part of the state information?

# assume internal parallelism of 8 and MAXVECTORLEN of 8
VSETL r0, 8
FADD x1, x2, x3

x3[0]: ok
x3[1]: exception
x3[2]: ok
...
...
x3[7]: ok

what happens to result elements 2-7?  those may be *big* results (RV128) or in the RVV-Extended may be arbitrary bit-widths far greater.

l.

p.s. i found this btw (expired patent, filed 1996, just after 1995 when US law changed to "filing" priority date not *published* priority date, so it's *not* a submarine patent): https://patentimages.storage.googleapis.com/fc/f6/e2/2cbee92fcd8743/US5895501.pdf -  It describes three schemes (corresponding to unit, stride and indexed) where the vector execution gives "hints" to the hardware-TLB RTL, ultimately generating page-faults (when needed) *in advance* via a hardware signal line.  kinda neat.

Andrew Waterman

unread,
Apr 18, 2018, 8:48:37 AM4/18/18
to lk...@lkcl.net, RISC-V ISA Dev
Thrown away.

But keep that cost in context: Unix-like systems do their best to make this an uncommon case. Even when it does happen, the scalar unit needs to execute >100 instructions of OS code in the best case, and quite a bit more on average.  The energy spent handling the page fault will dominate the energy to re-execute the partially executed vector operations.


l.

p.s. i found this btw (expired patent, filed 1996, just after 1995 when US law changed to "filing" priority date not *published* priority date, so it's *not* a submarine patent): https://patentimages.storage.googleapis.com/fc/f6/e2/2cbee92fcd8743/US5895501.pdf -  It describes three schemes (corresponding to unit, stride and indexed) where the vector execution gives "hints" to the hardware-TLB RTL, ultimately generating page-faults (when needed) *in advance* via a hardware signal line.  kinda neat.

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

Jacob Bachmeyer

unread,
Apr 18, 2018, 10:33:21 PM4/18/18
to Andrew Waterman, Luke Kenneth Casson Leighton, RISC-V ISA Dev
Andrew Waterman wrote:
> On Wed, Apr 18, 2018 at 1:28 AM, Luke Kenneth Casson Leighton
> <lk...@lkcl.net <mailto:lk...@lkcl.net>> wrote:
>
> [...]
>
> oh - actually it would in turn mean that it potentially would be
> possible to optimise (amalgamate) the page-fault handling!
>
> Best case you might find (particularly with sequential loads) that one
> page-fault covers them all. If it turns out to be several page-faults
> that's great because it's then possible to issue multiple page-loads
> at once, then *wait* for them all and only then return from the
> Exception to repeat the instruction.
>
>
> In either model, the OS has the flexibility to examine the
> architectural state and decide to fault in multiple pages at once.
> But the absolute perf. is so low if you're faulting all the time that
> this is probably doesn't matter too much.

This suggests an interesting possibility to me: an "svbadaddr" "vector
CSR" that stores, on a vector LOAD or STORE that causes page fault, the
set of faulting addresses. Each element would be either zero or a
faulting address. The "svbadaddr" vector would need to be inaccessible
to U-mode to prevent the same problem that length-speculative loads
introduced. On the other hand, a supervisor could work this out by
disassembling the faulting vector-load instruction and extracting the
addresses in software. An "SPROBE.VM" instruction that attempts to load
a page mapping and reports either a physical address or an indication
that a page fault would occur (but does not actually take the nested
page fault trap) might be useful here to avoid walking page tables in
software.

Are vector page faults expected to be frequent enough for such batch
processing to be worthwhile? Is the software implementation of
"svbadaddr" "good enough" in the sense that it will be dominated by the
overhead of a single page fault? Or are we giving supervisors an
uncomfortable trade-off between batch processing of vector page faults
and efficient processing of scalar page faults?


-- Jacob

Luke Kenneth Casson Leighton

unread,
Apr 18, 2018, 11:48:10 PM4/18/18
to Andrew Waterman, RISC-V ISA Dev
On Wed, Apr 18, 2018 at 1:48 PM, Andrew Waterman
<wate...@eecs.berkeley.edu> wrote:

> On Wed, Apr 18, 2018 at 5:09 AM, lk...@lkcl.net <lk...@lkcl.net> wrote:
>> So that would just leave the question of what's done with the RVV results
>> that are sequentially numbered higher than those which had exceptions. Are
>> those indeed thrown away? Or stored as part of the state information?
>>
>> # assume internal parallelism of 8 and MAXVECTORLEN of 8
>> VSETL r0, 8
>> FADD x1, x2, x3
>>
>> x3[0]: ok
>> x3[1]: exception
>> x3[2]: ok
>> ...
>> ...
>> x3[7]: ok
>>
>> what happens to result elements 2-7? those may be *big* results (RV128)
>> or in the RVV-Extended may be arbitrary bit-widths far greater.
>
>
> Thrown away.

ok.

> But keep that cost in context: Unix-like systems do their best to make this
> an uncommon case. Even when it does happen, the scalar unit needs to execute
>>100 instructions of OS code in the best case, and quite a bit more on
> average. The energy spent handling the page fault will dominate the energy
> to re-execute the partially executed vector operations.

it's... it's a compelling argument (application of Ahmdal's Law): I
just can't help feel that it's missing an opportunity, in cases where
there are architectural decisions that are slightly different (each of
which would need an entire research project of their own to assess).
software-walked TLBs, embedded 3D Graphics Engines (as opposed to
high-end ones more akin to past Cray Supercomputers) with reduced TLB
cacne sizes, many other aspects.

I think what I'm trying to say is - and this is unfortunately hard to
"justify" given the high financial and time cost of doing the research
- that the RVV sequential element-execution semantics might be
assuming certain architectural decisions that would place barriers on
adoption, given that implementors would be effectively *required* to
follow pretty much the same architectural decisions that went into the
(amazingly comprehensive) RVV design decisions.

Now ironically we could say that that would actually save
implementors from making costly mistakes.... we could also equally say
that it restricts *experienced* implementors (or future ones) from
coming up with compelling improvements.

I'm recognising the value of that work (man-decades of research!)
whilst at the same time asking if it would be ok to make it an
implementor's decision, by providing the bit-field of CSR-exceptions
and allowing them, on *top* of that, the freedom to choose whether to
implement sequential RVV semantics or whether to implement parallel
ones.

The reason I ask is because in the design that I'm planning, it *may*
end up, for cost/area reasons, with software-walked TLBs and not quite
so large TLB cache sizes... but I don't know yet.

What do you think?

l.

lk...@lkcl.net

unread,
Apr 19, 2018, 12:17:28 AM4/19/18
to RISC-V ISA Dev, wate...@eecs.berkeley.edu, lk...@lkcl.net, jcb6...@gmail.com


On Thursday, April 19, 2018 at 3:33:21 AM UTC+1, Jacob Bachmeyer wrote:
Andrew Waterman wrote:
> In either model, the OS has the flexibility to examine the
> architectural state and decide to fault in multiple pages at once.  
> But the absolute perf. is so low if you're faulting all the time that
> this is probably doesn't matter too much.

This suggests an interesting possibility to me:  an "svbadaddr" "vector
CSR" that stores, on a vector LOAD or STORE that causes page fault, the
set of faulting addresses.  Each element would be either zero or a
faulting address.  

Yes: that would be the next logical progression.  The information's right there, determined already by the hardware, so why not make it available (make software-TLB-implementor's jobs easier)
 
The "svbadaddr" vector would need to be inaccessible
to U-mode to prevent the same problem that length-speculative loads
introduced.  On the other hand, a supervisor could work this out by
disassembling the faulting vector-load instruction and extracting the
addresses in software.  An "SPROBE.VM" instruction that attempts to load
a page mapping and reports either a physical address or an indication
that a page fault would occur (but does not actually take the nested
page fault trap) might be useful here to avoid walking page tables in
software.


interesting idea!
 
Are vector page faults expected to be frequent enough for such batch
processing to be worthwhile?  

this area is entirely new to me, so someone with experience would have to pitch in... and did :)  bit of cross-over, Jacob.  Andrew points out that it's not very common *however* I do have to point out (respectfully, and with apologies for referring to you in 3rd person) that the context Andrew may have been describing was one in which hardware-TLB-walking is carried out, and the TLB cache size is assumed to be absolutely enormous precisely because that's what's *normally* done for a high-performance Vector Engine.

What we don't know is (and this is the use-case that I'm interested in), what happens at levels nearly an order of magnitude less (or even nearly two) than high performance (NVidia 1080) 3D Graphics scenarios: comparable to GC800, the older (first) MALI400 not the 400MP and so on.  Even reaching say 5 GFLOPS, 30 million triangles per second and 200 Mega-pixels per second (and not consume more than around 1.25 watts in 28nm when doing so) would be amazing as a first iteration.

Nvidia 1080s do around ten BILLION triangles per second in comparison, and over 100 BILLION pixels per second, and the shader's measured in *Teraflops* (6 TF/s) which is just insane.  Mind you each 1080 Card is of the order of 150 watts.  So all of its numbers - power consumption and performance - are approximately-linearly *two* orders of magnitude higher than the embedded 3D scenario that I'm targetting.

Now, given the greatly reduced power budget, Vectorisation is *essential* as Jeff Bush's Nyuzi-3D work illustrated graphically (pun intended) to me that it's *getting data through the caches* that is by far the most disproportionately large power cost, hence once the data's there you might as well make damn sure you do as much work as possible before pushing it back down through the cache(s).

Answer, Jacob: we don't know, for all possible potential (future) scenarios, and can't know (by definition).  We only know for *some* certain high-performance scenarios.

l.


Andrew Waterman

unread,
Apr 19, 2018, 1:49:08 AM4/19/18
to jcb6...@gmail.com, Luke Kenneth Casson Leighton, RISC-V ISA Dev
Not historically. And as NVM moves closer into the memory hierarchy, rather than being treated like I/O, it should become even less relevant.

It remains necessary to support this case correctly, but attempting to accelerate it seems like spending cleverness beans in the wrong place.

Allen Baum

unread,
Apr 19, 2018, 2:13:05 AM4/19/18
to Andrew Waterman, jcb6...@gmail.com, Luke Kenneth Casson Leighton, RISC-V ISA Dev
If you have any spare cleverness beans, I’ll buy them from you.

Have you had a chance to look at my slides yet?

-Allen

lk...@lkcl.net

unread,
Apr 19, 2018, 2:20:33 AM4/19/18
to RISC-V ISA Dev, jcb6...@gmail.com, lk...@lkcl.net


On Thursday, April 19, 2018 at 6:49:08 AM UTC+1, waterman wrote:

It remains necessary to support this case correctly, but attempting to accelerate it seems like spending cleverness beans in the wrong place.


i'll remember that phrase.  "cleverness beans" :)

 

Andrew Waterman

unread,
Apr 19, 2018, 2:28:58 AM4/19/18
to lk...@lkcl.net, RISC-V ISA Dev, Jacob Bachmeyer
Patterson deserves credit for that one.


 

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+unsubscribe@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

Jacob Bachmeyer

unread,
Apr 19, 2018, 11:31:00 PM4/19/18
to Andrew Waterman, Luke Kenneth Casson Leighton, RISC-V ISA Dev
Andrew Waterman wrote:
> On Wed, Apr 18, 2018 at 7:33 PM Jacob Bachmeyer <jcb6...@gmail.com
> <mailto:jcb6...@gmail.com>> wrote:
>
> Andrew Waterman wrote:
> > On Wed, Apr 18, 2018 at 1:28 AM, Luke Kenneth Casson Leighton
> > <lk...@lkcl.net <mailto:lk...@lkcl.net> <mailto:lk...@lkcl.net
> Not historically. And as NVM moves closer into the memory hierarchy,
> rather than being treated like I/O, it should become even less relevant.

The concern that I have here is that, in my experience, as data access
becomes faster, data volume grows even faster still. As an example,
over the years as hard disks have grown in both size and speed, in my
experience e2fsck(8) has become *slower* overall -- the expansions in
data volume outpaced the increases in performance. There is and will
still be significant benefit to be had from optimizing I/O.

> It remains necessary to support this case correctly, but attempting to
> accelerate it seems like spending cleverness beans in the wrong place.

Could the acceleration for this case also provide other benefits? (OK,
so "svbadaddr" is over-optimization, especially since there are
currently no similar CSR-like vector registers.) Could implementing
SPROBE.VM (small revision: virtual address -> leaf-most PTE entry; lack
of PTE V flag or relevant permission in result indicates page fault
would occur on actual access) also benefit other supervisor functions?
How often does an environment (at any level, since SPROBE.VM would be
available in higher modes as well) end up walking page tables in software?

[Also, in open development, is it really spending cleverness beans when
contributors bring their own? Most of mine are not exactly
general-purpose. :-) ]


-- Jacob

Allen Baum

unread,
Apr 20, 2018, 9:46:09 AM4/20/18
to Jacob Bachmeyer, Andrew Waterman, Luke Kenneth Casson Leighton, RISC-V ISA Dev
If a vector load encounters lots of faults, you're doing something wrong.
A structure that is large enough to generate multiple faults  probably should be using superpages - and the whole problem is avoided.
That's how you accelerate this particular problem - no  complicated state machines required. KISS principle. Cleverness beans saved for something more important.

Having said that: you don't want to have to re-execute a vector load with a single fault from the beginning, as opposed to where it faulted. There is "probing" version that sets a predicate register bit on a fault.
That actually sounds backwards; wouldn't you want to initialize the predicate to all1, and have a predicated version that loads vector elements that have set predicate bits, and have it  clear bits that don't fault?
I could be misunderstanding how that is supposed to work...

In any case, there is still the issue with dealing with page faults in the non-speculative version (live with re-execution?) .






-- Jacob

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

Andrew Waterman

unread,
Apr 20, 2018, 2:00:54 PM4/20/18
to Allen Baum, Jacob Bachmeyer, Luke Kenneth Casson Leighton, RISC-V ISA Dev
On Fri, Apr 20, 2018 at 6:46 AM Allen Baum <allen...@esperantotech.com> wrote:
If a vector load encounters lots of faults, you're doing something wrong.
A structure that is large enough to generate multiple faults  probably should be using superpages - and the whole problem is avoided.
That's how you accelerate this particular problem - no  complicated state machines required. KISS principle. Cleverness beans saved for something more important.

Having said that: you don't want to have to re-execute a vector load with a single fault from the beginning, as opposed to where it faulted. There is "probing" version that sets a predicate register bit on a fault.
That actually sounds backwards; wouldn't you want to initialize the predicate to all1, and have a predicated version that loads vector elements that have set predicate bits, and have it  clear bits that don't fault?
I could be misunderstanding how that is supposed to work...

In any case, there is still the issue with dealing with page faults in the non-speculative version (live with re-execution?) .

The proposal is to have a CSR that holds the faulting element number; this is also used to specify which element to resume execution at. (A successfully executed vector instruction resets this CSR to 0 so the next vector instruction starts at the beginning.)

Note that supporting starting in the middle of a vector isn’t any harder than having adjustable VL.

Luke Kenneth Casson Leighton

unread,
Apr 20, 2018, 7:00:45 PM4/20/18
to Andrew Waterman, Allen Baum, Jacob Bachmeyer, RISC-V ISA Dev
On Fri, Apr 20, 2018 at 7:00 PM, Andrew Waterman
<wate...@eecs.berkeley.edu> wrote:
>
> On Fri, Apr 20, 2018 at 6:46 AM Allen Baum <allen...@esperantotech.com>
> wrote:
>>
>> If a vector load encounters lots of faults, you're doing something wrong.
>> A structure that is large enough to generate multiple faults probably
>> should be using superpages - and the whole problem is avoided.
>> That's how you accelerate this particular problem - no complicated state
>> machines required. KISS principle. Cleverness beans saved for something more
>> important.
>>
>> Having said that: you don't want to have to re-execute a vector load with
>> a single fault from the beginning, as opposed to where it faulted. There is
>> "probing" version that sets a predicate register bit on a fault.
>> That actually sounds backwards; wouldn't you want to initialize the
>> predicate to all1, and have a predicated version that loads vector elements
>> that have set predicate bits, and have it clear bits that don't fault?
>> I could be misunderstanding how that is supposed to work...

Jeff Bush explored (and documented in an easy-to-follow fashion) the
issues he encountered.
https://jbush001.github.io/2015/11/03/lost-in-translation.html

Towards the end of that is a table where he outlines the number ot
TLB cache entries, the number of TLB misses *and* also the number of
hardware threads. What I find particularly fascinating is that he
learned that the number of hardware threads (the amount of internal
parallelism) has as drastic a reduction on TLB misses as the number of
cache entries does.

>> In any case, there is still the issue with dealing with page faults in the
>> non-speculative version (live with re-execution?) .
>
>
> The proposal is to have a CSR that holds the faulting element number; this
> is also used to specify which element to resume execution at. (A
> successfully executed vector instruction resets this CSR to 0 so the next
> vector instruction starts at the beginning.)
>
> Note that supporting starting in the middle of a vector isn’t any harder
> than having adjustable VL.

Indeed. Or having predication (except that for large VL with
Hybrid/Virtual parallelism and a predication mask that's mostly zeros
at the beginning you have quite a lot of entries to skip before
getting to actual work).

Andrew, I can't help feel that the RVV "sequential semantics" (which
as we discussed and you kindly outlined include throwing away data
after the first contiguous sequence of LOAD/STORE exceptions is
detected) are an implementation-specific detail rather than, how to
put it, a "necessary for correctness because there's no other possible
way to get this right" detail.

Bear with me, this is incredibly challenging to get an unambiguous
context. Yes I'm mindful of what you said of Ahmdal's Law here
(however see Jeff's table above: there may be a legitimate reason -
cost/performance-related - why an implementor chooses to go with a
reduced TLB size and instead chooses a greater number of hardware
threads, choosing to optimise for specific workloads at the expense of
certain carefully-evaluated *corner-case* performance penalties).

Context:

* Simple-V abstracts the entire RVV paradigm topologically and
directly concept-for-concept onto RV* (Base) using the standard
register file, which is "virtually" re-ordered according to
(additional) CSRs specifying how many "real" registers are to be
allocated to a "virtual" vector-register, example outlined here:
http://libre-riscv.org/simple_v_extension/#register_reordering

* Thus implicitly an ADD (standard opcode ADD) results in *multiple*
instructions being added into the instruction FIFO:
http://libre-riscv.org/simple_v_extension/#example_translation

* Thus assuming a pre-existing VLIW or Out-of-Order Architecture,
aacttuaallly.... adding Vector parallelism isn't conceptually that
much of a leap (which is a primary reason why the Simple-V proposal
exists in the first place)

* Thus on a *per-element* basis, register renaming could take place,
splitting out the related elements in the case where there are
multiple vectorised operations becomes... just part of the mundane
"normal" part of an OoO Architecture's job.

So - and this is where the RVV "sequential semantics" concern comes
in. If we may view the implementation as *purely* being "issuing
multiple regular contiguous instructions into the FIFO on a Standard
OoO Architecture", having the "sequential semantics" may be viewed as
an *artificial* architectural limitation that prevents and prohibits
an OoO architecture from being able to best optimise the workload.

Consider the case where a Vectorised LD.X ADD ST.X takes place. The
LD.X results in (register-renamed) operations that are on different
registers (in the main register file). Now let's imagine that
(arbitrarily) 50% of those result in TLB misses (because an
implementor chose to go with a lower-sized TLB and higher number of
threads for whatever legitimate reason).

If as we discussed previously the results are thrown away, the
opportunity is lost for an OoO architecture to *go ahead* with the ADD
*and the ST.X* in the element positions where the *LD.X succeeded*.

Which, due to register renaming on each element in the (virtual)
Vector it would, I believe, be perfectly fine to do.... if it wasn't
for the (arbitrary?) restriction of sequential semantics.

I appreciate that this is quite involved: it's really *really* hard
to get a precise and unambiguous context without nailing everything
down in hard detail. Is the above assessment reasonable, do you
think, and are there any errors in the logical reasoning that would
make it invalid?

l.

Andrew Waterman

unread,
Apr 20, 2018, 7:11:13 PM4/20/18
to Luke Kenneth Casson Leighton, Allen Baum, Jacob Bachmeyer, RISC-V ISA Dev
You're conflating the ISA-level semantics and the microarchitecture.  Nothing prevents the vector microarchitecture from executing in an arbitrarily pipelined, parallel, or out-of-order fashion.

It's no different than the scalar case, where precise exceptions also impose sequential semantics, yet OOO microarchitectures can still speculatively reorder instructions.

Luke Kenneth Casson Leighton

unread,
Apr 20, 2018, 7:38:07 PM4/20/18
to Andrew Waterman, Allen Baum, Jacob Bachmeyer, RISC-V ISA Dev
On Sat, Apr 21, 2018 at 12:10 AM, Andrew Waterman
<wate...@eecs.berkeley.edu> wrote:
>
>
> You're conflating the ISA-level semantics and the microarchitecture.

If you recall we went over that, and established a common frame of
reference in which that conflation as a possible source of confusion
was eliminated. Rather than repeat that (because it took both of us
some time to establish) can we assume the same context? (reminder
below of the discussion)

> Nothing prevents the vector microarchitecture from executing in an
> arbitrarily pipelined, parallel, or out-of-order fashion.

That would be a good summary of the context I wished to establish.

> It's no different than the scalar case, where precise exceptions also impose
> sequential semantics,

Yes! because Simple-V effectively and directly "remaps" Vector
operations *into* the scalar space.

> yet OOO microarchitectures can still speculatively
> reorder instructions.

... except... when RVV mandates "sequential semantics", would you
agree that it places artificial limitations on what the OOO
microarchitecture can and cannot do?

In other words, where an OOO microarchitecture would *normally* go
ahead with an entire sequence of operations (that would, crucially,
result in other parallel ALUs being freed up and used whilst
hardware-level TLB lookup is taking place for the stalled/exceptioned
LD.X elements), the RVV "sequential semantics" *forces* those
operations to not just be thrown away but every operation that depends
on the (legitimate) results as well.

l.

-----

> # assume internal parallelism of 8 and MAXVECTORLEN of 8
> VSETL r0, 8
> FADD x1, x2, x3

> x3[0]: ok
> x3[1]: exception
> x3[2]: ok
> ...
> ...
> x3[7]: ok

> what happens to result elements 2-7? those may be *big* results (RV128) or in the RVV-Extended may be arbitrary bit-widths far greater.

(you replied:)

Thrown away.

Andrew Waterman

unread,
Apr 20, 2018, 7:43:17 PM4/20/18
to Luke Kenneth Casson Leighton, Allen Baum, Jacob Bachmeyer, RISC-V ISA Dev
On Fri, Apr 20, 2018 at 4:37 PM, Luke Kenneth Casson Leighton <lk...@lkcl.net> wrote:
On Sat, Apr 21, 2018 at 12:10 AM, Andrew Waterman
<wate...@eecs.berkeley.edu> wrote:
>
>
> You're conflating the ISA-level semantics and the microarchitecture.

 If you recall we went over that, and established a common frame of
reference in which that conflation as a possible source of confusion
was eliminated.  Rather than repeat that (because it took both of us
some time to establish) can we assume the same context?  (reminder
below of the discussion)

> Nothing prevents the vector microarchitecture from executing in an
> arbitrarily pipelined, parallel, or out-of-order fashion.

 That would be a good summary of the context I wished to establish.

> It's no different than the scalar case, where precise exceptions also impose
> sequential semantics,

 Yes!  because Simple-V effectively and directly "remaps" Vector
operations *into* the scalar space.

> yet OOO microarchitectures can still speculatively
> reorder instructions.

 ... except... when RVV mandates "sequential semantics", would you
agree that it places artificial limitations on what the OOO
microarchitecture can and cannot do?

No.

Michael Clark

unread,
Apr 20, 2018, 7:55:27 PM4/20/18
to Allen Baum, Jacob Bachmeyer, Andrew Waterman, Luke Kenneth Casson Leighton, RISC-V ISA Dev


On 21/04/2018, at 1:46 AM, Allen Baum <allen...@esperantotech.com> wrote:

If a vector load encounters lots of faults, you're doing something wrong.
A structure that is large enough to generate multiple faults  probably should be using superpages - and the whole problem is avoided.
That's how you accelerate this particular problem - no  complicated state machines required. KISS principle. Cleverness beans saved for something more important.

Exactly.

GPUs typically have flat address spaces and the IOMMU on the host is used to create the illusion of unified virtual memory (OpenCL 2.0). Accelerated rendering pipelines create all of their buffers in advance and there is no page faulting in the fast path. The cache hierarchy is explicit and GPUs often have multiple separate load and store instructions, for constant, local, shared and global memory (of course only load instructions for constant memory). It’s only recently that larger implicit caches are being added for global memory access (similar to regular loads and stores on CPUs). Having separate load and store instructions makes the bus decode simpler for accessing cluster local memory which is not part of the global address space. Compute kernels are typically written with this explicit memory model in mind with passes to explicitly gather data from global memory to local or shared memory (CUDA, OpenCL).

This is the same reason we are going to have separate vector instructions so we don’t complicate the pipeline for dispatch of scalar instructions vs vector instructions. A given micro-architecture can decide to share functional units but it shouldn’t be baked into the scalar ISA. The Base ISA should be kept simple. Complicating instruction decode for scalar vs vector ops in the scalar ISA by adding extra implicit state to the decoder basically complicates an OoO as each instruction needs a copy of how the register is/was configured in the pipeline at that point in time. This basically complicate the scalar execution pipeline (someone hasn’t thought about re-ordering of scalar instructions that have implicit external state set by another instruction that needs to be in some rename table). Sorry, distracted a bit, was thinking about the recent Complex-V discussions.

Also note the TLB can be in the order of 20% of energy in some memory intensive workloads (CAMs light up all rows for matches), so you are unlikely to have a TLB in the fast path if you are focused on performance. A custom accelerator for AI could perceivably run without paging enabled, as machine learning workloads tend to work on large fixed size matrices that don’t require any allocations during their main loops.

One of the principles of the RISC-V design is modularity. i.e. “V” is a distinct extension with its own encoding which doesn’t complicate “I”. One could for example have 64 OoO scalar cores with paging and TLBs (rv64imafdcsu) running scalar code with offload of compute kernels to say 4096 in-order single address space vector optimised cores with simpler instruction decode (rv64imafdv). Just imagining a possible architecture... ;-)

Apologies for going off on a tangent on this thread, but i’m thinking it’s more appropriate to be discussing the “V” extension on this list given the Base ISA is frozen. If someone has an alternative non-standard extension they should probably set up a mailing list to discuss it so that isa-dev can stay focused on the extensions that are outlined in the current specifications, versus somewhat radical deviations.

Apologies for hijacking your reply Allen for responding implicitly to recent discussions on isa-dev. I hope we can close them as off-topic and move on. I think the “pivot point” being the fact that the semantics of the Base ISA is frozen and has been for quite some time, perhaps besides formal verification and compliance.

The V working group is the right place to discuss the vector extensions. It also seems fair and on topic to discuss the draft V extension on the isa-dev list.

Again apologies for hijacking your reply to vent my opinions.

Luke Kenneth Casson Leighton

unread,
Apr 20, 2018, 7:59:12 PM4/20/18
to Andrew Waterman, Allen Baum, Jacob Bachmeyer, RISC-V ISA Dev
On Sat, Apr 21, 2018 at 12:42 AM, Andrew Waterman
<wate...@eecs.berkeley.edu> wrote:

>> > yet OOO microarchitectures can still speculatively
>> > reorder instructions.
>>
>> ... except... when RVV mandates "sequential semantics", would you
>> agree that it places artificial limitations on what the OOO
>> microarchitecture can and cannot do?
>
>
> No.

hmmm, it seems to me that it would... thinking about how you would
believe otherwise... one possible explanation is that we may be
mis-communicating regarding the context (continuing below, for a more
accurate context)

>> > # assume internal parallelism of 8 and MAXVECTORLEN of 8
>> > VSETL r0, 8
>> > FADD x1, x2, x3
>>
>> > x3[0]: ok
>> > x3[1]: exception
>> > x3[2]: ok
>> > ...
>> > ...
>> > x3[7]: ok
>>
>> > what happens to result elements 2-7? those may be *big* results (RV128)
>> > or in the RVV-Extended may be arbitrary bit-widths far greater.
>>
>> (you replied:)
>>
>> Thrown away.

in this context, where the results are thrown away (which we
previously established are due to the RVV "sequential semantics"),
could an Out-of-Order microarchitecture have potentially *continued*
with operations on elements 2-7, along with the dependent operations
of other instructions *also* referencing in elements 2-7?

Sorry it's taking several rounds to get this complex issue nailed
down, really appreciate you taking the time to reply, Andrew.

l.

Andrew Waterman

unread,
Apr 20, 2018, 8:25:06 PM4/20/18
to Luke Kenneth Casson Leighton, Allen Baum, Jacob Bachmeyer, RISC-V ISA Dev
By "no" I meant I don't view the limitations as artificial, as they are the semantics that software wants and they are not more onerous than the restrictions imposed by scalar ISAs.

You could certainly design a vector ISA that specified an imprecise exception model, and allowed partially completed vector operations to be exposed to software across exceptions.  The Vector VAX and Hwacha are two examples of processors that did this in a limited fashion.  These processors didn't employ this approach for higher performance; they did so to simplify the hardware: it removes the need to rename registers and the need to buffer speculative stores.  These were in-order machines, so they didn't already have those facilities.  The point is moot if you're planning to build an OOO machine: you need these facilities, anyway.

The costs of the imprecise-exception model are greater than the benefit.  Software doesn't want to cope with it.  It's hard to debug.  You can't migrate state between different microarchitectures--unless you force all implementations to support the same imprecise-exception model, which would greatly limit implementation flexibility.  (Less important, but still relevant, is that the imprecise model increases the size of the context structure, as the microarchitectural guts have to be spilled to memory.)


l.

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+unsubscribe@groups.riscv.org.

To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

Michael Clark

unread,
Apr 20, 2018, 8:55:47 PM4/20/18
to RISC-V ISA Dev, Jacob Bachmeyer, Andrew Waterman, Luke Kenneth Casson Leighton, Allen Baum


> On 21/04/2018, at 11:55 AM, Michael Clark <michae...@mac.com> wrote:
>
>
>
> On 21/04/2018, at 1:46 AM, Allen Baum <allen...@esperantotech.com> wrote:
>
>> If a vector load encounters lots of faults, you're doing something wrong.
>> A structure that is large enough to generate multiple faults probably should be using superpages - and the whole problem is avoided.
>> That's how you accelerate this particular problem - no complicated state machines required. KISS principle. Cleverness beans saved for something more important.
>
> Exactly.
>
> GPUs typically have flat address spaces and the IOMMU on the host is used to create the illusion of unified virtual memory (OpenCL 2.0). Accelerated rendering pipelines create all of their buffers in advance and there is no page faulting in the fast path. The cache hierarchy is explicit and GPUs often have multiple separate load and store instructions, for constant, local, shared and global memory (of course only load instructions for constant memory). It’s only recently that larger implicit caches are being added for global memory access (similar to regular loads and stores on CPUs). Having separate load and store instructions makes the bus decode simpler for accessing cluster local memory which is not part of the global address space. Compute kernels are typically written with this explicit memory model in mind with passes to explicitly gather data from global memory to local or shared memory (CUDA, OpenCL).
>
> This is the same reason we are going to have separate vector instructions so we don’t complicate the pipeline for dispatch of scalar instructions vs vector instructions. A given micro-architecture can decide to share functional units but it shouldn’t be baked into the scalar ISA. The Base ISA should be kept simple. Complicating instruction decode for scalar vs vector ops in the scalar ISA by adding extra implicit state to the decoder basically complicates an OoO as each instruction needs a copy of how the register is/was configured in the pipeline at that point in time. This basically complicate the scalar execution pipeline (someone hasn’t thought about re-ordering of scalar instructions that have implicit external state set by another instruction that needs to be in some rename table). Sorry, distracted a bit, was thinking about the recent Complex-V discussions.
>
> Also note the TLB can be in the order of 20% of energy in some memory intensive workloads (CAMs light up all rows for matches), so you are unlikely to have a TLB in the fast path if you are focused on performance. A custom accelerator for AI could perceivably run without paging enabled, as machine learning workloads tend to work on large fixed size matrices that don’t require any allocations during their main loops.
>
> One of the principles of the RISC-V design is modularity. i.e. “V” is a distinct extension with its own encoding which doesn’t complicate “I”. One could for example have 64 OoO scalar cores with paging and TLBs (rv64imafdcsu) running scalar code with offload of compute kernels to say 4096 in-order single address space vector optimised cores with simpler instruction decode (rv64imafdv). Just imagining a possible architecture... ;-)
>
> Apologies for going off on a tangent on this thread, but i’m thinking it’s more appropriate to be discussing the “V” extension on this list given the Base ISA is frozen. If someone has an alternative non-standard extension they should probably set up a mailing list to discuss it so that isa-dev can stay focused on the extensions that are outlined in the current specifications, versus somewhat radical deviations.
>
> Apologies for hijacking your reply Allen for responding implicitly to recent discussions on isa-dev. I hope we can close them as off-topic and move on. I think the “pivot point” being the fact that the semantics of the Base ISA is frozen and has been for quite some time, perhaps besides formal verification and compliance.
>
> The V working group is the right place to discuss the vector extensions. It also seems fair and on topic to discuss the draft V extension on the isa-dev list.
>
> Again apologies for hijacking your reply to vent my opinions.

Nevertheless, this thread is on-topic, so I don’t want to be unreasonable or unfair, and we should keep the discussions here technical and avoid opinion.

I think it is reasonable to defer partial Vector load store faulting, and leave this up to the micro-architecture, such that VL (vector len) operations either complete or fault.

Choosing the appropriate vector length and alignment is the responsibility of the compiler. I think its reasonable to expect one vector instruction to either fault or complete and leave it at that. Page faults shouldn’t be frequent. I guess the cost becomes higher when the vector length is larger. i.e. 16 or 32 lanes/elements.
> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/A3176D7E-7D26-4AA1-AF71-D69E9AE64BDF%40mac.com.

Luke Kenneth Casson Leighton

unread,
Apr 20, 2018, 9:48:50 PM4/20/18
to Andrew Waterman, Allen Baum, Jacob Bachmeyer, RISC-V ISA Dev
ok, understood - really appreciate the clarification: see below.

> You could certainly design a vector ISA that specified an imprecise
> exception model, and allowed partially completed vector operations to be
> exposed to software across exceptions. The Vector VAX and Hwacha are two
> examples of processors that did this in a limited fashion. These processors
> didn't employ this approach for higher performance; they did so to simplify
> the hardware: it removes the need to rename registers and the need to buffer
> speculative stores. These were in-order machines, so they didn't already
> have those facilities. The point is moot if you're planning to build an OOO
> machine: you need these facilities, anyway.

appreciated.

> The costs of the imprecise-exception model are greater than the benefit.
> Software doesn't want to cope with it. It's hard to debug. You can't
> migrate state between different microarchitectures--unless you force all
> implementations to support the same imprecise-exception model, which would
> greatly limit implementation flexibility. (Less important, but still
> relevant, is that the imprecise model increases the size of the context
> structure, as the microarchitectural guts have to be spilled to memory.)

understood... and extremely valuable insights which an implementor
should take into account when making a decision, so if it's ok with
you I'm going to pretty much add this verbatim to the document i'm
writing as a "warning". or something to be aware of.

my point thus becomes: should RVV impose "sequential semantics" *even
though* the cost of imprecise-exception may be high? What if an
implementor comes up with an innovative solution to that problem? Or,
if in their particular specially-targetted use-case, it doesn't
matter? Examples of that could include proprietary 3D GPUs where
there's a proprietary software stack that comes along-side the
hardware implementation. As far as a licensee is concerned it's "not
their problem!".

Basically I'm attempting to make the case that RVV "sequential
semantics" are - were - an artefact of architectural decisions made in
the Reference Implementations (Hwacha) that could lead to restrictions
on future innovation, along-side albeit *extremely carefully*
thought-through implications, right down to the software stack and
debugging environment (which, now that I think about it, apply to
scalar OoO microarchitectures just as well)

By removing the restriction there is indeed plenty of room for
foot-shooting (another technical term like "cleverness beans"), at the
same time permitting innovation for those implementors with an excess
in the "bean" department.

Am I making sense?

l.

p.s. Andrew: going into "reflection" mode, this is in part what I was
referring to a few days ago: whilst it's extremely important to
recognise the huge value of what's been achieved, *and* also recognise
the huge value of the absolutely enormous amount of research that's
gone into the decision-making process, I also feel that it's very very
important that the Standard(s) *not* place restrictions on
implementors which prevent them from exploring and innovating. The
above is one such area, I apologise it's taken several rounds to get
to the point, inevitable as it is with our completely different
backgrounds to establish a common frame of reference.

Andrew Waterman

unread,
Apr 20, 2018, 11:02:42 PM4/20/18
to Luke Kenneth Casson Leighton, Allen Baum, Jacob Bachmeyer, RISC-V ISA Dev
As I think I mentioned in an earlier email, providing precise exceptions is a platform issue, not just an ISA issue.  In some platforms, memory-protection exceptions are inherently fatal (e.g., simple embedded systems).  Mandating precise exceptions there doesn't benefit SW much, so perhaps the platform shouldn't mandate that the HW provide that abstraction.

For applications-/server-class processors, where precise exceptions benefit the most, it remains legal to punt to M-mode to help out with HW corner cases.  If the implementation provides enough information to M-mode so that it can recover precise state and present it to S/U-mode, it can still be considered conformant.  (It's not altogether different than the RISC-V philosophy of misaligned memory accesses; it's just more complicated.)


 Basically I'm attempting to make the case that RVV "sequential
semantics" are - were - an artefact of architectural decisions made in
the Reference Implementations (Hwacha) that could lead to restrictions
on future innovation, along-side albeit *extremely carefully*
thought-through implications, right down to the software stack and
debugging environment (which, now that I think about it, apply to
scalar OoO microarchitectures just as well)

If a hypothetical SW system no longer needs this hardware feature, then a hypothetically superior HW platform will eventually cater to it.

But this same debate has already happened, decades ago, back in the early days of scalar ILP.  After much wrangling, precise exceptions arose as the clearly superior choice, in no small part because microarchitectural innovation closed the performance gap enough that the software pain could no longer be justified.  We now have the benefit of all that hindsight.


 By removing the restriction there is indeed plenty of room for
foot-shooting (another technical term like "cleverness beans"), at the
same time permitting innovation for those implementors with an excess
in the "bean" department.

 Am I making sense?

Yes, with the caveat that others are entitled to expressly disdain the time this mailing list spends on foot-shooting.

Jacob Bachmeyer

unread,
Apr 20, 2018, 11:49:05 PM4/20/18
to Allen Baum, Andrew Waterman, Luke Kenneth Casson Leighton, RISC-V ISA Dev
Allen Baum wrote:
> If a vector load encounters lots of faults, you're doing something wrong.
> A structure that is large enough to generate multiple faults probably
> should be using superpages - and the whole problem is avoided.

There is a small catch: a vector-gather LOAD can potentially span the
address space, so superpages are only a partial solution.


-- Jacob

Jacob Bachmeyer

unread,
Apr 20, 2018, 11:53:41 PM4/20/18
to Andrew Waterman, Allen Baum, Luke Kenneth Casson Leighton, RISC-V ISA Dev
Andrew Waterman wrote:
> On Fri, Apr 20, 2018 at 6:46 AM Allen Baum
> <allen...@esperantotech.com <mailto:allen...@esperantotech.com>>
> wrote:
>
> If a vector load encounters lots of faults, you're doing something
> wrong.
> A structure that is large enough to generate multiple faults
> probably should be using superpages - and the whole problem is
> avoided.
> That's how you accelerate this particular problem - no
> complicated state machines required. KISS principle. Cleverness
> beans saved for something more important.
>
> Having said that: you don't want to have to re-execute a vector
> load with a single fault from the beginning, as opposed to where
> it faulted. There is "probing" version that sets a predicate
> register bit on a fault.
> That actually sounds backwards; wouldn't you want to initialize
> the predicate to all1, and have a predicated version that loads
> vector elements that have set predicate bits, and have it clear
> bits that don't fault?
> I could be misunderstanding how that is supposed to work...
>
> In any case, there is still the issue with dealing with page
> faults in the non-speculative version (live with re-execution?) .
>
>
> The proposal is to have a CSR that holds the faulting element number;
> this is also used to specify which element to resume execution at. (A
> successfully executed vector instruction resets this CSR to 0 so the
> next vector instruction starts at the beginning.)

Will this CSR also be available as a read-only "vector progress
indicator" for user code reporting progress on pending vector
operations? (Is that even feasible?)


-- Jacob

Jacob Bachmeyer

unread,
Apr 21, 2018, 12:02:49 AM4/21/18
to Luke Kenneth Casson Leighton, Andrew Waterman, Allen Baum, RISC-V ISA Dev
Luke Kenneth Casson Leighton wrote:
> [...]
>
> If as we discussed previously the results are thrown away, the
> opportunity is lost for an OoO architecture to *go ahead* with the ADD
> *and the ST.X* in the element positions where the *LD.X succeeded*.
>

Exposing that parallelism to user space makes page faults visible, and
might permit an exploit to succeed despite causing a fatal page fault.
In other words, here is a "vector Meltdown" scenario: consider a
hypothetical ultra-high-performance system with extensive MMIO
(including network) mapped all the way to user-space. ASLR is in use,
but a software error leads to a possible exploit, normally mitigated by
ASLR. The attacker supplies an exploit payload that uses a
vector-gather LOAD and linear vector STORE to guess a large number of
possible locations for some small piece of valuable data, like the
private key for a server's TLS certificate, and copy them all to the NIC
outgoing packet buffer. One element is LOADed successfully and written
to the outgoing packet buffer by the speculative STORE, *then* the page
fault trap is taken. By the time supervisor detects the error, the key
has already been sent to the attacker.

Speculative memory access is *VERY* dangerous. Computing with
speculated values makes Spectre attacks work.


-- Jacob

Luke Kenneth Casson Leighton

unread,
Apr 21, 2018, 12:45:49 AM4/21/18
to Andrew Waterman, Allen Baum, Jacob Bachmeyer, RISC-V ISA Dev
---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68


On Sat, Apr 21, 2018 at 4:02 AM, Andrew Waterman
<wate...@eecs.berkeley.edu> wrote:

>> Basically I'm attempting to make the case that RVV "sequential
>> semantics" are - were - an artefact of architectural decisions made in
>> the Reference Implementations (Hwacha) that could lead to restrictions
>> on future innovation, along-side albeit *extremely carefully*
>> thought-through implications, right down to the software stack and
>> debugging environment (which, now that I think about it, apply to
>> scalar OoO microarchitectures just as well)
>
>
> If a hypothetical SW system no longer needs this hardware feature, then a
> hypothetically superior HW platform will eventually cater to it.
>
> But this same debate has already happened, decades ago, back in the early
> days of scalar ILP. After much wrangling, precise exceptions arose as the
> clearly superior choice, in no small part because microarchitectural
> innovation closed the performance gap enough that the software pain could no
> longer be justified. We now have the benefit of all that hindsight.

... so two (logically-chained) things leave me still puzzled.

* scalar OoO implementations exist, and have [i *hope*] solved this,
satisfactorily. today we don't see people *not* implementing OoO in
scalar architectures because exceptions are problematic.

* Simple-V is in effect, by dropping vectors onto *standard* scalar
register files, just an "instruction generator". A compact way to put
3 (or more) instructions with regularly-increasing register indices
into the exact same *Scalar* OoO-microarchitected execution engine.

Combine those two and I honestly and genuinely don't see the problem.
Scalar OoO has it "solved", so why should
Scalar-OoO-with-a-multi-instruction-generator-on-the-front-end be any
different? no blindfolds issued, no loaded pistols being waved about
with the safety off.

please do give me some time to think about this - no need to reply.


>> By removing the restriction there is indeed plenty of room for
>> foot-shooting (another technical term like "cleverness beans"), at the
>> same time permitting innovation for those implementors with an excess
>> in the "bean" department.
>>
>> Am I making sense?
>
> Yes, with the caveat that others are entitled to expressly disdain the time
> this mailing list spends on foot-shooting.

we all gotta learn somehow! and an effort in foot-shooting might
actually end up hypothetically winning the 1,000 yard rifle
competition completely by accident. million monkeys 'an'all... :)

but seriously: one of the things about history repeating itself is
that the hard lessons tend to sink in a bit better than if they were
avoided.

thanks for taking the time to go over this, Andrew, I'm happy
(fiiinally) to have been able to successfully get across something
that I feel is important.

l.

Jacob Bachmeyer

unread,
Apr 21, 2018, 1:06:17 AM4/21/18
to Luke Kenneth Casson Leighton, Andrew Waterman, Allen Baum, RISC-V ISA Dev
Luke Kenneth Casson Leighton wrote:
> ---
> crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68
>
>
> On Sat, Apr 21, 2018 at 4:02 AM, Andrew Waterman
> <wate...@eecs.berkeley.edu> wrote:
>>> Basically I'm attempting to make the case that RVV "sequential
>>> semantics" are - were - an artefact of architectural decisions made in
>>> the Reference Implementations (Hwacha) that could lead to restrictions
>>> on future innovation, along-side albeit *extremely carefully*
>>> thought-through implications, right down to the software stack and
>>> debugging environment (which, now that I think about it, apply to
>>> scalar OoO microarchitectures just as well)
>>>
>> If a hypothetical SW system no longer needs this hardware feature, then a
>> hypothetically superior HW platform will eventually cater to it.
>>
>> But this same debate has already happened, decades ago, back in the early
>> days of scalar ILP. After much wrangling, precise exceptions arose as the
>> clearly superior choice, in no small part because microarchitectural
>> innovation closed the performance gap enough that the software pain could no
>> longer be justified. We now have the benefit of all that hindsight.
>>
>
> ... so two (logically-chained) things leave me still puzzled.
>
> * scalar OoO implementations exist, and have [i *hope*] solved this,
> satisfactorily. today we don't see people *not* implementing OoO in
> scalar architectures because exceptions are problematic.
>

Correct; we get Meltdown when people get OoO exceptions *wrong* in
scalar architectures.

> * Simple-V is in effect, by dropping vectors onto *standard* scalar
> register files, just an "instruction generator". A compact way to put
> 3 (or more) instructions with regularly-increasing register indices
> into the exact same *Scalar* OoO-microarchitected execution engine.
>

That is the problem: is that "instruction generator" worth its
associated costs?

> Combine those two and I honestly and genuinely don't see the problem.
> Scalar OoO has it "solved", so why should
> Scalar-OoO-with-a-multi-instruction-generator-on-the-front-end be any
> different? no blindfolds issued, no loaded pistols being waved about
> with the safety off.
>

The big problem is that instruction decode can be on the critical path
and Simple-V makes instruction decode more complex -- the processor must
extract the register numbers and then "look up" whether to use the
Simple-V instruction generator or simply execute the instruction.
RISC-V keeps the register numbers in specific places across all
instructions and ISTR commentary in the spec that this shortens a
critical path in some/most implementations.


-- Jacob

Andrew Waterman

unread,
Apr 21, 2018, 4:06:12 AM4/21/18
to Jacob Bachmeyer, Allen Baum, Luke Kenneth Casson Leighton, RISC-V ISA Dev
On Fri, Apr 20, 2018 at 8:53 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
Andrew Waterman wrote:

It's basically an extension of epc, so anywhere you'd need epc, you'd need "vprogress."  So it necessarily becomes part of the user context.




-- Jacob

Allen Baum

unread,
Apr 21, 2018, 5:25:22 AM4/21/18
to jcb6...@gmail.com, Andrew Waterman, Luke Kenneth Casson Leighton, RISC-V ISA Dev
A wise person once said:
If it hurts when you do that, don’t do that.

Yes, it is possible to cover them entire address space with a single vector load.

And, if you’re doing that, you’re doing something wrong also. You will have horrible performance. Fix your code, don’t expect HW to solve your problem..

-Allen

Andrew Waterman

unread,
Apr 21, 2018, 6:42:40 AM4/21/18
to Luke Kenneth Casson Leighton, Allen Baum, Jacob Bachmeyer, RISC-V ISA Dev
The Simple-V concept can be implemented in a performant way for some microarchitectural styles.

Yet, there are two important use cases where it isn't a good idea:
- The tried-and-true simple control processor with decoupled vector unit
- The Intel-style OOO-AVX unit

Although the microarchitectural reasoning is different between the two, the explanation is the same: neither machine wants to share the scalar register file with the vector register file.

The Simple-V concept deserves merit for systems constrained primarily by regfile capacity.  So I'd view it as a competitor to P.


please do give me some time to think about this - no need to reply.


>>  By removing the restriction there is indeed plenty of room for
>> foot-shooting (another technical term like "cleverness beans"), at the
>> same time permitting innovation for those implementors with an excess
>> in the "bean" department.
>>
>>  Am I making sense?
>
> Yes, with the caveat that others are entitled to expressly disdain the time
> this mailing list spends on foot-shooting.

 we all gotta learn somehow!  and an effort in foot-shooting might
actually end up hypothetically winning the 1,000 yard rifle
competition completely by accident.  million monkeys 'an'all... :)

 but seriously: one of the things about history repeating itself is
that the hard lessons tend to sink in a bit better than if they were
avoided.

 thanks for taking the time to go over this, Andrew, I'm happy
(fiiinally) to have been able to successfully get across something
that I feel is important.

Thank you.


l.

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+unsubscribe@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

Luke Kenneth Casson Leighton

unread,
Apr 22, 2018, 10:59:09 PM4/22/18
to Jacob Bachmeyer, Andrew Waterman, Allen Baum, RISC-V ISA Dev
On Sat, Apr 21, 2018 at 6:06 AM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:

> The big problem is that instruction decode can be on the critical path and
> Simple-V makes instruction decode more complex --

with MISA switching off instructions it's already more complex

> the processor must extract
> the register numbers and then "look up" whether to use the Simple-V
> instruction generator or simply execute the instruction. RISC-V keeps the
> register numbers in specific places across all instructions and ISTR
> commentary in the spec that this shortens a critical path in some/most
> implementations.

state (as low as one bit per register) can be generated *from* the
relevant CSRs and held closer to the decode engine.

l.
Reply all
Reply to author
Forward
0 new messages