Re: [isa-dev] do we need an architectural speculation barrier?

285 views
Skip to first unread message

Jose Renau

unread,
Jan 29, 2019, 1:55:30 AM1/29/19
to Luke Kenneth Casson Leighton, Andrew Waterman, Daniel Lustig, Michael Clark, Jim Wilson, Christopher Celio, RISC-V ISA Dev

My main "requirement" is to allow hardware to handle this spectre leaks efficiently. I mean,
if we add instructions, hardware that does not leak should be able to perform them as NOPs,
and try to avoid Gazzilions of NOPs.

My other "want" is being able to mark "time domains" or when is possible to have a time leak
between two groups and when it is not. E.g: it is OK to time like between threads in a PARSEC
application, but not OK between threads in a web browser. There should be some RISCV way to
mark this efficiently.

Jose Renau
Professor, Computer Science & Engineering
On Jan 28 2019, at 8:12 pm, Luke Kenneth Casson Leighton <lk...@lkcl.net> wrote:
On Fri, Aug 3, 2018 at 3:54 PM Jose Renau <re...@ucsc.edu> wrote:

You can do speculation and be Spectre safe as long as there are no side effects. Current cores did bit do it because they did not care.

Check the last riscv workshop talk from Chris. It shows how to solve the issue.

Or check this longer talk

so it's amazing how much difference a few months can make, as, thanks
to mitch alsup and others on comp.arch i did an intensive in-depth
study of out-of-order systems. this is an extremely useful
presentation, jose, as it helps categorise the domains in which timing
attacks occur.

the long and short of it is: OoO is f*****d, big-time. yet, we can't
"give up" and go back to in-order single-issue as the benefits of the
increased performance dramatically outweigh the security risks in the
majority of use-cases.

the area that is hardest to protect against *in hardware only* is the
same-process one (which is covered by the "APIs" category in your
presentation, jose).

inter-core and across kernelspace-userspace boundaries can be dealt
with in hardware. inter-core: instead of a shared DIV unit, have one
per core. across kernelspace-userspace boundaries (exceptions,
basically), the processor has an atomic hardware event at which the
engine may be paused until it reaches a quiescent uniform state.

same-process timing attacks ("API" category) however simply cannot be
dealt with - they simply cannot be detected - without an *actual*
instruction that is called which the hardware knows "this is an API
call, we have to quiesce the speculation internal state, right now".

these instructions need to be done as hints, because they also need
to be ignored by in-order systems.

*we need consensus on what to do* as a group, here. this cannot be
left for just one RISC-V implementor to proceed without a discussion,
as it involves modifying software right across the board, in userspace
*and* kernelspace *and* firmware *and* bootloaders.

who is going to step forward and take responsibility for leading the
discussion?

l.
Sent from Mailspring

Jose Renau

unread,
Jan 29, 2019, 10:09:20 AM1/29/19
to Luke Kenneth Casson Leighton, Andrew Waterman, Daniel Lustig, Michael Clark, Jim Wilson, Christopher Celio, RISC-V ISA Dev

See inline

Jose Renau
Professor, Computer Science & Engineering
On Jan 29 2019, at 12:40 am, Luke Kenneth Casson Leighton <lk...@lkcl.net> wrote:
On Tue, Jan 29, 2019 at 6:55 AM 'Jose Renau' via RISC-V ISA Dev <isa...@groups.riscv.org> wrote:

My main "requirement" is to allow hardware to handle this spectre leaks efficiently. I mean,
if we add instructions, hardware that does not leak should be able to perform them as NOPs,
and try to avoid Gazzilions of NOPs.


that's why i advocate them to be "hints".  it needs to be the first "official" hints, otherwise the first implementor that uses "custom" hints for public distribution (public world-wide release of upstream patches to gcc and other software that requires the hints) will result in fragmentation of the RISC-V ecosystem.
 
My other "want" is being able to mark "time domains" or when is possible to have a time leak
between two groups and when it is not.

 that's same-process, right?  do you envisage that to still result in quiescing of the internal state, except perhaps each time domain having the equivalent of an ASID? (an identifier per time-domain?)

 
E.g: it is OK to time like between threads in a PARSEC
application, but not OK between threads in a web browser.

would that be based on an assumption that web browsers cannot trigger spectre-style timing attacks?  or have i misunderstood?  the reason i ask is that it is known that javascript may be used to trigger spectre-style timing attacks:

I meant the opposite. In Javascript we need protection between different threads ("Time domains" == threads that need time side channel isolation) because threads do not share data. In PARSEC, we have threads, but we do not need protection because they share pointers and data.

I agree that extending the ASIDs may be a way to deal with it. Now, different ASIDS are used for different process, if we have different ASIDs for different threads, we could use this to mark time domains.

A time domain is a collection of threads that can share data and that it is OK to have time leaks within the same time domain. It is not OK to have time leaks across time domains.

 
There should be some RISCV way to
mark this efficiently.


 hints [hints are operations that have no effect, that one microarchitecture may take to mean "something", whilst they are NOPs on others].

l.
Sent from Mailspring

Jose Renau

unread,
Jan 29, 2019, 12:36:36 PM1/29/19
to Luke Kenneth Casson Leighton, Andrew Waterman, Daniel Lustig, Michael Clark, Jim Wilson, Christopher Celio, RISC-V ISA Dev

I think that the Time Domain ID can be fairly low overhead.

E.g: if we use bit 62 to 53 of the physical address space (10 bits). We can have 1024 different IDs at the same time. The hypervisor can assign different Time Domain IDs (TDID) and use the upper bits in the physical as ID. It should be transparent for software. Only if they want to use it, they need to make sure to have different upper physical bits (a request to the OS, and the OS maps to available time domain IDs).

I think that it should be possible extending the ASIDs concept, but I have not gone over details.

Jose Renau
Professor, Computer Science & Engineering
On Jan 29 2019, at 8:29 am, Luke Kenneth Casson Leighton <lk...@lkcl.net> wrote:


On Tuesday, January 29, 2019, 'Jose Renau' via RISC-V ISA Dev <isa...@groups.riscv.org> wrote:

See inline

Ack. 

I meant the opposite. In Javascript we need protection between different threads ("Time domains" == threads that need time side channel isolation) because threads do not share data. In PARSEC, we have threads, but we do not need protection because they share pointers and data.


Right got it. I did wonder :)

In SE/Linux the security boundary is exec (not even fork because fork can share sockets). JS and other interpreters (python, java) there will be assumptions involving mutexes and so on...

This is a massive deal, it's a huge paradigm shift in how programming needs to be done. Just as people are taught how to do sockets, when to call the time-speculation fence hint will need to become just as prevalent, with tutorials and example code online. 

dang.

I agree that extending the ASIDs may be a way to deal with it. Now, different ASIDS are used for different process, if we have different ASIDs for different threads, we could use this to mark time domains.

That in turn implies that, realistically, there needs to be a "start of domain" hint, not just a "transition" hint.

Effectively similar to how mutexes work.  Or how the NT System Calls EnterCriticalSection and LeaveCriticalSection work

Without both a start and end hint to mark the critical section, the risk is that the program may continue to run after a transition, thinking that it is in a given time domain when in fact it is not.

Also... deep breath: the TDID (time domain id) needs to be saved and restored on context switch.

Ah... is the TDID actually something that needs to be pushed on the stack? Argh, I think it is.  A function call may need to be temporarily in one time domain, and switch back to the former on exit.

Are TDIDs to be shared across processes? Don't know the answer to that one. Doesn't sound to me like they should be.  Making them unique therefore, at the hardware level, they would need to be concatenated with the ASID from the TLB.

Are TDIDs to be shared across cores? No, standard hardware spectre timing mitigation is supposed to take care of that boundary.

This is really very involved, and not a lot of choice in the matter, if OoO and speculation (and associated performance) is to be kept.


A time domain is a collection of threads that can share data and that it is OK to have time leaks within the same time domain. It is not OK to have time leaks across time domains.

Concur. 

L.



--
---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68

lk...@lkcl.net

unread,
Jan 31, 2019, 12:50:53 AM1/31/19
to RISC-V ISA Dev, lk...@lkcl.net, wate...@eecs.berkeley.edu, dlu...@nvidia.com, m...@sifive.com, ji...@sifive.com, ce...@berkeley.edu
[apologies to cc recipients who may have already received it: this message has not shown up on the isa-dev mailing list: reposting]

On Tue, Jan 29, 2019 at 5:36 PM 'Jose Renau' via RISC-V ISA Dev <isa...@groups.riscv.org> wrote:

I think that the Time Domain ID can be fairly low overhead.

E.g: if we use bit 62 to 53 of the physical address space (10 bits). We can have 1024 different IDs at the same time. The hypervisor can assign different Time Domain IDs (TDID) and use the upper bits in the physical as ID. It should be transparent for software. Only if they want to use it, they need to make sure to have different upper physical bits (a request to the OS, and the OS maps to available time domain IDs).

I think that it should be possible extending the ASIDs concept, but I have not gone over details.

allow me to take a step back, and make an assertion then (haha) do some speculative branch-prediction of the conversation.

the hypothesis is that the TDID is not needed (nor the invasive paradigm shift in computing), on the basis that a uniform quiescence of the OoO engine to a known state is all that is needed when switching from one time domain to another.

would you agree with that?  if not, please do ignore the branch-predicted path of the conversation that follows :)

some background:

a way to state spectre-based timing attacks is that a given (untrusted) instruction may affect the completion time of past *or future* instructions, the timing being potentially affected through shared resource bottlenecks of numerous different types.

in-order systems are [typically] immune to timing attacks precisely because they are specifically designed never to stall the pipeline(s).  any given instruction *always* [typically] completes in a fixed time independent of past [and future] instructions, because that's just the way that the pipelines, register ports, caches (and TLB?) are set up.

put over-simply: in an in-order system there *is* no speculation by which a stall may be caused (which by definition *is* a timing attack).

so this is why (in other threads) i described that OoO systems may be made immune to timing attacks by *massively* over-resourcing the number of ports on the register file, as well as the bandwidth on the operand forwarding bus (if one exists), massively over-resourcing the number of Function Units (if a scoreboard design is utilised), and backing down the amount of branch-prediction and instruction-issue to the point where it can be *formally proven* that all and any given OoO instructions *will* complete in a guaranteed time.

augmentations to that include permitting resource-consuming speculation on the proviso that if the resources being used for speculation are required for a *non*-speculative instruction, the non-speculative instruction takes absolute guaranteed precedence *in the same instruction cycle*.

basically, it's hell to implement and takes up huge numbers of gates, hence the need for the alternative solutions.


so the idea is that as long as, to one group of instructions (in one time domain) another group has no way to determine any information from a group of instructions in another time domain, we're ok.

my point is: that does *not* necessarily mean that it is necessary to assign an ID *to* any given Time Domain.  we *only* need to guarantee a means of separation *between* them.

now, if it were the case that there was some sort of special instruction usage (a restricted subset of instructions or features of instructions) that would guarantee that certain *TYPES* of spectre-style timing attacks were known NEVER to occur (across any given Time Domain transition), THEN it would be useful to assign TDIDs to groups of instructions, and, in a similar fashion to memory FENCE instructions, use the change of TDID to identify which spectre-related resources needed to be quiesced, thus, we reason, reducing latency i.e. the amount of time needed to wait for the processor to quiesce to a known-good (uniform) state.

an example would be that it was known (guaranteed and formally declared by the application writer) that a given Time Domain was not going to use any DIV instructions.  thus, the TDID-FENCE instruction could declare "This TDID does not use DIV", and, consequently, on switching from one TDID to another, if during the transition there happen to be some outstanding DIV operations, they need not be quiesced.

clearly, if the Time Domain violated that constraint (by then actually using a DIV operation when it had formally declared that it was not going to), an exception would need to be raised.

which means in turn that one of the primary advantages for having Time Domains is even more complex than formerly envisaged.

my assertion is: in the case of spectre-style timing attacks, unlike memory FENCE instructions, i do not believe that there *are* any (safe) subdivisions of the types of attacks.  the whole basis of immunity against spectre *is* that the processor returns to a known-good quiescent state in which it is *guaranteed* that no instruction to be executed in the immediate future will be short of resources due to past ones still within the system.

or, more to the point: it is far, far too early and too little research has yet been done to be able to deploy such fine-grained Time-Domain-related security strategies.

which leaves a blanket, uniform "*everything* is quiesced" speculative fence instruction as the safest, simplest, most pragmatic option.

in other words, it is fortunate that a uniform quiescent state is what's needed, and it so happens that it doesn't matter what the domain is: all that matters *is* that the internal state is quiesced [1] at the transition point.

does that sound reasonable?

l.

[1] full quiescence may *or may not* be required.  remember that the actual requirement is that the subsequent instructions have 100% available resources such that they are guaranteed not to be affected by the past instructions already in the system.  so whilst on first analysis it may appear that a full commit to the register file is needed, a full cancellation of all speculative operations, etc. etc., this may not actually be the case.  it is however up to the architectural implementor to determine that, *not* the specification of the proposed FENCE hint itself.

Jacob Lifshay

unread,
Jan 31, 2019, 1:26:32 AM1/31/19
to Luke Kenneth Casson Leighton, RISC-V ISA Dev, Andrew Waterman, dlu...@nvidia.com, m...@sifive.com, ji...@sifive.com, ce...@berkeley.edu
On Wed, Jan 30, 2019, 21:50 lk...@lkcl.net <lk...@lkcl.net wrote:
[apologies to cc recipients who may have already received it: this message has not shown up on the isa-dev mailing list: reposting]

On Tue, Jan 29, 2019 at 5:36 PM 'Jose Renau' via RISC-V ISA Dev <isa...@groups.riscv.org> wrote:

I think that the Time Domain ID can be fairly low overhead.

E.g: if we use bit 62 to 53 of the physical address space (10 bits). We can have 1024 different IDs at the same time. The hypervisor can assign different Time Domain IDs (TDID) and use the upper bits in the physical as ID. It should be transparent for software. Only if they want to use it, they need to make sure to have different upper physical bits (a request to the OS, and the OS maps to available time domain IDs).

I think that it should be possible extending the ASIDs concept, but I have not gone over details.

allow me to take a step back, and make an assertion then (haha) do some speculative branch-prediction of the conversation.

the hypothesis is that the TDID is not needed (nor the invasive paradigm shift in computing), on the basis that a uniform quiescence of the OoO engine to a known state is all that is needed when switching from one time domain to another.

would you agree with that?  if not, please do ignore the branch-predicted path of the conversation that follows :)
That's part of it. The other part is ensuring that the speculation can't change any long-lived state that can be later observed, such as loading the cache on a cache miss with a variable address. Later, even if you force all speculative instructions to be killed, even if you've been executing nothing but nops for 10ms, you can still detect which rows were loaded into the cache by detecting how long it takes to access that address.

some background:

a way to state spectre-based timing attacks is that a given (untrusted) instruction may affect the completion time of past *or future* instructions, the timing being potentially affected through shared resource bottlenecks of numerous different types.

in-order systems are [typically] immune to timing attacks precisely because they are specifically designed never to stall the pipeline(s).  any given instruction *always* [typically] completes in a fixed time independent of past [and future] instructions, because that's just the way that the pipelines, register ports, caches (and TLB?) are set up.

put over-simply: in an in-order system there *is* no speculation by which a stall may be caused (which by definition *is* a timing attack).

so this is why (in other threads) i described that OoO systems may be made immune to timing attacks by *massively* over-resourcing the number of ports on the register file, as well as the bandwidth on the operand forwarding bus (if one exists), massively over-resourcing the number of Function Units (if a scoreboard design is utilised), and backing down the amount of branch-prediction and instruction-issue to the point where it can be *formally proven* that all and any given OoO instructions *will* complete in a guaranteed time.

augmentations to that include permitting resource-consuming speculation on the proviso that if the resources being used for speculation are required for a *non*-speculative instruction, the non-speculative instruction takes absolute guaranteed precedence *in the same instruction cycle*.

basically, it's hell to implement and takes up huge numbers of gates, hence the need for the alternative solutions.


so the idea is that as long as, to one group of instructions (in one time domain) another group has no way to determine any information from a group of instructions in another time domain, we're ok.

my point is: that does *not* necessarily mean that it is necessary to assign an ID *to* any given Time Domain.  we *only* need to guarantee a means of separation *between* them.

now, if it were the case that there was some sort of special instruction usage (a restricted subset of instructions or features of instructions) that would guarantee that certain *TYPES* of spectre-style timing attacks were known NEVER to occur (across any given Time Domain transition), THEN it would be useful to assign TDIDs to groups of instructions, and, in a similar fashion to memory FENCE instructions, use the change of TDID to identify which spectre-related resources needed to be quiesced, thus, we reason, reducing latency i.e. the amount of time needed to wait for the processor to quiesce to a known-good (uniform) state.

an example would be that it was known (guaranteed and formally declared by the application writer) that a given Time Domain was not going to use any DIV instructions.  thus, the TDID-FENCE instruction could declare "This TDID does not use DIV", and, consequently, on switching from one TDID to another, if during the transition there happen to be some outstanding DIV operations, they need not be quiesced.

clearly, if the Time Domain violated that constraint (by then actually using a DIV operation when it had formally declared that it was not going to), an exception would need to be raised.

which means in turn that one of the primary advantages for having Time Domains is even more complex than formerly envisaged.

my assertion is: in the case of spectre-style timing attacks, unlike memory FENCE instructions, i do not believe that there *are* any (safe) subdivisions of the types of attacks.  the whole basis of immunity against spectre *is* that the processor returns to a known-good quiescent state in which it is *guaranteed* that no instruction to be executed in the immediate future will be short of resources due to past ones still within the system.

or, more to the point: it is far, far too early and too little research has yet been done to be able to deploy such fine-grained Time-Domain-related security strategies.

which leaves a blanket, uniform "*everything* is quiesced" speculative fence instruction as the safest, simplest, most pragmatic option.

in other words, it is fortunate that a uniform quiescent state is what's needed, and it so happens that it doesn't matter what the domain is: all that matters *is* that the internal state is quiesced [1] at the transition point.

does that sound reasonable?
sounds good to me, assuming all long-lived state is unaffected by previous speculation after the speculation fence finishes executing.

l.

[1] full quiescence may *or may not* be required.  remember that the actual requirement is that the subsequent instructions have 100% available resources such that they are guaranteed not to be affected by the past instructions already in the system.  so whilst on first analysis it may appear that a full commit to the register file is needed, a full cancellation of all speculative operations, etc. etc., this may not actually be the case.  it is however up to the architectural implementor to determine that, *not* the specification of the proposed FENCE hint itself.

Jacob Lifshay

lkcl

unread,
Feb 3, 2019, 4:01:52 AM2/3/19
to RISC-V ISA Dev, lk...@lkcl.net, wate...@eecs.berkeley.edu, dlu...@nvidia.com, m...@sifive.com, ji...@sifive.com, ce...@berkeley.edu
On Thursday, January 31, 2019 at 2:26:32 PM UTC+8, Jacob Lifshay wrote:

> > would you agree with that?  if not, please do ignore the branch-predicted path of the conversation that follows :)
> That's part of it. The other part is ensuring that the speculation can't change any long-lived state that can be later observed, such as loading the cache on a cache miss with a variable address. Later, even if you force all speculative instructions to be killed, even if you've been executing nothing but nops for 10ms, you can still detect which rows were loaded into the cache by detecting how long it takes to access that address.

Ok. Good point. And good advice to implementors. Not so necessary as a specification detail.

>
>

> > in other words, it is fortunate that a uniform quiescent state is what's needed, and it so happens that it doesn't matter what the domain is: all that matters *is* that the internal state is quiesced [1] at the transition point.
>
>
> > does that sound reasonable?
> sounds good to me, assuming all long-lived state is unaffected by previous speculation after the speculation fence finishes executing.

Ok. So this brings us back to the question, what is the next step?

More specifically, why is there no response on this absolutely critical area that affects absolutely every single out-of-order RISCV implementor?

It makes absolutely no sense that noone from RISCV Foundation is responding to take responsibility for what is clearly an absolutely critical need for allocation of speculation hints, as well as raising awareness within the RISCV community as to the severity of the problem.

Bruce Hoult

unread,
Feb 3, 2019, 5:15:30 AM2/3/19
to lkcl, RISC-V ISA Dev, Luke Kenneth Casson Leighton, Andrew Waterman, dlu...@nvidia.com, Michael Clark, Jim Wilson, ce...@berkeley.edu
On Sun, Feb 3, 2019 at 1:01 AM lkcl <luke.l...@gmail.com> wrote:
> More specifically, why is there no response on this absolutely critical area that affects absolutely every single out-of-order RISCV implementor?
>
> It makes absolutely no sense that noone from RISCV Foundation is responding to take responsibility for what is clearly an absolutely critical need for allocation of speculation hints, as well as raising awareness within the RISCV community as to the severity of the problem.

Luke, would you kindly cease making such categoric, inflammatory, and
uninformed statements?

Even something as simple as following the RISC-V youtube channel or
viewing conference presentation slides would help you to stay current.

https://www.youtube.com/watch?v=yvaFpNNLkzw
https://www.youtube.com/watch?v=-xswp7uUd88
https://www.youtube.com/watch?v=uIbPt1v6QKE

Not to mention that the RISC-V Foundation *has* a Standing Committee
for security, with the people in the preceeding videos on it

https://riscv.org/2018/07/risc-v-foundation-announces-security-standing-committee-calls-industry-to-join-in-efforts/

lkcl

unread,
Feb 3, 2019, 6:09:00 AM2/3/19
to Bruce Hoult, RISC-V ISA Dev, Andrew Waterman, Daniel Lustig, Michael Clark, Jim Wilson, Christopher Celio
On Sun, Feb 3, 2019 at 10:15 AM Bruce Hoult <bruce...@sifive.com> wrote:
>
> On Sun, Feb 3, 2019 at 1:01 AM lkcl <luke.l...@gmail.com> wrote:
> > More specifically, why is there no response on this absolutely critical area that affects absolutely every single out-of-order RISCV implementor?
> >
> > It makes absolutely no sense that noone from RISCV Foundation is responding to take responsibility for what is clearly an absolutely critical need for allocation of speculation hints, as well as raising awareness within the RISCV community as to the severity of the problem.
>
> Luke, would you kindly cease making such categoric, inflammatory, and
> uninformed statements?

i asked five times. yours is the first response, and it's extremely rude.

you don't own, pay or control me. pleasse cease ordering me about.

> Even something as simple as following the RISC-V youtube channel or
> viewing conference presentation slides would help you to stay current.

there was absolutely no need for you to be so rude. all you had to
do was say, "are you aware that there is a youtube channel and that
there are conference slides on this topic".

Bruce Hoult

unread,
Feb 3, 2019, 6:17:34 AM2/3/19
to lkcl, RISC-V ISA Dev, Andrew Waterman, Daniel Lustig, Michael Clark, Jim Wilson, Christopher Celio
Luke, I haven't see you ask. I've seen you make false statements that
may mislead others. If you *asked* it would be different.

Are you aware that there is a riscv.org web page?
Are you aware that every page on that site has a bar at the top with
links to various riscv.org communications channels such as youtube,
twitter, and an RSS feed?

lkcl

unread,
Feb 3, 2019, 6:22:28 AM2/3/19
to Bruce Hoult, RISC-V ISA Dev, Andrew Waterman, Daniel Lustig, Michael Clark, Jim Wilson, Christopher Celio
On Sun, Feb 3, 2019 at 11:17 AM Bruce Hoult <bruce...@sifive.com> wrote:
>
> Luke, I haven't see you ask. I've seen you make false statements that
> may mislead others.

please stop deliberately spreading misinformation about what i say and do.

lkcl

unread,
Feb 3, 2019, 6:47:11 AM2/3/19
to Bruce Hoult, RISC-V ISA Dev, Andrew Waterman, Daniel Lustig, Michael Clark, Jim Wilson, Christopher Celio
On Sun, Feb 3, 2019 at 10:15 AM Bruce Hoult <bruce...@sifive.com> wrote:

> https://www.youtube.com/watch?v=yvaFpNNLkzw

the video is very similar to jose's excellent slides, quoted in this thread.

at 1:52 this video talks about *not* modifying the ISA. it does not
cover the critical case that jose and i discussed a couple of days
ago, which is the in-process case (in jose's presentations'
terminology, this is the "API" case).

at 14:40 a slide is shown which advises that this is too important to
get wrong, however it makes no recommendations.

the topic of this thread - the discussion and implementation of
architectural speculation barriers under the control of the ISA - is
not discussed or raised (as best i can make out).

> https://www.youtube.com/watch?v=-xswp7uUd88

this talk appears to be about a specially-designed microarchitecture,
not an out-of-order architecture. it is an extremely good in-depth
*hardware* solution.

at 10:00 the idea of a "transparent hardware-protection" layer is
proposed, where software may be written in an "unprotected" fashion.

at 16:46 (summary) it says that they implemented a "DPA-hardened"
RISC-V core, confirming that the security is *transparent* to the
software.

it does not bear any relevance to this discussion (aside from
containing excellent descriptions of techniques to *detect*
side-channel leakage)

> https://www.youtube.com/watch?v=uIbPt1v6QKE

at 16:40 the key to this talk is given, that the techniques being
described are not "fixes" for spectre/meltdown, rather they are a way
to *validate* if purported fixes do the job.

at 17:31 a work-in-progress architectural approach is mentioned,
however no details are given.

again, there is no mention of anything related to the topic of this
thread, which is the discussion and implementation of architectural
speculation barriers, through augmentation of the ISA.


> Not to mention that the RISC-V Foundation *has* a Standing Committee
> for security, with the people in the preceeding videos on it

where is the public mailing list for that, so that people here may
review the discussions and refer to anything relevant, publicly?

l.

MitchAlsup

unread,
Mar 20, 2019, 4:43:56 PM3/20/19
to RISC-V ISA Dev, lk...@lkcl.net, wate...@eecs.berkeley.edu, dlu...@nvidia.com, m...@sifive.com, ji...@sifive.com, ce...@berkeley.edu


On Tuesday, January 29, 2019 at 12:55:30 AM UTC-6, Jose Renau wrote:

My main "requirement" is to allow hardware to handle this spectre leaks efficiently. I mean,
if we add instructions, hardware that does not leak should be able to perform them as NOPs,
and try to avoid Gazzilions of NOPs.

My other "want" is being able to mark "time domains" or when is possible to have a time leak
between two groups and when it is not. E.g: it is OK to time like between threads in a PARSEC
application, but not OK between threads in a web browser. There should be some RISCV way to
mark this efficiently.

It is time to start designing microarchitectures that are not subject ot Spectré and Meltdown style
of attacks. These attacks observe microarchitectural state that is not defined at the architecture 
level. 

The first requirement to avoid the attacks is not to allow microarchitectural state to leak into
architectural state--and the prime way of doing this is to completely avoiding modification of
any support structure until the Write stage of the pipeline (ROB in OoO design points). This
includes I and D Caches, I and D TLBs, I and D tablewalkers (when present), along with CR
writes and register writes. Doing these cost fairly little, utilizes already existent HW features,
but may add some additional pressure on those resources.

At a microarchitectural level, one is going to have to tag data with a "I'm not real" bit and do 
not allow a subsequent calculation using an operand so tagged. This is the integer and AGEN
version of the FP NaN but needs to use a bit not part of the operand/result.

Multithreading is replete with leaks from one thread to its neighboring threads. MT probably
has to die, especially if there is access to an accurate real time clock.

RISC-V is still in design, let us not repeat mistakes of the past in the future.

Bruce Hoult

unread,
Mar 20, 2019, 9:40:08 PM3/20/19
to MitchAlsup, RISC-V ISA Dev, Luke Kenneth Casson Leighton, Andrew Waterman, Daniel Lustig, Michael Clark, Jim Wilson, Christopher Celio
On Wed, Mar 20, 2019 at 1:43 PM 'MitchAlsup' via RISC-V ISA Dev
<isa...@groups.riscv.org> wrote:
> On Tuesday, January 29, 2019 at 12:55:30 AM UTC-6, Jose Renau wrote:
>>
>>
>> My main "requirement" is to allow hardware to handle this spectre leaks efficiently. I mean,
>> if we add instructions, hardware that does not leak should be able to perform them as NOPs,
>> and try to avoid Gazzilions of NOPs.
>>
>> My other "want" is being able to mark "time domains" or when is possible to have a time leak
>> between two groups and when it is not. E.g: it is OK to time like between threads in a PARSEC
>> application, but not OK between threads in a web browser. There should be some RISCV way to
>> mark this efficiently.
>
>
> It is time to start designing microarchitectures that are not subject ot Spectré and Meltdown style
> of attacks. These attacks observe microarchitectural state that is not defined at the architecture
> level.
>
> The first requirement to avoid the attacks is not to allow microarchitectural state to leak into
> architectural state--and the prime way of doing this is to completely avoiding modification of
> any support structure until the Write stage of the pipeline (ROB in OoO design points). This
> includes I and D Caches, I and D TLBs, I and D tablewalkers (when present), along with CR
> writes and register writes. Doing these cost fairly little, utilizes already existent HW features,
> but may add some additional pressure on those resources.

I agree with that, and others such as Chris Celio seem to have come to
the same conclusion.

> Multithreading is replete with leaks from one thread to its neighboring threads. MT probably
> has to die, especially if there is access to an accurate real time clock.

Unless you go to a full "barrel processor" and have (up to) N active
threads on each processor and each thread gets a chance to start an
instruction exactly every N clock cycles, no matter what the other
threads are doing. There is a long history of these, starting maybe
from the CDC 6000 series I/O processor, to the Tera MTA, the Xerox
Alto, to the modern XMOS xCore microcontrollers which I understand are
quite popular in certain application domains. e.g.
https://nz.element14.com/xmos/xs1-u8a-128-fb217-c10/mcu-32bit-xcore-500mhz-fbga-217/dp/2424398

> RISC-V is still in design, let us not repeat mistakes of the past in the future.

Agreed.

lk...@lkcl.net

unread,
Mar 23, 2019, 8:33:43 AM3/23/19
to RISC-V ISA Dev, lk...@lkcl.net, wate...@eecs.berkeley.edu, dlu...@nvidia.com, m...@sifive.com, ji...@sifive.com, ce...@berkeley.edu


On Wednesday, March 20, 2019 at 8:43:56 PM UTC, MitchAlsup wrote:


On Tuesday, January 29, 2019 at 12:55:30 AM UTC-6, Jose Renau wrote:

My main "requirement" is to allow hardware to handle this spectre leaks efficiently. I mean,
if we add instructions, hardware that does not leak should be able to perform them as NOPs,
and try to avoid Gazzilions of NOPs.

My other "want" is being able to mark "time domains" or when is possible to have a time leak
between two groups and when it is not. E.g: it is OK to time like between threads in a PARSEC
application, but not OK between threads in a web browser. There should be some RISCV way to
mark this efficiently.

It is time to start designing microarchitectures that are not subject ot Spectré and Meltdown style
of attacks. These attacks observe microarchitectural state that is not defined at the architecture 
level. 

The first requirement to avoid the attacks is not to allow microarchitectural state to leak into
architectural state--and the prime way of doing this is to completely avoiding modification of
any support structure until the Write stage of the pipeline (ROB in OoO design points). This
includes I and D Caches, I and D TLBs, I and D tablewalkers (when present), along with CR
writes and register writes. Doing these cost fairly little, utilizes already existent HW features,
but may add some additional pressure on those resources.

this seems to progress logically from the definition of a timing attack, which is that past and future instructions' completion time must be unaffected by the present instruction, in respect of both resources and state.

one area where the above recommendation might go wrong is if the allocation of instructions to the pipeline causes future allocations to have to be stalled.  and if those future allocations are dependent on the data *in* a present instruction in the pipeline, you're *really* hosed.

for example: i was in the process of designing an early-out IEEE754 pipeline, where the "special cases" phase (zero plus zero, NaN plus NaN and so on) could exit from a side-pipe, instead of having to go through the main pipeline stages (without computation occurring).  if there was insufficient numbers of FPUs, latches and internal buses such that the allocation of multi-issued instructions could potentially jam up (and stall) due to patterns in the incoming FP data, that would by definition be a timing / state vulnerability.



At a microarchitectural level, one is going to have to tag data with a "I'm not real" bit and do 
not allow a subsequent calculation using an operand so tagged. This is the integer and AGEN
version of the FP NaN but needs to use a bit not part of the operand/result.

so... very similar to how the Mill Architecture marks data as "no longer valid", where if any one of the operands are invalid then so is the result (in a chain), and, once results come to the end of the "Belt" (commit time), if any are marked "invalid" they're discarded rather than committed?
 
Multithreading is replete with leaks from one thread to its neighboring threads. MT probably
has to die, especially if there is access to an accurate real time clock.
RISC-V is still in design, let us not repeat mistakes of the past in the future.

unfortunately, like the popup buttons on an old radio, that's such a stark requirement that it's going to continue to be denied that it is the "only solution".  surely (pop) there must be a way in hardware (pop) that this can be solved such that performance (pop) is not affected, or that software (pop) does not need such drastic rewriting, or surely we can only do away with hyperthreading (pop) and surely we can still keep virtualisation secure (pop), and surely we don't have to abandon SMP (pop) and go back to single-core systems??

people are now so used to the high performance that's available today, that to even suggest that they have to e.g. stop using FastCGI / WSGI (a single-process web server gateway API) or that they have to stop using apache's worker thread model, and suffer the performance degradation and increases in latency that results, they'll go into total denial rather than accept reality.

btw after seeing this https://arxiv.org/pdf/1902.05178.pdf and learning of TLBleed i don't believe that even in-order systems are immune from timing attacks (only certain drastically-simplified classes of in-order systems).

RISC-V is still in design, let us not repeat mistakes of the past in the future.

how much is down to the ISA and how much down to microarchitectural decisions of implementors, i believe those to primarily be separate.

however from this paper, https://arxiv.org/pdf/1902.05178.pdf it's clear that separate processes - a context switch - mitigates timing attacks and state leakage CAVEAT: *with the right microarchitectural design*

so that had me thinking: whilst the authors of that paper recommend switching from one process to another, why not switch from one process *and directly back to the same process* as a means to mitigate timing attacks *within* that process?

and if that can be done, then surely it is obvious that a "hint" instruction which achieves the same effect (without the unnecessary software overhead of the actual context-switch-and-back-again) would be a desirable addition to the RISC-V Instruction Set.

l.


lk...@lkcl.net

unread,
Mar 23, 2019, 8:53:37 AM3/23/19
to RISC-V ISA Dev, Mitch...@aol.com, lk...@lkcl.net, wate...@eecs.berkeley.edu, dlu...@nvidia.com, m...@sifive.com, ji...@sifive.com, ce...@berkeley.edu
i took a look a couple weeks ago at the concept of the barrel processor, after jacob alerted me to its existence [1]

it appears that barrel processors are primarily used to implement I/O.  i.e. yes it's "Bit-Banging" [2] (however if it's a *dedicated* processor that's doing the "banging", can it actually be *called* bit-banging, especially if its software is in ROM )?

so, as it would be absolutely absolutely critical for a bus to have a clock that's regular, and for that SPI data to be read *on* the incoming clock pulse, and for the timing to be absolutely rock-solid, no exceptions, no interrupts, NOTHING that could potentially disrupt the occurrence of instructions, on time EVERY time, a requirement that instructions be executed in a predictable and guaranteed strict real-time fashion is clearly paramount.

therefore, it's not that barrel processors are *designed* to be immune to spectre and other timing attacks, it's just that their immunity is a side-effect of the absolute strict and inviolate requirement to have instructions executed in the absolute strictest time-conformant manner, in order to process and generate timing-accurate I/O.

meeting this strict timing requirement unfortunately has some very drastic side-effects.  from what i've seen of that wikipedia page, caches are out (because a cache miss could result in missing the timing window for delivering a synchronised clock or I/O pulse), TLBs are out (for the same reason), pipeline stalling is out (likewise), and multi-issue is out (far too complex and unpredictable).  i'm not sure if exceptions or even interrupts are allowed.

in addition, the threading capability is provided by having copies of the *entire* register file and associated control state, and for each "thread", the performance drops.  a 200mhz processor with 4 barrel threads *actually* only executes each "thread" at 50mhz.

by the time things like muilt-issue have been added in, and L1 and L2 caches and TLBs, it's no longer called a barrel processor, it's called Hyper-Threading.

bottom line: my understanding is that if you need an I/O processor microarchitecture, a barrel processor design is an excellent choice.  however for general-purpose high-performance usage (modern general-purpose computing workloads) the stringent design requirements would be so restrictive that it would be extremely unlikely to be a successful product anywhere other than niche markets.

l.

Samuel Falvo II

unread,
Mar 23, 2019, 11:53:07 AM3/23/19
to lk...@lkcl.net, RISC-V ISA Dev, Mitch...@aol.com, Andrew Waterman, dlu...@nvidia.com, m...@sifive.com, Jim Wilson, Christopher Celio
On Sat, Mar 23, 2019 at 5:53 AM lk...@lkcl.net <lk...@lkcl.net> wrote:
> meeting this strict timing requirement unfortunately has some very drastic side-effects. from what i've seen of that wikipedia page, caches are out (because a cache miss could result in missing the timing window for delivering a synchronised clock or I/O pulse), TLBs are out (for the same reason), pipeline stalling is out (likewise), and multi-issue is out (far too complex and unpredictable). i'm not sure if exceptions or even interrupts are allowed.

I'm going to define some arbitrary terms as a matter of convenience:
if a barrel processor has N "threads" (which are really virtual CPUs),
then each virtual CPU is an IOP (I/O Processor), picked arbitrarily
because of CDC's prior art.

One of the defining characteristics of I/O versus plain memory is that
you frequently have registers which cause the processor to wait. For
instance, in the Atari 8-bit computers, there's explicitly a register
which injects wait-states to the CPU until the next vertical sync
period. For serial ports, it can be the case that you block the
processor until the transmit queue becomes free again before receiving
the next byte. For video chips like the TMS9918A, your I/O accesses
will block until the VDP chip has finished its current memory fetch
operation. And so on. Put simply, when it comes to I/O operations,
*expect* that externally-influenced wait conditions are going to be
the norm, not the exception.

Given that condition, then, if a load or store instruction running on
an IOP has to block on something (it's a matter of when, not if),
clearly the barrel will not wait; however, the individual IOP *can*
wait; it just waits N more clock cycles. When that IOP's context
becomes active in the memory pipeline stage again, it can look at that
IOP's memory port to see if it's allowed to continue or not. In other
words, even a barrel processor does not mitigate timing-based attacks;
in fact, it might even exacerbate them because your latencies are all
multiplied by a factor of N.

Regarding interrupts, each IOP is, in essence, just a normal processor
whose guts are multiplexed amongst other contexts. Thus, each IOP is
capable of managing its own set of interrupts independently of other
IOPs. So, unless your IOPs are explicitly designed to not support
interrupts, this will be an additional consideration as well.

> by the time things like muilt-issue have been added in, and L1 and L2 caches and TLBs, it's no longer called a barrel processor, it's called Hyper-Threading.

I'm showing my ignorance here -- I thought hyperthreading was
opportunistic -- e.g., switch contexts only when the current
hyperthread is to block on something. A barrel design is rigid in its
timing. It's like the difference between common Ethernet and Sonet:
both rely on time-division multiplexing, but the former is
opportunistic (I can use the network as long as I don't hear anyone
else using it) while the latter is rigidly defined by atomic clocks
(I'm not allowed to transmit this next fixed-size buffer until 125
microseconds from ..... now).

--
Samuel A. Falvo II

MitchAlsup

unread,
Mar 23, 2019, 1:16:18 PM3/23/19
to RISC-V ISA Dev


On Saturday, March 23, 2019 at 10:53:07 AM UTC-5, Samuel Falvo II wrote:

> by the time things like muilt-issue have been added in, and L1 and L2 caches and TLBs, it's no longer called a barrel processor, it's called Hyper-Threading.

I'm showing my ignorance here -- I thought hyperthreading was
opportunistic -- e.g., switch contexts only when the current
hyperthread is to block on something.  A barrel design is rigid in its
timing.  It's like the difference between common Ethernet and Sonet:
both rely on time-division multiplexing, but the former is
opportunistic (I can use the network as long as I don't hear anyone
else using it) while the latter is rigidly defined by atomic clocks
(I'm not allowed to transmit this next fixed-size buffer until 125
microseconds from ..... now).

There are at least a 1/2 dozen ways to do multithreading CPUs.

It it is all based on decisions made at/around FETCH time and
at/around DECODE time in the pipeline.

A Barrel processor has a strict DIV-N timing. Thread K gets a FETCH
cycle or a DECODE cycle when clock MOD N = k.

A power saving MT design might change threads only when it sees
a Cache/TLB miss and switch to another thread to occupy the pipe.

A higher perf design might have several threads fetch and decode
instructions and when the master thread cannot use a function unit,
some other thread lobs an instruction in its direction; thus keeping
the function units busy. {Have not seen this one implemented.}

In the guise of the immediately above, one could have multiple
fetch and decode units that are BW limited into their register file
(each not en-massé). As long as the aggregate RF BW is sufficient
perf remains good.

One could imagine multiple fetch/decode machines feeding a 
common reservation station (or scoreboard) which then drives
the calculation units.

But as long as SW remains incapable of utilizing more than a handful
of threads, it's all moot anyway.

In general, function units are no longer so costly and routing data
to them is at least as costly, the balance between keeping them
busy or replication has shifted towards replication and shorter
wires.

lk...@lkcl.net

unread,
Mar 24, 2019, 6:52:24 AM3/24/19
to RISC-V ISA Dev


On Saturday, March 23, 2019 at 5:16:18 PM UTC, MitchAlsup wrote:



There are at least a 1/2 dozen ways to do multithreading CPUs.

It it is all based on decisions made at/around FETCH time and
at/around DECODE time in the pipeline.

A Barrel processor has a strict DIV-N timing. Thread K gets a FETCH
cycle or a DECODE cycle when clock MOD N = k.


... which would be why it is immune to timing attacks [as an accidental side-effect of the strict design requirements of I/O processing]
 
A power saving MT design might change threads only when it sees
a Cache/TLB miss and switch to another thread to occupy the pipe.

... which would be potentially why such designs would *not* be immune to side-channel timing attacks [as an accidental side-effect of the pipe being a shared timing-influenceable resource / bottleneck]

 

A higher perf design might have several threads fetch and decode
instructions and when the master thread cannot use a function unit,
some other thread lobs an instruction in its direction; thus keeping
the function units busy. {Have not seen this one implemented.}


an interesting idea.  particularly if the master thread (so designated because it has been specifically marked as "not to be influenced by other work") could order the cancellation of allocation of other instructions to function units from non-master threads.
 
In the guise of the immediately above, one could have multiple
fetch and decode units that are BW limited into their register file
(each not en-massé). As long as the aggregate RF BW is sufficient
perf remains good.

which again underscores that whenever a resource such as a register file becomes a bottleneck (due to insufficient port bandwidth), that's the definition of a timing attack opportunity.

an idea occurred to me a few weeks ago that it might be a good idea to allow high-security programs control over the instruction issue and resource allocation.  following from the definition of a timing attack, being that it is resource contention (leaving state out of the equation for now) that results in timing attacks, if a high security application can *CONTROL* the allocation of instructions [slow them down], backing down to single-issue where execution is normally multi-issue, and utilising the massive resources of an otherwise multi-issue execution engine to *GUARANTEE* uniform uninterrupted execution of the high security program, we might have a workable compromise solution.

where security is irrelevant (you know what i mean: where performance matters) multi-issue execution may proceed at the maximum rate: resource bottlenecks, even side-channel attacks would be *declared* by the application as "not relevant here".

this is one of the disastrous things about the current wave of "fixes".  slashing performance by up to 30% is unacceptable to many power users of GNU/Linux OSes: they're not *running* a high-security server, they don't give a damn about Spectre or Meltdown or whatever the hell is going on, they just want maximum performance, yet they're being forced into a situation of having to step outside of the distribution and compile up their own kernel with spectre-etc mitigation switched *off*, because the decision was made without consulting them.

providing the *dynamic* option as part of the *hardware* to mitigate or not-mitigate would stop the tug-of-war between those people who want performance and those who need high security.

l.

Allen Baum

unread,
Mar 25, 2019, 12:41:10 PM3/25/19
to lk...@lkcl.net, RISC-V ISA Dev
The basis of timing attacks is shared resources. 
Multi-threaded processors share all sorts of resources: caches, pipelines, etc.
Barrel processors are not immune if timing is ever variable.
You would think that making them completely separate processors would fix that problem - but it won't.
Sharing an address space in a coherent world is probably enough to effect timing attacks (e.g. you can force TLBs in another core to be evicted...). 
But there is another source of sharing that is less visible: DRAM memory controllers, and the DRAM itself.
If multiple cores share DRAM (and therefore their controllers), you're open to timing attacks (e.g. forcing row closure, bumping prefetches out of limited buffer space) - not to mention rowhammer attacks.
It isn't all about the core....

The idea of labelling sensitive areas of code to inhibit speculative behavior in order to prevent timing attacks (just as labelling sections of code to be a critical section to avoid broken concurrency bugs) has been mentioned previously (probably by Jose Renau) - I don't see any other way to prevent timing attacks. Otherwise, you're merely making it more difficult, but if it the return on investment is sufficient, attackers will spend the resources to succeed.

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/eb4f6600-6f42-424b-a4bc-22a76c9b0d3c%40groups.riscv.org.

Samuel Falvo II

unread,
Mar 25, 2019, 12:57:54 PM3/25/19
to Allen Baum, lk...@lkcl.net, RISC-V ISA Dev
Would an approach that introduces a mode bit into the CPU which, if
enabled, introduces a random delay per instruction work? The
resulting machine performance will take a major hit, but it should be
sufficient to bury intelligence gleaned from timing information in
random noise.
> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAF4tt%3DBnBqqwgK79rABBQsGpbnXERbp8FQ5ak6EP9w7WgMmkJw%40mail.gmail.com.

Allen Baum

unread,
Mar 25, 2019, 1:26:37 PM3/25/19
to Samuel Falvo II, lk...@lkcl.net, RISC-V ISA Dev
That sounds like the fuzzing approach used in GPS to decrease accuracy in non-military receivers.
I'm not sure how well that works in practice; it does degrade timing attacks, but I suspect doesn't eliminate them. 
The tradeoff is performance vs. effort to attack, and the amount of random delay that a user can tolerate is likely not large (not to mention the effort to validate a design with fundamental randomness, or to attempt to benchmark such systems. Ouch)
In any case, I am taking a wild-ass guess that it won't eliminate attacks, just make them take longer, and the economics are usually in favor of the attacker. A mode bit that designates code as sensitive and makes (or prohibits) actions that contribute to timing attacks seems safer than one introducing random doeays.

MitchAlsup

unread,
Mar 25, 2019, 4:30:14 PM3/25/19
to RISC-V ISA Dev, sam....@gmail.com, lk...@lkcl.net


On Monday, March 25, 2019 at 12:26:37 PM UTC-5, Allen Baum wrote:
That sounds like the fuzzing approach used in GPS to decrease accuracy in non-military receivers.

While at AMD circa 2005, we discussed a mask on the real time clock that would randomize some of the lower order bits.
In the end, this only decreases the BW of the side channel and does not in any way eliminate it. 

Allen Baum

unread,
Mar 25, 2019, 5:45:55 PM3/25/19
to MitchAlsup, RISC-V ISA Dev, Samuel Falvo II, Luke Kenneth Casson Leighton
exactly.

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
Reply all
Reply to author
Forward
0 new messages