Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

[PATCH 00/10] x86/xsaves: Fix XSAVES known issues

56 views
Skip to first unread message

Yu-cheng Yu

unread,
Feb 22, 2016, 1:58:52 PM2/22/16
to x...@kernel.org, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-...@vger.kernel.org, Dave Hansen, Andy Lutomirski, Borislav Petkov, Sai Praneeth Prakhya, Ravi V. Shankar, Fenghua Yu, Yu-cheng Yu
XSAVES is a kernel-mode instruction. It offers a compacted format and
memory-write optimization. These patches fix known issues in the first
implementation. They are intended for discussion and getting feedback
before actually getting applied.

Patch 1, 2, and 4 are for converting between kernel-mode xstate area and
signal frames.

Patch 8 is for converting between kernel-mode xstate area and ptrace
frames.

Patch 3 and 7 fix optimization issues introduced by XSAVES to the buffer
init_fpstate.

Patch 5 and 6 are related to xstate component offsets.

Patch 9 fixes xstate area print out.

Patch 10 re-enables XSAVES.

Yu-cheng Yu (10):
x86/xsaves: Define and use user_xstate_size for xstate size in signal
context
x86/xsaves: Rename xstate_size to kernel_xstate_size to explicitly
distinguish xstate size in kernel from user space
x86/xsaves: Keep init_fpstate.xsave.header.xfeatures as zero for init
optimization
x86/xsaves: Introduce a new check that allows correct xstates copy
from kernel to user directly
x86/xsaves: Align xstate components according to CPUID
x86/xsaves: Supervisor state component offset
x86/xsaves: Fix init_fpstate.header.xcomp_bv
x86/xsaves: Fix PTRACE frames for XSAVES
x86/xsaves: Fix XSTATE component offset print out
x86/xsaves: Re-enable XSAVES

arch/x86/include/asm/fpu/types.h | 2 +
arch/x86/include/asm/fpu/xstate.h | 8 +-
arch/x86/include/asm/processor.h | 3 +-
arch/x86/kernel/fpu/core.c | 6 +-
arch/x86/kernel/fpu/init.c | 32 +---
arch/x86/kernel/fpu/regset.c | 56 ++++--
arch/x86/kernel/fpu/signal.c | 69 ++++++-
arch/x86/kernel/fpu/xstate.c | 388 +++++++++++++++++++++++++++++---------
8 files changed, 425 insertions(+), 139 deletions(-)

--
1.9.1

Ingo Molnar

unread,
Feb 24, 2016, 3:43:47 AM2/24/16
to Yu-cheng Yu, x...@kernel.org, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-...@vger.kernel.org, Dave Hansen, Andy Lutomirski, Borislav Petkov, Sai Praneeth Prakhya, Ravi V. Shankar, Fenghua Yu
Small housekeeping request: could you please make sure your series is properly
threaded for email clients? This submission was sent as singular patches, without
any References header.

I don't even know how you coaxed git-send-email into doing that - did you use
--no-thread?

The recommended flags are --thread --no-chain-reply-to.

Thanks!

Ingo

Yu-cheng Yu

unread,
Feb 24, 2016, 11:30:21 AM2/24/16
to Ingo Molnar, x...@kernel.org, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-...@vger.kernel.org, Dave Hansen, Andy Lutomirski, Borislav Petkov, Sai Praneeth Prakhya, Ravi V. Shankar, Fenghua Yu
On Wed, Feb 24, 2016 at 09:43:36AM +0100, Ingo Molnar wrote:
>
> Small housekeeping request: could you please make sure your series is properly
> threaded for email clients? This submission was sent as singular patches, without
> any References header.
>
> I don't even know how you coaxed git-send-email into doing that - did you use
> --no-thread?
>
> The recommended flags are --thread --no-chain-reply-to.
>
> Thanks!
>
> Ingo

I will fix it when sending out the next version.
Thanks!

-- Yu-cheng

Yu-cheng Yu

unread,
Feb 25, 2016, 3:29:26 PM2/25/16
to x...@kernel.org, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-...@vger.kernel.org, Dave Hansen, Andy Lutomirski, Borislav Petkov, Sai Praneeth Prakhya, Ravi V. Shankar, Fenghua Yu, Yu-cheng Yu
XSAVES is a kernel-mode instruction. It offers a compacted format and
memory-write optimization. These patches fix known issues in the first
implementation. They are intended for discussion and getting feedback
before actually getting applied.

Version 2 fixes a mistake in handling supervisor states of ptrace function
copyin_to_xsaves() and some coding style/naming issues in the first version.
It also limits XSAVES only to 64-bit kernel in patch 9.

Patch 1, 2, and 4 are for converting between kernel-mode xstate area and
signal frames.

Patch 3 fixes optimization issues introduced by XSAVES to the buffer
init_fpstate.

Patch 5 and 6 are related to xstate component offsets.

Patch 7 is for converting between kernel-mode xstate area and ptrace
frames.

Patch 8 fixes xstate area print out.

Patch 9 re-enables XSAVES.

Yu-cheng Yu (9):
x86/xsaves: Define and use user_xstate_size for xstate size in signal
context
x86/xsaves: Rename xstate_size to kernel_xstate_size to explicitly
distinguish xstate size in kernel from user space
x86/xsaves: Keep init_fpstate.xsave.header.xfeatures as zero for init
optimization
x86/xsaves: Introduce a new check that allows correct xstates copy
from kernel to user directly
x86/xsaves: Align xstate components according to CPUID
x86/xsaves: Supervisor state component offset
x86/xsaves: Fix PTRACE frames for XSAVES
x86/xsaves: Fix XSTATE component offset print out
x86/xsaves: Re-enable XSAVES

arch/x86/include/asm/fpu/types.h | 2 +
arch/x86/include/asm/fpu/xstate.h | 8 +-
arch/x86/include/asm/processor.h | 3 +-
arch/x86/kernel/fpu/core.c | 6 +-
arch/x86/kernel/fpu/init.c | 35 ++--
arch/x86/kernel/fpu/regset.c | 56 ++++--
arch/x86/kernel/fpu/signal.c | 69 ++++++-
arch/x86/kernel/fpu/xstate.c | 392 +++++++++++++++++++++++++++++---------
8 files changed, 435 insertions(+), 136 deletions(-)

--
1.9.1

Dave Hansen

unread,
Feb 25, 2016, 3:35:11 PM2/25/16
to Yu-cheng Yu, x...@kernel.org, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-...@vger.kernel.org, Andy Lutomirski, Borislav Petkov, Sai Praneeth Prakhya, Ravi V. Shankar, Fenghua Yu
On 02/25/2016 12:26 PM, Yu-cheng Yu wrote:
> - * ( This is easy to backport while we are fixing
> - * XSAVES* support. )
> + * Most recent CPUs supporting XSAVES can run 64-bit mode.
> + * Enable XSAVES for 64-bit.
> */
> - setup_clear_cpu_cap(X86_FEATURE_XSAVES);
> + if (!config_enabled(CONFIG_X86_64))
> + setup_clear_cpu_cap(X86_FEATURE_XSAVES);
> }

Is this intended to be permanent? If so, we would not be able to
support any future supervisor features on 32-bit.

Dave Hansen

unread,
Feb 25, 2016, 3:57:36 PM2/25/16
to Yu-cheng Yu, x...@kernel.org, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-...@vger.kernel.org, Andy Lutomirski, Borislav Petkov, Sai Praneeth Prakhya, Ravi V. Shankar, Fenghua Yu
On 02/25/2016 12:26 PM, Yu-cheng Yu wrote:
> Patch 9 re-enables XSAVES.

Could we also add one more thing to this: A big, fat warning that
supervisor states are not supported? We might get that from an eventual
use of xfeature_uncompacted_offset(), but we need something that's very
clear.

I just don't want somebody coming along and shoving a supervisor state
in to XCR0 and expecting it to work just because we have XSAVES support
itself. That might happen in-tree or out-of-tree as things get prototyped.

Yu-cheng Yu

unread,
Feb 25, 2016, 4:05:34 PM2/25/16
to Dave Hansen, x...@kernel.org, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-...@vger.kernel.org, Andy Lutomirski, Borislav Petkov, Sai Praneeth Prakhya, Ravi V. Shankar, Fenghua Yu
I was thinking of the 64-bit-only change, but then decided to put it out
for discussion.

Yu-cheng

Yu-cheng Yu

unread,
Feb 25, 2016, 4:09:58 PM2/25/16
to Dave Hansen, x...@kernel.org, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-...@vger.kernel.org, Andy Lutomirski, Borislav Petkov, Sai Praneeth Prakhya, Ravi V. Shankar, Fenghua Yu
I will do that.

Yu-cheng

Yu-cheng Yu

unread,
Feb 29, 2016, 12:46:22 PM2/29/16
to x...@kernel.org, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-...@vger.kernel.org, Dave Hansen, Andy Lutomirski, Borislav Petkov, Sai Praneeth Prakhya, Ravi V. Shankar, Fenghua Yu, Yu-cheng Yu
XSAVES is a kernel-mode instruction. It offers a compacted format and
memory-write optimization. These patches fix known issues in the first
implementation. They are intended for discussion and getting feedback
before actually getting applied.

Version 3 adds a WARN_ONCE() in patch 9 to make it clear XSAVES supervisor
states are not yet supported.
arch/x86/kernel/fpu/xstate.c | 400 +++++++++++++++++++++++++++++---------
8 files changed, 443 insertions(+), 136 deletions(-)

--
1.9.1

Yu-cheng Yu

unread,
Mar 1, 2016, 7:37:30 PM3/1/16
to Dave Hansen, x...@kernel.org, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-...@vger.kernel.org, Andy Lutomirski, Borislav Petkov, Sai Praneeth Prakhya, Ravi V. Shankar, Fenghua Yu
On Tue, Mar 01, 2016 at 03:56:12PM -0800, Dave Hansen wrote:
> On 02/29/2016 09:42 AM, Yu-cheng Yu wrote:
> > /*
> > - * Quirk: we don't yet handle the XSAVES* instructions
> > - * correctly, as we don't correctly convert between
> > - * standard and compacted format when interfacing
> > - * with user-space - so disable it for now.
> > - *
> > - * The difference is small: with recent CPUs the
> > - * compacted format is only marginally smaller than
> > - * the standard FPU state format.
> > - *
> > - * ( This is easy to backport while we are fixing
> > - * XSAVES* support. )
> > + * Most recent CPUs supporting XSAVES can run 64-bit mode.
> > + * Enable XSAVES for 64-bit.
> > */
> > - setup_clear_cpu_cap(X86_FEATURE_XSAVES);
> > + if (!config_enabled(CONFIG_X86_64))
> > + setup_clear_cpu_cap(X86_FEATURE_XSAVES);
> > }
>
> I think we need a much better explanation of this for posterity. Why
> are we not supporting this now, and what would someone have to do in the
> future in order to enable it?
>
If anyone is using this newer feature, then that user is most likely using
a 64-bit capable processor and a 64-bit kernel. The intention here is to
take out the complexity and any potential of error. If the user removes
the restriction and builds a private kernel, it should work but we have
not checked all possible combinations. I will put these in the comments.

> > /*
> > diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
> > index 2e80d6f..cb2a484 100644
> > --- a/arch/x86/kernel/fpu/xstate.c
> > +++ b/arch/x86/kernel/fpu/xstate.c
> > @@ -204,6 +204,14 @@ void fpu__init_cpu_xstate(void)
> > if (!cpu_has_xsave || !xfeatures_mask)
> > return;
> >
> > + /*
> > + * Make it clear that XSAVES supervisor states are not yet
> > + * implemented should anyone expect it to work by changing
> > + * bits in XFEATURE_MASK_* macros and XCR0.
> > + */
> > + WARN_ONCE((xfeatures_mask & XFEATURE_MASK_SUPERVISOR),
> > + "x86/fpu: XSAVES supervisor states are not yet implemented.\n");
> > +
> > cr4_set_bits(X86_CR4_OSXSAVE);
> > xsetbv(XCR_XFEATURE_ENABLED_MASK, xfeatures_mask);
> > }
>
> Let's also do a:
>
> xfeatures_mask &= ~XFEATURE_MASK_SUPERVISOR;
>
> Otherwise, we have a broken system at the moment.
>
Currently, if anyone sets any supervisor state in xfeatures_mask, the
kernel prints out the warning then goes into a protection fault.
That is a very strong indication to the user. Do we want to mute it?

Yu-cheng

Dave Hansen

unread,
Mar 1, 2016, 7:47:01 PM3/1/16
to Yu-cheng Yu, x...@kernel.org, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-...@vger.kernel.org, Andy Lutomirski, Borislav Petkov, Sai Praneeth Prakhya, Ravi V. Shankar, Fenghua Yu
On 03/01/2016 04:34 PM, Yu-cheng Yu wrote:
> On Tue, Mar 01, 2016 at 03:56:12PM -0800, Dave Hansen wrote:
>> On 02/29/2016 09:42 AM, Yu-cheng Yu wrote:
>>> - setup_clear_cpu_cap(X86_FEATURE_XSAVES);
>>> + if (!config_enabled(CONFIG_X86_64))
>>> + setup_clear_cpu_cap(X86_FEATURE_XSAVES);
>>> }
>>
>> I think we need a much better explanation of this for posterity. Why
>> are we not supporting this now, and what would someone have to do in the
>> future in order to enable it?
>>
> If anyone is using this newer feature, then that user is most likely using
> a 64-bit capable processor and a 64-bit kernel. The intention here is to
> take out the complexity and any potential of error. If the user removes
> the restriction and builds a private kernel, it should work but we have
> not checked all possible combinations. I will put these in the comments.

A user can go download a 32-bit version of Ubuntu or Debian and install
it on a 64-bit processor today. It's a very easy mistake to make when
downloading the install CD.

In any case, I don't have a _problem_ with leaving i386 in the dust
here. I just want us to be very explicit about what we are doing.

>>> + /*
>>> + * Make it clear that XSAVES supervisor states are not yet
>>> + * implemented should anyone expect it to work by changing
>>> + * bits in XFEATURE_MASK_* macros and XCR0.
>>> + */
>>> + WARN_ONCE((xfeatures_mask & XFEATURE_MASK_SUPERVISOR),
>>> + "x86/fpu: XSAVES supervisor states are not yet implemented.\n");
>>> +
>>> cr4_set_bits(X86_CR4_OSXSAVE);
>>> xsetbv(XCR_XFEATURE_ENABLED_MASK, xfeatures_mask);
>>> }
>>
>> Let's also do a:
>>
>> xfeatures_mask &= ~XFEATURE_MASK_SUPERVISOR;
>>
>> Otherwise, we have a broken system at the moment.
>>
> Currently, if anyone sets any supervisor state in xfeatures_mask, the
> kernel prints out the warning then goes into a protection fault.
> That is a very strong indication to the user. Do we want to mute it?

By "goes into a protection fault", do you mean that it doesn't boot?

I'd just rather we put the kernel in a known-safe configuration (masking
supervisor state out of xfeatures_mask) rather than rely on the general
protection fault continuing to be generated by whatever is generating it.


Yu-cheng Yu

unread,
Mar 1, 2016, 7:51:17 PM3/1/16
to Dave Hansen, x...@kernel.org, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-...@vger.kernel.org, Andy Lutomirski, Borislav Petkov, Sai Praneeth Prakhya, Ravi V. Shankar, Fenghua Yu
On Tue, Mar 01, 2016 at 04:45:41PM -0800, Dave Hansen wrote:
> >>> + WARN_ONCE((xfeatures_mask & XFEATURE_MASK_SUPERVISOR),
> >>> + "x86/fpu: XSAVES supervisor states are not yet implemented.\n");
> >>> +
> >>> cr4_set_bits(X86_CR4_OSXSAVE);
> >>> xsetbv(XCR_XFEATURE_ENABLED_MASK, xfeatures_mask);
> >>> }
> >>
> >> Let's also do a:
> >>
> >> xfeatures_mask &= ~XFEATURE_MASK_SUPERVISOR;
> >>
> >> Otherwise, we have a broken system at the moment.
> >>
> > Currently, if anyone sets any supervisor state in xfeatures_mask, the
> > kernel prints out the warning then goes into a protection fault.
> > That is a very strong indication to the user. Do we want to mute it?
>
> By "goes into a protection fault", do you mean that it doesn't boot?
>
> I'd just rather we put the kernel in a known-safe configuration (masking
> supervisor state out of xfeatures_mask) rather than rely on the general
> protection fault continuing to be generated by whatever is generating it.
>
Ok.

Yu-cheng

H. Peter Anvin

unread,
Mar 1, 2016, 7:54:23 PM3/1/16
to Dave Hansen, Yu-cheng Yu, x...@kernel.org, Thomas Gleixner, Ingo Molnar, linux-...@vger.kernel.org, Andy Lutomirski, Borislav Petkov, Sai Praneeth Prakhya, Ravi V. Shankar, Fenghua Yu
Differences between i386 and x86-64 generally add problems, so unless this requires significant 32-bit-specific code we should not exclude i386 just because.
--
Sent from my Android device with K-9 Mail. Please excuse brevity and formatting.

Dave Hansen

unread,
Mar 1, 2016, 7:57:36 PM3/1/16
to Yu-cheng Yu, x...@kernel.org, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-...@vger.kernel.org, Andy Lutomirski, Borislav Petkov, Sai Praneeth Prakhya, Ravi V. Shankar, Fenghua Yu
On 02/29/2016 09:42 AM, Yu-cheng Yu wrote:
> /*
> - * Quirk: we don't yet handle the XSAVES* instructions
> - * correctly, as we don't correctly convert between
> - * standard and compacted format when interfacing
> - * with user-space - so disable it for now.
> - *
> - * The difference is small: with recent CPUs the
> - * compacted format is only marginally smaller than
> - * the standard FPU state format.
> - *
> - * ( This is easy to backport while we are fixing
> - * XSAVES* support. )
> + * Most recent CPUs supporting XSAVES can run 64-bit mode.
> + * Enable XSAVES for 64-bit.
> */
> - setup_clear_cpu_cap(X86_FEATURE_XSAVES);
> + if (!config_enabled(CONFIG_X86_64))
> + setup_clear_cpu_cap(X86_FEATURE_XSAVES);
> }

I think we need a much better explanation of this for posterity. Why
are we not supporting this now, and what would someone have to do in the
future in order to enable it?

> /*
> diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
> index 2e80d6f..cb2a484 100644
> --- a/arch/x86/kernel/fpu/xstate.c
> +++ b/arch/x86/kernel/fpu/xstate.c
> @@ -204,6 +204,14 @@ void fpu__init_cpu_xstate(void)
> if (!cpu_has_xsave || !xfeatures_mask)
> return;
>

Yu-cheng Yu

unread,
Mar 1, 2016, 8:01:00 PM3/1/16
to H. Peter Anvin, Dave Hansen, x...@kernel.org, Thomas Gleixner, Ingo Molnar, linux-...@vger.kernel.org, Andy Lutomirski, Borislav Petkov, Sai Praneeth Prakhya, Ravi V. Shankar, Fenghua Yu
On Tue, Mar 01, 2016 at 04:53:53PM -0800, H. Peter Anvin wrote:
> Differences between i386 and x86-64 generally add problems, so unless this requires significant 32-bit-specific code we should not exclude i386 just because.

I have not seen any issues with 32-bit code, but will do some tests.
Thanks.

Yu-cheng

Yu-cheng Yu

unread,
Mar 4, 2016, 1:15:56 PM3/4/16
to x...@kernel.org, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-...@vger.kernel.org, Dave Hansen, Andy Lutomirski, Borislav Petkov, Sai Praneeth Prakhya, Ravi V. Shankar, Fenghua Yu, Yu-cheng Yu
XSAVES is a kernel-mode instruction. It offers a compacted format and
memory-write optimization. These patches fix known issues in the first
implementation. They are intended for discussion and getting feedback
before actually getting applied.

Version 4 adds a new patch (#9) to fix missing XSAVE legacy region
offset/size values.

Patch 1, 2, and 4 are for converting between kernel-mode xstate area and
signal frames.

Patch 3 fixes optimization issues introduced by XSAVES to the buffer
init_fpstate.

Patch 5 and 6 are related to xstate component offsets.

Patch 7 is for converting between kernel-mode xstate area and ptrace
frames.

Patch 8 fixes xstate area print out.

Patch 9 fixes missing offset/size values for XSAVE legacy region.

Patch 10 re-enables XSAVES.

Yu-cheng Yu (10):
x86/xsaves: Define and use user_xstate_size for xstate size in signal
context
x86/xsaves: Rename xstate_size to kernel_xstate_size to explicitly
distinguish xstate size in kernel from user space
x86/xsaves: Keep init_fpstate.xsave.header.xfeatures as zero for init
optimization
x86/xsaves: Introduce a new check that allows correct xstates copy
from kernel to user directly
x86/xsaves: Align xstate components according to CPUID
x86/xsaves: Supervisor state component offset
x86/xsaves: Fix PTRACE frames for XSAVES
x86/xsaves: Fix XSTATE component offset print out
x86/xsaves: Fix xstate_offsets, xstate_sizes for non-extended states
x86/xsaves: Re-enable XSAVES

arch/x86/include/asm/fpu/types.h | 2 +
arch/x86/include/asm/fpu/xstate.h | 8 +-
arch/x86/include/asm/processor.h | 3 +-
arch/x86/kernel/fpu/core.c | 6 +-
arch/x86/kernel/fpu/init.c | 32 +--
arch/x86/kernel/fpu/regset.c | 56 ++++--
arch/x86/kernel/fpu/signal.c | 69 ++++++-
arch/x86/kernel/fpu/xstate.c | 411 ++++++++++++++++++++++++++++++--------
8 files changed, 449 insertions(+), 138 deletions(-)

--
1.9.1

Dave Hansen

unread,
Apr 29, 2016, 2:09:32 PM4/29/16
to Yu-cheng Yu, x...@kernel.org, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-...@vger.kernel.org, Andy Lutomirski, Borislav Petkov, Sai Praneeth Prakhya, Ravi V. Shankar, Fenghua Yu
Hi Folks,

I've heard through the grapevine that there's some concern that we
should not be bothering to enable XSAVES because there's not a
sufficient use case for it. Maybe it's meager today, but I still think
we should do it.

I'll try to lay out why.

Today, on every Skylake system, this patch saves 128 bytes in each
task_struct. If there were an Atom system with XSAVES it would save 384
bytes since there is no AVX support on Atom. If there were a future
processor which has an xstate _past_ AVX-512, but that does not have
AVX-512 itself, that savings goes up to 2048+384 bytes. I believe it is
*inevitable* that the savings will become substantial.

Plus, if the processors ever start supporting a supervisor state that we
_need_ in Linux, we have to XSAVES support anyway.

It's inevitable that we _will_ need it.

Why do it today?

Now that Skylake is out, we _can_ get reasonable testing of this feature
from early adopters in the wild. If we turn this on today, and it
breaks, we break a relatively modest number of Skylake systems (1%? 2%?
0.1%?). Let's say we wait $X years when the benefits are greater. We
turn it on, and something breaks. We'll break 50% (or 40% or whatever)
of the systems in production.

Once we *HAVE* XSAVES support, it also opens up the possibilities for
doing things like dynamic XSAVE buffer allocation. For instance, let
threads that are not _using_ AVX-512 not waste the 2k of space for it.

So why wait?

Ingo Molnar

unread,
Apr 29, 2016, 3:43:52 PM4/29/16
to Dave Hansen, Yu-cheng Yu, x...@kernel.org, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-...@vger.kernel.org, Andy Lutomirski, Borislav Petkov, Sai Praneeth Prakhya, Ravi V. Shankar, Fenghua Yu

* Dave Hansen <dave....@linux.intel.com> wrote:

> Hi Folks,
>
> I've heard through the grapevine that there's some concern that we
> should not be bothering to enable XSAVES because there's not a
> sufficient use case for it. [...]

So I have no fundamental objections against this series - I didn't apply it back
in March because not all patches had your Reviewed-by tag. Basically after you
sorted out all the XSAVE dynamic feature detection/sizing issues I was a happy
camper and have no objection against XSAVES.

Could you please send a refreshed version against the latestest tip:master?

Thanks,

Ingo

Yu-cheng Yu

unread,
Apr 29, 2016, 4:01:55 PM4/29/16
to Dave Hansen, x...@kernel.org, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-...@vger.kernel.org, Andy Lutomirski, Borislav Petkov, Sai Praneeth Prakhya, Ravi V. Shankar, Fenghua Yu
On Fri, Apr 29, 2016 at 11:09:23AM -0700, Dave Hansen wrote:

> Once we *HAVE* XSAVES support, it also opens up the possibilities for
> doing things like dynamic XSAVE buffer allocation. For instance, let
> threads that are not _using_ AVX-512 not waste the 2k of space for it.

If we can somehow modify exec* system call to scan the executable binary (in user space) and pass along a bitmask containing xfeatures used in the binary, and XSAVES is enabled in the kernel, we can easily save a lot of memory. The kernel only needs to allocate space for tasks that actually use xstates; most of them do not.

Dave Hansen

unread,
Apr 29, 2016, 4:03:52 PM4/29/16
to Yu-cheng Yu, x...@kernel.org, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-...@vger.kernel.org, Andy Lutomirski, Borislav Petkov, Sai Praneeth Prakhya, Ravi V. Shankar, Fenghua Yu
That's not feasible. Think of dynamic libraries or just-in-time
compilers. What instruction set does /usr/bin/java use, for instance? :)

Dave Hansen

unread,
Apr 29, 2016, 4:09:17 PM4/29/16
to Yu-cheng Yu, x...@kernel.org, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-...@vger.kernel.org, Andy Lutomirski, Borislav Petkov, Sai Praneeth Prakhya, Ravi V. Shankar, Fenghua Yu
On 03/04/2016 10:12 AM, Yu-cheng Yu wrote:
> diff --git a/arch/x86/kernel/fpu/signal.c b/arch/x86/kernel/fpu/signal.c
> index 0fbf60c..09945f1 100644
> --- a/arch/x86/kernel/fpu/signal.c
> +++ b/arch/x86/kernel/fpu/signal.c
> @@ -130,6 +130,45 @@ static inline int copy_fpregs_to_sigframe(struct xregs_state __user *buf)
> return err;
> }
>
> +static int may_copy_fpregs_to_sigframe(void)
> +{
> + /*
> + * In signal handling path, the kernel already checks if
> + * FPU instructions have been used before it calls
> + * copy_fpstate_to_sigframe(). We check this here again
> + * to detect any potential mis-use and saving invalid
> + * register values directly to a signal frame.
> + */
> + WARN_ONCE(!current->thread.fpu.fpstate_active,
> + "direct FPU save with no math use\n");

This is probably an OK check for this _particular_ context (since this
context is all ready to copy_to_user() the fpu state). But is it good
generally? Why couldn't you have a !fpstate_active thread that _was_
fpregs_active?

Such a thread _could_ do a direct XSAVE with no issues.

Yu-cheng Yu

unread,
Apr 29, 2016, 4:11:39 PM4/29/16
to Dave Hansen, x...@kernel.org, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-...@vger.kernel.org, Andy Lutomirski, Borislav Petkov, Sai Praneeth Prakhya, Ravi V. Shankar, Fenghua Yu
On Fri, Apr 29, 2016 at 01:03:43PM -0700, Dave Hansen wrote:
> That's not feasible. Think of dynamic libraries or just-in-time
> compilers. What instruction set does /usr/bin/java use, for instance? :)

The java argument is true. In that case or when the bitmask is missing, we can allocate for all supported features.

Dave Hansen

unread,
Apr 29, 2016, 4:12:11 PM4/29/16
to Yu-cheng Yu, x...@kernel.org, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-...@vger.kernel.org, Andy Lutomirski, Borislav Petkov, Sai Praneeth Prakhya, Ravi V. Shankar, Fenghua Yu
On 03/04/2016 10:12 AM, Yu-cheng Yu wrote:
> CPUID function 0x0d, sub function (i, i > 1) returns in ecx[1] the
> alignment requirement of component i when the compacted format is used.
>
> If ecx[1] is 0, component i is located immediately following the preceding
> component. If ecx[1] is 1, component i is located on the next 64-byte
> boundary following the preceding component.

Reviewed-by: Dave Hansen <dave....@intel.com>

Dave Hansen

unread,
Apr 29, 2016, 4:17:11 PM4/29/16
to Yu-cheng Yu, x...@kernel.org, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-...@vger.kernel.org, Andy Lutomirski, Borislav Petkov, Sai Praneeth Prakhya, Ravi V. Shankar, Fenghua Yu
On 03/04/2016 10:12 AM, Yu-cheng Yu wrote:
> + if (boot_cpu_has(X86_FEATURE_XSAVES)) {
> + ret = copyout_from_xsaves(pos, count, kbuf, ubuf, xsave);

On a higher level, we really should stop using
"boot_cpu_has(X86_FEATURE_XSAVES)" as a proxy for "the kernel XSAVE
buffer is in the XSAVES format". I think our use of the _hardware_
CPUID bit is confusing at least the KVM folks.

We probably want a software X86_FEATURE_OS_XSAVES or something.

Andy Lutomirski

unread,
Apr 29, 2016, 4:25:39 PM4/29/16
to Yu-cheng Yu, Dave Hansen, X86 ML, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-...@vger.kernel.org, Andy Lutomirski, Borislav Petkov, Sai Praneeth Prakhya, Ravi V. Shankar, Fenghua Yu
I actually want to see us moving in the direction of unconditionally
allocating everything on process startup. If we can stop using CR0.TS
entirely, I think everything will be better.

--Andy

Dave Hansen

unread,
Apr 29, 2016, 4:25:44 PM4/29/16
to Yu-cheng Yu, x...@kernel.org, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-...@vger.kernel.org, Andy Lutomirski, Borislav Petkov, Sai Praneeth Prakhya, Ravi V. Shankar, Fenghua Yu
On 03/04/2016 10:12 AM, Yu-cheng Yu wrote:
> + for (i = 0; i < XFEATURE_MAX; i++) {
> + /*
> + * Copy only in-use xstates.
> + */
> + if (((header.xfeatures >> i) & 1) && xfeature_enabled(i)) {
> + void *src = get_xsave_addr_no_check(xsave, i);

How could a bit in header.xfeatures get set if it is not set in
xfeature_enabled() aka xfeatures_mask aka XCR0?

..
> +int copyin_to_xsaves(const void *kbuf, const void __user *ubuf,
> + struct xregs_state *xsave)
> +{
> + unsigned int offset, size;
> + int i;
> + u64 xfeatures;
> +
> + offset = offsetof(struct xregs_state, header);
> + size = sizeof(xfeatures);
> +
> + if (kbuf)
> + memcpy(&xfeatures, kbuf + offset, size);
> + else if (__copy_from_user(&xfeatures, ubuf + offset, size))
> + return -EFAULT;
> +
> + /*
> + * Reject if the user tries to set any supervisor xstates.
> + */
> + if (xfeatures & XFEATURE_MASK_SUPERVISOR)
> + return -EINVAL;
> +
> + for (i = 0; i < XFEATURE_MAX; i++) {
> + u64 mask = ((u64)1 << i);
> +
> + if ((xfeatures & mask) && xfeature_enabled(i)) {
> + void *dst = get_xsave_addr_no_check(xsave, i);
> +
> + offset = xstate_offsets[i];
> + size = xstate_sizes[i];
> +
> + if (kbuf)
> + memcpy(dst, kbuf + offset, size);
> + else if (__copy_from_user(dst, ubuf + offset, size))
> + return -EFAULT;
> + }
> + }

If a caller tries to pass a non-enabled xfeature in, we appear to just
silently drop it and return success. Is that really what we want to do
or do we want to error out?

Dave Hansen

unread,
Apr 29, 2016, 4:27:01 PM4/29/16
to Yu-cheng Yu, x...@kernel.org, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-...@vger.kernel.org, Andy Lutomirski, Borislav Petkov, Sai Praneeth Prakhya, Ravi V. Shankar, Fenghua Yu
On 03/04/2016 10:12 AM, Yu-cheng Yu wrote:
> Component offset print out was incorrect for XSAVES. Correct it and move
> to a separate function.

Reviewed-by: Dave Hansen <dave....@intel.com>

Dave Hansen

unread,
Apr 29, 2016, 4:28:41 PM4/29/16
to Yu-cheng Yu, x...@kernel.org, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-...@vger.kernel.org, Andy Lutomirski, Borislav Petkov, Sai Praneeth Prakhya, Ravi V. Shankar, Fenghua Yu
On 03/04/2016 10:12 AM, Yu-cheng Yu wrote:
> The arrays xstate_offsets[] and xstate_sizes[] record XSAVE area
> offsets and sizes. Values for legacy components i387 and XMMs were
> not initialized. Fix it.

Is this just a completeness thing or does it actually break something?

In any case:

Reviewed-by: Dave Hansen <dave....@intel.com>

Dave Hansen

unread,
Apr 29, 2016, 4:32:25 PM4/29/16
to Yu-cheng Yu, x...@kernel.org, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-...@vger.kernel.org, Andy Lutomirski, Borislav Petkov, Sai Praneeth Prakhya, Ravi V. Shankar, Fenghua Yu
On 03/04/2016 10:12 AM, Yu-cheng Yu wrote:
> We did not handle XSAVES* instructions correctly. There were issues in
> converting between standard and compacted format when interfacing with
> user-space. These issues have been corrected.
>
> Add a WARN_ONCE() to make it clear that XSAVES supervisor states are not
> yet implemented.

The reason I haven't acked this patch is that I want to be _sure_ that
we've audited all of the call paths that access the XSAVE buffer to
ensure that they can all either handle the XSAVES format *or* don't care
for whatever reason.

Could you share the steps that you've taken to assure yourself that all
of the call paths are handled and we don't have more bugs?

FWIW, this was the single biggest lesson I learned from the failure the
last time this got turned on: we simply didn't go look for all the
places that the new format had to be handled. Let's be sure we don't
repeat that.

If we get this *wrong* in another user/kernel interface like we did for
ptrace and the signal save/restore and we ever enable a supervisor state
we've got an almost certain immediate root hole of some kind. I think
we need to exercise some serious caution here. Thank $DEITY we don't
have any supported supervisor states at the moment.

Dave Hansen

unread,
Apr 29, 2016, 4:37:32 PM4/29/16
to Yu-cheng Yu, x...@kernel.org, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-...@vger.kernel.org, Andy Lutomirski, Borislav Petkov, Sai Praneeth Prakhya, Ravi V. Shankar, Fenghua Yu
Remember, execve() doesn't replace the task_struct. How do we resize
the task_struct at execve() time? If /bin/bash doesn't use AVX, then
fork()s and execve()s an AVX-using program, what do we do?

Borislav Petkov

unread,
Apr 29, 2016, 4:38:57 PM4/29/16
to Dave Hansen, Yu-cheng Yu, x...@kernel.org, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-...@vger.kernel.org, Andy Lutomirski, Sai Praneeth Prakhya, Ravi V. Shankar, Fenghua Yu
On Fri, Apr 29, 2016 at 01:16:59PM -0700, Dave Hansen wrote:
> We probably want a software X86_FEATURE_OS_XSAVES or something.

.. or a simple variable.

--
Regards/Gruss,
Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
--

Dave Hansen

unread,
Apr 29, 2016, 4:40:18 PM4/29/16
to Andy Lutomirski, Yu-cheng Yu, X86 ML, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-...@vger.kernel.org, Andy Lutomirski, Borislav Petkov, Sai Praneeth Prakhya, Ravi V. Shankar, Fenghua Yu
We can absolutely allocate the worst-case XSAVE buffer at task startup
for folks that never want to see a latency spike in the life of the app
no matter what.

But I also think it would be pretty nice if 'ls' didn't pay the 2k cost
to have AVX-512 state if it's not using AVX-512. We also don't have to
do this with CR0.TS. We'd actually use a combination of out-of-line
(not appended to task_struct) XSAVE buffers and XGETBV1 to check the
size of our XSAVE buffer before we call XSAVE* and resize it when needed.

Maybe nobody will ever care enough about 2kbytes/thread, though.

Dave Hansen

unread,
Apr 29, 2016, 4:41:03 PM4/29/16
to Borislav Petkov, Yu-cheng Yu, x...@kernel.org, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-...@vger.kernel.org, Andy Lutomirski, Sai Praneeth Prakhya, Ravi V. Shankar, Fenghua Yu
On 04/29/2016 01:38 PM, Borislav Petkov wrote:
> On Fri, Apr 29, 2016 at 01:16:59PM -0700, Dave Hansen wrote:
>> > We probably want a software X86_FEATURE_OS_XSAVES or something.
> ... or a simple variable.

I think we do some instruction patching based on it, so I was just
suggesting the software X86_FEATURE because it would plug in to the
existing scheme easier.

Borislav Petkov

unread,
Apr 29, 2016, 4:47:09 PM4/29/16
to Dave Hansen, Yu-cheng Yu, x...@kernel.org, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-...@vger.kernel.org, Andy Lutomirski, Sai Praneeth Prakhya, Ravi V. Shankar, Fenghua Yu
On Fri, Apr 29, 2016 at 01:40:51PM -0700, Dave Hansen wrote:
> I think we do some instruction patching based on it, so I was just
> suggesting the software X86_FEATURE because it would plug in to the
> existing scheme easier.

Ah ok, that's fine then.

Andy Lutomirski

unread,
Apr 29, 2016, 4:49:56 PM4/29/16
to Dave Hansen, Yu-cheng Yu, X86 ML, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-...@vger.kernel.org, Andy Lutomirski, Borislav Petkov, Sai Praneeth Prakhya, Ravi V. Shankar, Fenghua Yu
I suspect we're so far about 2k/thread that no one cares.

That being said, when I wrote this email, I wasn't thinking about
compacted form at all. I think we should allocate a viable xstate
area of some sort on startup and use saves/xrstors/xsaveopt/whatever
without fiddling with TS and eagerly save and restore even if no
extended state whatsoever has been used. I'm certainly okay in
principle with reallocating.

However, what do we do if we run out when memory when trying to reallocate?

--
Andy Lutomirski
AMA Capital Management, LLC

Yu-cheng Yu

unread,
Apr 29, 2016, 6:11:43 PM4/29/16
to Dave Hansen, x...@kernel.org, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-...@vger.kernel.org, Andy Lutomirski, Borislav Petkov, Sai Praneeth Prakhya, Ravi V. Shankar, Fenghua Yu
On Fri, Apr 29, 2016 at 01:28:31PM -0700, Dave Hansen wrote:
> On 03/04/2016 10:12 AM, Yu-cheng Yu wrote:
> > The arrays xstate_offsets[] and xstate_sizes[] record XSAVE area
> > offsets and sizes. Values for legacy components i387 and XMMs were
> > not initialized. Fix it.
>
> Is this just a completeness thing or does it actually break something?

PTRACE format conversion needs them.

Dave Hansen

unread,
Apr 29, 2016, 6:13:42 PM4/29/16
to Yu-cheng Yu, x...@kernel.org, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-...@vger.kernel.org, Andy Lutomirski, Borislav Petkov, Sai Praneeth Prakhya, Ravi V. Shankar, Fenghua Yu
If you respin these, can you please note that in the changelog and/or
comments?

Yu-cheng Yu

unread,
Apr 29, 2016, 6:20:11 PM4/29/16
to Dave Hansen, x...@kernel.org, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-...@vger.kernel.org, Andy Lutomirski, Borislav Petkov, Sai Praneeth Prakhya, Ravi V. Shankar, Fenghua Yu
Do you mean when it is rebased? I will do that.

Yu-cheng Yu

unread,
Apr 29, 2016, 6:35:04 PM4/29/16
to Dave Hansen, x...@kernel.org, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-...@vger.kernel.org, Andy Lutomirski, Borislav Petkov, Sai Praneeth Prakhya, Ravi V. Shankar, Fenghua Yu
On Fri, Apr 29, 2016 at 01:25:34PM -0700, Dave Hansen wrote:
> On 03/04/2016 10:12 AM, Yu-cheng Yu wrote:
> > + for (i = 0; i < XFEATURE_MAX; i++) {
> > + /*
> > + * Copy only in-use xstates.
> > + */
> > + if (((header.xfeatures >> i) & 1) && xfeature_enabled(i)) {
> > + void *src = get_xsave_addr_no_check(xsave, i);
>
> How could a bit in header.xfeatures get set if it is not set in
> xfeature_enabled() aka xfeatures_mask aka XCR0?

Do you mean, we should test xfeature_enabled(i) first, like,

if (xfeature_enabled(i) && ((header.xfeatures >> i) & 1)) ?

The result will be the same, like you said, if XCR0[i] is not set,
hader.xfeatures[i] cannot be set. But if XCR0[i] is set,
header.xfeatures[i] can be cleared.

>
> If a caller tries to pass a non-enabled xfeature in, we appear to just
> silently drop it and return success. Is that really what we want to do
> or do we want to error out?

Let it fail. I will chage it.

Dave Hansen

unread,
Apr 29, 2016, 6:36:24 PM4/29/16
to Yu-cheng Yu, x...@kernel.org, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-...@vger.kernel.org, Andy Lutomirski, Borislav Petkov, Sai Praneeth Prakhya, Ravi V. Shankar, Fenghua Yu
On 04/29/2016 03:30 PM, Yu-cheng Yu wrote:
> On Fri, Apr 29, 2016 at 01:25:34PM -0700, Dave Hansen wrote:
>> On 03/04/2016 10:12 AM, Yu-cheng Yu wrote:
>>> + for (i = 0; i < XFEATURE_MAX; i++) {
>>> + /*
>>> + * Copy only in-use xstates.
>>> + */
>>> + if (((header.xfeatures >> i) & 1) && xfeature_enabled(i)) {
>>> + void *src = get_xsave_addr_no_check(xsave, i);
>>
>> How could a bit in header.xfeatures get set if it is not set in
>> xfeature_enabled() aka xfeatures_mask aka XCR0?
>
> Do you mean, we should test xfeature_enabled(i) first, like,
>
> if (xfeature_enabled(i) && ((header.xfeatures >> i) & 1)) ?
>
> The result will be the same, like you said, if XCR0[i] is not set,
> hader.xfeatures[i] cannot be set. But if XCR0[i] is set,
> header.xfeatures[i] can be cleared.

I think the xfeature_enabled(i) is probably redundant. Does it serve
any actual purpose?


Dave Hansen

unread,
Apr 29, 2016, 6:42:22 PM4/29/16
to Andy Lutomirski, Yu-cheng Yu, X86 ML, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-...@vger.kernel.org, Andy Lutomirski, Borislav Petkov, Sai Praneeth Prakhya, Ravi V. Shankar, Fenghua Yu
On 04/29/2016 01:49 PM, Andy Lutomirski wrote:
>> >
>> > But I also think it would be pretty nice if 'ls' didn't pay the 2k cost
>> > to have AVX-512 state if it's not using AVX-512. We also don't have to
>> > do this with CR0.TS. We'd actually use a combination of out-of-line
>> > (not appended to task_struct) XSAVE buffers and XGETBV1 to check the
>> > size of our XSAVE buffer before we call XSAVE* and resize it when needed.
>> >
>> > Maybe nobody will ever care enough about 2kbytes/thread, though.
> I suspect we're so far about 2k/thread that no one cares.
>
..
> However, what do we do if we run out when memory when trying to reallocate?

The thread has to die a horrible death. We can switch away from it, but
can never switch back to it. Well we can switch to it, we just can't
return to userspace.

Actually, though... On my laptop 1/3 of the task_struct is XSAVE state
(it's ~3k). With AVX-512, ~3/5 of it will be XSAVE, and it will be ~5k
(>1 page). Breaking it in to two pieces makes it overall less likely
that we'd have to fail an allocation.

We'd be in a situation where we probably can't fork *anyway* if that
happened.

Yu-cheng Yu

unread,
Apr 29, 2016, 6:43:22 PM4/29/16
to Dave Hansen, x...@kernel.org, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-...@vger.kernel.org, Andy Lutomirski, Borislav Petkov, Sai Praneeth Prakhya, Ravi V. Shankar, Fenghua Yu
Got it.

Yu-cheng Yu

unread,
Apr 29, 2016, 6:47:52 PM4/29/16
to Dave Hansen, x...@kernel.org, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-...@vger.kernel.org, Andy Lutomirski, Borislav Petkov, Sai Praneeth Prakhya, Ravi V. Shankar, Fenghua Yu
But it won't come to this function unless fpstate_active is ture?

Yu-cheng Yu

unread,
Apr 29, 2016, 7:17:07 PM4/29/16
to Dave Hansen, x...@kernel.org, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-...@vger.kernel.org, Andy Lutomirski, Borislav Petkov, Sai Praneeth Prakhya, Ravi V. Shankar, Fenghua Yu
On Fri, Apr 29, 2016 at 01:32:15PM -0700, Dave Hansen wrote:
> The reason I haven't acked this patch is that I want to be _sure_ that
> we've audited all of the call paths that access the XSAVE buffer to
> ensure that they can all either handle the XSAVES format *or* don't care
> for whatever reason.
>
> Could you share the steps that you've taken to assure yourself that all
> of the call paths are handled and we don't have more bugs?
>

We tested for signal, ptrace, context switch, avx, and mpx. We also run
these tests with your audit patch to detect any format mis-match.
That said, I cannot be sure there are no more bugs. As you said, we want
to get this feature tested in the field and find potential issues early.

Dave Hansen

unread,
Apr 29, 2016, 8:36:57 PM4/29/16
to Yu-cheng Yu, x...@kernel.org, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-...@vger.kernel.org, Andy Lutomirski, Borislav Petkov, Sai Praneeth Prakhya, Ravi V. Shankar, Fenghua Yu
If may_copy_fpregs_to_sigframe() were called from a slightly different
context, or if we change the call-site, what breaks?

In other words. if we can still "may_copy_fpregs_to_sigframe()" no
matter the state of fpu.fpstate_active, then I don't think we should be
checking it in may_copy_fpregs_to_sigframe().

Dave Hansen

unread,
Apr 29, 2016, 8:40:52 PM4/29/16
to Yu-cheng Yu, x...@kernel.org, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-...@vger.kernel.org, Andy Lutomirski, Borislav Petkov, Sai Praneeth Prakhya, Ravi V. Shankar, Fenghua Yu
That's better than what we had before, but it relies entirely on testing
coverage and runtime checks.

Is it too much to ask that you also take a look and audit all the places
the XSAVE buffer is accessed in the kernel and ensure that they either
have code to handle standard vs. compacted/supervisor or don't care for
some reason?

I did such an audit once upon a time, but I think it would be a good
exercise to repeat both by a second set of eyes and because some time
has passed.

Ingo Molnar

unread,
Apr 30, 2016, 3:54:02 AM4/30/16
to Dave Hansen, Andy Lutomirski, Yu-cheng Yu, X86 ML, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-...@vger.kernel.org, Andy Lutomirski, Borislav Petkov, Sai Praneeth Prakhya, Ravi V. Shankar, Fenghua Yu

* Dave Hansen <dave....@linux.intel.com> wrote:

> But I also think it would be pretty nice if 'ls' didn't pay the 2k cost to have
> AVX-512 state if it's not using AVX-512. [...]

A C library might decide to use AVX-512 memset(). RAM is cheap, while allocation
complexity, especially in the kernel, has various other costs.

I mean, we should not worry about per thread allocation sizes that can be compared
to the kernel stack size.

We can still use the compacted area handling instructions, because presumably
those are the fastest and are also the most optimized ones? But I wouldn't use
them to do dynamic allocation: just allocate the maximum possible FPU save area at
task creation time and never again worry about that detail.

Ok?

Thanks,

Ingo

Yu-cheng Yu

unread,
May 2, 2016, 12:02:24 PM5/2/16
to Dave Hansen, x...@kernel.org, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-...@vger.kernel.org, Andy Lutomirski, Borislav Petkov, Sai Praneeth Prakhya, Ravi V. Shankar, Fenghua Yu
On Fri, Apr 29, 2016 at 05:36:48PM -0700, Dave Hansen wrote:
> If may_copy_fpregs_to_sigframe() were called from a slightly different
> context, or if we change the call-site, what breaks?
>
> In other words. if we can still "may_copy_fpregs_to_sigframe()" no
> matter the state of fpu.fpstate_active, then I don't think we should be
> checking it in may_copy_fpregs_to_sigframe().

Do you mean, don't check fpu.fpstate_active here?

Dave Hansen

unread,
May 2, 2016, 12:06:57 PM5/2/16
to Yu-cheng Yu, x...@kernel.org, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-...@vger.kernel.org, Andy Lutomirski, Borislav Petkov, Sai Praneeth Prakhya, Ravi V. Shankar, Fenghua Yu
Not really. I'm asking *why* the check is there.

Yu-cheng Yu

unread,
May 2, 2016, 12:15:58 PM5/2/16
to Dave Hansen, x...@kernel.org, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-...@vger.kernel.org, Andy Lutomirski, Borislav Petkov, Sai Praneeth Prakhya, Ravi V. Shankar, Fenghua Yu
On Fri, Apr 29, 2016 at 05:40:44PM -0700, Dave Hansen wrote:
> That's better than what we had before, but it relies entirely on testing
> coverage and runtime checks.
>
> Is it too much to ask that you also take a look and audit all the places
> the XSAVE buffer is accessed in the kernel and ensure that they either
> have code to handle standard vs. compacted/supervisor or don't care for
> some reason?
>
> I did such an audit once upon a time, but I think it would be a good
> exercise to repeat both by a second set of eyes and because some time
> has passed.

I think there are 12 files that can be directly impacted by XSAVES.

arch/x86/include/asm/fpu/internal.h
arch/x86/include/asm/fpu/types.h
arch/x86/include/asm/fpu/xstate.h
arch/x86/include/uapi/asm/sigcontext.h
arch/x86/kernel/traps.c
arch/x86/kernel/fpu/core.c
arch/x86/kernel/fpu/init.c
arch/x86/kernel/fpu/regset.c
arch/x86/kernel/fpu/signal.c
arch/x86/kernel/fpu/xstate.c
arch/x86/kvm/x86.c
arch/x86/mm/mpx.c

They have been reviewed from the perspective of the compacted format.
Please let me know anything else.

Dave Hansen

unread,
May 2, 2016, 12:28:23 PM5/2/16
to Ingo Molnar, Andy Lutomirski, Yu-cheng Yu, X86 ML, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-...@vger.kernel.org, Andy Lutomirski, Borislav Petkov, Sai Praneeth Prakhya, Ravi V. Shankar, Fenghua Yu
On 04/30/2016 12:53 AM, Ingo Molnar wrote:
> We can still use the compacted area handling instructions, because presumably
> those are the fastest and are also the most optimized ones? But I wouldn't use
> them to do dynamic allocation: just allocate the maximum possible FPU save area at
> task creation time and never again worry about that detail.
>
> Ok?

Sounds sane to me.

BTW, I hacked up your "fpu performance" to compare XSAVE vs. XSAVES:

> [ 0.048347] x86/fpu: Cost of: XSAVE insn : 127 cycles
> [ 0.049134] x86/fpu: Cost of: XSAVES insn : 113 cycles
> [ 0.048492] x86/fpu: Cost of: XRSTOR insn : 120 cycles
> [ 0.049267] x86/fpu: Cost of: XRSTORS insn : 102 cycles

So I guess we can add that to the list of things that XSAVES is good
for. Granted, the real-world benefit is probably hard to measure
because the cache residency of the XSAVE buffer isn't as good when
_actually_ context switching, but this at least shows a small
theoretical advantage for XSAVES.

Yu-cheng Yu

unread,
May 2, 2016, 12:38:34 PM5/2/16
to Dave Hansen, x...@kernel.org, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-...@vger.kernel.org, Andy Lutomirski, Borislav Petkov, Sai Praneeth Prakhya, Ravi V. Shankar, Fenghua Yu
If (fpu.fpstate_active == 0), then the task does not use FPU; we don't
want to save these registers, right?

Dave Hansen

unread,
May 2, 2016, 12:44:35 PM5/2/16
to Yu-cheng Yu, x...@kernel.org, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-...@vger.kernel.org, Andy Lutomirski, Borislav Petkov, Sai Praneeth Prakhya, Ravi V. Shankar, Fenghua Yu
No. It's possible to have fpstate_active=0 while fpregs_active=1. Such
a task uses the FPU, but just hasn't done an XSAVE* to save the register
content to the fpstate buffer.

Note, this is just theoretical, and does not happen in this particular
call path today.

Yu-cheng Yu

unread,
May 2, 2016, 1:23:59 PM5/2/16
to Dave Hansen, x...@kernel.org, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-...@vger.kernel.org, Andy Lutomirski, Borislav Petkov, Sai Praneeth Prakhya, Ravi V. Shankar, Fenghua Yu
On Mon, May 02, 2016 at 09:43:47AM -0700, Dave Hansen wrote:
> > If (fpu.fpstate_active == 0), then the task does not use FPU; we don't
> > want to save these registers, right?
>
> No. It's possible to have fpstate_active=0 while fpregs_active=1. Such
> a task uses the FPU, but just hasn't done an XSAVE* to save the register
> content to the fpstate buffer.
>
> Note, this is just theoretical, and does not happen in this particular
> call path today.

What about...

static int may_copy_fpregs_to_sigframe(void)
{
if (fpregs_active())
return 1;


WARN_ONCE(!current->thread.fpu.fpstate_active,
"direct FPU save with no math use\n");

if (boot_cpu_has(X86_FEATURE_XSAVES))
return 1;

return 0;
}

Dave Hansen

unread,
May 2, 2016, 1:33:25 PM5/2/16
to Yu-cheng Yu, x...@kernel.org, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-...@vger.kernel.org, Andy Lutomirski, Borislav Petkov, Sai Praneeth Prakhya, Ravi V. Shankar, Fenghua Yu
I don't think that changes anything. We still have a check in there
that has no purpose. You've changed the ordering so that the specific
example that I pointed out no longer triggers it. But, the underlying
issue remains.

Ingo Molnar

unread,
May 2, 2016, 2:32:36 PM5/2/16
to Dave Hansen, Andy Lutomirski, Yu-cheng Yu, X86 ML, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-...@vger.kernel.org, Andy Lutomirski, Borislav Petkov, Sai Praneeth Prakhya, Ravi V. Shankar, Fenghua Yu

* Dave Hansen <dave....@linux.intel.com> wrote:

> On 04/30/2016 12:53 AM, Ingo Molnar wrote:
> > We can still use the compacted area handling instructions, because presumably
> > those are the fastest and are also the most optimized ones? But I wouldn't use
> > them to do dynamic allocation: just allocate the maximum possible FPU save area at
> > task creation time and never again worry about that detail.
> >
> > Ok?
>
> Sounds sane to me.
>
> BTW, I hacked up your "fpu performance" to compare XSAVE vs. XSAVES:
>
> > [ 0.048347] x86/fpu: Cost of: XSAVE insn : 127 cycles
> > [ 0.049134] x86/fpu: Cost of: XSAVES insn : 113 cycles
> > [ 0.048492] x86/fpu: Cost of: XRSTOR insn : 120 cycles
> > [ 0.049267] x86/fpu: Cost of: XRSTORS insn : 102 cycles
>
> So I guess we can add that to the list of things that XSAVES is good for.

Absolutely!

> [...] Granted, the real-world benefit is probably hard to measure because the
> cache residency of the XSAVE buffer isn't as good when _actually_ context
> switching, but this at least shows a small theoretical advantage for XSAVES.

Yeah, and anything that was measured for real is far from being theoretical. It's
simply a best-case microbenchmark figure, but it's still a nice 10+ cycles
improvement overall - which might become bigger in future CPU generations.

Thanks,

Ingo

Yu-cheng Yu

unread,
May 2, 2016, 5:22:41 PM5/2/16
to Dave Hansen, x...@kernel.org, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-...@vger.kernel.org, Andy Lutomirski, Borislav Petkov, Sai Praneeth Prakhya, Ravi V. Shankar, Fenghua Yu
Before Linux gets into copy_fpstate_to_sigframe(),
current->thread.fpu.fpstate_active must be true.
For eagerfpu, fpregs_active() must also be true.
For lazyfpu, once we try to do FSAVE/FXSAVE/XSAVE,
fpregs_active() will become true as well.

We should have not based on boot_cpu_has(X86_FEATURE_XSAVES)
at all.

Why don't we make it simple and always copy_fpregs_to_signal_frame()?
Or, only for the lazy case, i.e. !fpregs_active(), we do __copy_to_user().

Anyway, I think we can just replace may_copy_fpregs_to_sigframe() with
!fpregs_active().

Comments?

Yu-cheng Yu

unread,
May 2, 2016, 5:29:20 PM5/2/16
to Dave Hansen, x...@kernel.org, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-...@vger.kernel.org, Andy Lutomirski, Borislav Petkov, Sai Praneeth Prakhya, Ravi V. Shankar, Fenghua Yu
On Mon, May 02, 2016 at 02:18:17PM -0700, Yu-cheng Yu wrote:
> Before Linux gets into copy_fpstate_to_sigframe(),
> current->thread.fpu.fpstate_active must be true.
> For eagerfpu, fpregs_active() must also be true.
> For lazyfpu, once we try to do FSAVE/FXSAVE/XSAVE,
> fpregs_active() will become true as well.
>
> We should have not based on boot_cpu_has(X86_FEATURE_XSAVES)
> at all.
>
> Why don't we make it simple and always copy_fpregs_to_signal_frame()?
> Or, only for the lazy case, i.e. !fpregs_active(), we do __copy_to_user().

For (lazy && not XSAVES) actually!

Dave Hansen

unread,
May 2, 2016, 5:32:33 PM5/2/16
to Yu-cheng Yu, x...@kernel.org, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-...@vger.kernel.org, Andy Lutomirski, Borislav Petkov, Sai Praneeth Prakhya, Ravi V. Shankar, Fenghua Yu
I think we're off in the weeds here.

Please just send an updated patch for what you want to do here.

Yu-cheng Yu

unread,
May 2, 2016, 6:22:20 PM5/2/16
to Dave Hansen, x...@kernel.org, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-...@vger.kernel.org, Andy Lutomirski, Borislav Petkov, Sai Praneeth Prakhya, Ravi V. Shankar, Fenghua Yu
On Mon, May 02, 2016 at 02:32:14PM -0700, Dave Hansen wrote:
>
> I think we're off in the weeds here.
>
> Please just send an updated patch for what you want to do here.

From 43134a773d23ae8bab9f158d143c5cfb76bc0e9c Mon Sep 17 00:00:00 2001
From: Yu-cheng Yu <yu-ch...@intel.com>
Date: Sat, 14 Nov 2015 16:59:45 -0800
Subject: [PATCH] x86/xsaves: Introduce a new check that allows correct xstates
copy from kernel to user directly

XSAVES is a kernel instruction and uses a compacted format. When working with user space, the kernel should provide
standard-format, non-supervisor state data. We cannot do __copy_to_user() from a compacted- format kernel xstate area to a
signal frame.

Dave Hansen proposes this method to simplify copy xstate directly to user.

Signed-off-by: Fenghua Yu <fengh...@intel.com>
Signed-off by: Yu-cheng Yu <yu-ch...@intel.com>
---
arch/x86/include/asm/fpu/xstate.h | 1 +
arch/x86/kernel/fpu/signal.c | 3 ++-
arch/x86/kernel/fpu/xstate.c | 2 +-
3 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index 16df2c4..d812cf3 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -47,5 +47,6 @@ extern void update_regset_xstate_info(unsigned int size, u64 xstate_mask);
void fpu__xstate_clear_all_cpu_caps(void);
void *get_xsave_addr(struct xregs_state *xsave, int xstate);
const void *get_xsave_field_ptr(int xstate_field);
+int using_compacted_format(void);

#endif
diff --git a/arch/x86/kernel/fpu/signal.c b/arch/x86/kernel/fpu/signal.c
index 0fbf60c..d7fdd8c 100644
--- a/arch/x86/kernel/fpu/signal.c
+++ b/arch/x86/kernel/fpu/signal.c
@@ -8,6 +8,7 @@
#include <asm/fpu/internal.h>
#include <asm/fpu/signal.h>
#include <asm/fpu/regset.h>
+#include <asm/fpu/xstate.h>

#include <asm/sigframe.h>

@@ -167,7 +168,7 @@ int copy_fpstate_to_sigframe(void __user *buf, void __user *buf_fx, int size)
sizeof(struct user_i387_ia32_struct), NULL,
(struct _fpstate_32 __user *) buf) ? -1 : 1;

- if (fpregs_active()) {
+ if (fpregs_active() || using_compacted_format()) {
/* Save the live register state to the user directly. */
if (copy_fpregs_to_sigframe(buf_fx))
return -1;
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 170c164..2b59bd7 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -415,7 +415,7 @@ static int xfeature_size(int xfeature_nr)
* that it is obvious which aspect of 'XSAVES' is being handled
* by the calling code.
*/
-static int using_compacted_format(void)
+int using_compacted_format(void)
{
return cpu_has_xsaves;
}
--
1.9.1

Dave Hansen

unread,
May 2, 2016, 6:38:00 PM5/2/16
to Yu-cheng Yu, x...@kernel.org, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-...@vger.kernel.org, Andy Lutomirski, Borislav Petkov, Sai Praneeth Prakhya, Ravi V. Shankar, Fenghua Yu
On 05/02/2016 03:17 PM, Yu-cheng Yu wrote:
> @@ -167,7 +168,7 @@ int copy_fpstate_to_sigframe(void __user *buf, void __user *buf_fx, int size)
> sizeof(struct user_i387_ia32_struct), NULL,
> (struct _fpstate_32 __user *) buf) ? -1 : 1;
>
> - if (fpregs_active()) {
> + if (fpregs_active() || using_compacted_format()) {
> /* Save the live register state to the user directly. */
> if (copy_fpregs_to_sigframe(buf_fx))
> return -1;

So, compared to the first patch, you move the fpregs_active() check out
to the caller of may_copy_fpregs_to_sigframe() (good) and removed a
bunch of comments explaining what was going on (bad).

Do we really want all those comments to die?

Dave Hansen

unread,
May 4, 2016, 6:15:46 PM5/4/16
to Yu-cheng Yu, x...@kernel.org, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-...@vger.kernel.org, Andy Lutomirski, Borislav Petkov, Sai Praneeth Prakhya, Ravi V. Shankar, Fenghua Yu
Can you double-check that nothing has changed in mainline (or tip for
that matter) since you first did these checks?

Yu-cheng Yu

unread,
May 4, 2016, 6:25:54 PM5/4/16
to Dave Hansen, x...@kernel.org, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-...@vger.kernel.org, Andy Lutomirski, Borislav Petkov, Sai Praneeth Prakhya, Ravi V. Shankar, Fenghua Yu
Yes. I am also doing some tests now.

Dave Hansen

unread,
May 4, 2016, 6:41:58 PM5/4/16
to Yu-cheng Yu, x...@kernel.org, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-...@vger.kernel.org, Andy Lutomirski, Borislav Petkov, Sai Praneeth Prakhya, Ravi V. Shankar, Fenghua Yu
It's my fault, but you also need to go update

fpu__xfeature_set_state()
and
__raw_xsave_addr()

The theoretical problem is that you might ask for a __raw_xsave_addr()
of a component which has been compacted out of an XSAVES buffer and thus
has no address. We could work around this by doing a memmove() and
moving the components "up" after the one we are trying to set in order
to make space.

But, since we *always* call XSAVES with an instruction mask of -1 and
end up with a requested feature bitmap (RFBM) equal to XCR0, I think we
can do a shortcut because we'll practically *always* have an
xcomp_bv==RFBM==XCR0, which means that all (present) components will
always have an address.

So, the alternative to doing the memmove() is to add some WARN_ON_FPU()
checks to enforce xcomp_bv==RFBM==XCR0 in places where we call
XSAVES/XRSTORS and __raw_xsave_addr(), maybe more.

Dave Hansen

unread,
May 4, 2016, 6:46:56 PM5/4/16
to Dave Hansen, Yu-cheng Yu, x...@kernel.org, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-...@vger.kernel.org, Andy Lutomirski, Borislav Petkov, Sai Praneeth Prakhya, Ravi V. Shankar, Fenghua Yu
On 05/04/2016 03:41 PM, Dave Hansen wrote:
> But, since we *always* call XSAVES with an instruction mask of -1 and
> end up with a requested feature bitmap (RFBM) equal to XCR0, I think we
> can do a shortcut because we'll practically *always* have an
> xcomp_bv==RFBM==XCR0, which means that all (present) components will
> always have an address.

Actually, we depend on that anyway. xstate_comp_offsets[] is completely
bogus if we don't have xcomp_bv==RFBM==XCR0.

Perhaps we need to add those checks anyway.

Yu-cheng Yu

unread,
May 4, 2016, 6:50:51 PM5/4/16
to Dave Hansen, x...@kernel.org, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-...@vger.kernel.org, Andy Lutomirski, Borislav Petkov, Sai Praneeth Prakhya, Ravi V. Shankar, Fenghua Yu
In the coming version 5 patches, we are going to have one additional
patch for fixing __fpu_restore_sig() for the compacted format. I changed
my existing patch a little and run into some problems. Fixing it now.

Our ptrace tests went OK before, but are failing now. It might be relating
to what you are saying? I will check it.

0 new messages