Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

[PATCH 9/9] x86, xsave: add kernel support for AMDs Lightweight Profiling (LWP)

87 views

Skip to first unread message

Hans Rosenfeld

unread,

Nov 29, 2011, 7:50:01 AM11/29/11

This patch extends the xsave structure to support the LWP state. The
xstate feature bit for LWP is added to XCNTXT_NONLAZY, thereby enabling
kernel support for saving/restoring LWP state. The LWP state is also
saved/restored on signal entry/return, just like all other xstates. LWP
state needs to be reset (disabled) when entering a signal handler.

v2:
When copying xstates for fork/clone, disable LWP in the new copy so that
new tasks start with LWP disabled as required by the LWP specification.

Signed-off-by: Hans Rosenfeld <hans.ro...@amd.com>
---
arch/x86/include/asm/i387.h | 4 ++++
arch/x86/include/asm/msr-index.h | 1 +
arch/x86/include/asm/processor.h | 12 ++++++++++++
arch/x86/include/asm/sigcontext.h | 12 ++++++++++++
arch/x86/include/asm/xsave.h | 3 ++-
arch/x86/kernel/xsave.c | 5 +++++
6 files changed, 36 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/i387.h b/arch/x86/include/asm/i387.h
index 6efe38a..c56cb2b 100644
--- a/arch/x86/include/asm/i387.h
+++ b/arch/x86/include/asm/i387.h
@@ -326,6 +326,10 @@ static inline void fpu_free(struct fpu *fpu)
static inline void fpu_copy(struct fpu *dst, struct fpu *src)
{
memcpy(dst->state, src->state, xstate_size);
+
+ /* disable LWP in the copy */
+ if (pcntxt_mask & XSTATE_LWP)
+ dst->state->xsave.xsave_hdr.xstate_bv &= ~XSTATE_LWP;
}

extern void fpu_finit(struct fpu *fpu);
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index d52609a..2d9cf3c 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -136,6 +136,7 @@
#define MSR_AMD64_IBSDCPHYSAD 0xc0011039
#define MSR_AMD64_IBSCTL 0xc001103a
#define MSR_AMD64_IBSBRTARGET 0xc001103b
+#define MSR_AMD64_LWP_CBADDR 0xc0000106

/* Fam 15h MSRs */
#define MSR_F15H_PERF_CTL 0xc0010200
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 0d1171c..bb31ab6 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -353,6 +353,17 @@ struct ymmh_struct {
u32 ymmh_space[64];
};

+struct lwp_struct {
+ u64 lwpcb_addr;
+ u32 flags;
+ u32 buf_head_offset;
+ u64 buf_base;
+ u32 buf_size;
+ u32 filters;
+ u64 saved_event_record[4];
+ u32 event_counter[16];
+};
+
struct xsave_hdr_struct {
u64 xstate_bv;
u64 reserved1[2];
@@ -363,6 +374,7 @@ struct xsave_struct {
struct i387_fxsave_struct i387;
struct xsave_hdr_struct xsave_hdr;
struct ymmh_struct ymmh;
+ struct lwp_struct lwp;
/* new processor state extensions will go here */
} __attribute__ ((packed, aligned (64)));

diff --git a/arch/x86/include/asm/sigcontext.h b/arch/x86/include/asm/sigcontext.h
index 04459d2..0a58b82 100644
--- a/arch/x86/include/asm/sigcontext.h
+++ b/arch/x86/include/asm/sigcontext.h
@@ -274,6 +274,17 @@ struct _ymmh_state {
__u32 ymmh_space[64];
};

+struct _lwp_state {
+ __u64 lwpcb_addr;
+ __u32 flags;
+ __u32 buf_head_offset;
+ __u64 buf_base;
+ __u32 buf_size;
+ __u32 filters;
+ __u64 saved_event_record[4];
+ __u32 event_counter[16];
+};
+
/*
* Extended state pointed by the fpstate pointer in the sigcontext.
* In addition to the fpstate, information encoded in the xstate_hdr
@@ -284,6 +295,7 @@ struct _xstate {
struct _fpstate fpstate;
struct _xsave_hdr xstate_hdr;
struct _ymmh_state ymmh;
+ struct _lwp_state lwp;
/* new processor state extensions go here */
};

diff --git a/arch/x86/include/asm/xsave.h b/arch/x86/include/asm/xsave.h
index 02f1e1d..d61c87f 100644
--- a/arch/x86/include/asm/xsave.h
+++ b/arch/x86/include/asm/xsave.h
@@ -9,6 +9,7 @@
#define XSTATE_FP 0x1
#define XSTATE_SSE 0x2
#define XSTATE_YMM 0x4
+#define XSTATE_LWP (1ULL << 62)

#define XSTATE_FPSSE (XSTATE_FP | XSTATE_SSE)

@@ -24,7 +25,7 @@
* These are the features that the OS can handle currently.
*/
#define XCNTXT_LAZY (XSTATE_FP | XSTATE_SSE | XSTATE_YMM)
-#define XCNTXT_NONLAZY 0
+#define XCNTXT_NONLAZY (XSTATE_LWP)

#define XCNTXT_MASK (XCNTXT_LAZY | XCNTXT_NONLAZY)

diff --git a/arch/x86/kernel/xsave.c b/arch/x86/kernel/xsave.c
index 988bdef..5d886ec 100644
--- a/arch/x86/kernel/xsave.c
+++ b/arch/x86/kernel/xsave.c
@@ -249,6 +249,11 @@ int save_xstates_sigframe(void __user *buf, unsigned int size)
return err;
}

+ if (pcntxt_mask & XSTATE_LWP) {
+ xsave->xsave_hdr.xstate_bv &= ~XSTATE_LWP;
+ wrmsrl(MSR_AMD64_LWP_CBADDR, 0);
+ }
+
return 1;
}

--
1.7.5.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Hans Rosenfeld

unread,

Nov 29, 2011, 7:50:01 AM11/29/11

Non-lazy xstates are, as the name suggests, extended states that cannot
be saved or restored lazily. The state for AMDs LWP feature is a
non-lazy state.

This patch adds support for this kind of xstates. If any such states are
present and supported on the running system, they will always be enabled
in xstate_mask so that they are always restored in switch_to.

Signed-off-by: Hans Rosenfeld <hans.ro...@amd.com>
---

arch/x86/include/asm/xsave.h | 5 +++--
arch/x86/kernel/xsave.c | 6 +++++-
2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/xsave.h b/arch/x86/include/asm/xsave.h
index 12793b6..02f1e1d 100644
--- a/arch/x86/include/asm/xsave.h
+++ b/arch/x86/include/asm/xsave.h
@@ -23,9 +23,10 @@
/*

* These are the features that the OS can handle currently.
*/

-#define XCNTXT_MASK (XSTATE_FP | XSTATE_SSE | XSTATE_YMM)
+#define XCNTXT_LAZY (XSTATE_FP | XSTATE_SSE | XSTATE_YMM)
+#define XCNTXT_NONLAZY 0

-#define XCNTXT_LAZY XCNTXT_MASK
+#define XCNTXT_MASK (XCNTXT_LAZY | XCNTXT_NONLAZY)

#ifdef CONFIG_X86_64
#define REX_PREFIX "0x48, "
diff --git a/arch/x86/kernel/xsave.c b/arch/x86/kernel/xsave.c
index 1a11291..988bdef 100644
--- a/arch/x86/kernel/xsave.c
+++ b/arch/x86/kernel/xsave.c
@@ -16,6 +16,7 @@
* Supported feature mask by the CPU and the kernel.
*/
u64 pcntxt_mask;
+EXPORT_SYMBOL(pcntxt_mask);

/*
* Represents init state for the supported extended state.
@@ -261,7 +262,7 @@ int restore_xstates_sigframe(void __user *buf, unsigned int size)
struct task_struct *tsk = current;
struct _fpstate_ia32 __user *fp = buf;
struct xsave_struct *xsave;
- u64 xstate_mask = 0;
+ u64 xstate_mask = pcntxt_mask & XCNTXT_NONLAZY;

if (!buf) {
if (used_math()) {
@@ -478,6 +479,9 @@ static void __init xstate_enable_boot_cpu(void)
printk(KERN_INFO "xsave/xrstor: enabled xstate_bv 0x%llx, "
"cntxt size 0x%x\n",
pcntxt_mask, xstate_size);
+
+ if (pcntxt_mask & XCNTXT_NONLAZY)
+ task_thread_info(&init_task)->xstate_mask |= XCNTXT_NONLAZY;
}

/*

Hans Rosenfeld

unread,

Nov 29, 2011, 7:50:01 AM11/29/11

The patches to rework the fpu/xsave handling and signal frame setup have
made a lot of code unused. This patch removes all this now useless stuff.

Signed-off-by: Hans Rosenfeld <hans.ro...@amd.com>
---

arch/x86/include/asm/i387.h | 167 --------------------------------
arch/x86/include/asm/xsave.h | 51 ----------
arch/x86/kernel/i387.c | 219 ------------------------------------------
arch/x86/kernel/traps.c | 22 ----
arch/x86/kernel/xsave.c | 163 -------------------------------
5 files changed, 0 insertions(+), 622 deletions(-)

diff --git a/arch/x86/include/asm/i387.h b/arch/x86/include/asm/i387.h
index daab77f..687e550 100644
--- a/arch/x86/include/asm/i387.h
+++ b/arch/x86/include/asm/i387.h
@@ -29,8 +29,6 @@
# include <asm/sigcontext32.h>
# include <asm/user32.h>
#else
-# define save_i387_xstate_ia32 save_i387_xstate
-# define restore_i387_xstate_ia32 restore_i387_xstate
# define _fpstate_ia32 _fpstate
# define _xstate_ia32 _xstate
# define sig_xstate_ia32_size sig_xstate_size
@@ -44,7 +42,6 @@ extern void fpu_init(void);
extern void mxcsr_feature_mask_init(void);
extern int init_fpu(struct task_struct *child);
extern asmlinkage void math_state_restore(void);
-extern void __math_state_restore(void);
extern int dump_fpu(struct pt_regs *, struct user_i387_struct *);

extern void convert_from_fxsr(struct user_i387_ia32_struct *, struct task_struct *);
@@ -108,35 +105,6 @@ static inline void sanitize_i387_state(struct task_struct *tsk)
}

#ifdef CONFIG_X86_64
-static inline int fxrstor_checking(struct i387_fxsave_struct *fx)
-{
- int err;
-
- /* See comment in fxsave() below. */
-#ifdef CONFIG_AS_FXSAVEQ
- asm volatile("1: fxrstorq %[fx]\n\t"
- "2:\n"
- ".section .fixup,\"ax\"\n"
- "3: movl $-1,%[err]\n"
- " jmp 2b\n"
- ".previous\n"
- _ASM_EXTABLE(1b, 3b)
- : [err] "=r" (err)
- : [fx] "m" (*fx), "0" (0));
-#else
- asm volatile("1: rex64/fxrstor (%[fx])\n\t"
- "2:\n"
- ".section .fixup,\"ax\"\n"
- "3: movl $-1,%[err]\n"
- " jmp 2b\n"
- ".previous\n"
- _ASM_EXTABLE(1b, 3b)
- : [err] "=r" (err)
- : [fx] "R" (fx), "m" (*fx), "0" (0));
-#endif
- return err;
-}
-
static inline void fxrstor(struct i387_fxsave_struct *fx)
{
/* See comment in fxsave() below. */
@@ -149,48 +117,6 @@ static inline void fxrstor(struct i387_fxsave_struct *fx)
#endif
}

-static inline int fxsave_user(struct i387_fxsave_struct __user *fx)
-{
- int err;
-
- /*
- * Clear the bytes not touched by the fxsave and reserved
- * for the SW usage.
- */
- err = __clear_user(&fx->sw_reserved,
- sizeof(struct _fpx_sw_bytes));
- if (unlikely(err))
- return -EFAULT;
-
- /* See comment in fxsave() below. */
-#ifdef CONFIG_AS_FXSAVEQ
- asm volatile("1: fxsaveq %[fx]\n\t"
- "2:\n"
- ".section .fixup,\"ax\"\n"
- "3: movl $-1,%[err]\n"
- " jmp 2b\n"
- ".previous\n"
- _ASM_EXTABLE(1b, 3b)
- : [err] "=r" (err), [fx] "=m" (*fx)
- : "0" (0));
-#else
- asm volatile("1: rex64/fxsave (%[fx])\n\t"
- "2:\n"
- ".section .fixup,\"ax\"\n"
- "3: movl $-1,%[err]\n"
- " jmp 2b\n"
- ".previous\n"
- _ASM_EXTABLE(1b, 3b)
- : [err] "=r" (err), "=m" (*fx)
- : [fx] "R" (fx), "0" (0));
-#endif
- if (unlikely(err) &&
- __clear_user(fx, sizeof(struct i387_fxsave_struct)))
- err = -EFAULT;
- /* No need to clear here because the caller clears USED_MATH */
- return err;
-}
-
static inline void fpu_fxsave(struct fpu *fpu)
{
/* Using "rex64; fxsave %0" is broken because, if the memory operand
@@ -221,21 +147,6 @@ static inline void fpu_fxsave(struct fpu *fpu)
#else /* CONFIG_X86_32 */

/* perform fxrstor iff the processor has extended states, otherwise frstor */
-static inline int fxrstor_checking(struct i387_fxsave_struct *fx)
-{
- /*
- * The "nop" is needed to make the instructions the same
- * length.
- */
- alternative_input(
- "nop ; frstor %1",
- "fxrstor %1",
- X86_FEATURE_FXSR,
- "m" (*fx));
-
- return 0;
-}
-
static inline void fxrstor(struct i387_fxsave_struct *fx)
{
/*
@@ -303,69 +214,6 @@ static inline void fpu_clean(struct fpu *fpu)
[addr] "m" (safe_address));
}

-static inline void fpu_save_init(struct fpu *fpu)
-{
- if (use_xsave()) {
- struct xsave_struct *xstate = &fpu->state->xsave;
-
- fpu_xsave(xstate, -1);
-
- /*
- * xsave header may indicate the init state of the FP.
- */
- if (!(xstate->xsave_hdr.xstate_bv & XSTATE_FP))
- return;
- } else if (use_fxsr()) {
- fpu_fxsave(fpu);
- } else {
- asm volatile("fnsave %[fx]; fwait"
- : [fx] "=m" (fpu->state->fsave));
- return;
- }
-
- if (unlikely(fpu->state->fxsave.swd & X87_FSW_ES))
- asm volatile("fnclex");
-
- /* AMD K7/K8 CPUs don't save/restore FDP/FIP/FOP unless an exception
- is pending. Clear the x87 state here by setting it to fixed
- values. safe_address is a random variable that should be in L1 */
- alternative_input(
- ASM_NOP8 ASM_NOP2,
- "emms\n\t" /* clear stack tags */
- "fildl %P[addr]", /* set F?P to defined value */
- X86_FEATURE_FXSAVE_LEAK,
- [addr] "m" (safe_address));
-}
-
-static inline void __save_init_fpu(struct task_struct *tsk)
-{
- fpu_save_init(&tsk->thread.fpu);
- task_thread_info(tsk)->status &= ~TS_USEDFPU;
-}
-
-static inline int fpu_restore_checking(struct fpu *fpu)
-{
- if (use_xsave())
- return xrstor_checking(&fpu->state->xsave, -1);
- else
- return fxrstor_checking(&fpu->state->fxsave);
-}
-
-/*
- * Signal frame handlers...
- */
-extern int save_i387_xstate(void __user *buf);
-extern int restore_i387_xstate(void __user *buf);
-
-static inline void __unlazy_fpu(struct task_struct *tsk)
-{
- if (task_thread_info(tsk)->status & TS_USEDFPU) {
- __save_init_fpu(tsk);
- stts();
- } else
- tsk->fpu_counter = 0;
-}
-
static inline void __clear_fpu(struct task_struct *tsk)
{
if (task_thread_info(tsk)->status & TS_USEDFPU) {
@@ -434,21 +282,6 @@ static inline void irq_ts_restore(int TS_state)
/*
* These disable preemption on their own and are safe
*/
-static inline void save_init_fpu(struct task_struct *tsk)
-{
- preempt_disable();
- __save_init_fpu(tsk);
- stts();
- preempt_enable();
-}
-
-static inline void unlazy_fpu(struct task_struct *tsk)
-{
- preempt_disable();
- __unlazy_fpu(tsk);
- preempt_enable();
-}
-
static inline void clear_fpu(struct task_struct *tsk)
{
preempt_disable();
diff --git a/arch/x86/include/asm/xsave.h b/arch/x86/include/asm/xsave.h
index 42d02b9..8d5bb0e 100644
--- a/arch/x86/include/asm/xsave.h
+++ b/arch/x86/include/asm/xsave.h
@@ -65,26 +65,6 @@ static inline void restore_xstates(struct task_struct *tsk, u64 mask)
preempt_enable();
}

-static inline int xrstor_checking(struct xsave_struct *fx, u64 mask)
-{
- int err;
- u32 lmask = mask;
- u32 hmask = mask >> 32;
-
- asm volatile("1: .byte " REX_PREFIX "0x0f,0xae,0x2f\n\t"
- "2:\n"
- ".section .fixup,\"ax\"\n"
- "3: movl $-1,%[err]\n"
- " jmp 2b\n"
- ".previous\n"
- _ASM_EXTABLE(1b, 3b)
- : [err] "=r" (err)
- : "D" (fx), "m" (*fx), "a" (lmask), "d" (hmask), "0" (0)
- : "memory");
-
- return err;
-}
-
static inline void xrstor_state(struct xsave_struct *fx, u64 mask)
{
u32 lmask = mask;
@@ -95,37 +75,6 @@ static inline void xrstor_state(struct xsave_struct *fx, u64 mask)
: "memory");
}

-static inline int xsave_checking(struct xsave_struct __user *buf)
-{
- int err;
-
- /*
- * Clear the xsave header first, so that reserved fields are
- * initialized to zero.
- */
- err = __clear_user(&buf->xsave_hdr,
- sizeof(struct xsave_hdr_struct));
- if (unlikely(err))
- return -EFAULT;
-
- asm volatile("1: .byte " REX_PREFIX "0x0f,0xae,0x27\n"
- "2:\n"
- ".section .fixup,\"ax\"\n"
- "3: movl $-1,%[err]\n"
- " jmp 2b\n"
- ".previous\n"
- _ASM_EXTABLE(1b,3b)
- : [err] "=r" (err)
- : "D" (buf), "a" (-1), "d" (-1), "0" (0)
- : "memory");
-
- if (unlikely(err) && __clear_user(buf, xstate_size))
- err = -EFAULT;
-
- /* No need to clear here because the caller clears USED_MATH */
- return err;
-}
-
static inline void xsave_state(struct xsave_struct *fx, u64 mask)
{
u32 lmask = mask;
diff --git a/arch/x86/kernel/i387.c b/arch/x86/kernel/i387.c
index e1b8a42..49d23a5 100644
--- a/arch/x86/kernel/i387.c
+++ b/arch/x86/kernel/i387.c
@@ -487,225 +487,6 @@ int fpregs_set(struct task_struct *target, const struct user_regset *regset,
}

/*
- * Signal frame handlers.
- */
-
-static inline int save_i387_fsave(struct _fpstate_ia32 __user *buf)
-{
- struct task_struct *tsk = current;
- struct i387_fsave_struct *fp = &tsk->thread.fpu.state->fsave;
-
- fp->status = fp->swd;
- if (__copy_to_user(buf, fp, sizeof(struct i387_fsave_struct)))
- return -1;
- return 1;
-}
-
-static int save_i387_fxsave(struct _fpstate_ia32 __user *buf)
-{
- struct task_struct *tsk = current;
- struct i387_fxsave_struct *fx = &tsk->thread.fpu.state->fxsave;
- struct user_i387_ia32_struct env;
- int err = 0;
-
- convert_from_fxsr(&env, tsk);
- if (__copy_to_user(buf, &env, sizeof(env)))
- return -1;
-
- err |= __put_user(fx->swd, &buf->status);
- err |= __put_user(X86_FXSR_MAGIC, &buf->magic);
- if (err)
- return -1;
-
- if (__copy_to_user(&buf->_fxsr_env[0], fx, xstate_size))
- return -1;
- return 1;
-}
-
-static int save_i387_xsave(void __user *buf)
-{
- struct task_struct *tsk = current;
- struct _fpstate_ia32 __user *fx = buf;
- int err = 0;
-
-
- sanitize_i387_state(tsk);
-
- /*
- * For legacy compatible, we always set FP/SSE bits in the bit
- * vector while saving the state to the user context.
- * This will enable us capturing any changes(during sigreturn) to
- * the FP/SSE bits by the legacy applications which don't touch
- * xstate_bv in the xsave header.
- *
- * xsave aware applications can change the xstate_bv in the xsave
- * header as well as change any contents in the memory layout.
- * xrestore as part of sigreturn will capture all the changes.
- */
- tsk->thread.fpu.state->xsave.xsave_hdr.xstate_bv |= XSTATE_FPSSE;
-
- if (save_i387_fxsave(fx) < 0)
- return -1;
-
- err = __copy_to_user(&fx->sw_reserved, &fx_sw_reserved_ia32,
- sizeof(struct _fpx_sw_bytes));
- err |= __put_user(FP_XSTATE_MAGIC2,
- (__u32 __user *) (buf + sig_xstate_ia32_size
- - FP_XSTATE_MAGIC2_SIZE));
- if (err)
- return -1;
-
- return 1;
-}
-
-int save_i387_xstate_ia32(void __user *buf)
-{
- struct _fpstate_ia32 __user *fp = (struct _fpstate_ia32 __user *) buf;
- struct task_struct *tsk = current;
-
- if (!used_math())
- return 0;
-
- if (!access_ok(VERIFY_WRITE, buf, sig_xstate_ia32_size))
- return -EACCES;
- /*
- * This will cause a "finit" to be triggered by the next
- * attempted FPU operation by the 'current' process.
- */
- clear_used_math();
-
- if (!HAVE_HWFP) {
- return fpregs_soft_get(current, NULL,
- 0, sizeof(struct user_i387_ia32_struct),
- NULL, fp) ? -1 : 1;
- }
-
- unlazy_fpu(tsk);
-
- if (cpu_has_xsave)
- return save_i387_xsave(fp);
- if (cpu_has_fxsr)
- return save_i387_fxsave(fp);
- else
- return save_i387_fsave(fp);
-}
-
-static inline int restore_i387_fsave(struct _fpstate_ia32 __user *buf)
-{
- struct task_struct *tsk = current;
-
- return __copy_from_user(&tsk->thread.fpu.state->fsave, buf,
- sizeof(struct i387_fsave_struct));
-}
-
-static int restore_i387_fxsave(struct _fpstate_ia32 __user *buf,
- unsigned int size)
-{
- struct task_struct *tsk = current;
- struct user_i387_ia32_struct env;
- int err;
-
- err = __copy_from_user(&tsk->thread.fpu.state->fxsave, &buf->_fxsr_env[0],
- size);
- /* mxcsr reserved bits must be masked to zero for security reasons */
- tsk->thread.fpu.state->fxsave.mxcsr &= mxcsr_feature_mask;
- if (err || __copy_from_user(&env, buf, sizeof(env)))
- return 1;
- convert_to_fxsr(tsk, &env);
-
- return 0;
-}
-
-static int restore_i387_xsave(void __user *buf)
-{
- struct _fpx_sw_bytes fx_sw_user;
- struct _fpstate_ia32 __user *fx_user =
- ((struct _fpstate_ia32 __user *) buf);
- struct i387_fxsave_struct __user *fx =
- (struct i387_fxsave_struct __user *) &fx_user->_fxsr_env[0];
- struct xsave_hdr_struct *xsave_hdr =
- &current->thread.fpu.state->xsave.xsave_hdr;
- u64 mask;
- int err;
-
- if (check_for_xstate(fx, sig_xstate_ia32_size -
- offsetof(struct _fpstate_ia32, _fxsr_env),
- &fx_sw_user))
- goto fx_only;
-
- mask = fx_sw_user.xstate_bv;
-
- err = restore_i387_fxsave(buf, fx_sw_user.xstate_size);
-
- xsave_hdr->xstate_bv &= pcntxt_mask;
- /*
- * These bits must be zero.
- */
- xsave_hdr->reserved1[0] = xsave_hdr->reserved1[1] = 0;
-
- /*
- * Init the state that is not present in the memory layout
- * and enabled by the OS.
- */
- mask = ~(pcntxt_mask & ~mask);
- xsave_hdr->xstate_bv &= mask;
-
- return err;
-fx_only:
- /*
- * Couldn't find the extended state information in the memory
- * layout. Restore the FP/SSE and init the other extended state
- * enabled by the OS.
- */
- xsave_hdr->xstate_bv = XSTATE_FPSSE;
- return restore_i387_fxsave(buf, sizeof(struct i387_fxsave_struct));
-}
-
-int restore_i387_xstate_ia32(void __user *buf)
-{
- int err;
- struct task_struct *tsk = current;
- struct _fpstate_ia32 __user *fp = (struct _fpstate_ia32 __user *) buf;
-
- if (HAVE_HWFP)
- clear_fpu(tsk);
-
- if (!buf) {
- if (used_math()) {
- clear_fpu(tsk);
- clear_used_math();
- }
-
- return 0;
- } else
- if (!access_ok(VERIFY_READ, buf, sig_xstate_ia32_size))
- return -EACCES;
-
- if (!used_math()) {
- err = init_fpu(tsk);
- if (err)
- return err;
- }
-
- if (HAVE_HWFP) {
- if (cpu_has_xsave)
- err = restore_i387_xsave(buf);
- else if (cpu_has_fxsr)
- err = restore_i387_fxsave(fp, sizeof(struct
- i387_fxsave_struct));
- else
- err = restore_i387_fsave(fp);
- } else {
- err = fpregs_soft_set(current, NULL,
- 0, sizeof(struct user_i387_ia32_struct),
- NULL, fp) != 0;
- }
- set_used_math();
-
- return err;
-}
-
-/*
* FPU state for core dumps.
* This is only used for a.out dumps now.
* It is declared generically using elf_fpregset_t (which is
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 95ae50d..f0946aa 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -719,28 +719,6 @@ asmlinkage void __attribute__((weak)) smp_threshold_interrupt(void)
}

/*
- * __math_state_restore assumes that cr0.TS is already clear and the
- * fpu state is all ready for use. Used during context switch.
- */
-void __math_state_restore(void)
-{
- struct thread_info *thread = current_thread_info();
- struct task_struct *tsk = thread->task;
-
- /*
- * Paranoid restore. send a SIGSEGV if we fail to restore the state.
- */
- if (unlikely(fpu_restore_checking(&tsk->thread.fpu))) {
- stts();
- force_sig(SIGSEGV, tsk);
- return;
- }
-
- thread->status |= TS_USEDFPU; /* So we fnsave on switch_to() */
- tsk->fpu_counter++;
-}
-
-/*
* 'math_state_restore()' saves the current math information in the
* old math state array, and gets the new ones from the current task
*
diff --git a/arch/x86/kernel/xsave.c b/arch/x86/kernel/xsave.c
index d35adf8..ca5812c 100644
--- a/arch/x86/kernel/xsave.c
+++ b/arch/x86/kernel/xsave.c
@@ -251,169 +251,6 @@ int save_xstates_sigframe(void __user *buf, unsigned int size)
return 1;
}

-#ifdef CONFIG_X86_64
-int save_i387_xstate(void __user *buf)
-{
- struct task_struct *tsk = current;
- int err = 0;
-
- if (!access_ok(VERIFY_WRITE, buf, sig_xstate_size))
- return -EACCES;
-
- BUG_ON(sig_xstate_size < xstate_size);
-
- if ((unsigned long)buf % 64)
- printk("save_i387_xstate: bad fpstate %p\n", buf);
-
- if (!used_math())
- return 0;
-
- if (task_thread_info(tsk)->status & TS_USEDFPU) {
- if (use_xsave())
- err = xsave_checking(buf);
- else
- err = fxsave_user(buf);
-
- if (err)
- return err;
- task_thread_info(tsk)->status &= ~TS_USEDFPU;
- stts();
- } else {
- sanitize_i387_state(tsk);
- if (__copy_to_user(buf, &tsk->thread.fpu.state->fxsave,
- xstate_size))
- return -1;
- }
-
- clear_used_math(); /* trigger finit */
-
- if (use_xsave()) {
- struct _fpstate __user *fx = buf;
- struct _xstate __user *x = buf;
- u64 xstate_bv;
-
- err = __copy_to_user(&fx->sw_reserved, &fx_sw_reserved,
- sizeof(struct _fpx_sw_bytes));
-
- err |= __put_user(FP_XSTATE_MAGIC2,
- (__u32 __user *) (buf + sig_xstate_size
- - FP_XSTATE_MAGIC2_SIZE));
-
- /*
- * Read the xstate_bv which we copied (directly from the cpu or
- * from the state in task struct) to the user buffers and
- * set the FP/SSE bits.
- */
- err |= __get_user(xstate_bv, &x->xstate_hdr.xstate_bv);
-
- /*
- * For legacy compatible, we always set FP/SSE bits in the bit
- * vector while saving the state to the user context. This will
- * enable us capturing any changes(during sigreturn) to
- * the FP/SSE bits by the legacy applications which don't touch
- * xstate_bv in the xsave header.
- *
- * xsave aware apps can change the xstate_bv in the xsave
- * header as well as change any contents in the memory layout.
- * xrestore as part of sigreturn will capture all the changes.
- */
- xstate_bv |= XSTATE_FPSSE;
-
- err |= __put_user(xstate_bv, &x->xstate_hdr.xstate_bv);
-
- if (err)
- return err;
- }
-
- return 1;
-}
-
-/*
- * Restore the extended state if present. Otherwise, restore the FP/SSE
- * state.
- */
-static int restore_user_xstate(void __user *buf)
-{
- struct _fpx_sw_bytes fx_sw_user;
- u64 mask;
- int err;
-
- if (((unsigned long)buf % 64) ||
- check_for_xstate(buf, sig_xstate_size, &fx_sw_user))
- goto fx_only;
-
- mask = fx_sw_user.xstate_bv;
-
- /*
- * restore the state passed by the user.
- */
- err = xrstor_checking((__force struct xsave_struct *)buf, mask);
- if (err)
- return err;
-
- /*
- * init the state skipped by the user.
- */
- mask = pcntxt_mask & ~mask;
- if (unlikely(mask))
- xrstor_state(init_xstate_buf, mask);
-
- return 0;
-
-fx_only:
- /*
- * couldn't find the extended state information in the
- * memory layout. Restore just the FP/SSE and init all
- * the other extended state.
- */
- xrstor_state(init_xstate_buf, pcntxt_mask & ~XSTATE_FPSSE);
- return fxrstor_checking((__force struct i387_fxsave_struct *)buf);
-}
-
-/*
- * This restores directly out of user space. Exceptions are handled.
- */
-int restore_i387_xstate(void __user *buf)
-{
- struct task_struct *tsk = current;
- int err = 0;
-
- if (!buf) {
- if (used_math())
- goto clear;
- return 0;
- } else
- if (!access_ok(VERIFY_READ, buf, sig_xstate_size))
- return -EACCES;
-
- if (!used_math()) {
- err = init_fpu(tsk);
- if (err)
- return err;
- }
-
- if (!(task_thread_info(current)->status & TS_USEDFPU)) {
- clts();
- task_thread_info(current)->status |= TS_USEDFPU;
- }
- if (use_xsave())
- err = restore_user_xstate(buf);
- else
- err = fxrstor_checking((__force struct i387_fxsave_struct *)
- buf);
- if (unlikely(err)) {
- /*
- * Encountered an error while doing the restore from the
- * user buffer, clear the fpu state.
- */
-clear:
- clear_fpu(tsk);
- clear_used_math();
- }
- return err;
-}
-#endif
-

int restore_xstates_sigframe(void __user *buf, unsigned int size)

{
#if defined(CONFIG_X86_32) || defined(CONFIG_IA32_EMULATION)

Hans Rosenfeld

unread,

Nov 29, 2011, 7:50:02 AM11/29/11

This patch set is a general cleanup and rework of the code related to
handling of FPU and other extended states. All handling of extended
states, including the FPU state, is now handled by xsave/xrstor wrappers
that fall back to fxsave/fxrstor, or even fsave/frstor, if hardware
support for those features is lacking. The code handling xstates in
signal frames has been unified and cleaned up.

The lazy allocation of the xstate area has been removed. The support for
extended states that cannot be saved/restored lazily, like AMD's LWP,
need this. Since optimized library functions using SSE etc. are widely
used today, most processes would have an xstate area anyway, making the
memory overhead negligible.

Changes since the last RFC:
* added patch to catch #NM exceptions caused by the kernel
* reordered the patches in a way that seemed more logical, with the
side-effect of reducing the size of some of the patches
* two bugfixes in the preallocation of the xstate area
* explicitly disable LWP in new tasks (required by the LWP spec)

These patches were built and tested against 3.1. The older RFC patches
that have been lingering in tip/x86/xsave for the last few months should
be removed.

Hans Rosenfeld (9):
x86, xsave: warn on #NM exceptions caused by the kernel
x86, xsave: cleanup fpu/xsave support
x86, xsave: cleanup fpu/xsave signal frame setup
x86, xsave: rework fpu/xsave support
x86, xsave: remove unused code
x86, xsave: more cleanups
x86, xsave: remove lazy allocation of xstate area
x86, xsave: add support for non-lazy xstates
x86, xsave: add kernel support for AMDs Lightweight Profiling (LWP)

arch/x86/ia32/ia32_signal.c | 4 +-
arch/x86/include/asm/i387.h | 251 ++++++++--------------------

arch/x86/include/asm/msr-index.h | 1 +
arch/x86/include/asm/processor.h | 12 ++

Hans Rosenfeld

unread,

Nov 29, 2011, 7:50:02 AM11/29/11

Removed some declarations from headers that weren't used.

Retired TS_USEDFPU, it has been replaced by the XCNTXT_* bits in
xstate_mask.

There is no reason functions like fpu_fxsave() etc. need to know or
handle anything else than a buffer to save/restore their stuff to/from.

Sanitize_i387_state() is extra work that is only needed when xsaveopt is
used. There is no point in hiding this in an inline function, adding
extra code lines just to save a single if() in the five places it is
used. Also, it is obscuring a fact that might well be interesting to
whoever is reading the code, but it is not gaining anything.

Signed-off-by: Hans Rosenfeld <hans.ro...@amd.com>
---

arch/x86/include/asm/i387.h | 66 ++++++++++++-----------------------
arch/x86/include/asm/thread_info.h | 2 -
arch/x86/include/asm/xsave.h | 14 +++----
arch/x86/kernel/i387.c | 12 ++++--
arch/x86/kernel/xsave.c | 35 ++++++++-----------
arch/x86/kvm/vmx.c | 2 +-
arch/x86/kvm/x86.c | 4 +-
7 files changed, 55 insertions(+), 80 deletions(-)

diff --git a/arch/x86/include/asm/i387.h b/arch/x86/include/asm/i387.h
index 687e550..3474267 100644
--- a/arch/x86/include/asm/i387.h
+++ b/arch/x86/include/asm/i387.h
@@ -59,15 +59,10 @@ extern user_regset_set_fn fpregs_set, xfpregs_set, fpregs_soft_set,
*/
#define xstateregs_active fpregs_active

-extern struct _fpx_sw_bytes fx_sw_reserved;
extern unsigned int mxcsr_feature_mask;
+
#ifdef CONFIG_IA32_EMULATION
extern unsigned int sig_xstate_ia32_size;
-extern struct _fpx_sw_bytes fx_sw_reserved_ia32;
-struct _fpstate_ia32;
-struct _xstate_ia32;
-extern int save_i387_xstate_ia32(void __user *buf);
-extern int restore_i387_xstate_ia32(void __user *buf);
#endif

#ifdef CONFIG_MATH_EMULATION
@@ -75,7 +70,7 @@ extern int restore_i387_xstate_ia32(void __user *buf);
extern void finit_soft_fpu(struct i387_soft_struct *soft);
#else
# define HAVE_HWFP 1
-static inline void finit_soft_fpu(struct i387_soft_struct *soft) {}
+# define finit_soft_fpu(x)
#endif

#define X87_FSW_ES (1 << 7) /* Exception Summary */
@@ -95,15 +90,6 @@ static __always_inline __pure bool use_fxsr(void)
return static_cpu_has(X86_FEATURE_FXSR);
}

-extern void __sanitize_i387_state(struct task_struct *);
-
-static inline void sanitize_i387_state(struct task_struct *tsk)
-{
- if (!use_xsaveopt())
- return;
- __sanitize_i387_state(tsk);
-}
-
#ifdef CONFIG_X86_64

static inline void fxrstor(struct i387_fxsave_struct *fx)
{

@@ -117,7 +103,7 @@ static inline void fxrstor(struct i387_fxsave_struct *fx)
#endif
}

-static inline void fpu_fxsave(struct fpu *fpu)
+static inline void fpu_fxsave(struct i387_fxsave_struct *fx)

{
/* Using "rex64; fxsave %0" is broken because, if the memory operand

uses any extended registers for addressing, a second REX prefix
@@ -128,7 +114,7 @@ static inline void fpu_fxsave(struct fpu *fpu)
/* Using "fxsaveq %0" would be the ideal choice, but is only supported
starting with gas 2.16. */
__asm__ __volatile__("fxsaveq %0"
- : "=m" (fpu->state->fxsave));
+ : "=m" (*fx));
#else
/* Using, as a workaround, the properly prefixed form below isn't
accepted by any binutils version so far released, complaining that
@@ -139,8 +125,8 @@ static inline void fpu_fxsave(struct fpu *fpu)
This, however, we can work around by forcing the compiler to select
an addressing mode that doesn't require extended registers. */
asm volatile("rex64/fxsave (%[fx])"
- : "=m" (fpu->state->fxsave)
- : [fx] "R" (&fpu->state->fxsave));
+ : "=m" (*fx)
+ : [fx] "R" (fx));
#endif
}

@@ -160,10 +146,10 @@ static inline void fxrstor(struct i387_fxsave_struct *fx)
"m" (*fx));
}

-static inline void fpu_fxsave(struct fpu *fpu)
+static inline void fpu_fxsave(struct i387_fxsave_struct *fx)
{
asm volatile("fxsave %[fx]"
- : [fx] "=m" (fpu->state->fxsave));
+ : [fx] "=m" (*fx));
}

#endif /* CONFIG_X86_64 */
@@ -180,25 +166,25 @@ static inline void fpu_fxsave(struct fpu *fpu)
/*
* These must be called with preempt disabled
*/
-static inline void fpu_restore(struct fpu *fpu)
+static inline void fpu_restore(struct i387_fxsave_struct *fx)
{
- fxrstor(&fpu->state->fxsave);
+ fxrstor(fx);
}

-static inline void fpu_save(struct fpu *fpu)
+static inline void fpu_save(struct i387_fxsave_struct *fx)
{
if (use_fxsr()) {
- fpu_fxsave(fpu);
+ fpu_fxsave(fx);
} else {
asm volatile("fsave %[fx]; fwait"

- : [fx] "=m" (fpu->state->fsave));

+ : [fx] "=m" (*fx));
}
}

-static inline void fpu_clean(struct fpu *fpu)
+static inline void fpu_clean(struct i387_fxsave_struct *fx)
{
u32 swd = (use_fxsr() || use_xsave()) ?
- fpu->state->fxsave.swd : fpu->state->fsave.swd;
+ fx->swd : ((struct i387_fsave_struct *)fx)->swd;

if (unlikely(swd & X87_FSW_ES))
asm volatile("fnclex");
@@ -214,19 +200,6 @@ static inline void fpu_clean(struct fpu *fpu)
[addr] "m" (safe_address));
}

-static inline void __clear_fpu(struct task_struct *tsk)

-{
- if (task_thread_info(tsk)->status & TS_USEDFPU) {

- /* Ignore delayed exceptions from user space */
- asm volatile("1: fwait\n"
- "2:\n"
- _ASM_EXTABLE(1b, 2b));

- task_thread_info(tsk)->status &= ~TS_USEDFPU;

- task_thread_info(tsk)->xstate_mask &= ~XCNTXT_LAZY;
- stts();
- }
-}
-
static inline void kernel_fpu_begin(void)
{
preempt_disable();
@@ -285,7 +258,14 @@ static inline void irq_ts_restore(int TS_state)

static inline void clear_fpu(struct task_struct *tsk)
{
preempt_disable();

- __clear_fpu(tsk);
+ if (task_thread_info(tsk)->xstate_mask & XCNTXT_LAZY) {
+ /* Ignore delayed exceptions from user space */
+ asm volatile("1: fwait\n"
+ "2:\n"
+ _ASM_EXTABLE(1b, 2b));
+ task_thread_info(tsk)->xstate_mask &= ~XCNTXT_LAZY;
+ stts();
+ }
preempt_enable();
}

diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 6652c9b..02112a7 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -244,8 +244,6 @@ static inline struct thread_info *current_thread_info(void)
* ever touches our thread-synchronous status, so we don't
* have to worry about atomic accesses.
*/
-#define TS_USEDFPU 0x0001 /* FPU was used by this task
- this quantum (SMP) */
#define TS_COMPAT 0x0002 /* 32bit syscall active (64BIT)*/
#define TS_POLLING 0x0004 /* idle task polling need_resched,
skip sending interrupt */
diff --git a/arch/x86/include/asm/xsave.h b/arch/x86/include/asm/xsave.h
index 8d5bb0e..12793b6 100644
--- a/arch/x86/include/asm/xsave.h
+++ b/arch/x86/include/asm/xsave.h
@@ -37,8 +37,8 @@ extern unsigned int xstate_size;
extern u64 pcntxt_mask;
extern u64 xstate_fx_sw_bytes[USER_XSTATE_FX_SW_WORDS];

-extern void xsave(struct fpu *, u64);
-extern void xrstor(struct fpu *, u64);
+extern void xsave(struct xsave_struct *, u64);
+extern void xrstor(struct xsave_struct *, u64);
extern void __save_xstates(struct task_struct *);
extern void __restore_xstates(struct task_struct *, u64);
extern int save_xstates_sigframe(void __user *, unsigned int);
@@ -46,10 +46,7 @@ extern int restore_xstates_sigframe(void __user *, unsigned int);

extern void xsave_init(void);
extern void update_regset_xstate_info(unsigned int size, u64 xstate_mask);
-extern int init_fpu(struct task_struct *child);
-extern int check_for_xstate(struct i387_fxsave_struct __user *buf,
- unsigned int size,
- struct _fpx_sw_bytes *sw);
+extern void sanitize_i387_state(struct task_struct *);

static inline void save_xstates(struct task_struct *tsk)
{
@@ -85,7 +82,7 @@ static inline void xsave_state(struct xsave_struct *fx, u64 mask)
: "memory");
}

-static inline void fpu_xsave(struct xsave_struct *fx, u64 mask)
+static inline void xsaveopt_state(struct xsave_struct *fx, u64 mask)
{
u32 lmask = mask;

u32 hmask = mask >> 32;

@@ -96,7 +93,8 @@ static inline void fpu_xsave(struct xsave_struct *fx, u64 mask)
".byte " REX_PREFIX "0x0f,0xae,0x27",
".byte " REX_PREFIX "0x0f,0xae,0x37",
X86_FEATURE_XSAVEOPT,
- [fx] "D" (fx), "a" (lmask), "d" (hmask) :
+ "D" (fx), "a" (lmask), "d" (hmask) :
"memory");
}
+
#endif
diff --git a/arch/x86/kernel/i387.c b/arch/x86/kernel/i387.c
index 49d23a5..a1f4bc0 100644
--- a/arch/x86/kernel/i387.c
+++ b/arch/x86/kernel/i387.c
@@ -179,7 +179,8 @@ int xfpregs_get(struct task_struct *target, const struct user_regset *regset,
if (ret)
return ret;

- sanitize_i387_state(target);
+ if (use_xsaveopt())
+ sanitize_i387_state(target);

return user_regset_copyout(&pos, &count, &kbuf, &ubuf,
&target->thread.fpu.state->fxsave, 0, -1);
@@ -198,7 +199,8 @@ int xfpregs_set(struct task_struct *target, const struct user_regset *regset,
if (ret)
return ret;

- sanitize_i387_state(target);
+ if (use_xsaveopt())
+ sanitize_i387_state(target);

ret = user_regset_copyin(&pos, &count, &kbuf, &ubuf,
&target->thread.fpu.state->fxsave, 0, -1);
@@ -437,7 +439,8 @@ int fpregs_get(struct task_struct *target, const struct user_regset *regset,
-1);
}

- sanitize_i387_state(target);
+ if (use_xsaveopt())
+ sanitize_i387_state(target);

if (kbuf && pos == 0 && count == sizeof(env)) {
convert_from_fxsr(kbuf, target);
@@ -460,7 +463,8 @@ int fpregs_set(struct task_struct *target, const struct user_regset *regset,
if (ret)
return ret;

- sanitize_i387_state(target);
+ if (use_xsaveopt())
+ sanitize_i387_state(target);

if (!HAVE_HWFP)
return fpregs_soft_set(target, regset, pos, count, kbuf, ubuf);
diff --git a/arch/x86/kernel/xsave.c b/arch/x86/kernel/xsave.c
index ca5812c..9d95d2f 100644
--- a/arch/x86/kernel/xsave.c
+++ b/arch/x86/kernel/xsave.c
@@ -39,7 +39,7 @@ static unsigned int *xstate_offsets, *xstate_sizes, xstate_features;
* that the user doesn't see some stale state in the memory layout during
* signal handling, debugging etc.
*/
-void __sanitize_i387_state(struct task_struct *tsk)
+void sanitize_i387_state(struct task_struct *tsk)
{
u64 xstate_bv;
int feature_bit = 0x2;
@@ -49,7 +49,6 @@ void __sanitize_i387_state(struct task_struct *tsk)
return;

BUG_ON(task_thread_info(tsk)->xstate_mask & XCNTXT_LAZY);
- BUG_ON(task_thread_info(tsk)->status & TS_USEDFPU);

xstate_bv = tsk->thread.fpu.state->xsave.xsave_hdr.xstate_bv;

@@ -104,8 +103,8 @@ void __sanitize_i387_state(struct task_struct *tsk)
* Check for the presence of extended state information in the
* user fpstate pointer in the sigcontext.
*/
-int check_for_xstate(struct i387_fxsave_struct __user *buf, unsigned int size,
- struct _fpx_sw_bytes *fx_sw_user)
+static int check_for_xstate(struct i387_fxsave_struct __user *buf, unsigned int size,
+ struct _fpx_sw_bytes *fx_sw_user)
{
int min_xstate_size = sizeof(struct i387_fxsave_struct) +
sizeof(struct xsave_hdr_struct);
@@ -178,7 +177,8 @@ int save_xstates_sigframe(void __user *buf, unsigned int size)

tsk->fpu_counter = 0;
save_xstates(tsk);
- sanitize_i387_state(tsk);
+ if (use_xsaveopt())
+ sanitize_i387_state(tsk);

#if defined(CONFIG_X86_32) || defined(CONFIG_IA32_EMULATION)
if (ia32) {
@@ -299,7 +299,6 @@ int restore_xstates_sigframe(void __user *buf, unsigned int size)

preempt_disable();
stts();

- task_thread_info(tsk)->status &= ~TS_USEDFPU;

task_thread_info(tsk)->xstate_mask = 0;
preempt_enable();

@@ -505,17 +504,17 @@ void __cpuinit xsave_init(void)
this_func();
}

-void xsave(struct fpu *fpu, u64 mask)
+void xsave(struct xsave_struct *x, u64 mask)
{
clts();

if (use_xsave())
- fpu_xsave(&fpu->state->xsave, mask);
+ xsaveopt_state(x, mask);
else if (mask & XCNTXT_LAZY)
- fpu_save(fpu);
+ fpu_save(&x->i387);

if (mask & XCNTXT_LAZY)
- fpu_clean(fpu);
+ fpu_clean(&x->i387);

stts();
}
@@ -528,7 +527,7 @@ void __save_xstates(struct task_struct *tsk)
if (!fpu_allocated(&tsk->thread.fpu))
return;

- xsave(&tsk->thread.fpu, ti->xstate_mask);
+ xsave(&tsk->thread.fpu.state->xsave, ti->xstate_mask);

if (!(ti->xstate_mask & XCNTXT_LAZY))
tsk->fpu_counter = 0;
@@ -540,19 +539,17 @@ void __save_xstates(struct task_struct *tsk)
*/
if (tsk->fpu_counter < 5)
ti->xstate_mask &= ~XCNTXT_LAZY;
-
- ti->status &= ~TS_USEDFPU;
}
EXPORT_SYMBOL(save_xstates);

-void xrstor(struct fpu *fpu, u64 mask)
+void xrstor(struct xsave_struct *x, u64 mask)
{
clts();

if (use_xsave())
- xrstor_state(&fpu->state->xsave, mask);
+ xrstor_state(x, mask);
else if (mask & XCNTXT_LAZY)
- fpu_restore(fpu);
+ fpu_restore(&x->i387);

if (!(mask & XCNTXT_LAZY))
stts();
@@ -566,12 +563,10 @@ void __restore_xstates(struct task_struct *tsk, u64 mask)
if (!fpu_allocated(&tsk->thread.fpu))
return;

- xrstor(&tsk->thread.fpu, mask);
+ xrstor(&tsk->thread.fpu.state->xsave, mask);

ti->xstate_mask |= mask;
- if (mask & XCNTXT_LAZY) {
- ti->status |= TS_USEDFPU;
+ if (mask & XCNTXT_LAZY)
tsk->fpu_counter++;
- }
}
EXPORT_SYMBOL(__restore_xstates);
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index e65a158..e41385f 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1402,7 +1402,7 @@ static void __vmx_load_host_state(struct vcpu_vmx *vmx)
#ifdef CONFIG_X86_64
wrmsrl(MSR_KERNEL_GS_BASE, vmx->msr_host_kernel_gs_base);
#endif
- if (current_thread_info()->status & TS_USEDFPU)
+ if (current_thread_info()->xstate_mask & XCNTXT_LAZY)
clts();
load_gdt(&__get_cpu_var(host_gdt));
}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d14fd8c..36bc63d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6259,7 +6259,7 @@ void kvm_load_guest_fpu(struct kvm_vcpu *vcpu)
kvm_put_guest_xcr0(vcpu);
vcpu->guest_fpu_loaded = 1;
save_xstates(current);
- xrstor(&vcpu->arch.guest_fpu, -1);
+ xrstor(&vcpu->arch.guest_fpu.state->xsave, -1);
trace_kvm_fpu(1);
}

@@ -6271,7 +6271,7 @@ void kvm_put_guest_fpu(struct kvm_vcpu *vcpu)
return;

vcpu->guest_fpu_loaded = 0;
- xsave(&vcpu->arch.guest_fpu, -1);
+ xsave(&vcpu->arch.guest_fpu.state->xsave, -1);
++vcpu->stat.fpu_reload;
kvm_make_request(KVM_REQ_DEACTIVATE_FPU, vcpu);
trace_kvm_fpu(0);

Hans Rosenfeld

unread,

Nov 29, 2011, 7:50:02 AM11/29/11

This is a complete rework of the code that handles FPU and related
extended states. Since FPU, XMM and YMM states are just variants of what
xsave handles, all of the old FPU-specific state handling code will be
hidden behind a set of functions that resemble xsave and xrstor. For
hardware that does not support xsave, the code falls back to
fxsave/fxrstor or even fsave/frstor.

A xstate_mask member will be added to the thread_info structure that
will control which states are to be saved by xsave. It is set to include
all "lazy" states (that is, all states currently supported: FPU, XMM and
YMM) by the #NM handler when a lazy restore is triggered or by
switch_to() when the tasks FPU context is preloaded. Xstate_mask is
intended to completely replace TS_USEDFPU in a later cleanup patch.

v2:
Add a wrapper function for save_xstates/restore_xstates to avoid
explicitly wrapping it in preempt_disable/preempt_enable everywhere.

Signed-off-by: Hans Rosenfeld <hans.ro...@amd.com>
---

arch/x86/include/asm/i387.h | 69 ++++++++++++++++++++++++++---
arch/x86/include/asm/thread_info.h | 2 +
arch/x86/include/asm/xsave.h | 27 ++++++++++-
arch/x86/kernel/i387.c | 2 +-
arch/x86/kernel/process_32.c | 25 +++--------
arch/x86/kernel/process_64.c | 24 ++--------
arch/x86/kernel/traps.c | 8 ++--
arch/x86/kernel/xsave.c | 86 ++++++++++++++++++++++++++++++++++--
arch/x86/kvm/x86.c | 7 ++-
drivers/lguest/x86/core.c | 2 +-
10 files changed, 194 insertions(+), 58 deletions(-)

diff --git a/arch/x86/include/asm/i387.h b/arch/x86/include/asm/i387.h
index f0c605d..daab77f 100644
--- a/arch/x86/include/asm/i387.h
+++ b/arch/x86/include/asm/i387.h
@@ -137,6 +137,18 @@ static inline int fxrstor_checking(struct i387_fxsave_struct *fx)
return err;
}

+static inline void fxrstor(struct i387_fxsave_struct *fx)
+{
+ /* See comment in fxsave() below. */
+#ifdef CONFIG_AS_FXSAVEQ
+ asm volatile("fxrstorq %[fx]\n\t"
+ : : [fx] "m" (*fx));
+#else
+ asm volatile("rex64/fxrstor (%[fx])\n\t"
+ : : [fx] "R" (fx), "m" (*fx));
+#endif
+}
+

static inline int fxsave_user(struct i387_fxsave_struct __user *fx)

{
int err;
@@ -224,6 +236,19 @@ static inline int fxrstor_checking(struct i387_fxsave_struct *fx)
return 0;
}

+static inline void fxrstor(struct i387_fxsave_struct *fx)
+{
+ /*
+ * The "nop" is needed to make the instructions the same
+ * length.
+ */
+ alternative_input(
+ "nop ; frstor %1",
+ "fxrstor %1",
+ X86_FEATURE_FXSR,
+ "m" (*fx));
+}
+

static inline void fpu_fxsave(struct fpu *fpu)

{
asm volatile("fxsave %[fx]"

@@ -244,12 +269,46 @@ static inline void fpu_fxsave(struct fpu *fpu)

/*
* These must be called with preempt disabled
*/

+static inline void fpu_restore(struct fpu *fpu)
+{
+ fxrstor(&fpu->state->fxsave);
+}
+
+static inline void fpu_save(struct fpu *fpu)
+{
+ if (use_fxsr()) {
+ fpu_fxsave(fpu);
+ } else {
+ asm volatile("fsave %[fx]; fwait"
+ : [fx] "=m" (fpu->state->fsave));
+ }
+}
+
+static inline void fpu_clean(struct fpu *fpu)
+{
+ u32 swd = (use_fxsr() || use_xsave()) ?
+ fpu->state->fxsave.swd : fpu->state->fsave.swd;
+
+ if (unlikely(swd & X87_FSW_ES))
+ asm volatile("fnclex");
+
+ /* AMD K7/K8 CPUs don't save/restore FDP/FIP/FOP unless an exception
+ is pending. Clear the x87 state here by setting it to fixed
+ values. safe_address is a random variable that should be in L1 */
+ alternative_input(
+ ASM_NOP8 ASM_NOP2,
+ "emms\n\t" /* clear stack tags */
+ "fildl %P[addr]", /* set F?P to defined value */
+ X86_FEATURE_FXSAVE_LEAK,
+ [addr] "m" (safe_address));
+}
+

static inline void fpu_save_init(struct fpu *fpu)

{
if (use_xsave()) {

struct xsave_struct *xstate = &fpu->state->xsave;

- fpu_xsave(xstate);
+ fpu_xsave(xstate, -1);

/*

* xsave header may indicate the init state of the FP.

@@ -315,18 +374,16 @@ static inline void __clear_fpu(struct task_struct *tsk)
"2:\n"
_ASM_EXTABLE(1b, 2b));
task_thread_info(tsk)->status &= ~TS_USEDFPU;

+ task_thread_info(tsk)->xstate_mask &= ~XCNTXT_LAZY;

stts();
}
}

static inline void kernel_fpu_begin(void)
{
- struct thread_info *me = current_thread_info();
preempt_disable();
- if (me->status & TS_USEDFPU)
- __save_init_fpu(me->task);
- else
- clts();
+ __save_xstates(current_thread_info()->task);
+ clts();
}

static inline void kernel_fpu_end(void)
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index a1fe5c1..6652c9b 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -26,6 +26,7 @@ struct exec_domain;
struct thread_info {
struct task_struct *task; /* main task structure */
struct exec_domain *exec_domain; /* execution domain */
+ __u64 xstate_mask; /* xstates in use */
__u32 flags; /* low level flags */
__u32 status; /* thread synchronous flags */
__u32 cpu; /* current CPU */
@@ -47,6 +48,7 @@ struct thread_info {
{ \
.task = &tsk, \
.exec_domain = &default_exec_domain, \
+ .xstate_mask = 0, \
.flags = 0, \
.cpu = 0, \
.preempt_count = INIT_PREEMPT_COUNT, \
diff --git a/arch/x86/include/asm/xsave.h b/arch/x86/include/asm/xsave.h
index 8632ddb..42d02b9 100644
--- a/arch/x86/include/asm/xsave.h
+++ b/arch/x86/include/asm/xsave.h
@@ -25,6 +25,8 @@
*/

#define XCNTXT_MASK (XSTATE_FP | XSTATE_SSE | XSTATE_YMM)

+#define XCNTXT_LAZY XCNTXT_MASK
+

#ifdef CONFIG_X86_64
#define REX_PREFIX "0x48, "

#else
@@ -35,6 +37,10 @@ extern unsigned int xstate_size;

extern u64 pcntxt_mask;
extern u64 xstate_fx_sw_bytes[USER_XSTATE_FX_SW_WORDS];

+extern void xsave(struct fpu *, u64);
+extern void xrstor(struct fpu *, u64);
+extern void __save_xstates(struct task_struct *);
+extern void __restore_xstates(struct task_struct *, u64);

extern int save_xstates_sigframe(void __user *, unsigned int);

extern int restore_xstates_sigframe(void __user *, unsigned int);

@@ -45,6 +51,20 @@ extern int check_for_xstate(struct i387_fxsave_struct __user *buf,
unsigned int size,
struct _fpx_sw_bytes *sw);

+static inline void save_xstates(struct task_struct *tsk)
+{
+ preempt_disable();
+ __save_xstates(tsk);
+ preempt_enable();
+}
+
+static inline void restore_xstates(struct task_struct *tsk, u64 mask)
+{
+ preempt_disable();
+ __restore_xstates(tsk, mask);
+ preempt_enable();
+}
+

static inline int xrstor_checking(struct xsave_struct *fx, u64 mask)

{
int err;
@@ -116,15 +136,18 @@ static inline void xsave_state(struct xsave_struct *fx, u64 mask)
: "memory");
}

-static inline void fpu_xsave(struct xsave_struct *fx)
+static inline void fpu_xsave(struct xsave_struct *fx, u64 mask)
{
+ u32 lmask = mask;
+ u32 hmask = mask >> 32;
+
/* This, however, we can work around by forcing the compiler to select

an addressing mode that doesn't require extended registers. */

alternative_input(

".byte " REX_PREFIX "0x0f,0xae,0x27",
".byte " REX_PREFIX "0x0f,0xae,0x37",
X86_FEATURE_XSAVEOPT,

- [fx] "D" (fx), "a" (-1), "d" (-1) :
+ [fx] "D" (fx), "a" (lmask), "d" (hmask) :
"memory");

}
#endif
diff --git a/arch/x86/kernel/i387.c b/arch/x86/kernel/i387.c

index ae799e2..e1b8a42 100644
--- a/arch/x86/kernel/i387.c
+++ b/arch/x86/kernel/i387.c
@@ -133,7 +133,7 @@ int init_fpu(struct task_struct *tsk)

if (tsk_used_math(tsk)) {
if (HAVE_HWFP && tsk == current)
- unlazy_fpu(tsk);
+ save_xstates(tsk);
return 0;
}

diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 7a3b651..22d2bac 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -187,7 +187,7 @@ void release_thread(struct task_struct *dead_task)
*/
void prepare_to_copy(struct task_struct *tsk)
{
- unlazy_fpu(tsk);
+ save_xstates(tsk);
}

int copy_thread(unsigned long clone_flags, unsigned long sp,
@@ -295,21 +295,13 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
*next = &next_p->thread;
int cpu = smp_processor_id();
struct tss_struct *tss = &per_cpu(init_tss, cpu);
- bool preload_fpu;

/* never put a printk in __switch_to... printk() calls wake_up*() indirectly */

- /*
- * If the task has used fpu the last 5 timeslices, just do a full
- * restore of the math state immediately to avoid the trap; the
- * chances of needing FPU soon are obviously high now
- */
- preload_fpu = tsk_used_math(next_p) && next_p->fpu_counter > 5;
-
- __unlazy_fpu(prev_p);
+ __save_xstates(prev_p);

/* we're going to use this soon, after a few expensive things */
- if (preload_fpu)
+ if (task_thread_info(next_p)->xstate_mask)
prefetch(next->fpu.state);

/*
@@ -350,11 +342,6 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
task_thread_info(next_p)->flags & _TIF_WORK_CTXSW_NEXT))
__switch_to_xtra(prev_p, next_p, tss);

- /* If we're going to preload the fpu context, make sure clts
- is run while we're batching the cpu state updates. */
- if (preload_fpu)
- clts();
-
/*
* Leave lazy mode, flushing any hypercalls made here.
* This must be done before restoring TLS segments so
@@ -364,8 +351,10 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
*/
arch_end_context_switch(next_p);

- if (preload_fpu)
- __math_state_restore();
+ /*
+ * Restore enabled extended states for the task.
+ */
+ __restore_xstates(next_p, task_thread_info(next_p)->xstate_mask);

/*
* Restore %gs if needed (which is common)
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index f693e44..2d1745c 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -251,7 +251,7 @@ static inline u32 read_32bit_tls(struct task_struct *t, int tls)
*/
void prepare_to_copy(struct task_struct *tsk)
{
- unlazy_fpu(tsk);
+ save_xstates(tsk);
}

int copy_thread(unsigned long clone_flags, unsigned long sp,
@@ -379,17 +379,9 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
int cpu = smp_processor_id();
struct tss_struct *tss = &per_cpu(init_tss, cpu);
unsigned fsindex, gsindex;
- bool preload_fpu;
-
- /*
- * If the task has used fpu the last 5 timeslices, just do a full
- * restore of the math state immediately to avoid the trap; the
- * chances of needing FPU soon are obviously high now
- */
- preload_fpu = tsk_used_math(next_p) && next_p->fpu_counter > 5;

/* we're going to use this soon, after a few expensive things */
- if (preload_fpu)
+ if (task_thread_info(next_p)->xstate_mask)
prefetch(next->fpu.state);

/*
@@ -421,11 +413,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
load_TLS(next, cpu);

/* Must be after DS reload */
- __unlazy_fpu(prev_p);
-
- /* Make sure cpu is ready for new context */
- if (preload_fpu)
- clts();
+ __save_xstates(prev_p);

/*
* Leave lazy mode, flushing any hypercalls made here.
@@ -486,11 +474,9 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
__switch_to_xtra(prev_p, next_p, tss);

/*
- * Preload the FPU context, now that we've determined that the
- * task is likely to be using it.
+ * Restore enabled extended states for the task.
*/
- if (preload_fpu)
- __math_state_restore();
+ __restore_xstates(next_p, task_thread_info(next_p)->xstate_mask);

return prev_p;
}
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 8e924d1..95ae50d 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -625,7 +625,9 @@ void math_error(struct pt_regs *regs, int error_code, int trapnr)
/*
* Save the info for the exception handler and clear the error.
*/
- save_init_fpu(task);
+ task->fpu_counter = 0;
+ save_xstates(task);
+
task->thread.trap_no = trapnr;
task->thread.error_code = error_code;
info.si_signo = SIGFPE;
@@ -768,9 +770,7 @@ asmlinkage void math_state_restore(void)
local_irq_disable();
}

- clts(); /* Allow maths ops (or we recurse) */
-
- __math_state_restore();
+ __restore_xstates(tsk, XCNTXT_LAZY);
}
EXPORT_SYMBOL_GPL(math_state_restore);

diff --git a/arch/x86/kernel/xsave.c b/arch/x86/kernel/xsave.c
index 349e835..d35adf8 100644
--- a/arch/x86/kernel/xsave.c
+++ b/arch/x86/kernel/xsave.c
@@ -5,6 +5,7 @@
*/
#include <linux/bootmem.h>
#include <linux/compat.h>
+#include <linux/module.h>
#include <asm/i387.h>
#ifdef CONFIG_IA32_EMULATION
#include <asm/sigcontext32.h>
@@ -47,6 +48,7 @@ void __sanitize_i387_state(struct task_struct *tsk)
if (!fx)
return;

+ BUG_ON(task_thread_info(tsk)->xstate_mask & XCNTXT_LAZY);

BUG_ON(task_thread_info(tsk)->status & TS_USEDFPU);

xstate_bv = tsk->thread.fpu.state->xsave.xsave_hdr.xstate_bv;

@@ -174,7 +176,8 @@ int save_xstates_sigframe(void __user *buf, unsigned int size)
sizeof(struct user_i387_ia32_struct), NULL,
(struct _fpstate_ia32 __user *) buf) ? -1 : 1;

- unlazy_fpu(tsk);
+ tsk->fpu_counter = 0;
+ save_xstates(tsk);
sanitize_i387_state(tsk);

#if defined(CONFIG_X86_32) || defined(CONFIG_IA32_EMULATION)
@@ -421,6 +424,7 @@ int restore_xstates_sigframe(void __user *buf, unsigned int size)

struct task_struct *tsk = current;

struct _fpstate_ia32 __user *fp = buf;
struct xsave_struct *xsave;
+ u64 xstate_mask = 0;
int err;

if (!buf) {
@@ -459,6 +463,7 @@ int restore_xstates_sigframe(void __user *buf, unsigned int size)
preempt_disable();
stts();
task_thread_info(tsk)->status &= ~TS_USEDFPU;
+ task_thread_info(tsk)->xstate_mask = 0;
preempt_enable();

xsave = &tsk->thread.fpu.state->xsave;
@@ -479,8 +484,12 @@ int restore_xstates_sigframe(void __user *buf, unsigned int size)
else
*xstate_bv &= pcntxt_mask & fx_sw_user.xstate_bv;

+ xstate_mask |= *xstate_bv;
+
xsave->xsave_hdr.reserved1[0] =
xsave->xsave_hdr.reserved1[1] = 0;
+ } else {
+ xstate_mask |= XCNTXT_LAZY;
}

if (use_xsave() || use_fxsr()) {
@@ -492,10 +501,8 @@ int restore_xstates_sigframe(void __user *buf, unsigned int size)
#endif
}

- preempt_disable();
set_used_math();
- math_state_restore();
- preempt_enable();
+ restore_xstates(tsk, xstate_mask);

return 0;
}
@@ -660,3 +667,74 @@ void __cpuinit xsave_init(void)
next_func = xstate_enable;
this_func();
}
+
+void xsave(struct fpu *fpu, u64 mask)
+{
+ clts();
+
+ if (use_xsave())
+ fpu_xsave(&fpu->state->xsave, mask);
+ else if (mask & XCNTXT_LAZY)
+ fpu_save(fpu);
+

+ if (mask & XCNTXT_LAZY)

+ fpu_clean(fpu);
+
+ stts();
+}
+EXPORT_SYMBOL(xsave);
+
+void __save_xstates(struct task_struct *tsk)
+{
+ struct thread_info *ti = task_thread_info(tsk);
+
+ if (!fpu_allocated(&tsk->thread.fpu))
+ return;
+
+ xsave(&tsk->thread.fpu, ti->xstate_mask);
+
+ if (!(ti->xstate_mask & XCNTXT_LAZY))
+ tsk->fpu_counter = 0;
+
+ /*
+ * If the task hasn't used the fpu the last 5 timeslices,
+ * force a lazy restore of the math states by clearing them
+ * from xstate_mask.
+ */
+ if (tsk->fpu_counter < 5)
+ ti->xstate_mask &= ~XCNTXT_LAZY;
+
+ ti->status &= ~TS_USEDFPU;
+}
+EXPORT_SYMBOL(save_xstates);
+
+void xrstor(struct fpu *fpu, u64 mask)
+{
+ clts();
+
+ if (use_xsave())
+ xrstor_state(&fpu->state->xsave, mask);
+ else if (mask & XCNTXT_LAZY)
+ fpu_restore(fpu);
+
+ if (!(mask & XCNTXT_LAZY))
+ stts();
+}
+EXPORT_SYMBOL(xrstor);
+
+void __restore_xstates(struct task_struct *tsk, u64 mask)
+{
+ struct thread_info *ti = task_thread_info(tsk);
+
+ if (!fpu_allocated(&tsk->thread.fpu))
+ return;
+
+ xrstor(&tsk->thread.fpu, mask);
+
+ ti->xstate_mask |= mask;
+ if (mask & XCNTXT_LAZY) {
+ ti->status |= TS_USEDFPU;
+ tsk->fpu_counter++;
+ }
+}
+EXPORT_SYMBOL(__restore_xstates);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 84a28ea..d14fd8c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -58,6 +58,7 @@
#include <asm/xcr.h>
#include <asm/pvclock.h>
#include <asm/div64.h>
+#include <asm/xsave.h>

#define MAX_IO_MSRS 256
#define KVM_MAX_MCE_BANKS 32
@@ -6257,8 +6258,8 @@ void kvm_load_guest_fpu(struct kvm_vcpu *vcpu)
*/

kvm_put_guest_xcr0(vcpu);
vcpu->guest_fpu_loaded = 1;

- unlazy_fpu(current);
- fpu_restore_checking(&vcpu->arch.guest_fpu);
+ save_xstates(current);
+ xrstor(&vcpu->arch.guest_fpu, -1);
trace_kvm_fpu(1);
}

@@ -6270,7 +6271,7 @@ void kvm_put_guest_fpu(struct kvm_vcpu *vcpu)

return;

vcpu->guest_fpu_loaded = 0;

- fpu_save_init(&vcpu->arch.guest_fpu);
+ xsave(&vcpu->arch.guest_fpu, -1);

++vcpu->stat.fpu_reload;
kvm_make_request(KVM_REQ_DEACTIVATE_FPU, vcpu);
trace_kvm_fpu(0);

diff --git a/drivers/lguest/x86/core.c b/drivers/lguest/x86/core.c
index 65af42f..6d617f1 100644
--- a/drivers/lguest/x86/core.c
+++ b/drivers/lguest/x86/core.c
@@ -204,7 +204,7 @@ void lguest_arch_run_guest(struct lg_cpu *cpu)
* uses the FPU.
*/
if (cpu->ts)
- unlazy_fpu(current);
+ save_xstates(current);

/*
* SYSENTER is an optimized way of doing system calls. We can't allow

Hans Rosenfeld

unread,

Nov 29, 2011, 7:50:02 AM11/29/11

This patch removes lazy allocation of the xstate area. All user tasks
will always have an xstate area preallocated. The size of the xsave area
ranges from 112 to 960 bytes, depending on the xstates present and
enabled. Since it is common to use SSE etc. for optimization, the actual
overhead is expected to negligible.

This greatly simplifies some code paths by removing the allocation code
from init_fpu(), the check for presence of the xstate area or the need
to check the init_fpu() return value.

v2:
A single static xsave area just for init is not enough, since there are
more user processes that are directly spawned by kernel threads. Add a
call to a new arch-specific function to flush_old_exec(), which will in
turn call fpu_alloc() to allocate the xsave area if necessary.

v3:
The new xsave area has to be cleared to avoid xrstor errors.

Signed-off-by: Hans Rosenfeld <hans.ro...@amd.com>
---

arch/x86/include/asm/i387.h | 9 +++++++-
arch/x86/include/asm/thread_info.h | 2 +
arch/x86/kernel/i387.c | 41 ++++++++----------------------------
arch/x86/kernel/process.c | 14 ++++++++++++
arch/x86/kernel/process_32.c | 4 +-
arch/x86/kernel/process_64.c | 4 +-
arch/x86/kernel/traps.c | 16 +------------
arch/x86/kernel/xsave.c | 8 +-----
arch/x86/kvm/x86.c | 4 +-
arch/x86/math-emu/fpu_entry.c | 8 +-----
fs/exec.c | 8 +++++++
11 files changed, 53 insertions(+), 65 deletions(-)

diff --git a/arch/x86/include/asm/i387.h b/arch/x86/include/asm/i387.h
index 3474267..6efe38a 100644
--- a/arch/x86/include/asm/i387.h
+++ b/arch/x86/include/asm/i387.h
@@ -40,7 +40,7 @@
extern unsigned int sig_xstate_size;

extern void fpu_init(void);
extern void mxcsr_feature_mask_init(void);

-extern int init_fpu(struct task_struct *child);

+extern void init_fpu(struct task_struct *child);
extern asmlinkage void math_state_restore(void);

extern int dump_fpu(struct pt_regs *, struct user_i387_struct *);

@@ -330,6 +330,13 @@ static inline void fpu_copy(struct fpu *dst, struct fpu *src)

extern void fpu_finit(struct fpu *fpu);

+static inline void fpu_clear(struct fpu *fpu)
+{
+ memset(fpu->state, 0, xstate_size);
+ fpu_finit(fpu);
+ set_used_math();
+}
+
#endif /* __ASSEMBLY__ */

#endif /* _ASM_X86_I387_H */
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 02112a7..b886a47 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -265,6 +265,8 @@ static inline void set_restore_sigmask(void)
extern void arch_task_cache_init(void);
extern void free_thread_info(struct thread_info *ti);
extern int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src);
+extern int arch_prealloc_fpu(struct task_struct *tsk);
+#define arch_prealloc_fpu arch_prealloc_fpu
#define arch_task_cache_init arch_task_cache_init
#endif
#endif /* _ASM_X86_THREAD_INFO_H */
diff --git a/arch/x86/kernel/i387.c b/arch/x86/kernel/i387.c
index a1f4bc0..7f0004c 100644
--- a/arch/x86/kernel/i387.c
+++ b/arch/x86/kernel/i387.c
@@ -127,27 +127,19 @@ EXPORT_SYMBOL_GPL(fpu_finit);
* value at reset if we support XMM instructions and then
* remember the current task has used the FPU.
*/
-int init_fpu(struct task_struct *tsk)
+void init_fpu(struct task_struct *tsk)
{
- int ret;
+ BUG_ON(tsk->flags & PF_KTHREAD);

if (tsk_used_math(tsk)) {
if (HAVE_HWFP && tsk == current)

save_xstates(tsk);
- return 0;
+ return;
}

- /*
- * Memory allocation at the first usage of the FPU and other state.
- */
- ret = fpu_alloc(&tsk->thread.fpu);
- if (ret)
- return ret;
-
fpu_finit(&tsk->thread.fpu);

set_stopped_child_used_math(tsk);
- return 0;
}
EXPORT_SYMBOL_GPL(init_fpu);

@@ -170,14 +162,10 @@ int xfpregs_get(struct task_struct *target, const struct user_regset *regset,
unsigned int pos, unsigned int count,
void *kbuf, void __user *ubuf)
{
- int ret;
-
if (!cpu_has_fxsr)
return -ENODEV;

- ret = init_fpu(target);
- if (ret)
- return ret;
+ init_fpu(target);

if (use_xsaveopt())
sanitize_i387_state(target);
@@ -195,9 +183,7 @@ int xfpregs_set(struct task_struct *target, const struct user_regset *regset,
if (!cpu_has_fxsr)
return -ENODEV;

- ret = init_fpu(target);
- if (ret)
- return ret;
+ init_fpu(target);

if (use_xsaveopt())
sanitize_i387_state(target);
@@ -229,9 +215,7 @@ int xstateregs_get(struct task_struct *target, const struct user_regset *regset,
if (!cpu_has_xsave)
return -ENODEV;

- ret = init_fpu(target);
- if (ret)
- return ret;
+ init_fpu(target);

/*
* Copy the 48bytes defined by the software first into the xstate
@@ -259,9 +243,7 @@ int xstateregs_set(struct task_struct *target, const struct user_regset *regset,
if (!cpu_has_xsave)
return -ENODEV;

- ret = init_fpu(target);
- if (ret)
- return ret;
+ init_fpu(target);

ret = user_regset_copyin(&pos, &count, &kbuf, &ubuf,

&target->thread.fpu.state->xsave, 0, -1);
@@ -424,11 +406,8 @@ int fpregs_get(struct task_struct *target, const struct user_regset *regset,
void *kbuf, void __user *ubuf)
{
struct user_i387_ia32_struct env;
- int ret;

- ret = init_fpu(target);
- if (ret)
- return ret;
+ init_fpu(target);

if (!HAVE_HWFP)
return fpregs_soft_get(target, regset, pos, count, kbuf, ubuf);
@@ -459,9 +438,7 @@ int fpregs_set(struct task_struct *target, const struct user_regset *regset,
struct user_i387_ia32_struct env;
int ret;

- ret = init_fpu(target);
- if (ret)
- return ret;
+ init_fpu(target);

if (use_xsaveopt())
sanitize_i387_state(target);
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index e7e3b01..20c4e1c 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -26,6 +26,20 @@
struct kmem_cache *task_xstate_cachep;
EXPORT_SYMBOL_GPL(task_xstate_cachep);

+int arch_prealloc_fpu(struct task_struct *tsk)

+{
+ if (!fpu_allocated(&tsk->thread.fpu)) {

+ int err = fpu_alloc(&tsk->thread.fpu);
+
+ if (err)
+ return err;
+
+ fpu_clear(&tsk->thread.fpu);
+ }
+
+ return 0;
+}
+
int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src)
{
int ret;
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 22d2bac..22d46f6 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -254,9 +254,9 @@ start_thread(struct pt_regs *regs, unsigned long new_ip, unsigned long new_sp)
regs->ip = new_ip;
regs->sp = new_sp;
/*
- * Free the old FP and other extended state
+ * Clear the old FP and other extended state
*/
- free_thread_xstate(current);
+ fpu_clear(&current->thread.fpu);
}
EXPORT_SYMBOL_GPL(start_thread);

diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 2d1745c..436ed82 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -341,9 +341,9 @@ start_thread_common(struct pt_regs *regs, unsigned long new_ip,
regs->ss = _ss;
regs->flags = X86_EFLAGS_IF;
/*
- * Free the old FP and other extended state
+ * Clear the old FP and other extended state
*/
- free_thread_xstate(current);
+ fpu_clear(&current->thread.fpu);
}

void
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index f0946aa..1f04c70 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -733,20 +733,8 @@ asmlinkage void math_state_restore(void)

struct thread_info *thread = current_thread_info();

struct task_struct *tsk = thread->task;

- if (!tsk_used_math(tsk)) {
- local_irq_enable();
- /*
- * does a slab alloc which can sleep
- */
- if (init_fpu(tsk)) {
- /*
- * ran out of memory!
- */
- do_group_exit(SIGKILL);
- return;
- }
- local_irq_disable();
- }
+ if (!tsk_used_math(tsk))
+ init_fpu(tsk);

__restore_xstates(tsk, XCNTXT_LAZY);
}
diff --git a/arch/x86/kernel/xsave.c b/arch/x86/kernel/xsave.c
index 9d95d2f..1a11291 100644
--- a/arch/x86/kernel/xsave.c
+++ b/arch/x86/kernel/xsave.c
@@ -262,7 +262,6 @@ int restore_xstates_sigframe(void __user *buf, unsigned int size)

struct _fpstate_ia32 __user *fp = buf;
struct xsave_struct *xsave;

u64 xstate_mask = 0;
- int err;

if (!buf) {
if (used_math()) {
@@ -275,11 +274,8 @@ int restore_xstates_sigframe(void __user *buf, unsigned int size)
if (!access_ok(VERIFY_READ, buf, size))
return -EACCES;

- if (!used_math()) {
- err = init_fpu(tsk);
- if (err)
- return err;
- }

+ if (!used_math())
+ init_fpu(tsk);

if (!HAVE_HWFP) {
set_used_math();
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 36bc63d..b1abb1a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5836,8 +5836,8 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
int r;
sigset_t sigsaved;

- if (!tsk_used_math(current) && init_fpu(current))
- return -ENOMEM;
+ if (!tsk_used_math(current))
+ init_fpu(current);

if (vcpu->sigset_active)
sigprocmask(SIG_SETMASK, &vcpu->sigset, &sigsaved);
diff --git a/arch/x86/math-emu/fpu_entry.c b/arch/x86/math-emu/fpu_entry.c
index 7718541..472e2b9 100644
--- a/arch/x86/math-emu/fpu_entry.c
+++ b/arch/x86/math-emu/fpu_entry.c
@@ -147,12 +147,8 @@ void math_emulate(struct math_emu_info *info)
unsigned long code_limit = 0; /* Initialized to stop compiler warnings */
struct desc_struct code_descriptor;

- if (!used_math()) {
- if (init_fpu(current)) {
- do_group_exit(SIGKILL);
- return;
- }
- }
+ if (!used_math())
+ init_fpu(current);

#ifdef RE_ENTRANT_CHECKING
if (emulating) {
diff --git a/fs/exec.c b/fs/exec.c
index 25dcbe5..af33562 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1071,10 +1071,18 @@ void set_task_comm(struct task_struct *tsk, char *buf)
perf_event_comm(tsk);
}

+#if !defined(arch_prealloc_fpu)
+#define arch_prealloc_fpu(tsk) (0)
+#endif
+
int flush_old_exec(struct linux_binprm * bprm)
{
int retval;

+ retval = arch_prealloc_fpu(current);
+ if (retval)
+ goto out;
+
/*
* Make sure we have a private signal table and that
* we are unassociated from the previous thread group.

Hans Rosenfeld

unread,

Nov 29, 2011, 7:50:02 AM11/29/11

There are currently two code paths that handle the fpu/xsave context in
a signal frame for 32bit and 64bit tasks. These two code paths differ
only in that they have or lack certain micro-optimizations or do some
additional work (fsave compatibility for 32bit). The code is complex,
mostly duplicate and hard to understand and maintain.

This patch creates a set of two new, unified and cleaned up functions to
replace them. Besides avoiding the duplicate code, it is now obvious
what is done in which situations. The micro-optimization w.r.t xsave
(saving and restoring directly from the user buffer) is gone, and with
it the headaches it caused about validating the buffer alignment and
contents and catching possible xsave/xrstor faults.

Signed-off-by: Hans Rosenfeld <hans.ro...@amd.com>
---

arch/x86/ia32/ia32_signal.c | 4 +-
arch/x86/include/asm/i387.h | 20 ++++
arch/x86/include/asm/xsave.h | 5 +-
arch/x86/kernel/i387.c | 32 ++-----
arch/x86/kernel/signal.c | 4 +-
arch/x86/kernel/xsave.c | 198 ++++++++++++++++++++++++++++++++++++++++--
6 files changed, 227 insertions(+), 36 deletions(-)

diff --git a/arch/x86/ia32/ia32_signal.c b/arch/x86/ia32/ia32_signal.c
index 6557769..e641379 100644
--- a/arch/x86/ia32/ia32_signal.c
+++ b/arch/x86/ia32/ia32_signal.c
@@ -257,7 +257,7 @@ static int ia32_restore_sigcontext(struct pt_regs *regs,

get_user_ex(tmp, &sc->fpstate);
buf = compat_ptr(tmp);
- err |= restore_i387_xstate_ia32(buf);
+ err |= restore_xstates_sigframe(buf, sig_xstate_ia32_size);

get_user_ex(*pax, &sc->ax);
} get_user_catch(err);
@@ -392,7 +392,7 @@ static void __user *get_sigframe(struct k_sigaction *ka, struct pt_regs *regs,
if (used_math()) {
sp = sp - sig_xstate_ia32_size;
*fpstate = (struct _fpstate_ia32 *) sp;
- if (save_i387_xstate_ia32(*fpstate) < 0)
+ if (save_xstates_sigframe(*fpstate, sig_xstate_ia32_size) < 0)
return (void __user *) -1L;
}

diff --git a/arch/x86/include/asm/i387.h b/arch/x86/include/asm/i387.h
index a1c0d38..f0c605d 100644
--- a/arch/x86/include/asm/i387.h
+++ b/arch/x86/include/asm/i387.h
@@ -25,6 +25,20 @@
#include <asm/uaccess.h>
#include <asm/xsave.h>

+#ifdef CONFIG_X86_64
+# include <asm/sigcontext32.h>
+# include <asm/user32.h>
+#else
+# define save_i387_xstate_ia32 save_i387_xstate
+# define restore_i387_xstate_ia32 restore_i387_xstate
+# define _fpstate_ia32 _fpstate
+# define _xstate_ia32 _xstate
+# define sig_xstate_ia32_size sig_xstate_size
+# define fx_sw_reserved_ia32 fx_sw_reserved
+# define user_i387_ia32_struct user_i387_struct
+# define user32_fxsr_struct user_fxsr_struct
+#endif
+

extern unsigned int sig_xstate_size;
extern void fpu_init(void);
extern void mxcsr_feature_mask_init(void);

@@ -33,6 +47,9 @@ extern asmlinkage void math_state_restore(void);
extern void __math_state_restore(void);

extern int dump_fpu(struct pt_regs *, struct user_i387_struct *);

+extern void convert_from_fxsr(struct user_i387_ia32_struct *, struct task_struct *);
+extern void convert_to_fxsr(struct task_struct *, const struct user_i387_ia32_struct *);
+
extern user_regset_active_fn fpregs_active, xfpregs_active;
extern user_regset_get_fn fpregs_get, xfpregs_get, fpregs_soft_get,
xstateregs_get;
@@ -46,6 +63,7 @@ extern user_regset_set_fn fpregs_set, xfpregs_set, fpregs_soft_set,
#define xstateregs_active fpregs_active

extern struct _fpx_sw_bytes fx_sw_reserved;
+extern unsigned int mxcsr_feature_mask;

#ifdef CONFIG_IA32_EMULATION
extern unsigned int sig_xstate_ia32_size;

extern struct _fpx_sw_bytes fx_sw_reserved_ia32;
@@ -56,8 +74,10 @@ extern int restore_i387_xstate_ia32(void __user *buf);
#endif

#ifdef CONFIG_MATH_EMULATION
+# define HAVE_HWFP (boot_cpu_data.hard_math)

extern void finit_soft_fpu(struct i387_soft_struct *soft);
#else

+# define HAVE_HWFP 1

static inline void finit_soft_fpu(struct i387_soft_struct *soft) {}

#endif

diff --git a/arch/x86/include/asm/xsave.h b/arch/x86/include/asm/xsave.h
index 8bcbbce..8632ddb 100644
--- a/arch/x86/include/asm/xsave.h
+++ b/arch/x86/include/asm/xsave.h
@@ -35,11 +35,14 @@ extern unsigned int xstate_size;

extern u64 pcntxt_mask;
extern u64 xstate_fx_sw_bytes[USER_XSTATE_FX_SW_WORDS];

+extern int save_xstates_sigframe(void __user *, unsigned int);
+extern int restore_xstates_sigframe(void __user *, unsigned int);
+

extern void xsave_init(void);
extern void update_regset_xstate_info(unsigned int size, u64 xstate_mask);

extern int init_fpu(struct task_struct *child);

extern int check_for_xstate(struct i387_fxsave_struct __user *buf,

- void __user *fpstate,
+ unsigned int size,
struct _fpx_sw_bytes *sw);

static inline int xrstor_checking(struct xsave_struct *fx, u64 mask)

diff --git a/arch/x86/kernel/i387.c b/arch/x86/kernel/i387.c
index 739d859..ae799e2 100644
--- a/arch/x86/kernel/i387.c
+++ b/arch/x86/kernel/i387.c
@@ -18,27 +18,7 @@
#include <asm/i387.h>
#include <asm/user.h>

-#ifdef CONFIG_X86_64
-# include <asm/sigcontext32.h>
-# include <asm/user32.h>
-#else

-# define save_i387_xstate_ia32 save_i387_xstate
-# define restore_i387_xstate_ia32 restore_i387_xstate

-# define _fpstate_ia32 _fpstate
-# define _xstate_ia32 _xstate
-# define sig_xstate_ia32_size sig_xstate_size
-# define fx_sw_reserved_ia32 fx_sw_reserved
-# define user_i387_ia32_struct user_i387_struct
-# define user32_fxsr_struct user_fxsr_struct
-#endif
-
-#ifdef CONFIG_MATH_EMULATION
-# define HAVE_HWFP (boot_cpu_data.hard_math)
-#else
-# define HAVE_HWFP 1
-#endif
-
-static unsigned int mxcsr_feature_mask __read_mostly = 0xffffffffu;
+unsigned int mxcsr_feature_mask __read_mostly = 0xffffffffu;
unsigned int xstate_size;
EXPORT_SYMBOL_GPL(xstate_size);
unsigned int sig_xstate_ia32_size = sizeof(struct _fpstate_ia32);
@@ -372,7 +352,7 @@ static inline u32 twd_fxsr_to_i387(struct i387_fxsave_struct *fxsave)
* FXSR floating point environment conversions.
*/

-static void
+void
convert_from_fxsr(struct user_i387_ia32_struct *env, struct task_struct *tsk)
{
struct i387_fxsave_struct *fxsave = &tsk->thread.fpu.state->fxsave;
@@ -409,8 +389,8 @@ convert_from_fxsr(struct user_i387_ia32_struct *env, struct task_struct *tsk)
memcpy(&to[i], &from[i], sizeof(to[0]));
}

-static void convert_to_fxsr(struct task_struct *tsk,
- const struct user_i387_ia32_struct *env)
+void convert_to_fxsr(struct task_struct *tsk,
+ const struct user_i387_ia32_struct *env)

{
struct i387_fxsave_struct *fxsave = &tsk->thread.fpu.state->fxsave;
@@ -648,7 +628,9 @@ static int restore_i387_xsave(void __user *buf)
u64 mask;
int err;

- if (check_for_xstate(fx, buf, &fx_sw_user))
+ if (check_for_xstate(fx, sig_xstate_ia32_size -
+ offsetof(struct _fpstate_ia32, _fxsr_env),
+ &fx_sw_user))
goto fx_only;

mask = fx_sw_user.xstate_bv;
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index 54ddaeb2..3798fbb 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -117,7 +117,7 @@ restore_sigcontext(struct pt_regs *regs, struct sigcontext __user *sc,
regs->orig_ax = -1; /* disable syscall checks */

get_user_ex(buf, &sc->fpstate);
- err |= restore_i387_xstate(buf);
+ err |= restore_xstates_sigframe(buf, sig_xstate_size);

get_user_ex(*pax, &sc->ax);
} get_user_catch(err);
@@ -252,7 +252,7 @@ get_sigframe(struct k_sigaction *ka, struct pt_regs *regs, size_t frame_size,
return (void __user *)-1L;

/* save i387 state */
- if (used_math() && save_i387_xstate(*fpstate) < 0)
+ if (used_math() && save_xstates_sigframe(*fpstate, sig_xstate_size) < 0)
return (void __user *)-1L;

return (void __user *)sp;
diff --git a/arch/x86/kernel/xsave.c b/arch/x86/kernel/xsave.c
index 6b063d7..349e835 100644
--- a/arch/x86/kernel/xsave.c
+++ b/arch/x86/kernel/xsave.c
@@ -102,8 +102,7 @@ void __sanitize_i387_state(struct task_struct *tsk)

* Check for the presence of extended state information in the
* user fpstate pointer in the sigcontext.
*/
-int check_for_xstate(struct i387_fxsave_struct __user *buf,

- void __user *fpstate,
+int check_for_xstate(struct i387_fxsave_struct __user *buf, unsigned int size,

struct _fpx_sw_bytes *fx_sw_user)
{
int min_xstate_size = sizeof(struct i387_fxsave_struct) +

@@ -130,11 +129,11 @@ int check_for_xstate(struct i387_fxsave_struct __user *buf,
fx_sw_user->xstate_size > fx_sw_user->extended_size)
return -EINVAL;

- err = __get_user(magic2, (__u32 *) (((void *)fpstate) +
- fx_sw_user->extended_size -
+ err = __get_user(magic2, (__u32 *) (((void *)buf) + size -
FP_XSTATE_MAGIC2_SIZE));
if (err)
return err;
+
/*
* Check for the presence of second magic word at the end of memory
* layout. This detects the case where the user just copied the legacy
@@ -147,11 +146,109 @@ int check_for_xstate(struct i387_fxsave_struct __user *buf,
return 0;
}

-#ifdef CONFIG_X86_64
/*
* Signal frame handlers.
*/
+int save_xstates_sigframe(void __user *buf, unsigned int size)
+{
+ void __user *buf_fxsave = buf;
+ struct task_struct *tsk = current;
+ struct xsave_struct *xsave = &tsk->thread.fpu.state->xsave;
+#if defined(CONFIG_X86_32) || defined(CONFIG_IA32_EMULATION)
+ int ia32 = size == sig_xstate_ia32_size;
+#endif
+ int err;
+
+ if (!access_ok(VERIFY_WRITE, buf, size))
+ return -EACCES;
+
+ BUG_ON(size < xstate_size);
+
+ if (!used_math())
+ return 0;
+
+ clear_used_math(); /* trigger finit */
+
+ if (!HAVE_HWFP)
+ return fpregs_soft_get(current, NULL, 0,
+ sizeof(struct user_i387_ia32_struct), NULL,
+ (struct _fpstate_ia32 __user *) buf) ? -1 : 1;
+
+ unlazy_fpu(tsk);
+ sanitize_i387_state(tsk);
+
+#if defined(CONFIG_X86_32) || defined(CONFIG_IA32_EMULATION)
+ if (ia32) {
+ if (use_xsave() || use_fxsr()) {
+ struct user_i387_ia32_struct env;
+ struct _fpstate_ia32 __user *fp = buf;
+
+ convert_from_fxsr(&env, tsk);
+ if (__copy_to_user(buf, &env, sizeof(env)))
+ return -1;
+
+ err = __put_user(xsave->i387.swd, &fp->status);
+ err |= __put_user(X86_FXSR_MAGIC, &fp->magic);
+
+ if (err)
+ return -1;
+
+ buf_fxsave = fp->_fxsr_env;
+ size -= offsetof(struct _fpstate_ia32, _fxsr_env);
+#if defined(CONFIG_X86_64)
+ buf = buf_fxsave;
+#endif
+ } else {
+ struct i387_fsave_struct *fsave =
+ &tsk->thread.fpu.state->fsave;
+
+ fsave->status = fsave->swd;
+ }
+ }
+#endif

+ if (__copy_to_user(buf_fxsave, xsave, size))
+ return -1;
+
+ if (use_xsave()) {
+ struct _fpstate __user *fp = buf;
+ struct _xstate __user *x = buf;
+ u64 xstate_bv = xsave->xsave_hdr.xstate_bv;
+
+ err = __copy_to_user(&fp->sw_reserved,
+#if defined(CONFIG_X86_32) || defined(CONFIG_IA32_EMULATION)
+ ia32 ? &fx_sw_reserved_ia32 :
+#endif
+ &fx_sw_reserved,
+ sizeof (struct _fpx_sw_bytes));
+
+ err |= __put_user(FP_XSTATE_MAGIC2,
+ (__u32 __user *) (buf_fxsave + size
+ - FP_XSTATE_MAGIC2_SIZE));
+
+ /*
+ * For legacy compatible, we always set FP/SSE bits in the bit
+ * vector while saving the state to the user context. This will
+ * enable us capturing any changes(during sigreturn) to
+ * the FP/SSE bits by the legacy applications which don't touch
+ * xstate_bv in the xsave header.
+ *
+ * xsave aware apps can change the xstate_bv in the xsave
+ * header as well as change any contents in the memory layout.
+ * xrestore as part of sigreturn will capture all the changes.
+ */
+ xstate_bv |= XSTATE_FPSSE;
+
+ err |= __put_user(xstate_bv, &x->xstate_hdr.xstate_bv);

+
+ if (err)
+ return err;
+ }
+

+ return 1;
+}
+
+#ifdef CONFIG_X86_64
int save_i387_xstate(void __user *buf)
{

struct task_struct *tsk = current;

@@ -239,7 +336,7 @@ static int restore_user_xstate(void __user *buf)
int err;

if (((unsigned long)buf % 64) ||

- check_for_xstate(buf, buf, &fx_sw_user))
+ check_for_xstate(buf, sig_xstate_size, &fx_sw_user))
goto fx_only;

mask = fx_sw_user.xstate_bv;
@@ -314,6 +411,95 @@ clear:
}
#endif

+int restore_xstates_sigframe(void __user *buf, unsigned int size)
+{
+#if defined(CONFIG_X86_32) || defined(CONFIG_IA32_EMULATION)
+ struct user_i387_ia32_struct env;
+ int ia32 = size == sig_xstate_ia32_size;
+#endif
+ struct _fpx_sw_bytes fx_sw_user;
+ struct task_struct *tsk = current;
+ struct _fpstate_ia32 __user *fp = buf;
+ struct xsave_struct *xsave;
+ int err;
+
+ if (!buf) {
+ if (used_math()) {
+ clear_fpu(tsk);
+ clear_used_math();

+ }
+ return 0;
+ }
+

+ if (!access_ok(VERIFY_READ, buf, size))
+ return -EACCES;
+
+ if (!used_math()) {
+ err = init_fpu(tsk);

+ if (err)
+ return err;
+ }
+

+ if (!HAVE_HWFP) {
+ set_used_math();
+ return fpregs_soft_set(current, NULL,
+ 0, sizeof(struct user_i387_ia32_struct),
+ NULL, fp) != 0;
+ }
+
+#if defined(CONFIG_X86_32) || defined(CONFIG_IA32_EMULATION)
+ if (ia32 && (use_xsave() || use_fxsr())) {
+ if (__copy_from_user(&env, buf, sizeof(env)))
+ return -1;
+ buf = fp->_fxsr_env;
+ size -= offsetof(struct _fpstate_ia32, _fxsr_env);
+ }
+#endif
+
+ preempt_disable();
+ stts();
+ task_thread_info(tsk)->status &= ~TS_USEDFPU;
+ preempt_enable();
+
+ xsave = &tsk->thread.fpu.state->xsave;
+ if (__copy_from_user(xsave, buf, xstate_size))
+ return -1;
+
+ if (use_xsave()) {
+ u64 *xstate_bv = &xsave->xsave_hdr.xstate_bv;
+
+ /*
+ * If this is no valid xstate, disable all extended states.
+ *
+ * For valid xstates, clear any illegal bits and any bits
+ * that have been cleared in fx_sw_user.xstate_bv.
+ */
+ if (check_for_xstate(buf, size, &fx_sw_user))
+ *xstate_bv = XSTATE_FPSSE;
+ else
+ *xstate_bv &= pcntxt_mask & fx_sw_user.xstate_bv;
+

+ xsave->xsave_hdr.reserved1[0] =

+ xsave->xsave_hdr.reserved1[1] = 0;
+ }
+
+ if (use_xsave() || use_fxsr()) {
+ xsave->i387.mxcsr &= mxcsr_feature_mask;
+
+#if defined(CONFIG_X86_32) || defined(CONFIG_IA32_EMULATION)
+ if (ia32)
+ convert_to_fxsr(tsk, &env);
+#endif
+ }
+
+ preempt_disable();
+ set_used_math();
+ math_state_restore();
+ preempt_enable();

+
+ return 0;
+}
+

/*
* Prepare the SW reserved portion of the fxsave memory layout, indicating
* the presence of the extended state information in the memory layout

Hans Rosenfeld

unread,

Nov 29, 2011, 7:50:02 AM11/29/11

The kernel code handling the FPU states should handle the TS bit in such
a way that #NM exceptions cannot happen inside the kernel. Any other
kernel code using FPU features should use kernel_fpu_begin() and
kernel_fpu_end(), which handles the TS bit and disallows preemption.

So if a #NM exception ever comes from kernel mode, it would indicate a
serious bug. Trapping this with WARN_ON_ONCE() could prove helpful in
finding and eliminating such bugs.

Signed-off-by: Hans Rosenfeld <hans.ro...@amd.com>
---

arch/x86/kernel/traps.c | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 6913369..953d5f6 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -777,6 +777,8 @@ EXPORT_SYMBOL_GPL(math_state_restore);
dotraplinkage void __kprobes
do_device_not_available(struct pt_regs *regs, long error_code)
{
+ WARN_ON_ONCE(!user_mode_vm(regs));
+
#ifdef CONFIG_MATH_EMULATION
if (read_cr0() & X86_CR0_EM) {
struct math_emu_info info = { };

Hans Rosenfeld

unread,

Nov 29, 2011, 7:50:02 AM11/29/11

Removed the functions fpu_fxrstor_checking() and restore_fpu_checking()
because they weren't doing anything. Removed redundant xsave/xrstor
implementations. Since xsave/xrstor is not specific to the FPU, and also
for consistency, all xsave/xrstor functions now take a xsave_struct
argument.

Signed-off-by: Hans Rosenfeld <hans.ro...@amd.com>
Link: http://lkml.kernel.org/r/1302018656-586370-2-git-...@amd.com
Signed-off-by: H. Peter Anvin <h...@linux.intel.com>
---
arch/x86/include/asm/i387.h | 20 +++-------
arch/x86/include/asm/xsave.h | 81 +++++++++++++++---------------------------
arch/x86/kernel/traps.c | 2 +-
arch/x86/kernel/xsave.c | 4 +-
4 files changed, 38 insertions(+), 69 deletions(-)

diff --git a/arch/x86/include/asm/i387.h b/arch/x86/include/asm/i387.h
index c9e09ea..a1c0d38 100644
--- a/arch/x86/include/asm/i387.h
+++ b/arch/x86/include/asm/i387.h
@@ -227,12 +227,14 @@ static inline void fpu_fxsave(struct fpu *fpu)

static inline void fpu_save_init(struct fpu *fpu)
{
if (use_xsave()) {

- fpu_xsave(fpu);
+ struct xsave_struct *xstate = &fpu->state->xsave;
+
+ fpu_xsave(xstate);

/*
* xsave header may indicate the init state of the FP.

*/
- if (!(fpu->state->xsave.xsave_hdr.xstate_bv & XSTATE_FP))
+ if (!(xstate->xsave_hdr.xstate_bv & XSTATE_FP))
return;
} else if (use_fxsr()) {
fpu_fxsave(fpu);
@@ -262,22 +264,12 @@ static inline void __save_init_fpu(struct task_struct *tsk)
task_thread_info(tsk)->status &= ~TS_USEDFPU;
}

-static inline int fpu_fxrstor_checking(struct fpu *fpu)
-{

- return fxrstor_checking(&fpu->state->fxsave);
-}
-

static inline int fpu_restore_checking(struct fpu *fpu)

{
if (use_xsave())
- return fpu_xrstor_checking(fpu);
+ return xrstor_checking(&fpu->state->xsave, -1);
else
- return fpu_fxrstor_checking(fpu);
-}
-
-static inline int restore_fpu_checking(struct task_struct *tsk)
-{
- return fpu_restore_checking(&tsk->thread.fpu);
+ return fxrstor_checking(&fpu->state->fxsave);
}

/*
diff --git a/arch/x86/include/asm/xsave.h b/arch/x86/include/asm/xsave.h
index c6ce245..8bcbbce 100644
--- a/arch/x86/include/asm/xsave.h
+++ b/arch/x86/include/asm/xsave.h
@@ -42,10 +42,11 @@ extern int check_for_xstate(struct i387_fxsave_struct __user *buf,
void __user *fpstate,
struct _fpx_sw_bytes *sw);

-static inline int fpu_xrstor_checking(struct fpu *fpu)
+static inline int xrstor_checking(struct xsave_struct *fx, u64 mask)
{
- struct xsave_struct *fx = &fpu->state->xsave;
int err;

+ u32 lmask = mask;
+ u32 hmask = mask >> 32;

asm volatile("1: .byte " REX_PREFIX "0x0f,0xae,0x2f\n\t"

"2:\n"
@@ -55,13 +56,23 @@ static inline int fpu_xrstor_checking(struct fpu *fpu)
".previous\n"
_ASM_EXTABLE(1b, 3b)
: [err] "=r" (err)
- : "D" (fx), "m" (*fx), "a" (-1), "d" (-1), "0" (0)
+ : "D" (fx), "m" (*fx), "a" (lmask), "d" (hmask), "0" (0)
: "memory");

return err;
}

-static inline int xsave_user(struct xsave_struct __user *buf)
+static inline void xrstor_state(struct xsave_struct *fx, u64 mask)
+{

+ u32 lmask = mask;
+ u32 hmask = mask >> 32;
+

+ asm volatile(".byte " REX_PREFIX "0x0f,0xae,0x2f\n\t"
+ : : "D" (fx), "m" (*fx), "a" (lmask), "d" (hmask)
+ : "memory");
+}
+
+static inline int xsave_checking(struct xsave_struct __user *buf)
{
int err;

@@ -74,58 +85,24 @@ static inline int xsave_user(struct xsave_struct __user *buf)
if (unlikely(err))
return -EFAULT;

- __asm__ __volatile__("1: .byte " REX_PREFIX "0x0f,0xae,0x27\n"
- "2:\n"

- ".section .fixup,\"ax\"\n"
- "3: movl $-1,%[err]\n"
- " jmp 2b\n"
- ".previous\n"

- ".section __ex_table,\"a\"\n"
- _ASM_ALIGN "\n"
- _ASM_PTR "1b,3b\n"
- ".previous"

- : [err] "=r" (err)
- : "D" (buf), "a" (-1), "d" (-1), "0" (0)
- : "memory");

+ asm volatile("1: .byte " REX_PREFIX "0x0f,0xae,0x27\n"
+ "2:\n"
+ ".section .fixup,\"ax\"\n"
+ "3: movl $-1,%[err]\n"
+ " jmp 2b\n"
+ ".previous\n"
+ _ASM_EXTABLE(1b,3b)
+ : [err] "=r" (err)
+ : "D" (buf), "a" (-1), "d" (-1), "0" (0)
+ : "memory");
+

if (unlikely(err) && __clear_user(buf, xstate_size))

err = -EFAULT;
- /* No need to clear here because the caller clears USED_MATH */
- return err;
-}
-

-static inline int xrestore_user(struct xsave_struct __user *buf, u64 mask)
-{
- int err;
- struct xsave_struct *xstate = ((__force struct xsave_struct *)buf);

- u32 lmask = mask;

- u32 hmask = mask >> 32;

- __asm__ __volatile__("1: .byte " REX_PREFIX "0x0f,0xae,0x2f\n"
- "2:\n"

- ".section .fixup,\"ax\"\n"
- "3: movl $-1,%[err]\n"
- " jmp 2b\n"
- ".previous\n"

- ".section __ex_table,\"a\"\n"
- _ASM_ALIGN "\n"
- _ASM_PTR "1b,3b\n"
- ".previous"
- : [err] "=r" (err)
- : "D" (xstate), "a" (lmask), "d" (hmask), "0" (0)
- : "memory"); /* memory required? */
+ /* No need to clear here because the caller clears USED_MATH */
return err;
}

-static inline void xrstor_state(struct xsave_struct *fx, u64 mask)
-{

- u32 lmask = mask;

- u32 hmask = mask >> 32;
-
- asm volatile(".byte " REX_PREFIX "0x0f,0xae,0x2f\n\t"
- : : "D" (fx), "m" (*fx), "a" (lmask), "d" (hmask)

- : "memory");
-}
-

static inline void xsave_state(struct xsave_struct *fx, u64 mask)

{
u32 lmask = mask;
@@ -136,7 +113,7 @@ static inline void xsave_state(struct xsave_struct *fx, u64 mask)
: "memory");
}

-static inline void fpu_xsave(struct fpu *fpu)
+static inline void fpu_xsave(struct xsave_struct *fx)
{

/* This, however, we can work around by forcing the compiler to select
an addressing mode that doesn't require extended registers. */

@@ -144,7 +121,7 @@ static inline void fpu_xsave(struct fpu *fpu)

".byte " REX_PREFIX "0x0f,0xae,0x27",
".byte " REX_PREFIX "0x0f,0xae,0x37",
X86_FEATURE_XSAVEOPT,

- [fx] "D" (&fpu->state->xsave), "a" (-1), "d" (-1) :
+ [fx] "D" (fx), "a" (-1), "d" (-1) :
"memory");
}
#endif
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 953d5f6..8e924d1 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -728,7 +728,7 @@ void __math_state_restore(void)
/*

* Paranoid restore. send a SIGSEGV if we fail to restore the state.

*/
- if (unlikely(restore_fpu_checking(tsk))) {
+ if (unlikely(fpu_restore_checking(&tsk->thread.fpu))) {
stts();
force_sig(SIGSEGV, tsk);
return;
diff --git a/arch/x86/kernel/xsave.c b/arch/x86/kernel/xsave.c
index a391134..6b063d7 100644
--- a/arch/x86/kernel/xsave.c
+++ b/arch/x86/kernel/xsave.c
@@ -170,7 +170,7 @@ int save_i387_xstate(void __user *buf)

if (task_thread_info(tsk)->status & TS_USEDFPU) {

if (use_xsave())
- err = xsave_user(buf);
+ err = xsave_checking(buf);
else
err = fxsave_user(buf);

@@ -247,7 +247,7 @@ static int restore_user_xstate(void __user *buf)
/*

* restore the state passed by the user.

*/
- err = xrestore_user(buf, mask);
+ err = xrstor_checking((__force struct xsave_struct *)buf, mask);
if (err)
return err;

Andi Kleen

unread,

Nov 29, 2011, 4:40:03 PM11/29/11

Hans Rosenfeld <hans.ro...@amd.com> writes:
>
> The lazy allocation of the xstate area has been removed. The support for
> extended states that cannot be saved/restored lazily, like AMD's LWP,
> need this. Since optimized library functions using SSE etc. are widely
> used today, most processes would have an xstate area anyway, making the
> memory overhead negligible.

Do you have any data on that? It sounds dubious for specialized
workloads.

-Andi

--
a...@linux.intel.com -- Speaking for myself only

Hans Rosenfeld

unread,

Nov 30, 2011, 12:40:02 PM11/30/11

On Tue, Nov 29, 2011 at 01:31:09PM -0800, Andi Kleen wrote:
> Hans Rosenfeld <hans.ro...@amd.com> writes:
> >
> > The lazy allocation of the xstate area has been removed. The support for
> > extended states that cannot be saved/restored lazily, like AMD's LWP,
> > need this. Since optimized library functions using SSE etc. are widely
> > used today, most processes would have an xstate area anyway, making the
> > memory overhead negligible.
>
> Do you have any data on that? It sounds dubious for specialized
> workloads.

What kind of specialized workload do you mean?

Hans

--
%SYSTEM-F-ANARCHISM, The operating system has been overthrown

Andi Kleen

unread,

Nov 30, 2011, 5:00:02 PM11/30/11

On Wed, Nov 30, 2011 at 06:37:46PM +0100, Hans Rosenfeld wrote:
> On Tue, Nov 29, 2011 at 01:31:09PM -0800, Andi Kleen wrote:
> > Hans Rosenfeld <hans.ro...@amd.com> writes:
> > >
> > > The lazy allocation of the xstate area has been removed. The support for
> > > extended states that cannot be saved/restored lazily, like AMD's LWP,
> > > need this. Since optimized library functions using SSE etc. are widely
> > > used today, most processes would have an xstate area anyway, making the
> > > memory overhead negligible.
> >
> > Do you have any data on that? It sounds dubious for specialized
> > workloads.
>
> What kind of specialized workload do you mean?

Anything that doesn't do large memcpys/memsets: glibc only uses SSE
when you pass large buffers. And then doesn't use the FPU. And possibly
has lots of processes.

Some older glibc did an unconditional FPU initialization at start,
but I believe that's long gone.

-Andi

Hans Rosenfeld

unread,

Dec 1, 2011, 3:40:03 PM12/1/11

On Wed, Nov 30, 2011 at 10:52:00PM +0100, Andi Kleen wrote:
> On Wed, Nov 30, 2011 at 06:37:46PM +0100, Hans Rosenfeld wrote:
> > On Tue, Nov 29, 2011 at 01:31:09PM -0800, Andi Kleen wrote:
> > > Hans Rosenfeld <hans.ro...@amd.com> writes:
> > > >
> > > > The lazy allocation of the xstate area has been removed. The support for
> > > > extended states that cannot be saved/restored lazily, like AMD's LWP,
> > > > need this. Since optimized library functions using SSE etc. are widely
> > > > used today, most processes would have an xstate area anyway, making the
> > > > memory overhead negligible.
> > >
> > > Do you have any data on that? It sounds dubious for specialized
> > > workloads.
> >
> > What kind of specialized workload do you mean?
>
> Anything that doesn't do large memcpys/memsets: glibc only uses SSE
> when you pass large buffers. And then doesn't use the FPU. And possibly
> has lots of processes.
>
> Some older glibc did an unconditional FPU initialization at start,
> but I believe that's long gone.

Well, I can't comment on which glibc version does what exactly. But on
the 64bit systems that I observed, _all_ processes had an xstate area
allocated. That was not the case on 32bit, but I'd suspect that the
32bit distributions just aren't optimized for modern hardware.

So I assume, if you have 10000s of processes on a legacy 32bit system
that never do any FPU stuff or SSE optimizations, you might indeed waste
a couple of megabytes. I don't think thats very realistic, but that's
just my opinion.

Hans

--
%SYSTEM-F-ANARCHISM, The operating system has been overthrown

H. Peter Anvin

unread,

Dec 1, 2011, 9:10:01 PM12/1/11

On 12/01/2011 12:36 PM, Hans Rosenfeld wrote:
>
> So I assume, if you have 10000s of processes on a legacy 32bit system
> that never do any FPU stuff or SSE optimizations, you might indeed waste
> a couple of megabytes. I don't think thats very realistic, but that's
> just my opinion.
>

A couple of megabytes of *lowmem*...

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

Hans Rosenfeld

unread,

Dec 2, 2011, 6:30:02 AM12/2/11

On Thu, Dec 01, 2011 at 06:01:27PM -0800, H. Peter Anvin wrote:
> On 12/01/2011 12:36 PM, Hans Rosenfeld wrote:
> >
> > So I assume, if you have 10000s of processes on a legacy 32bit system
> > that never do any FPU stuff or SSE optimizations, you might indeed waste
> > a couple of megabytes. I don't think thats very realistic, but that's
> > just my opinion.
> >
>
> A couple of megabytes of *lowmem*...

Ok, I'll rework that part, so that preallocation only happens on systems
that support non-lazy states. That means patch #7 is going away and
patch #8 is getting slightly bigger. This may take a few days as I have
to test that again.

Meanwhile, could you please take a look at patches #1 to #6?

Hans

--
%SYSTEM-F-ANARCHISM, The operating system has been overthrown

Ingo Molnar

unread,

Dec 5, 2011, 5:30:02 AM12/5/11

* Hans Rosenfeld <hans.ro...@amd.com> wrote:

> These patches were built and tested against 3.1. The older RFC
> patches that have been lingering in tip/x86/xsave for the last
> few months should be removed.

They have been lingering because of negative review feedback i
have given to you about LWP. I'm not convinced about the current
form of abstraction that this patch-set offers.

See this past discussion from half a year ago:

Subject: Re: [RFC v3 0/8] x86, xsave: rework of extended state handling, LWP support

We can and should do better than that.

Thanks,

Ingo

Hans Rosenfeld

unread,

Dec 7, 2011, 3:00:02 PM12/7/11

Hi,

On Fri, Dec 02, 2011 at 12:20:50PM +0100, Hans Rosenfeld wrote:
> On Thu, Dec 01, 2011 at 06:01:27PM -0800, H. Peter Anvin wrote:
> > On 12/01/2011 12:36 PM, Hans Rosenfeld wrote:
> > >
> > > So I assume, if you have 10000s of processes on a legacy 32bit system
> > > that never do any FPU stuff or SSE optimizations, you might indeed waste
> > > a couple of megabytes. I don't think thats very realistic, but that's
> > > just my opinion.
> > >
> >
> > A couple of megabytes of *lowmem*...
>
> Ok, I'll rework that part, so that preallocation only happens on systems
> that support non-lazy states. That means patch #7 is going away and
> patch #8 is getting slightly bigger. This may take a few days as I have
> to test that again.

The reworked LWP patches are ready and will follow shortly. I also added
a Kconfig option to completely disable support for non-lazy states,
allowing to completely avoid preallocation of the xstate area if
required.

Did you look at the other patches already?

Hans Rosenfeld

unread,

Dec 7, 2011, 3:10:01 PM12/7/11

in xstate_mask so that they are always restored in switch_to. Also, all
processes will always have to have a xstate area preallocated, lazy
allocation will not work when non-lazy states are present.

v2:
A single static xsave area just for init is not enough, since there are
more user processes that are directly spawned by kernel threads. Add a
call to a new arch-specific function to flush_old_exec(), which will in
turn call fpu_alloc() to allocate the xsave area if necessary.

v3:
The new xsave area has to be cleared to avoid xrstor errors.

v4:
Add Kconfig option to disable support for non-lazy states.

Signed-off-by: Hans Rosenfeld <hans.ro...@amd.com>
---

arch/x86/Kconfig | 18 ++++++++++++++++++
arch/x86/include/asm/i387.h | 11 +++++++++++
arch/x86/include/asm/thread_info.h | 2 ++
arch/x86/include/asm/xsave.h | 9 +++++++--
arch/x86/kernel/process.c | 17 +++++++++++++++++
arch/x86/kernel/process_32.c | 4 ++--
arch/x86/kernel/process_64.c | 4 ++--
arch/x86/kernel/xsave.c | 3 ++-
fs/exec.c | 8 ++++++++
9 files changed, 69 insertions(+), 7 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 6a47bb2..1f4d706 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1367,6 +1367,24 @@ config MATH_EMULATION
If you are not sure, say Y; apart from resulting in a 66 KB bigger
kernel, it won't hurt.

+config NONLAZY_XSTATES
+ def_bool y
+ prompt "Non-lazy extented process states "
+ ---help---
+ Non-lazy extended process states differ from other extended
+ process states (such as FPU and SIMD states) in that they
+ cannot be saved or restored lazily. The state for AMDs
+ Lightweight Profiling (LWP) is currently the only such state.
+
+ On systems that support non-lazy states, the kernel has to
+ pre-allocate the extended state buffer for each user task.
+ This implies that tasks that do not use the FPU or any SIMD
+ optimizations will still use about 1kB of kernel memory for
+ the extended state buffer.
+
+ If this is unacceptable for your workload, say N to disable
+ all support for non-lazy extended states.
+
config MTRR
def_bool y
prompt "MTRR (Memory Type Range Register) support" if EXPERT
diff --git a/arch/x86/include/asm/i387.h b/arch/x86/include/asm/i387.h
index 3474267..7812d55 100644
--- a/arch/x86/include/asm/i387.h
+++ b/arch/x86/include/asm/i387.h
@@ -330,6 +330,17 @@ static inline void fpu_copy(struct fpu *dst, struct fpu *src)

extern void fpu_finit(struct fpu *fpu);

+static inline void fpu_clear(struct fpu *fpu)
+{

+ if (pcntxt_mask & XCNTXT_NONLAZY) {

+ memset(fpu->state, 0, xstate_size);
+ fpu_finit(fpu);
+ set_used_math();

+ } else {
+ fpu_free(fpu);
+ }

+}
+
#endif /* __ASSEMBLY__ */

#endif /* _ASM_X86_I387_H */
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 02112a7..b886a47 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -265,6 +265,8 @@ static inline void set_restore_sigmask(void)
extern void arch_task_cache_init(void);
extern void free_thread_info(struct thread_info *ti);
extern int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src);
+extern int arch_prealloc_fpu(struct task_struct *tsk);
+#define arch_prealloc_fpu arch_prealloc_fpu
#define arch_task_cache_init arch_task_cache_init
#endif
#endif /* _ASM_X86_THREAD_INFO_H */

diff --git a/arch/x86/include/asm/xsave.h b/arch/x86/include/asm/xsave.h
index 12793b6..10e0e45 100644
--- a/arch/x86/include/asm/xsave.h
+++ b/arch/x86/include/asm/xsave.h
@@ -23,9 +23,14 @@

/*
* These are the features that the OS can handle currently.
*/
-#define XCNTXT_MASK (XSTATE_FP | XSTATE_SSE | XSTATE_YMM)
+#define XCNTXT_LAZY (XSTATE_FP | XSTATE_SSE | XSTATE_YMM)
+#define XCNTXT_NONLAZY 0

-#define XCNTXT_LAZY XCNTXT_MASK

+#ifdef CONFIG_NONLAZY_XSTATES
+#define XCNTXT_MASK (XCNTXT_LAZY | XCNTXT_NONLAZY)
+#else
+#define XCNTXT_MASK (XCNTXT_LAZY)
+#endif

#ifdef CONFIG_X86_64
#define REX_PREFIX "0x48, "

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index e7e3b01..b43522d 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -26,6 +26,23 @@

struct kmem_cache *task_xstate_cachep;
EXPORT_SYMBOL_GPL(task_xstate_cachep);

+int arch_prealloc_fpu(struct task_struct *tsk)
+{

+ if ((pcntxt_mask & XCNTXT_NONLAZY) &&
+ !fpu_allocated(&tsk->thread.fpu)) {

+ int err = fpu_alloc(&tsk->thread.fpu);

+
+ if (err)
+ return err;
+

+ fpu_clear(&tsk->thread.fpu);
+
+ task_thread_info(tsk)->xstate_mask |= (pcntxt_mask & XCNTXT_NONLAZY);
+ }

+
+ return 0;
+}
+

diff --git a/arch/x86/kernel/xsave.c b/arch/x86/kernel/xsave.c
index 9d95d2f..ce329ff 100644
--- a/arch/x86/kernel/xsave.c
+++ b/arch/x86/kernel/xsave.c

@@ -16,6 +16,7 @@
* Supported feature mask by the CPU and the kernel.
*/
u64 pcntxt_mask;
+EXPORT_SYMBOL(pcntxt_mask);

/*
* Represents init state for the supported extended state.

@@ -261,7 +262,7 @@ int restore_xstates_sigframe(void __user *buf, unsigned int size)

struct task_struct *tsk = current;

struct _fpstate_ia32 __user *fp = buf;

struct xsave_struct *xsave;
- u64 xstate_mask = 0;
+ u64 xstate_mask = pcntxt_mask & XCNTXT_NONLAZY;

int err;

if (!buf) {

diff --git a/fs/exec.c b/fs/exec.c
index 25dcbe5..af33562 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1071,10 +1071,18 @@ void set_task_comm(struct task_struct *tsk, char *buf)
perf_event_comm(tsk);
}

+#if !defined(arch_prealloc_fpu)
+#define arch_prealloc_fpu(tsk) (0)
+#endif
+
int flush_old_exec(struct linux_binprm * bprm)
{
int retval;

+ retval = arch_prealloc_fpu(current);
+ if (retval)
+ goto out;
+
/*
* Make sure we have a private signal table and that
* we are unassociated from the previous thread group.
--
1.7.5.4

Hans Rosenfeld

unread,

Dec 7, 2011, 3:10:02 PM12/7/11

Signed-off-by: Hans Rosenfeld <hans.ro...@amd.com>
---

arch/x86/include/asm/i387.h | 4 ++++
arch/x86/include/asm/msr-index.h | 1 +

arch/x86/include/asm/processor.h | 12 ++++++++++++
arch/x86/include/asm/sigcontext.h | 12 ++++++++++++
arch/x86/include/asm/xsave.h | 3 ++-
arch/x86/kernel/xsave.c | 5 +++++
6 files changed, 36 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/i387.h b/arch/x86/include/asm/i387.h
index 7812d55..cfd0be9 100644
--- a/arch/x86/include/asm/i387.h
+++ b/arch/x86/include/asm/i387.h

@@ -326,6 +326,10 @@ static inline void fpu_free(struct fpu *fpu)

static inline void fpu_copy(struct fpu *dst, struct fpu *src)

{
memcpy(dst->state, src->state, xstate_size);
+
+ /* disable LWP in the copy */
+ if (pcntxt_mask & XSTATE_LWP)
+ dst->state->xsave.xsave_hdr.xstate_bv &= ~XSTATE_LWP;
}

extern void fpu_finit(struct fpu *fpu);

/* new processor state extensions go here */
};

diff --git a/arch/x86/include/asm/xsave.h b/arch/x86/include/asm/xsave.h
index 10e0e45..789a140 100644
--- a/arch/x86/include/asm/xsave.h
+++ b/arch/x86/include/asm/xsave.h

@@ -9,6 +9,7 @@
#define XSTATE_FP 0x1
#define XSTATE_SSE 0x2
#define XSTATE_YMM 0x4
+#define XSTATE_LWP (1ULL << 62)

#define XSTATE_FPSSE (XSTATE_FP | XSTATE_SSE)

@@ -24,7 +25,7 @@

* These are the features that the OS can handle currently.
*/

#define XCNTXT_LAZY (XSTATE_FP | XSTATE_SSE | XSTATE_YMM)
-#define XCNTXT_NONLAZY 0
+#define XCNTXT_NONLAZY (XSTATE_LWP)

#ifdef CONFIG_NONLAZY_XSTATES
#define XCNTXT_MASK (XCNTXT_LAZY | XCNTXT_NONLAZY)
diff --git a/arch/x86/kernel/xsave.c b/arch/x86/kernel/xsave.c
index ce329ff..2022069 100644
--- a/arch/x86/kernel/xsave.c
+++ b/arch/x86/kernel/xsave.c

@@ -249,6 +249,11 @@ int save_xstates_sigframe(void __user *buf, unsigned int size)
return err;
}

+ if (pcntxt_mask & XSTATE_LWP) {
+ xsave->xsave_hdr.xstate_bv &= ~XSTATE_LWP;
+ wrmsrl(MSR_AMD64_LWP_CBADDR, 0);
+ }
+
return 1;
}

Hans Rosenfeld

unread,

Dec 16, 2011, 11:10:02 AM12/16/11

On Mon, Dec 05, 2011 at 11:22:23AM +0100, Ingo Molnar wrote:
>
> * Hans Rosenfeld <hans.ro...@amd.com> wrote:
>
> > These patches were built and tested against 3.1. The older RFC
> > patches that have been lingering in tip/x86/xsave for the last
> > few months should be removed.
>
> They have been lingering because of negative review feedback i
> have given to you about LWP. I'm not convinced about the current
> form of abstraction that this patch-set offers.
>
> See this past discussion from half a year ago:
>
> Subject: Re: [RFC v3 0/8] x86, xsave: rework of extended state handling, LWP support
>
> We can and should do better than that.

We had an intern, Benjamin Block, working on perf support for LWP until
a few weeks ago. Because of the fundamental problem that it is not
reasonably possible to allocate virtual memory for a process from a
different process' context, the code only supports self-monitoring.

This allows a process to control LWP through the perf syscalls. Instead
of using malloc() and the LLWPCB instruction itself, it can use the perf
syscall to have the perf kernel code do it. It can also use perf to get
at the raw LWP samples instead of just reading them from the LWP buffer.

I'll send you Benjamins code as an RFC patch set, so please take a look
at it and tell me what you think about it. I admit I don't understand
every detail of it yet as I've only recently started to work my way
through it, and I have no prior knowledge about perf.

Hans

Hans Rosenfeld

unread,

Dec 16, 2011, 11:20:02 AM12/16/11

From: Benjamin Block <benjami...@amd.com>

Adds a new call to the PMU-functions, which can be used by PMUs who
need information about the task and/or CPU for which the event shall
be initialized.

Signed-off-by: Benjamin Block <benjami...@amd.com>

Signed-off-by: Hans Rosenfeld <hans.ro...@amd.com>
---

include/linux/perf_event.h | 2 ++
kernel/events/core.c | 17 +++++++++++++----
2 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 81807bd..0c6fae6 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -630,6 +630,8 @@ struct pmu {
* Should return -ENOENT when the @event doesn't match this PMU.
*/
int (*event_init) (struct perf_event *event);
+ int (*event_init_for) (struct perf_event *event, int cpu,
+ struct task_struct *task);

#define PERF_EF_START 0x01 /* start the counter when adding */
#define PERF_EF_RELOAD 0x02 /* reload the counter when starting */
diff --git a/kernel/events/core.c b/kernel/events/core.c
index faf52f7..fd18d70 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5746,7 +5746,8 @@ void perf_pmu_unregister(struct pmu *pmu)
free_pmu_context(pmu);
}

-struct pmu *perf_init_event(struct perf_event *event)
+struct pmu *perf_init_event(struct perf_event *event, int cpu,
+ struct task_struct *task)
{
struct pmu *pmu = NULL;
int idx;
@@ -5758,14 +5759,22 @@ struct pmu *perf_init_event(struct perf_event *event)
pmu = idr_find(&pmu_idr, event->attr.type);
rcu_read_unlock();
if (pmu) {
- ret = pmu->event_init(event);
+ if (pmu->event_init_for)
+ ret = pmu->event_init_for(event, cpu, task);
+ else
+ ret = pmu->event_init(event);
+
if (ret)
pmu = ERR_PTR(ret);
goto unlock;
}

list_for_each_entry_rcu(pmu, &pmus, entry) {
- ret = pmu->event_init(event);
+ if (pmu->event_init_for)
+ ret = pmu->event_init_for(event, cpu, task);
+ else
+ ret = pmu->event_init(event);
+
if (!ret)
goto unlock;

@@ -5875,7 +5884,7 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
if (attr->inherit && (attr->read_format & PERF_FORMAT_GROUP))
goto done;

- pmu = perf_init_event(event);
+ pmu = perf_init_event(event, cpu, task);

done:
err = 0;
--
1.7.7

Hans Rosenfeld

unread,

Dec 16, 2011, 11:20:02 AM12/16/11

From: Benjamin Block <benjami...@amd.com>

Adds a prototype for a new perf-event context-type for permanent events.
This new context shall later handle events of pmus which don't have to
be enabled or disabled by the scheduler (i.e. lwp).

The current implementation doesn't prevent the scheduler from scheduling
events in this context, it only adds the necessary enum-value and some
checks to prevent that other events than permanent events get added to
this context.

Signed-off-by: Benjamin Block <benjami...@amd.com>
Signed-off-by: Hans Rosenfeld <hans.ro...@amd.com>
---

include/linux/perf_event.h | 21 +++++++++++++++++++-
include/linux/sched.h | 15 ++++++++++++++
kernel/events/core.c | 46 ++++++++++++++++++++++++++++++++++++-------
3 files changed, 73 insertions(+), 9 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index c816075..81807bd 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1015,11 +1015,30 @@ static inline bool is_sampling_event(struct perf_event *event)
}

/*
+ * Returns the type of context, the event is in (hardware, software, permanent).
+ */
+static inline int context_type(struct perf_event *event)
+{
+ return event->pmu->task_ctx_nr;
+}
+
+/*
* Return 1 for a software event, 0 for a hardware event
*/
static inline int is_software_event(struct perf_event *event)
{
- return event->pmu->task_ctx_nr == perf_sw_context;
+ return context_type(event) == perf_sw_context;
+}
+
+static inline int is_permanent_event(struct perf_event *event)
+{
+ return context_type(event) == perf_permanent_context;
+}
+
+static inline int event_context_differ(struct perf_event *event1,
+ struct perf_event *event2)
+{
+ return context_type(event1) != context_type(event2);
}

extern struct jump_label_key perf_swevent_enabled[PERF_COUNT_SW_MAX];
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 41d0237..16b771d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1214,6 +1214,21 @@ enum perf_event_task_context {
perf_invalid_context = -1,
perf_hw_context = 0,
perf_sw_context,
+
+ /*
+ * perf_permanent_context should be used for events which run on PMUs
+ * that do the context-switching without the scheduler. The
+ * corresponding PMU has to ensure that the events are only running if
+ * the task is.
+ * Like with software-events, the PMU should not have limitations that
+ * could cause a event not to be activated. Because there is no
+ * interaction with the scheduler, these limitations could not be
+ * balanced by perf.
+ * No other event should be in this context.
+ *
+ * TODO: implement scheduler-exceptions
+ */
+ perf_permanent_context,
perf_nr_task_contexts,
};

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 0f85778..faf52f7 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6140,26 +6140,56 @@ SYSCALL_DEFINE5(perf_event_open,
*/
pmu = event->pmu;

- if (group_leader &&
- (is_software_event(event) != is_software_event(group_leader))) {
- if (is_software_event(event)) {
+ if (group_leader && event_context_differ(event, group_leader)) {
+ if (is_permanent_event(event) ||
+ is_permanent_event(group_leader)) {
/*
- * If event and group_leader are not both a software
- * event, and event is, then group leader is not.
+ * Permanent events are not scheduled by the scheduler-
+ * perf-hooks and thus no other event should ever be
+ * moved in their context, they also should not be
+ * moved in an other context, because that would cause
+ * unnecessary scheduler-overhead
+ *
+ * TODO: implement scheduler-exceptions
+ */
+ err = -EINVAL;
+ goto err_alloc;
+ }
+
+ /*
+ * In the case that neither the event nor the group_leader is a
+ * permanent event, we have do decide whether it is feasible to
+ * move the event to the context of the group_leader or
+ * vis-a-vis.
+ * Background is: a software-event can be grouped together with
+ * hardware-events, because they never fail to be scheduled. A
+ * hardware-event otherwise can fail to be scheduled and thus
+ * should not be added to a software-context, because that could
+ * lead to wrong decisions
+ */
+ switch (context_type(event)) {
+ case perf_sw_context:
+ /*
+ * If event and group_leader differ in the
+ * event_context_type and none of them is a permanent
+ * event, and event is a SW-event, then group_leader has
+ * to be a HW-event
*
* Allow the addition of software events to !software
* groups, this is safe because software events never
* fail to schedule.
*/
pmu = group_leader->pmu;
- } else if (is_software_event(group_leader) &&
- (group_leader->group_flags & PERF_GROUP_SOFTWARE)) {
+ break;
+ case perf_hw_context:
/*
* In case the group is a pure software group, and we
* try to add a hardware event, move the whole group to
* the hardware context.
*/
- move_group = 1;
+ if (group_leader->group_flags & PERF_GROUP_SOFTWARE)
+ move_group = 1;
+ break;
}
}

--
1.7.7

Hans Rosenfeld

unread,

Dec 16, 2011, 11:20:02 AM12/16/11

From: Benjamin Block <benjami...@amd.com>

Adds activation of lwp for ring3-software. With this patch and
scheduler-support (xsave-patches done by Hans Rosenfeld) a
userspace-application can use the lwp-instructions (llwpcb,
slwpcb, ..) to activate lwp.

Support for the threshold-interrupt is not given and thus the
corresponding bits are not set. It is therefore unavailable for
userspace-applications.

Signed-off-by: Benjamin Block <benjami...@amd.com>
Signed-off-by: Hans Rosenfeld <hans.ro...@amd.com>
---

arch/x86/include/asm/msr-index.h | 1 +
arch/x86/kernel/cpu/Makefile | 2 +-
arch/x86/kernel/cpu/perf_event_amd_lwp.c | 138 ++++++++++++++++++++++++++++++
3 files changed, 140 insertions(+), 1 deletions(-)
create mode 100644 arch/x86/kernel/cpu/perf_event_amd_lwp.c

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 2d9cf3c..557a6fd 100644

--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -136,6 +136,7 @@
#define MSR_AMD64_IBSDCPHYSAD 0xc0011039
#define MSR_AMD64_IBSCTL 0xc001103a
#define MSR_AMD64_IBSBRTARGET 0xc001103b

+#define MSR_AMD64_LWP_CFG 0xc0000105

#define MSR_AMD64_LWP_CBADDR 0xc0000106

/* Fam 15h MSRs */

diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
index 6042981..9973465 100644
--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -20,7 +20,7 @@ obj-$(CONFIG_X86_32) += bugs.o
obj-$(CONFIG_X86_64) += bugs_64.o

obj-$(CONFIG_CPU_SUP_INTEL) += intel.o
-obj-$(CONFIG_CPU_SUP_AMD) += amd.o
+obj-$(CONFIG_CPU_SUP_AMD) += amd.o perf_event_amd_lwp.o
obj-$(CONFIG_CPU_SUP_CYRIX_32) += cyrix.o
obj-$(CONFIG_CPU_SUP_CENTAUR) += centaur.o
obj-$(CONFIG_CPU_SUP_TRANSMETA_32) += transmeta.o
diff --git a/arch/x86/kernel/cpu/perf_event_amd_lwp.c b/arch/x86/kernel/cpu/perf_event_amd_lwp.c
new file mode 100644
index 0000000..9aa9a91
--- /dev/null
+++ b/arch/x86/kernel/cpu/perf_event_amd_lwp.c
@@ -0,0 +1,138 @@
+#include <linux/perf_event.h>
+#include <linux/module.h>
+
+#include <asm/cpufeature.h>
+#include <asm/processor.h>
+
+/* masks only the events as of spec r3.08 (lwp v1) */
+#define LWP_EVENT_MASK 0x7E
+
+struct lwp_capabilities {
+#define LWP_CAPS_LWP 0
+#define LWP_CAPS_THRESHOLD 31
+ unsigned long supported_events;
+ unsigned long available_events;
+
+ u8 size_lwpcb;
+ u8 size_event;
+ u8 size_max_event_id;
+ u8 size_event_offset;
+
+#define LWP_CAPS_FILTER_BRANCH 28
+#define LWP_CAPS_FILTER_IP 29
+#define LWP_CAPS_FILTER_CACHE_LVL 30
+#define LWP_CAPS_FILTER_CACHE_LAT 31
+ unsigned long features;
+};
+
+union lwp_cfg_msr {
+ struct {
+ u32 allowed_events;
+ u8 core_id;
+ u8 interrupt_vector;
+ u16 reserved;
+ } cfg;
+ u64 msr_value;
+};
+
+static struct lwp_capabilities lwp_caps;
+
+static void get_lwp_caps(struct lwp_capabilities *caps)
+{
+ u32 sizes;
+
+ memset(caps, 0, sizeof(*caps));
+ cpuid(0x8000001C, (u32 *) &caps->available_events, &sizes,
+ (u32 *) &caps->features,
+ (u32 *) &caps->supported_events);
+
+ caps->size_lwpcb = (u8) sizes;
+ caps->size_event = (u8) (sizes >> 0x8);
+ caps->size_max_event_id = (u8) (sizes >> 0x10);
+ caps->size_event_offset = (u8) (sizes >> 0x18);
+}
+
+static void lwp_start_cpu(void *c)
+{
+ struct lwp_capabilities *caps = (struct lwp_capabilities *) c;
+ union lwp_cfg_msr msr;
+
+ msr.msr_value = 0;
+ /* allow supported events of lwpv1 [1..6] */
+ msr.cfg.allowed_events |= ((u32) caps->supported_events) &
+ LWP_EVENT_MASK;
+ /*
+ * The value showing up in the core-id field of a event-record.
+ * I currently only 8 bits wide.
+ */
+ msr.cfg.core_id = (u8) smp_processor_id();
+
+ /*
+ * We currently do not support the threshold-interrupt so
+ * bit 31 and [40..47] of msr.msr_value keep 0
+ *
+ * msr.cfg.allowed_events |= (1U << 31);
+ * msr.cfg.interrupt_vector = xxx;
+ */
+
+ wrmsrl(MSR_AMD64_LWP_CFG, msr.msr_value);
+}
+
+static int __cpuinit
+lwp_cpu_notifier(struct notifier_block *self, unsigned long action,
+ void *hcpu)
+{
+ switch (action & ~CPU_TASKS_FROZEN) {
+ case CPU_STARTING:
+ lwp_start_cpu(&lwp_caps);
+ break;
+ default:
+ return NOTIFY_DONE;
+ }
+
+ return NOTIFY_OK;
+}
+
+static __init int amd_lwp_init(void)
+{
+ if (!static_cpu_has(X86_FEATURE_LWP))
+ return -ENODEV;
+
+ /* read the _supported_ events */
+ get_lwp_caps(&lwp_caps);
+
+ /* we currently only support lwp v1; spec r3.08 */
+ if (!test_bit(LWP_CAPS_LWP, &lwp_caps.supported_events) ||
+ (((lwp_caps.features >> 9) & 0x7F) != 1))
+ return -ENODEV;
+
+ if (!test_bit(LWP_CAPS_THRESHOLD, &lwp_caps.supported_events))
+ return -ENODEV;
+
+ get_online_cpus();
+
+ barrier();
+
+ perf_cpu_notifier(lwp_cpu_notifier);
+ smp_call_function(lwp_start_cpu, &lwp_caps, 1);
+
+ put_online_cpus();
+
+ /*
+ * The values returned by cpuid are corresponding to the values in
+ * MSR_AMD64_LWP_CFG and determine what events are available. As we
+ * have just changed MSR_AMD64_LWP_CFG, we have to re-read the lwp_caps.
+ */
+ get_lwp_caps(&lwp_caps);
+
+ printk(KERN_INFO "perf: AMD LWP caps: "
+ "[%#lx],[%#hhx|%#hhx|%#hhx|%#hhx],[%#lx],[%#lx]",
+ lwp_caps.available_events, lwp_caps.size_lwpcb,
+ lwp_caps.size_event, lwp_caps.size_max_event_id,
+ lwp_caps.size_event_offset, lwp_caps.features,
+ lwp_caps.supported_events);

+
+ return 0;
+}
+

+device_initcall(amd_lwp_init);
--
1.7.7

Hans Rosenfeld

unread,

Dec 16, 2011, 11:20:02 AM12/16/11

From: Benjamin Block <benjami...@amd.com>

Implements a basic integration of LWP into perf. Permits a way to create
a perf-event that will be backed by LWP. The PMU creates the required
structures and userspace-memories. The PMU also collects the samples
from the ring-buffer, but as there is currently no interrupt- and
overflow-implementation, they are not reported (TODO).

Signed-off-by: Benjamin Block <benjami...@amd.com>
Signed-off-by: Hans Rosenfeld <hans.ro...@amd.com>
---

arch/x86/include/asm/processor.h | 4 +-
arch/x86/kernel/cpu/perf_event_amd_lwp.c | 1179 +++++++++++++++++++++++++++++-
include/linux/perf_event.h | 5 +
kernel/events/core.c | 28 +
4 files changed, 1213 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index bb31ab6..d5240e7 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -353,7 +353,7 @@ struct ymmh_struct {
u32 ymmh_space[64];
};

-struct lwp_struct {
+struct lwp_state {
u64 lwpcb_addr;
u32 flags;
u32 buf_head_offset;
@@ -374,7 +374,7 @@ struct xsave_struct {

struct i387_fxsave_struct i387;
struct xsave_hdr_struct xsave_hdr;
struct ymmh_struct ymmh;

- struct lwp_struct lwp;
+ struct lwp_state lwp;

/* new processor state extensions will go here */
} __attribute__ ((packed, aligned (64)));

diff --git a/arch/x86/kernel/cpu/perf_event_amd_lwp.c b/arch/x86/kernel/cpu/perf_event_amd_lwp.c
index 9aa9a91..afc6c8d 100644
--- a/arch/x86/kernel/cpu/perf_event_amd_lwp.c
+++ b/arch/x86/kernel/cpu/perf_event_amd_lwp.c
@@ -1,12 +1,94 @@
#include <linux/perf_event.h>
#include <linux/module.h>
+#include <linux/mm.h>
+#include <linux/kref.h>
+#include <linux/mm_types.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/mman.h>
+#include <linux/hardirq.h>
+#include <linux/highmem.h>
+#include <linux/bitops.h>

+#include <asm/xsave.h>
#include <asm/cpufeature.h>
#include <asm/processor.h>

+/*
+ * The perf-config-vector (u64) contains 2 informations:
+ * * the event-id of the event that should be activated
+ * * filters for this class of event (lwp doesn't provide filters for
+ * individual events)
+ *
+ * Event-ID: lwp_config_event_get(perf-config)
+ * Filters: lwp_config_filter_get(perf-config)
+ *
+ * Each event-class has its own filter-config.
+ * for each class Filters contain:
+ * Bit 0: IP Filter Invert
+ * 1: IP Filter
+ * though it is possible to set the bits (for later implementations)
+ * the current implementation does not support ip-filtering (see
+ * get_filter_mask_for())
+ * for branch retired:
+ * Bit 2: No Mispredicted Branches
+ * 3: No Predicted Branches
+ * 4: No Absolute Branches
+ * 5: No Conditional Branches
+ * 6: No Unconditional Branches
+ * for dcache misses:
+ * Bit 2-9: MinLatency
+ * 10: Northbridge
+ * 11: Remote
+ * 12: Dram
+ * 13: Other
+ */
+#define LWP_CONFIG_EVENT_MASK 0x000000000000001FULL
+#define LWP_CONFIG_FILTER_MASK 0xFFFFF00000000000ULL
+#define LWP_CONFIG_MASK (LWP_CONFIG_EVENT_MASK \
+ | LWP_CONFIG_FILTER_MASK)
+
+static inline int lwp_config_event_get(u64 config)
+{
+ return (config) & LWP_CONFIG_EVENT_MASK;
+}
+
+static inline int lwp_config_filter_get(u64 config)
+{
+ return ((config) & LWP_CONFIG_FILTER_MASK) >> 44;
+}
+

/* masks only the events as of spec r3.08 (lwp v1) */

#define LWP_EVENT_MASK 0x7E

+enum lwp_event_nr {
+ LWP_EVENT_INVALID = 0,
+ LWP_EVENT_INSERT = 1,
+ LWP_EVENT_INSTRURET,
+ LWP_EVENT_BRANCHRET,
+ LWP_EVENT_DCACHEMISS,
+ LWP_EVENT_CPUCLOCK,
+ LWP_EVENT_CPURCLOCK,
+ LWP_EVENT_MAX,
+ LWP_EVENT_PROGRAMMED = 255 /* This is no mistake */
+};
+
+enum lwp_filter_nr {
+ LWP_FILTER_MIN_LATENCY = 0,
+ LWP_FILTER_CACHE_LEVEL = 8,
+ LWP_FILTER_CACHE_NORTHBRIDGE = 9,
+ LWP_FILTER_CACHE_REMOTE = 10,
+ LWP_FILTER_CACHE_DRAM = 11,
+ LWP_FILTER_CACHE_OTHER = 12,
+ LWP_FILTER_BRANCH_MISPREDICT = 25,
+ LWP_FILTER_BRANCH_PREDICT = 26,
+ LWP_FILTER_BRANCH_ABSOLUTE = 27,
+ LWP_FILTER_BRANCH_COND = 28,
+ LWP_FILTER_BRANCH_UNCOND = 29,
+ LWP_FILTER_IP_FILTER_INV = 30,
+ LWP_FILTER_IP_FILTER = 31
+};
+
struct lwp_capabilities {
#define LWP_CAPS_LWP 0
#define LWP_CAPS_THRESHOLD 31
@@ -35,7 +117,1096 @@ union lwp_cfg_msr {
u64 msr_value;
};

+struct lwp_event {
+ /*
+ * event id
+ * 0 - Reserved - Invalid
+ * 1 - Programmed value sample
+ * 2 - Instructions retired
+ * 3 - Branches retired
+ * 4 - DCache misses
+ * 5 - CPU clocks not halted
+ * 6 - CPU reference clocks not halted
+ * 255 - Programmed event
+ */
+ u8 event_id;
+ u8 core_id;
+ u16 flags; /* per-event flags; see spec. */
+ u32 data1;
+ u64 inst_adr;
+ u64 data2;
+ u64 data3;
+} __attribute__((packed));
+
+struct lwpcb_head {
+ u32 flags;
+ u32 buffer_size : 28;
+
+ /*
+ * If set, lwp-HW will randomize every event-interval by making
+ * the first 'random' bits random.
+ * Could be used to prevent fixed event-pattern.
+ */
+ u32 random : 4;
+ u64 buffer_base; /* has to be a userspace effective address */
+
+ /*
+ * buffer_head_offset is held by HW and must never changed by SW.
+ * It can be updated by executing slwpcb. <wiki:Circular_buffer>
+ */
+ u32 buffer_head_offset;
+ u32 reserved_1;
+ u64 missed_events; /* increases if buffer is full */
+
+ /*
+ * If the threshold-interrupt is active this size is evaluated as:
+ * threshold >= (buffer_head_offset - buffer_tail_offset) % buffer_size
+ * Should be a multiple of event_size, if not it is rounded down by HW.
+ */
+ u32 threshold;
+ u32 filters;
+
+ /*
+ * base_ip and limit_ip are only validated if instruction-pointer-filter
+ * is active.
+ */
+ u64 base_ip;
+ u64 limit_ip;
+ u64 reserved_2;
+
+ /*
+ * The tail-pointer of the ringbuffer, should point to the oldest event
+ * and has to be maintained by the software.
+ * If bto > buffer_size; then bto = 0; fi
+ */
+ u32 buffer_tail_offset;
+ u32 reserved_3;
+ u64 software_data_1; /* can be used by software */
+ u64 software_data_2;
+} __attribute__((packed));
+
+/*
+ * Between lwpcb_head and lwpcb_event is a undefined space
+ * which has to be read from hardware before allocating it.
+ * LwpEventOffset tells the startpoint of the events.
+ * lwpcb_event can be attached several times after that point.
+ */
+struct lwpcb_event {
+ s32 interval : 26;
+ u32 reserved_1 : 6;
+ s32 counter : 26;
+ u32 reserved_2 : 6;
+} __attribute__((packed));
+
+/* everything above is treated as 0 */
+#define LWP_EVENT_MAX_PERIOD 0x1FFFFFFULL
+/* we need a reasonable minimum as a to small value could start a intrp-strom */
+#define LWP_EVENT_MIN_PERIOD 0xFULL
+
+struct lwp_userspace {
+ void __user *addr;
+ struct page **pages;
+ size_t length; /* in pages */
+};
+
+struct lwp_struct {
+ struct { /* lwpcb */
+ void *lwpcb_base;
+
+ /*
+ * The actual size of the lwpcb.
+ * At least:
+ * sizeof(lwpcb_head) + lwp_caps.max_event_id *
+ * sizeof(lwpcb_event)
+ * But the hardware can request more,
+ * so better use lwp_caps.size_lwpcb * 8
+ */
+ size_t size;
+
+ struct lwpcb_head *head;
+ struct lwpcb_event *events;
+ } lwpcb;
+
+ /* the ringbuffer used by lwp to store the event_records */
+ struct { /* buffer */
+ void *buffer_base;
+ size_t size;
+ } buffer;
+
+ struct { /* userspace mappings */
+ struct mm_struct *mm;
+
+ /* both should be PAGE_ALIGNED or at least 64 bit aligned */
+ struct lwp_userspace lwpcb;
+ struct lwp_userspace buffer;
+ } userspace;
+
+ struct task_struct *owner;
+
+ /* This reflects caps.size_event at the time of creation */
+ size_t eventsize;
+ /* Max event_id supported by this lwp-instance */
+ size_t eventmax;
+
+ /* Cached events that have been read from buffer */
+ u64 *event_counter;
+ /*
+ * Cached xsave-values, to prevent lose of already counted but not
+ * submitted events.
+ */
+ u32 xstate_counter[LWP_EVENT_MAX-1];
+
+ u8 active;
+
+ struct kref ref_count;
+ raw_spinlock_t lock;
+};
+
+static inline int vector_test(unsigned int bit_nr, u32 vector)
+{
+ return (1U << bit_nr) & vector;
+}
+
static struct lwp_capabilities lwp_caps;
+static struct pmu perf_lwp_pmu;
+
+static u16 get_filter_mask_for(u32 eventnr)
+{
+ /*
+ * IP-filtering is currently not supported by this PMU,
+ * as it would cause every active event to be affected
+ *
+ * if (test_bit(LWP_FILTER_IP, &lwp_caps.features))
+ * u32 mask = 0x3;
+ */
+ u32 mask = 0x0;
+
+ switch (eventnr) {
+ case LWP_EVENT_BRANCHRET:
+ mask |= 0x70U;
+ if (test_bit(LWP_CAPS_FILTER_BRANCH, &lwp_caps.features))
+ mask |= 0xCU;
+ break;
+ case LWP_EVENT_DCACHEMISS:
+ if (test_bit(LWP_CAPS_FILTER_CACHE_LAT, &lwp_caps.features))
+ mask |= 0x3FCU;
+ if (test_bit(LWP_CAPS_FILTER_CACHE_LVL, &lwp_caps.features))
+ mask |= 0x3C00U;
+ break;
+ default:
+ break;
+ }
+
+ return mask;
+}
+
+static u32 get_filter_vector(u32 eventnr, u16 filter)
+{
+ u32 vector = 0;
+
+ filter &= get_filter_mask_for(eventnr);
+ if (!filter)
+ return 0;
+
+ /*
+ * ugly... but we have to use the given perf-config-fields
+ * maybe I will integrate this into a bitfield or enum
+ */
+ switch (eventnr) {
+ case LWP_EVENT_BRANCHRET:
+ /* branch-filter start at position 25 */
+ vector |= (filter << 23);
+ /* the following combinations would prevent any event */
+ if (vector_test(LWP_FILTER_BRANCH_MISPREDICT, vector) &&
+ vector_test(LWP_FILTER_BRANCH_PREDICT, vector))
+ return 0;
+ if (vector_test(LWP_FILTER_BRANCH_ABSOLUTE, vector) &&
+ vector_test(LWP_FILTER_BRANCH_COND, vector) &&
+ vector_test(LWP_FILTER_BRANCH_UNCOND, vector))
+ return 0;
+ break;
+ case LWP_EVENT_DCACHEMISS:
+ if (filter & 0x3C00)
+ vector |= (((filter & 0x3C00) >> 2) | 0x100);
+ vector |= ((filter & 0x3FC) >> 2);
+ break;
+ default:
+ break;
+ }
+
+ return vector;
+}
+
+static int
+get_userspace_mapping(struct lwp_userspace *l, struct mm_struct *mm,
+ size_t size)
+{
+ int err = 0,
+ pages = 0;
+
+ l->length = PAGE_ALIGN(size) / PAGE_SIZE;
+ if (l->length <= 0) {
+ err = -EFAULT;
+ goto err;
+ }
+
+ l->pages = kmalloc(l->length * sizeof(*l->pages), GFP_ATOMIC);
+ if (!l->pages) {
+ err = -ENOMEM;
+ goto err;
+ }
+
+ down_write(&mm->mmap_sem);
+
+ l->addr = (void __user *) do_mmap(NULL, 0, l->length * PAGE_SIZE,
+ PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, 0);
+ if (IS_ERR(l->addr)) {
+ err = -ENOMEM;
+ goto err_sem;
+ }
+
+ WARN_ON(!IS_ALIGNED((unsigned long) l->addr, PAGE_SIZE));
+
+ pages = get_user_pages(current, mm, (unsigned long) l->addr, l->length,
+ 1, 0, l->pages, NULL);
+ if (pages != l->length) {
+ err = -EFAULT;
+ goto err_mmap;
+ }
+
+ up_write(&mm->mmap_sem);
+
+ return 0;
+err_mmap:
+ do_munmap(mm, (unsigned long) l->addr, l->length * PAGE_SIZE);
+err_sem:
+ up_write(&mm->mmap_sem);
+ kfree(l->pages);
+err:

+ return err;
+}
+

+static int free_userspace_mapping(struct lwp_userspace *l, struct mm_struct *mm)
+{
+ int err = 0, i;
+
+ for (i = 0; i < l->length; i++)
+ put_page(l->pages[i]);
+
+ kfree(l->pages);
+
+ down_write(&mm->mmap_sem);
+ err = do_munmap(mm, (unsigned long) l->addr, l->length * PAGE_SIZE);
+ if (err)
+ goto err_sem;
+ up_write(&mm->mmap_sem);
+
+ return 0;
+err_sem:
+ up_write(&mm->mmap_sem);

+ return err;
+}
+

+static int userspace_write(struct page **dest, void *source, size_t length)
+{
+ int page;
+ size_t chk;
+ void *addr;
+ char *src = source;
+
+ for (page = 0, chk = 0; length > 0; page++, length -= chk) {
+ addr = __kmap_atomic(dest[page]);
+ if (!addr)
+ return -EFAULT;
+
+ chk = min(length, PAGE_SIZE);
+
+ memcpy(addr, src, chk);
+ src += chk;
+
+ __kunmap_atomic(addr);
+ }

+
+ return 0;
+}
+

+static int userwrite_lwpcb(struct lwp_struct *l)
+{
+ BUG_ON(l->active);
+ return userspace_write(l->userspace.lwpcb.pages, l->lwpcb.lwpcb_base,
+ l->lwpcb.size);
+}
+
+static int userwrite_buffer(struct lwp_struct *l)
+{
+ BUG_ON(l->active);
+ return userspace_write(l->userspace.buffer.pages,
+ l->buffer.buffer_base,
+ l->buffer.size);
+}
+
+static int userread_buffer(struct lwp_struct *l, u32 start_offset, u32 end_offset)
+{
+ int page;
+ size_t run, page_offset, length, chk;
+ size_t size = l->buffer.size;
+ char *kern_buf = l->buffer.buffer_base;
+ char *user_buf;
+ size_t page_count = l->userspace.buffer.length; /* in pages */
+ struct page **pages = l->userspace.buffer.pages;
+
+ /* start == end means that the interval is empty */
+ if (start_offset == end_offset)
+ return 0;
+
+ /*
+ * The first case is the 'usual', but since this is a ringbuffer, the
+ * end-Pointer could be below the start-Pointer. In this case we have
+ * to read from start to ringbuffer-end and then from ringbuffer-start
+ * to end.
+ */
+ if(start_offset < end_offset)
+ length = end_offset - start_offset;
+ else
+ length = (size - start_offset) + end_offset;
+
+ /* end_offset points to the start of the last event */
+ length = min(length + l->eventsize, size);
+
+ run = start_offset;
+ /* region between start_offset and the its containing page */
+ page_offset = start_offset - rounddown(start_offset, PAGE_SIZE);
+
+ for (page = start_offset / PAGE_SIZE; length > 0;
+ length -= chk,
+ page = (page + 1) % page_count,
+ run = (run + chk) % size) {
+ user_buf = __kmap_atomic(pages[page]);
+ if (!user_buf)
+ return -EFAULT;
+
+ chk = min3(size - run, PAGE_SIZE - page_offset, length);
+ memcpy(kern_buf + run, user_buf + page_offset, chk);
+
+ /* after the first round we don't need the offset anymore */
+ page_offset ^= page_offset;
+
+ __kunmap_atomic(user_buf);
+ }

+
+ return 0;
+}
+

+static int userwrite_buffer_tail_offset(struct lwp_struct *l)
+{
+ struct lwpcb_head *head;
+
+ head = (struct lwpcb_head *)
+ kmap_atomic(l->userspace.lwpcb.pages[0], KM_USER0);
+
+ if (!head)
+ return -EFAULT;
+
+ head->buffer_tail_offset = l->lwpcb.head->buffer_tail_offset;
+
+ kunmap_atomic((void *) head, KM_USER0);

+
+ return 0;
+}
+

+static int lwp_active(struct lwp_struct *l)
+{
+ u64 lwpcb_addr;
+ rdmsrl(MSR_AMD64_LWP_CBADDR, lwpcb_addr);
+
+ if (lwpcb_addr) {
+ if (lwpcb_addr == (u64) l->userspace.lwpcb.addr)
+ return 1;
+ else
+ return -1;

+ }
+ return 0;
+}
+

+static int lwp_xsave_check(struct lwp_struct *l)
+{
+ struct lwp_state *xlwp = &current->thread.fpu.state->xsave.lwp;
+
+ /* TODO: correct conversion */
+ if (xlwp->lwpcb_addr &&
+ (xlwp->lwpcb_addr != (u64) l->userspace.lwpcb.addr))
+ return 1;

+
+ return 0;
+}
+

+static int lwp_read_head_offset(struct lwp_struct *l, u32 *bufferHeadOffset)
+{
+ int rc;
+
+ rc = lwp_active(l);
+ if (rc < 0) {
+ return 1;
+ } else if (rc > 0) {
+ /* flush hw-states */
+ save_xstates(current);
+ } else if (lwp_xsave_check(l)) {

+ return 1;
+ }
+

+ *bufferHeadOffset =
+ current->thread.fpu.state->xsave.lwp.buf_head_offset;

+
+ return 0;
+}
+

+static int lwp_stop(struct lwp_struct *l)
+{
+ int rc, i;
+ struct lwp_state *xlwp;
+
+ xlwp = &current->thread.fpu.state->xsave.lwp;
+
+ /*
+ * this might set xsave_hdr.xstate_bv[62] to 0, which should be 1 if
+ * we want to restore the area later (with new values or not)
+ *
+ * saves all states into the xstate area
+ */
+ rc = lwp_active(l);
+ if (rc < 0) {
+ return 1;
+ } else if (rc > 0) {
+ save_xstates(current);
+ /* turns lwp off immediately */
+ wrmsrl(MSR_AMD64_LWP_CBADDR, 0);
+
+ for (i = 0; i < l->eventmax; i++) {
+ if (vector_test(i+1, xlwp->flags))
+ l->xstate_counter[i] = xlwp->event_counter[i];
+ }
+ } else if (lwp_xsave_check(l)) {

+ return 1;
+ }
+

+ l->active = 0;

+
+ return 0;
+}
+

+static int lwp_clear(struct lwp_struct *l)
+{
+ struct lwp_state *xlwp;
+
+ if (lwp_stop(l))
+ return 1;
+
+ xlwp = &current->thread.fpu.state->xsave.lwp;
+ memset(xlwp, 0, sizeof(*xlwp));
+
+ /* indicate the lwp-xsave-area is no longer valid */
+ current->thread.fpu.state->xsave.xsave_hdr.xstate_bv &=
+ ~(((u64) 1) << 62);
+ restore_xstates(current, task_thread_info(current)->xstate_mask);

+
+ return 0;
+}
+

+static int lwp_start(struct lwp_struct *l, int update)
+{
+ int i;
+ struct lwp_state *xlwp;
+ struct lwpcb_head *head = l->lwpcb.head;
+
+ if (lwp_active(l))
+ return 1;
+
+ xlwp = &current->thread.fpu.state->xsave.lwp;
+
+ if (!xlwp->lwpcb_addr) {
+ xlwp->lwpcb_addr = (u64) l->userspace.lwpcb.addr;
+ xlwp->flags = head->flags & LWP_EVENT_MASK;
+ xlwp->buf_head_offset = head->buffer_head_offset;
+ xlwp->buf_base = head->buffer_base;
+ xlwp->buf_size = head->buffer_size;
+ xlwp->filters = head->filters;
+ memset(xlwp->saved_event_record, 0,
+ sizeof(xlwp->saved_event_record));
+ memset(xlwp->event_counter, 0,
+ sizeof(xlwp->event_counter));
+ } else {
+ if (update) {
+ xlwp->flags = head->flags & LWP_EVENT_MASK;
+ xlwp->filters = head->filters;
+ }
+ }
+
+ for (i = 0; i < l->eventmax; i++) {
+ if (vector_test(i+1, xlwp->flags))
+ xlwp->event_counter[i] = l->xstate_counter[i];
+ }
+
+ /*
+ * if we used lwp_stop without lwp being enabled
+ * ???: is xstate_bv used or is it just a copy of the last xsave?
+ */
+ current->thread.fpu.state->xsave.xsave_hdr.xstate_bv |=
+ ((u64) 1) << 62;
+ restore_xstates(current, task_thread_info(current)->xstate_mask);
+
+ l->active = 1;

+
+ return 0;
+}
+

+static int perf_lwp_event_init(struct perf_event *event)
+{
+ return -EINVAL;
+}
+
+static struct lwp_struct *lwpcb_get(struct perf_event *event)
+{
+ struct lwp_struct *lwpcb;
+
+ /* TODO: has to be locked in later cross-lwp-implementations */
+ lwpcb = (struct lwp_struct *) event->hw.config;
+ kref_get(&lwpcb->ref_count);
+
+ return lwpcb;
+}
+
+static struct lwp_struct *lwpcb_new(void)
+{
+ int err;
+ char *lwpcb_base;
+ struct lwp_struct *l;
+
+ l = kmalloc(sizeof(*l), GFP_ATOMIC);
+ if (!l)
+ return ERR_PTR(-ENOMEM);
+ memset(l, 0, sizeof(*l));
+
+ l->owner = current;
+ l->active = 0;
+
+ l->eventsize = lwp_caps.size_event;
+ l->eventmax = lwp_caps.size_max_event_id;
+
+ /* l->cap.size_lwpcb contains expected size in quadwords */
+ l->lwpcb.size = lwp_caps.size_lwpcb * 8;
+ kref_init(&l->ref_count);
+ raw_spin_lock_init(&l->lock);
+
+ /* the kernel-space is cloned into the per-task-user-space */
+ lwpcb_base = kmalloc(l->lwpcb.size, GFP_ATOMIC);
+ if (!lwpcb_base) {
+ err = -ENOMEM;
+ goto err_lwpcb_alloc;
+ }
+ memset(lwpcb_base, 0, l->lwpcb.size);
+
+ l->lwpcb.lwpcb_base = lwpcb_base;
+ l->lwpcb.head = (struct lwpcb_head *) lwpcb_base;
+ l->lwpcb.events = (struct lwpcb_event *)
+ (lwpcb_base + lwp_caps.size_event_offset);
+
+ /*
+ * the spec requires at least
+ * 32 * caps.size_buffer_min * l->eventsize
+ * we let 128 records be our minimum (1 Page)
+ * = 32KB (v1)
+ */
+ l->buffer.size = (32 * ((lwp_caps.features >> 16) & 0xFF));
+ if (l->buffer.size < 128)
+ l->buffer.size = 128;
+ l->buffer.size *= l->eventsize;
+ l->buffer.buffer_base = kmalloc(l->buffer.size, GFP_ATOMIC);
+ if (!l->buffer.buffer_base) {
+ err = -ENOMEM;
+ goto err_lwpcbspace_alloc;
+ }
+ memset(l->buffer.buffer_base, 0, l->buffer.size);
+
+ l->event_counter = kmalloc(l->eventmax * sizeof(*l->event_counter),
+ GFP_ATOMIC);
+ if(!l->event_counter) {
+ err = -ENOENT;
+ goto err_lwpcbbuffer_alloc;
+ }
+ memset(l->event_counter, 0, l->eventmax * sizeof(*l->event_counter));
+
+ l->userspace.mm = get_task_mm(current);
+
+ err = get_userspace_mapping(&l->userspace.lwpcb, l->userspace.mm,
+ l->lwpcb.size);
+ if (err)
+ goto err_mm;
+
+ err = get_userspace_mapping(&l->userspace.buffer, l->userspace.mm,
+ l->buffer.size);
+ if (err)
+ goto err_ulwpcb;
+
+ /* modified on event-start */
+ l->lwpcb.head->flags = 0;
+ l->lwpcb.head->buffer_size = l->buffer.size;
+ l->lwpcb.head->buffer_base = (u64) l->userspace.buffer.addr;
+ /* currently not supported by this pmu */
+ l->lwpcb.head->random = 0;
+ /* l->lwpcb.head->buffer_head_offset = 0;
+ * l->lwpcb.head->missed_events = 0; */
+ l->lwpcb.head->threshold = 1 * l->eventsize;
+ /* modified on event-start */
+ l->lwpcb.head->filters = 0;
+ /* l->lwpcb.head->base_ip = 0;
+ * l->lwpcb.head->limit_ip = 0; */
+ l->lwpcb.head->buffer_tail_offset = 0;
+
+ /* init userspace */
+ err = userwrite_lwpcb(l);
+ if (err)
+ goto err_ubuffer;
+
+ err = userwrite_buffer(l);
+ if (err)
+ goto err_ubuffer;
+
+ return l;
+err_ubuffer:
+ free_userspace_mapping(&l->userspace.buffer, l->userspace.mm);
+err_ulwpcb:
+ free_userspace_mapping(&l->userspace.lwpcb, l->userspace.mm);
+err_mm:
+ mmput(l->userspace.mm);
+
+ kfree(l->event_counter);
+err_lwpcbbuffer_alloc:
+ kfree(l->buffer.buffer_base);
+err_lwpcbspace_alloc:
+ kfree(l->lwpcb.lwpcb_base);
+err_lwpcb_alloc:
+ kfree(l);
+ return ERR_PTR(err);
+}
+
+static void lwpcb_destory(struct kref *kref)
+{
+ struct lwp_struct *l = container_of(kref, struct lwp_struct,
+ ref_count);
+
+ /*
+ * we are the last one still standing, no locking required
+ * (if we use kref correctly)
+ */
+
+ BUG_ON(l->active);
+ BUG_ON(in_interrupt());
+
+ if (lwp_clear(l))
+ BUG();
+
+ free_userspace_mapping(&l->userspace.lwpcb, l->userspace.mm);
+ free_userspace_mapping(&l->userspace.buffer, l->userspace.mm);
+ mmput(l->userspace.mm);
+
+ kfree(l->event_counter);
+ kfree(l->buffer.buffer_base);
+ kfree(l->lwpcb.lwpcb_base);
+ kfree(l);
+}
+
+static void lwpcb_add_event(struct lwp_struct *lwps, u32 eventnr, u16 filter,
+ u64 sample)
+{
+ struct lwpcb_head *head = lwps->lwpcb.head;
+ struct lwpcb_event *events = lwps->lwpcb.events;
+ u32 filters = head->filters;
+
+ WARN_ON(lwps->active);
+
+ if (filter)
+ filters |= get_filter_vector(eventnr, filter);
+
+ head->filters = filters;
+ events[eventnr-1].interval = sample;
+ events[eventnr-1].counter = 0;
+}
+
+static void lwpcb_remove_event(struct lwp_struct *lwps, u32 eventnr)
+{
+ WARN_ON(lwps->active);
+
+ lwps->lwpcb.events[eventnr-1].interval = 0;
+ lwps->lwpcb.events[eventnr-1].counter = 0;
+}
+
+static int lwpcb_read_buffer(struct lwp_struct *l)
+{
+ u32 bho, bto, bz;
+ int count, i;
+ char *buffer = l->buffer.buffer_base;
+ struct lwp_event *event;
+
+ bz = l->lwpcb.head->buffer_size;
+
+ bto = l->lwpcb.head->buffer_tail_offset;
+ buffer += bto;
+
+ /*
+ * the last two checks are to prevent user-manipulations that could
+ * cause damage
+ */
+ if (lwp_read_head_offset(l, &bho) || (bho > bz) || (bho % l->eventsize))
+ BUG();
+
+ count = (((bho - bto) % bz) / l->eventsize);
+ if(count <= 0)
+ return 0;
+
+ /* todo read only needed chunks */
+ if (userread_buffer(l, bto, bho))
+ BUG();
+
+ for (i = 0; i < count; i++) {
+ event = (struct lwp_event *) (buffer + bto);
+
+ /*
+ * The opposite COULD be a programmed lwp-event (id=255), but we
+ * ignore them for now.
+ */
+ if ((event->event_id > LWP_EVENT_INVALID) ||
+ (event->event_id < LWP_EVENT_MAX)) {
+ l->event_counter[event->event_id - 1] +=
+ l->lwpcb.events[event->event_id - 1].interval;
+ }
+
+ bto += l->eventsize;
+ if (bto >= bz)
+ bto = 0;
+ }
+
+ l->lwpcb.head->buffer_tail_offset = bto;
+
+ if (userwrite_buffer_tail_offset(l))
+ BUG();

+
+ return 0;
+}
+

+static void perf_lwp_event_destroy(struct perf_event *event)
+{
+ struct lwp_struct *l = (struct lwp_struct *) event->hw.config;
+ /* ???: is it possible to modify event->attr.config at runtime? */
+ u32 eventnr = lwp_config_event_get(event->attr.config);
+ unsigned long flags;
+
+ /* this event has already a valid copy of the lwpcb */
+
+ WARN_ON(!(event->hw.state & PERF_HES_STOPPED));
+ BUG_ON(current != l->owner);
+
+ raw_spin_lock_irqsave(&l->lock, flags);
+
+ if (lwp_stop(l))
+ BUG();
+
+ lwpcb_remove_event(l, eventnr);
+
+ if (userwrite_lwpcb(l))
+ BUG();
+
+ l->event_counter[eventnr-1] = 0;
+ l->xstate_counter[eventnr-1] = 0;
+
+ if ((l->lwpcb.head->flags & LWP_EVENT_MASK) && lwp_start(l, 1))
+ BUG();
+
+ raw_spin_unlock_irqrestore(&l->lock, flags);
+
+ /* for future with cross-lwp-creation this needs to be locked */
+ kref_put(&l->ref_count, lwpcb_destory);
+}
+
+static int
+perf_lwp_event_init_for(struct perf_event *event, int cpu,
+ struct task_struct *task)
+{
+ int err;
+ unsigned long flags;
+ struct hw_perf_event *hwc = &event->hw;
+ struct perf_event_attr *attr = &event->attr;
+ struct task_struct *target, *observer;
+ struct perf_event_context *ctx;
+ struct perf_event *e;
+ struct lwp_struct *lwpcb;
+ u32 eventnr;
+ u16 filter;
+
+ if (perf_lwp_pmu.type != event->attr.type)
+ return -ENOENT;
+
+ observer = current;
+
+ if (event->attach_state != PERF_ATTACH_TASK || event->cpu != -1)
+ return -EINVAL;
+
+ target = task;
+
+ /* current restriction, till the mmap-problem is solved */
+ if (target != observer)
+ return -EINVAL;
+
+ if (attr->config & ~LWP_CONFIG_MASK)
+ return -EINVAL;
+
+ eventnr = (u32) lwp_config_event_get(attr->config);
+ if ((eventnr <= LWP_EVENT_INVALID) || (eventnr >= LWP_EVENT_MAX) ||
+ (eventnr > lwp_caps.size_max_event_id) ||
+ (!test_bit(eventnr, &lwp_caps.available_events)))
+ return -EINVAL;
+
+ filter = lwp_config_filter_get(attr->config);
+ if (filter & get_filter_mask_for(eventnr))
+ return -EINVAL;
+
+ /* either to big (>26 Bit) or to small (<16) */
+ if ((hwc->sample_period < 0xF) || (hwc->sample_period >= 0x2000000))
+ return -EINVAL;
+
+ /*
+ * we need to check if there is already a lwp-event running for this
+ * task, if so, we don't need to create a new lwpcb, just update it
+ *
+ * to do so, first get the context of the task and lock it
+ */
+
+ ctx = perf_find_get_context(&perf_lwp_pmu, task, cpu);
+ /* strange but possible, most likely due to memory-shortage */
+ if (IS_ERR(ctx))
+ return (int) PTR_ERR(ctx);
+
+ /*
+ * now we have a valid context, lets lock the event-list so it can't be
+ * modified
+ */
+ mutex_lock(&ctx->mutex);
+ rcu_read_lock();
+
+ /* ok, lets look for a lwp-event */
+ list_for_each_entry_rcu(e, &ctx->event_list, event_entry) {
+ if (e->pmu == &perf_lwp_pmu)
+ break;
+ }
+
+ if (e->pmu != &perf_lwp_pmu) {
+ /* there is currently no running lwp-event */
+
+ /*
+ * TODO: for later implementation of cross-lwp-creation we need
+ * to introduce a lock here, to prevent other threads from
+ * racing the creation of the lwpcb
+ *
+ * maybe we would better introduce a lwp-field in the
+ * event-context to prevent two events racing this
+ */
+
+ rcu_read_unlock();
+
+ lwpcb = lwpcb_new();
+ if (IS_ERR(lwpcb)) {
+ err = -ENOMEM;
+ goto err_lwpcbnew_failed;
+ }
+ } else {
+ /* found a running lwp-event */
+
+ lwpcb = lwpcb_get(e);
+ rcu_read_unlock();
+ }
+
+ hwc->config = (u64) lwpcb;
+ hwc->state = PERF_HES_STOPPED;
+
+ raw_spin_lock_irqsave(&lwpcb->lock, flags);
+
+ if (lwpcb->lwpcb.events[eventnr-1].interval) {
+ err = -EINVAL;
+ goto err_add_failed;
+ }
+
+ if (lwp_stop(lwpcb)) {
+ err = -EFAULT;
+ goto err_add_failed;
+ }
+
+ lwpcb_add_event(lwpcb, eventnr, filter, hwc->sample_period);
+ if(userwrite_lwpcb(lwpcb))
+ BUG();
+
+ lwpcb->event_counter[eventnr-1] = 0;
+ lwpcb->xstate_counter[eventnr-1] = 0;
+
+ event->destroy = perf_lwp_event_destroy;
+
+ if ((lwpcb->lwpcb.head->flags & LWP_EVENT_MASK) && lwp_start(lwpcb, 1))
+ BUG();
+
+ raw_spin_unlock_irqrestore(&lwpcb->lock, flags);
+
+ mutex_unlock(&ctx->mutex);
+ perf_release_context(ctx);
+
+ return 0;
+err_add_failed:
+ raw_spin_unlock_irqrestore(&lwpcb->lock, flags);
+ perf_lwp_event_destroy(event);
+err_lwpcbnew_failed:
+ mutex_unlock(&ctx->mutex);
+ perf_release_context(ctx);
+

+ return err;
+}
+

+static void perf_lwp_start(struct perf_event *event, int flags)
+{
+ struct hw_perf_event *hwc = &event->hw;
+ struct lwp_struct *l = (struct lwp_struct *) event->hw.config;
+ u32 eventnr = lwp_config_event_get(event->attr.config);
+ u32 lwpflags;
+ unsigned long lockflags = 0;
+
+ /* update cached values, before updating freq */
+ raw_spin_lock_irqsave(&l->lock, lockflags);
+ lwpcb_read_buffer(l);
+ raw_spin_unlock_irqrestore(&l->lock, lockflags);
+
+ lockflags = 0;
+ raw_spin_lock_irqsave(&l->lock, lockflags);
+
+ /* TODO: need a good way to handle takeovers of lwp by current */
+ if (lwp_stop(l))
+ BUG();
+
+ hwc->state = 0;
+
+ /* counters get reloaded every lwp_start
+ if (flags & PERF_EF_RELOAD) { DEBUG("reload counter"); } */
+
+ /* This implies that we currently not support 64 Bit-Counter */
+ if (hwc->sample_period < LWP_EVENT_MIN_PERIOD) {
+ __WARN();
+ hwc->sample_period = LWP_EVENT_MIN_PERIOD;
+ } else if (hwc->sample_period > LWP_EVENT_MAX_PERIOD) {
+ __WARN();
+ hwc->sample_period = LWP_EVENT_MAX_PERIOD;
+ }
+ l->lwpcb.events[eventnr-1].interval = hwc->sample_period;
+
+ lwpflags = l->lwpcb.head->flags;
+ lwpflags |= (1U << eventnr);
+ l->lwpcb.head->flags = lwpflags;
+
+ /* TODO: need a good way to handle mm-changes by current */
+ if (userwrite_lwpcb(l))
+ BUG();
+
+ if (lwp_start(l, 1))
+ BUG();
+
+ raw_spin_unlock_irqrestore(&l->lock, lockflags);
+
+ perf_event_update_userpage(event);
+}
+
+static void perf_lwp_stop(struct perf_event *event, int flags)
+{
+ struct hw_perf_event *hwc = &event->hw;
+ struct lwp_struct *l = (struct lwp_struct *) event->hw.config;
+ u32 eventnr = lwp_config_event_get(event->attr.config);
+ u32 lwpflags;
+ unsigned long lockflags = 0;
+
+ raw_spin_lock_irqsave(&l->lock, lockflags);
+
+ if (lwp_stop(l))
+ BUG();
+
+ /* counter get updated every stop, for each active event */
+ hwc->state = PERF_HES_STOPPED | PERF_HES_UPTODATE;
+
+ lwpflags = l->lwpcb.head->flags;
+ lwpflags &= ~(1U << eventnr);
+ l->lwpcb.head->flags = lwpflags;
+
+ if (userwrite_lwpcb(l))
+ BUG();
+
+ if (lwpflags & LWP_EVENT_MASK) {
+ if (lwp_start(l, 1))
+ BUG();
+ }
+
+ raw_spin_unlock_irqrestore(&l->lock, lockflags);
+
+ /* update cached values */
+ lockflags = 0;
+ raw_spin_lock_irqsave(&l->lock, lockflags);
+ lwpcb_read_buffer(l);
+ raw_spin_unlock_irqrestore(&l->lock, lockflags);
+
+ perf_event_update_userpage(event);
+}
+
+static int perf_lwp_add(struct perf_event *event, int flags)
+{
+ if (flags & PERF_EF_START)
+ perf_lwp_start(event, flags);

+
+ return 0;
+}
+

+static void perf_lwp_del(struct perf_event *event, int flags)
+{
+ perf_lwp_stop(event, flags);
+}
+
+static void perf_lwp_read(struct perf_event *event)
+{
+ struct lwp_struct *l = (struct lwp_struct *) event->hw.config;
+ unsigned long flags;
+
+ raw_spin_lock_irqsave(&l->lock, flags);
+
+ lwpcb_read_buffer(l);
+
+ raw_spin_unlock_irqrestore(&l->lock, flags);
+}
+
+static struct pmu perf_lwp_pmu = {
+ .task_ctx_nr = perf_permanent_context,
+
+ .event_init = perf_lwp_event_init,
+ .event_init_for = perf_lwp_event_init_for,
+ .add = perf_lwp_add,
+ .del = perf_lwp_del,
+ .start = perf_lwp_start,
+ .stop = perf_lwp_stop,
+ .read = perf_lwp_read,
+};
+
+static int perf_lwp_init_pmu(void)
+{
+ int ret;
+
+ ret = perf_pmu_register(&perf_lwp_pmu, "lwp", -1);
+ if (ret)
+ return ret;
+
+ printk(KERN_INFO "perf: registered LWP-PMU (type-id: %d)",
+ perf_lwp_pmu.type);

+
+ return 0;
+}

static void get_lwp_caps(struct lwp_capabilities *caps)
{
@@ -111,6 +1282,12 @@ static __init int amd_lwp_init(void)

get_online_cpus();

+ /*
+ * The global 'lwp_caps' has to be known to all functions after this.
+ *
+ * For the SMP-case we relay on the implicit fence of smp_call_function
+ * and in the none-SMP-case on the barrier afterwards.
+ */
barrier();

perf_cpu_notifier(lwp_cpu_notifier);
@@ -132,7 +1309,7 @@ static __init int amd_lwp_init(void)
lwp_caps.size_event_offset, lwp_caps.features,
lwp_caps.supported_events);

- return 0;
+ return perf_lwp_init_pmu();
}

device_initcall(amd_lwp_init);
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 0c6fae6..2539f6f 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -971,6 +971,11 @@ perf_event_create_kernel_counter(struct perf_event_attr *attr,
extern u64 perf_event_read_value(struct perf_event *event,
u64 *enabled, u64 *running);

+extern struct perf_event_context *
+perf_find_get_context(struct pmu *pmu, struct task_struct *task,
+ int cpu);
+extern void perf_release_context(struct perf_event_context *ctx);
+
struct perf_sample_data {
u64 type;

diff --git a/kernel/events/core.c b/kernel/events/core.c
index fd18d70..99715c0 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2920,6 +2920,34 @@ errout:
return ERR_PTR(err);
}

+/*
+ * Returns a matching context with refcount and pincount incremented.
+ * Tries to find a matching context for the given combination of PMU, task
+ * and CPU, which is a tasks context if task is given and a CPU-context if
+ * not.
+ *
+ * If a matching context is found, the pin-count and the ref-count of the
+ * context will be incremented. You have to decrement them again, if you're
+ * done with the context.
+ * They both protect the context from being freed and from being swapped away
+ * from the task/cpu.
+ */
+struct perf_event_context *
+perf_find_get_context(struct pmu *pmu, struct task_struct *task, int cpu)
+{
+ return find_get_context(pmu, task, cpu);
+}
+
+/*
+ * Release your valid pointer to the context, it will be invalid afterwards!
+ */
+void
+perf_release_context(struct perf_event_context *ctx)
+{
+ perf_unpin_context(ctx);
+ put_ctx(ctx);
+}
+
static void perf_event_free_filter(struct perf_event *event);

static void free_event_rcu(struct rcu_head *head)

Hans Rosenfeld

unread,

Dec 16, 2011, 11:20:01 AM12/16/11

From: Benjamin Block <benjami...@amd.com>

This patch adds support for the LWP threshold-interrupt into the
LWP-integration into perf. For each LWP-event that is written into the
buffer a interrupt is generated and a overflow is reported to perf. If
requested, the LWP-event is also reported as raw-event.

The perf-sample_rate is used as interval for the corresponding
LWP-event. The current implementation restricts the sample_rate to be
between 0xF and 0x1FFFFFF, because we couldn't report raw-LWP-event for
each overflow if the sample_rate would be bigger (period-calculation
could cause a overflow although there was no interrupt).

The interrupt is currently only available to the kernel and not to
userland-software that wants to use LWP without the in-kernel
implementation.

Signed-off-by: Benjamin Block <benjami...@amd.com>
Signed-off-by: Hans Rosenfeld <hans.ro...@amd.com>
---

arch/x86/include/asm/irq_vectors.h | 8 +-
arch/x86/kernel/cpu/Makefile | 4 +-
arch/x86/kernel/cpu/perf_event_amd_lwp.c | 318 +++++++++++++++++++++++-------
arch/x86/kernel/entry_64.S | 2 +
4 files changed, 253 insertions(+), 79 deletions(-)

diff --git a/arch/x86/include/asm/irq_vectors.h b/arch/x86/include/asm/irq_vectors.h
index 7e50f06..c5447f5 100644
--- a/arch/x86/include/asm/irq_vectors.h
+++ b/arch/x86/include/asm/irq_vectors.h
@@ -119,6 +119,12 @@
*/
#define LOCAL_TIMER_VECTOR 0xef

+/*
+ * Vector-Nr. used by the threshold-interrupt.
+ * Has to be initialized before it is written to MSR_AMD64_LWP_CFG.
+ */
+#define LWP_THRESHOLD_VECTOR 0xee
+
/* up to 32 vectors used for spreading out TLB flushes: */
#if NR_CPUS <= 32
# define NUM_INVALIDATE_TLB_VECTORS (NR_CPUS)
@@ -126,7 +132,7 @@
# define NUM_INVALIDATE_TLB_VECTORS (32)
#endif

-#define INVALIDATE_TLB_VECTOR_END (0xee)
+#define INVALIDATE_TLB_VECTOR_END (0xed)
#define INVALIDATE_TLB_VECTOR_START \
(INVALIDATE_TLB_VECTOR_END-NUM_INVALIDATE_TLB_VECTORS+1)

diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
index 9973465..6d87bac 100644

--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -20,7 +20,7 @@ obj-$(CONFIG_X86_32) += bugs.o
obj-$(CONFIG_X86_64) += bugs_64.o

obj-$(CONFIG_CPU_SUP_INTEL) += intel.o

-obj-$(CONFIG_CPU_SUP_AMD) += amd.o perf_event_amd_lwp.o
+obj-$(CONFIG_CPU_SUP_AMD) += amd.o

obj-$(CONFIG_CPU_SUP_CYRIX_32) += cyrix.o
obj-$(CONFIG_CPU_SUP_CENTAUR) += centaur.o
obj-$(CONFIG_CPU_SUP_TRANSMETA_32) += transmeta.o

@@ -31,7 +31,7 @@ obj-$(CONFIG_PERF_EVENTS) += perf_event.o
obj-$(CONFIG_X86_MCE) += mcheck/
obj-$(CONFIG_MTRR) += mtrr/

-obj-$(CONFIG_X86_LOCAL_APIC) += perfctr-watchdog.o
+obj-$(CONFIG_X86_LOCAL_APIC) += perfctr-watchdog.o perf_event_amd_lwp.o

quiet_cmd_mkcapflags = MKCAP $@
cmd_mkcapflags = $(PERL) $(srctree)/$(src)/mkcapflags.pl $< $@
diff --git a/arch/x86/kernel/cpu/perf_event_amd_lwp.c b/arch/x86/kernel/cpu/perf_event_amd_lwp.c
index afc6c8d..205245d 100644
--- a/arch/x86/kernel/cpu/perf_event_amd_lwp.c
+++ b/arch/x86/kernel/cpu/perf_event_amd_lwp.c
@@ -10,6 +10,9 @@
#include <linux/highmem.h>
#include <linux/bitops.h>

+#include <asm/idle.h>
+#include <asm/desc.h>
+#include <asm/irq_vectors.h>

#include <asm/xsave.h>
#include <asm/cpufeature.h>
#include <asm/processor.h>

@@ -250,6 +253,7 @@ struct lwp_struct {

/* Cached events that have been read from buffer */

u64 *event_counter;
+ struct perf_event **registered_events;
/*

* Cached xsave-values, to prevent lose of already counted but not

* submitted events.
@@ -270,6 +274,8 @@ static inline int vector_test(unsigned int bit_nr, u32 vector)
static struct lwp_capabilities lwp_caps;
static struct pmu perf_lwp_pmu;

+static DEFINE_PER_CPU(struct lwp_struct *, active_lwp_struct) = 0;
+
static u16 get_filter_mask_for(u32 eventnr)
{
/*
@@ -735,6 +741,16 @@ static struct lwp_struct *lwpcb_new(void)

}
memset(l->event_counter, 0, l->eventmax * sizeof(*l->event_counter));

+ l->registered_events =
+ kmalloc(l->eventmax * sizeof(*l->registered_events),
+ GFP_ATOMIC);
+ if(!l->registered_events) {
+ err = -ENOENT;
+ goto err_event_counter_alloc;
+ }
+ memset(l->registered_events, 0,
+ l->eventmax * sizeof(*l->registered_events));

+
l->userspace.mm = get_task_mm(current);

err = get_userspace_mapping(&l->userspace.lwpcb, l->userspace.mm,

@@ -747,8 +763,11 @@ static struct lwp_struct *lwpcb_new(void)
if (err)
goto err_ulwpcb;

- /* modified on event-start */
- l->lwpcb.head->flags = 0;
+ /*
+ * Activate only the threshold interrupt,
+ * all other events are activated on pmu-start() off the specific event
+ */
+ l->lwpcb.head->flags = (1U << LWP_CAPS_THRESHOLD);

l->lwpcb.head->buffer_size = l->buffer.size;

l->lwpcb.head->buffer_base = (u64) l->userspace.buffer.addr;

/* currently not supported by this pmu */

@@ -779,6 +798,8 @@ err_ulwpcb:
err_mm:
mmput(l->userspace.mm);

+ kfree(l->registered_events);
+err_event_counter_alloc:
kfree(l->event_counter);
err_lwpcbbuffer_alloc:
kfree(l->buffer.buffer_base);
@@ -809,6 +830,7 @@ static void lwpcb_destory(struct kref *kref)
free_userspace_mapping(&l->userspace.buffer, l->userspace.mm);
mmput(l->userspace.mm);

+ kfree(l->registered_events);
kfree(l->event_counter);
kfree(l->buffer.buffer_base);
kfree(l->lwpcb.lwpcb_base);
@@ -840,57 +862,46 @@ static void lwpcb_remove_event(struct lwp_struct *lwps, u32 eventnr)

lwps->lwpcb.events[eventnr-1].counter = 0;
}

-static int lwpcb_read_buffer(struct lwp_struct *l)
+static int
+lwpcb_update_period(struct lwp_struct *lwps, struct perf_event *event,
+ u64 period, u64 new_period)
{
- u32 bho, bto, bz;
- int count, i;
- char *buffer = l->buffer.buffer_base;
- struct lwp_event *event;
-
- bz = l->lwpcb.head->buffer_size;
-
- bto = l->lwpcb.head->buffer_tail_offset;
- buffer += bto;
-
- /*
- * the last two checks are to prevent user-manipulations that could
- * cause damage
- */
- if (lwp_read_head_offset(l, &bho) || (bho > bz) || (bho % l->eventsize))
- BUG();
-
- count = (((bho - bto) % bz) / l->eventsize);
- if(count <= 0)
- return 0;
-
- /* todo read only needed chunks */
- if (userread_buffer(l, bto, bho))
- BUG();

+ struct hw_perf_event *hwc = &event->hw;

+ u32 event_idx = lwp_config_event_get(event->attr.config) - 1;
+ u64 sample_period = hwc->sample_period;
+ u64 last_period = period;
+ u64 left = local64_read(&hwc->period_left);
+ s64 sleft;
+ int overflow = 0;

- for (i = 0; i < count; i++) {
- event = (struct lwp_event *) (buffer + bto);
+ hwc->last_period = last_period;
+ sleft = (new_period - sample_period);

- /*
- * The opposite COULD be a programmed lwp-event (id=255), but we
- * ignore them for now.
- */
- if ((event->event_id > LWP_EVENT_INVALID) ||
- (event->event_id < LWP_EVENT_MAX)) {
- l->event_counter[event->event_id - 1] +=
- l->lwpcb.events[event->event_id - 1].interval;
- }
-
- bto += l->eventsize;
- if (bto >= bz)
- bto = 0;
+ /* lets test if the change was already enough to trigger a overflow */
+ if (left < -sleft) {
+ overflow = 1;
+ left = new_period + (left + sleft);
+ }
+ else {
+ left += sleft;
}

- l->lwpcb.head->buffer_tail_offset = bto;
+ if (left <= last_period) {
+ overflow = 1;
+ left = new_period + (left - last_period);
+ local64_set(&hwc->period_left, left);
+ } else {
+ left -= last_period;
+ local64_set(&hwc->period_left, left);
+ }

- if (userwrite_buffer_tail_offset(l))
- BUG();
+ /*
+ * if new_period != hwc->sample_period, then this change
+ * has also to be promoted to lwp via userwrite_lwpcb
+ */
+ lwps->lwpcb.events[event_idx].interval = new_period;

- return 0;
+ return overflow;
}

static void perf_lwp_event_destroy(struct perf_event *event)
@@ -907,6 +918,9 @@ static void perf_lwp_event_destroy(struct perf_event *event)

raw_spin_lock_irqsave(&l->lock, flags);

+ if(l->registered_events[eventnr-1] != event)
+ goto not_registered;
+
if (lwp_stop(l))
BUG();

@@ -917,10 +931,12 @@ static void perf_lwp_event_destroy(struct perf_event *event)

l->event_counter[eventnr-1] = 0;

l->xstate_counter[eventnr-1] = 0;

+ l->registered_events[eventnr-1] = 0;

if ((l->lwpcb.head->flags & LWP_EVENT_MASK) && lwp_start(l, 1))

BUG();

+not_registered:
raw_spin_unlock_irqrestore(&l->lock, flags);

/* for future with cross-lwp-creation this needs to be locked */

@@ -1009,7 +1025,6 @@ perf_lwp_event_init_for(struct perf_event *event, int cpu,

* maybe we would better introduce a lwp-field in the

* event-context to prevent two events racing this

*/
-
rcu_read_unlock();

lwpcb = lwpcb_new();
@@ -1029,7 +1044,7 @@ perf_lwp_event_init_for(struct perf_event *event, int cpu,

raw_spin_lock_irqsave(&lwpcb->lock, flags);

- if (lwpcb->lwpcb.events[eventnr-1].interval) {
+ if (lwpcb->registered_events[eventnr-1]) {
err = -EINVAL;
goto err_add_failed;
}
@@ -1045,6 +1060,7 @@ perf_lwp_event_init_for(struct perf_event *event, int cpu,

lwpcb->event_counter[eventnr-1] = 0;

lwpcb->xstate_counter[eventnr-1] = 0;

+ lwpcb->registered_events[eventnr-1] = event;

event->destroy = perf_lwp_event_destroy;

@@ -1073,25 +1089,15 @@ static void perf_lwp_start(struct perf_event *event, int flags)

struct lwp_struct *l = (struct lwp_struct *) event->hw.config;

u32 eventnr = lwp_config_event_get(event->attr.config);

u32 lwpflags;
+ int overflow;

unsigned long lockflags = 0;

- /* update cached values, before updating freq */
- raw_spin_lock_irqsave(&l->lock, lockflags);
- lwpcb_read_buffer(l);
- raw_spin_unlock_irqrestore(&l->lock, lockflags);
-
- lockflags = 0;
raw_spin_lock_irqsave(&l->lock, lockflags);

/* TODO: need a good way to handle takeovers of lwp by current */

if (lwp_stop(l))
BUG();

- hwc->state = 0;
-
- /* counters get reloaded every lwp_start
- if (flags & PERF_EF_RELOAD) { DEBUG("reload counter"); } */
-

/* This implies that we currently not support 64 Bit-Counter */

if (hwc->sample_period < LWP_EVENT_MIN_PERIOD) {
__WARN();
@@ -1100,7 +1106,24 @@ static void perf_lwp_start(struct perf_event *event, int flags)
__WARN();
hwc->sample_period = LWP_EVENT_MAX_PERIOD;
}
- l->lwpcb.events[eventnr-1].interval = hwc->sample_period;
+
+ /* Set the (maybe) new period.
+ *
+ * A Overflow is theo. possible, as the new sample_rate could be smaller
+ * than the old, and thus some already counted events can be enough the
+ * trigger an overflow.
+ * This would be difficult, because there is not lwp-event to report.
+ * We would have to wait for the next interrupt, which should trigger
+ * immediately after the start.
+ *
+ * (left_period + (new_period - old_period)) <= 0
+ */
+ overflow = lwpcb_update_period(l, event, 0, hwc->sample_period);

+
+ hwc->state = 0;
+
+ /* counters get reloaded every lwp_start

+ if (flags & PERF_EF_RELOAD) { } */

lwpflags = l->lwpcb.head->flags;

lwpflags |= (1U << eventnr);
@@ -1110,6 +1133,8 @@ static void perf_lwp_start(struct perf_event *event, int flags)
if (userwrite_lwpcb(l))
BUG();

+ percpu_write(active_lwp_struct, l);
+
if (lwp_start(l, 1))
BUG();

@@ -1138,22 +1163,31 @@ static void perf_lwp_stop(struct perf_event *event, int flags)
lwpflags &= ~(1U << eventnr);

l->lwpcb.head->flags = lwpflags;

+ /*
+ * We could/should update update the period here but in the case of a
+ * overflow we wouldn't have a lwp-event report to report.
+ * Also, there should be no sample_period-changed between start and
+ * stop, thus there are no overflows as in perf_lwp_start. All other
+ * overflows should have been reported already (by the interrupt).
+ *
+ * overflow = lwpcb_update_period(l, hwc, l->xstate_counter[eventnr-1],
+ * l->events[eventnr-1].interval);
+ *
+ * l->xstate_counter[eventnr-1] = 0;
+ */
+
if (userwrite_lwpcb(l))
BUG();

if (lwpflags & LWP_EVENT_MASK) {
if (lwp_start(l, 1))
BUG();
+ } else {
+ percpu_write(active_lwp_struct, 0);
}

raw_spin_unlock_irqrestore(&l->lock, lockflags);

- /* update cached values */
- lockflags = 0;
- raw_spin_lock_irqsave(&l->lock, lockflags);
- lwpcb_read_buffer(l);
- raw_spin_unlock_irqrestore(&l->lock, lockflags);
-
perf_event_update_userpage(event);
}

@@ -1170,16 +1204,148 @@ static void perf_lwp_del(struct perf_event *event, int flags)
perf_lwp_stop(event, flags);
}

+static int
+lwpcb_report_event(struct lwp_struct *lwps, struct lwp_event *lwp_event,
+ struct pt_regs *regs)
+{
+ u64 period;
+ int overflow, event_idx, ret = 0;
+ struct perf_event *perf_event;
+ struct perf_sample_data data;
+ struct perf_raw_record raw;
+
+ event_idx = lwp_event->event_id - 1;
+ perf_event = lwps->registered_events[event_idx];
+
+ /*

+ * The opposite COULD be a programmed lwp-event (id=255), but we
+ * ignore them for now.
+ */

+ if ((lwp_event->event_id <= LWP_EVENT_INVALID) ||
+ (lwp_event->event_id > lwps->eventmax) ||
+ (!perf_event))
+ return -EINVAL;
+
+ /* update lwps-event-counter */
+ period = lwps->lwpcb.events[event_idx].interval;
+ lwps->event_counter[event_idx] += period;
+
+ /* update sample_period */
+ overflow = lwpcb_update_period(lwps, perf_event, period, period);
+
+ if(overflow) {
+ memset(&data, 0, sizeof(data));
+ perf_sample_data_init(&data, lwp_event->inst_adr);
+
+ if (perf_event->attr.sample_type & PERF_SAMPLE_RAW) {
+ raw.size = sizeof(*lwp_event);
+ raw.data = lwp_event;
+ data.raw = &raw;
+ }
+
+ /* disable event eventually */
+ ret = perf_event_overflow(perf_event, &data, regs);
+ }
+
+ perf_event_update_userpage(perf_event);
+

+ return ret;
+}
+

+static int lwpcb_read_buffer(struct lwp_struct *lwps, struct pt_regs *regs)

+{
+ u32 bho, bto, bz;
+ int count, i;

+ char *buffer = lwps->buffer.buffer_base;
+ size_t eventsize = lwps->eventsize;
+ struct lwp_event *lwp_event;
+
+ bz = lwps->lwpcb.head->buffer_size;
+ bto = lwps->lwpcb.head->buffer_tail_offset;
+
+ /*

+ * the last two checks are to prevent user-manipulations that could
+ * cause damage
+ */

+ if (lwp_read_head_offset(lwps, &bho) || (bho > bz) || (bho % eventsize))
+ BUG();
+
+ count = (((bho - bto) % bz) / eventsize);
+
+ if (userread_buffer(lwps, bto, bho))

+ BUG();
+
+ for (i = 0; i < count; i++) {

+ lwp_event = (struct lwp_event *) (buffer + bto);
+
+ /*
+ * TODO: if lwpcb_report_event returns x > 0, then this event
+ * should be stopped. But this is difficult because we are in
+ * a interrupt. We would have to run perf_lwp_stop and this
+ * function uses xsave/xrestore and other expensive operations.
+ */
+ lwpcb_report_event(lwps, lwp_event, regs);
+
+ bto += eventsize;

+ if (bto >= bz)
+ bto = 0;
+ }
+

+ lwps->lwpcb.head->buffer_tail_offset = bto;
+
+ if (userwrite_buffer_tail_offset(lwps))
+ BUG();

+
+ return 0;
+}
+

static void perf_lwp_read(struct perf_event *event)
{
- struct lwp_struct *l = (struct lwp_struct *) event->hw.config;
- unsigned long flags;
+ /*
+ * TODO: report current counter-states.
+ *
+ * Could be difficult because in the case of a overflow we wouldn't
+ * have a lwp-event to report
+ */
+}

- raw_spin_lock_irqsave(&l->lock, flags);
+static void
+lwp_threshold_handler(struct lwp_struct *lwps, struct pt_regs *regs)
+{
+ unsigned long flags = 0;

- lwpcb_read_buffer(l);
+ raw_spin_lock_irqsave(&lwps->lock, flags);

- raw_spin_unlock_irqrestore(&l->lock, flags);
+ lwpcb_read_buffer(lwps, regs);
+
+ raw_spin_unlock_irqrestore(&lwps->lock, flags);
+}
+
+extern void lwp_threshold_intr1(void);
+
+void lwp_threshold_interrupt(struct pt_regs *regs)
+{
+ struct pt_regs *old_regs = set_irq_regs(regs);
+ struct lwp_struct *lwps = percpu_read(active_lwp_struct);
+
+ ack_APIC_irq();
+
+ exit_idle();
+
+ /* Has to be done, to update timers and for locking. */
+ irq_enter();
+ if(lwps)
+ lwp_threshold_handler(lwps, regs);
+ /*
+ * else {
+ * This is likely a threshold-int triggert by a userspace-
+ * activated lwp.
+ * }
+ */
+
+ irq_exit();
+
+ set_irq_regs(old_regs);
}

static struct pmu perf_lwp_pmu = {
@@ -1239,12 +1405,10 @@ static void lwp_start_cpu(void *c)
msr.cfg.core_id = (u8) smp_processor_id();

/*
- * We currently do not support the threshold-interrupt so
- * bit 31 and [40..47] of msr.msr_value keep 0
- *
- * msr.cfg.allowed_events |= (1U << 31);
- * msr.cfg.interrupt_vector = xxx;
+ * Threshold-Interrrupt-Setup.
*/
+ msr.cfg.allowed_events |= (1U << LWP_CAPS_THRESHOLD);
+ msr.cfg.interrupt_vector = LWP_THRESHOLD_VECTOR;

wrmsrl(MSR_AMD64_LWP_CFG, msr.msr_value);
}
@@ -1280,6 +1444,8 @@ static __init int amd_lwp_init(void)
if (!test_bit(LWP_CAPS_THRESHOLD, &lwp_caps.supported_events))
return -ENODEV;

+ alloc_intr_gate(LWP_THRESHOLD_VECTOR, lwp_threshold_intr1);
+
get_online_cpus();

/*
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 6419bb0..03d47b1 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -966,6 +966,8 @@ apicinterrupt REBOOT_VECTOR \
apicinterrupt UV_BAU_MESSAGE \
uv_bau_message_intr1 uv_bau_message_interrupt
#endif
+apicinterrupt LWP_THRESHOLD_VECTOR \
+ lwp_threshold_intr1 lwp_threshold_interrupt
apicinterrupt LOCAL_TIMER_VECTOR \
apic_timer_interrupt smp_apic_timer_interrupt
apicinterrupt X86_PLATFORM_IPI_VECTOR \

Ingo Molnar

unread,

Dec 18, 2011, 3:10:01 AM12/18/11

* Hans Rosenfeld <hans.ro...@amd.com> wrote:

> From: Benjamin Block <benjami...@amd.com>
>
> Implements a basic integration of LWP into perf. Permits a way
> to create a perf-event that will be backed by LWP. The PMU
> creates the required structures and userspace-memories. The
> PMU also collects the samples from the ring-buffer, but as
> there is currently no interrupt- and overflow-implementation,
> they are not reported (TODO).

Ok, this is a step in the right direction - once the threshold
IRQ flow control mechanism is implemented we are looking at
something that might be mergeable. Any ETA on those bits?

Thanks,

Ingo

Benjamin Block

unread,

Dec 18, 2011, 10:30:03 AM12/18/11

On Sun, 18 Dec 2011 09:04:43 +0100, Ingo Molnar <mi...@elte.hu> wrote:
> * Hans Rosenfeld <hans.ro...@amd.com> wrote:
>
>> From: Benjamin Block <benjami...@amd.com>
>>
>> Implements a basic integration of LWP into perf. Permits a way
>> to create a perf-event that will be backed by LWP. The PMU
>> creates the required structures and userspace-memories. The
>> PMU also collects the samples from the ring-buffer, but as
>> there is currently no interrupt- and overflow-implementation,
>> they are not reported (TODO).
>
> Ok, this is a step in the right direction - once the threshold
> IRQ flow control mechanism is implemented we are looking at
> something that might be mergeable. Any ETA on those bits?
>

The threshold-interrupt is already integrated with patch 5 of this
patch-set.

Maybe I wrote the descriptions a little misleading. Sry for that. :)

best regards,
- Benjamin

Ingo Molnar

unread,

Dec 18, 2011, 6:50:02 PM12/18/11

* Benjamin Block <be...@mageta.org> wrote:

> The threshold-interrupt is already integrated with patch 5 of
> this patch-set.
>
> Maybe I wrote the descriptions a little misleading. Sry for
> that. :)

Okay, i stopped reading at the first patch that claimed that the
threshold irq was not supported ;-)

So the question becomes, how well is it integrated: can
perf 'record -a + perf report', or 'perf top' use LWP,
to do system-wide precise [user-space] profiling and such?

Thanks,

Ingo

Robert Richter

unread,

Dec 19, 2011, 4:10:01 AM12/19/11

On 19.12.11 00:43:10, Ingo Molnar wrote:
> So the question becomes, how well is it integrated: can
> perf 'record -a + perf report', or 'perf top' use LWP,
> to do system-wide precise [user-space] profiling and such?

There is only self-monitoring of a process possible, no kernel and
system-wide profiling. This is because we can not allocate memory
regions in the kernel for a thread other than the current. This would
require a complete rework of mm code.

-Robert

--
Advanced Micro Devices, Inc.
Operating System Research Center

Ingo Molnar

unread,

Dec 19, 2011, 6:00:01 AM12/19/11

* Robert Richter <robert....@amd.com> wrote:

> On 19.12.11 00:43:10, Ingo Molnar wrote:
>
> > So the question becomes, how well is it integrated: can perf
> > 'record -a + perf report', or 'perf top' use LWP, to do
> > system-wide precise [user-space] profiling and such?
>
> There is only self-monitoring of a process possible, no kernel
> and system-wide profiling. This is because we can not allocate
> memory regions in the kernel for a thread other than the
> current. This would require a complete rework of mm code.

Hm, i don't think a rework is needed: check the
vmalloc_to_page() code in kernel/events/ring_buffer.c. Right now
CONFIG_PERF_USE_VMALLOC is an ARM, MIPS, SH and Sparc specific
feature, on x86 it turns on if CONFIG_DEBUG_PERF_USE_VMALLOC=y.

That should be good enough for prototyping the kernel/user
shared buffering approach.

Thanks,

Ingo

Avi Kivity

unread,

Dec 19, 2011, 6:20:02 AM12/19/11

On 12/19/2011 12:54 PM, Ingo Molnar wrote:
> * Robert Richter <robert....@amd.com> wrote:
>
> > On 19.12.11 00:43:10, Ingo Molnar wrote:
> >
> > > So the question becomes, how well is it integrated: can perf
> > > 'record -a + perf report', or 'perf top' use LWP, to do
> > > system-wide precise [user-space] profiling and such?
> >
> > There is only self-monitoring of a process possible, no kernel
> > and system-wide profiling. This is because we can not allocate
> > memory regions in the kernel for a thread other than the
> > current. This would require a complete rework of mm code.
>
> Hm, i don't think a rework is needed: check the
> vmalloc_to_page() code in kernel/events/ring_buffer.c. Right now
> CONFIG_PERF_USE_VMALLOC is an ARM, MIPS, SH and Sparc specific
> feature, on x86 it turns on if CONFIG_DEBUG_PERF_USE_VMALLOC=y.
>
> That should be good enough for prototyping the kernel/user
> shared buffering approach.
>

LWP wants user memory, vmalloc is insufficient. You need do_mmap() with
a different mm.

You could let a workqueue call use_mm() and then do_mmap(). Even then
it is subject to disruption by the monitored thread (and may disrupt the
monitored thread by playing with its address space). This is for thread
monitoring only, I don't think system-wide monitoring is possible with LWP.

--
error compiling committee.c: too many arguments to function

Ingo Molnar

unread,

Dec 19, 2011, 6:50:02 AM12/19/11

* Avi Kivity <a...@redhat.com> wrote:

> On 12/19/2011 12:54 PM, Ingo Molnar wrote:
> > * Robert Richter <robert....@amd.com> wrote:
> >
> > > On 19.12.11 00:43:10, Ingo Molnar wrote:
> > >
> > > > So the question becomes, how well is it integrated: can perf
> > > > 'record -a + perf report', or 'perf top' use LWP, to do
> > > > system-wide precise [user-space] profiling and such?
> > >
> > > There is only self-monitoring of a process possible, no
> > > kernel and system-wide profiling. This is because we can
> > > not allocate memory regions in the kernel for a thread
> > > other than the current. This would require a complete
> > > rework of mm code.
> >
> > Hm, i don't think a rework is needed: check the
> > vmalloc_to_page() code in kernel/events/ring_buffer.c. Right
> > now CONFIG_PERF_USE_VMALLOC is an ARM, MIPS, SH and Sparc
> > specific feature, on x86 it turns on if
> > CONFIG_DEBUG_PERF_USE_VMALLOC=y.
> >
> > That should be good enough for prototyping the kernel/user
> > shared buffering approach.
>
> LWP wants user memory, vmalloc is insufficient. You need
> do_mmap() with a different mm.

Take a look at PERF_USE_VMALLOC, it allows in-kernel allocated
memory to be mmap()ed to user-space. It is basically a
shared/dual user/kernel mode vmalloc implementation.

So all the conceptual pieces are there.

> You could let a workqueue call use_mm() and then do_mmap().
> Even then it is subject to disruption by the monitored thread
> (and may disrupt the monitored thread by playing with its

> address space). [...]

Injecting this into another thread's context is indeed advanced
stuff:

> [...] This is for thread monitoring only, I don't think

> system-wide monitoring is possible with LWP.

That should be possible too, via two methods:

1) the easy hack: a (per cpu) vmalloc()ed buffer is made ring 3
accessible (by clearing the system bit in the ptes) - and
thus accessible to all user-space.

This is obviously globally writable/readable memory so only a
debugging/prototyping hack - but would be a great first step
to prove the concept and see some nice perf top and perf
record results ...

2) the proper solution: creating a 'user-space vmalloc()' that
is per mm and that gets inherited transparently, across
fork() and exec(), and which lies outside the regular vma
spaces. On 64-bit this should be straightforward.

These vmas are not actually 'known' to user-space normally -
the kernel PMU code knows about it and does what we do with
PEBS: flushes it when necessary and puts it into the
regular perf event channels.

This solves the inherited perf record workflow immediately:
the parent task just creates the buffer, which gets inherited
across exec() and fork(), into every portion of the workload.

System-wide profiling is a small additional variant of this:
creating such a user-vmalloc() area for all tasks in the
system so that the PMU code has them ready in the
context-switch code.

Solution #2 has the additional advantage that we could migrate
PEBS to it and could allow interested user-space access to the
'raw' PEBS buffer as well. (currently the PEBS buffer is only
visible to kernel-space.)

I'd suggest the easy hack first, to get things going - we can
then help out with the proper solution.

Thanks,

Ingo

Avi Kivity

unread,

Dec 19, 2011, 7:00:03 AM12/19/11

On 12/19/2011 01:40 PM, Ingo Molnar wrote:
>
> 2) the proper solution: creating a 'user-space vmalloc()' that
> is per mm and that gets inherited transparently, across
> fork() and exec(), and which lies outside the regular vma
> spaces. On 64-bit this should be straightforward.

That probably has uses outside perf too, but I can see mm nacks piling up.

> These vmas are not actually 'known' to user-space normally -
> the kernel PMU code knows about it and does what we do with
> PEBS: flushes it when necessary and puts it into the
> regular perf event channels.
>
> This solves the inherited perf record workflow immediately:
> the parent task just creates the buffer, which gets inherited
> across exec() and fork(), into every portion of the workload.

The buffer still needs to be managed. While you may be able to juggle
different threads on the same cpu using different events, threads on
other cpus need to use separate LWP contexts and buffers.

>
> System-wide profiling is a small additional variant of this:
> creating such a user-vmalloc() area for all tasks in the
> system so that the PMU code has them ready in the
> context-switch code.

What about security? Do we want to allow any userspace process to mess
up the buffers? It can even reprogram the LWP block, so you're counting
different things, or at higher frequencies, or into other processes
ordinary vmas?

You could rebuild the LWP block on every context switch I guess, but you
need to prevent access to other cpus' LWP blocks (since they may be
running other processes). I think this calls for per-cpu cr3, even for
threads in the same process.

> Solution #2 has the additional advantage that we could migrate
> PEBS to it and could allow interested user-space access to the
> 'raw' PEBS buffer as well. (currently the PEBS buffer is only
> visible to kernel-space.)

That's probably useful for jits.

> I'd suggest the easy hack first, to get things going - we can
> then help out with the proper solution.

I think you're underestimating the complexity there. LWP wasn't
designed for this.

--
error compiling committee.c: too many arguments to function

Benjamin

unread,

Dec 19, 2011, 1:40:02 PM12/19/11

Am 19.12.2011 12:58, schrieb Avi Kivity:
>> I'd suggest the easy hack first, to get things going - we can
>> then help out with the proper solution.
> I think you're underestimating the complexity there. LWP wasn't
> designed for this.
>

LWP is highly limited in its ability's to support more than one
"LWP-Instance" being active for a thread, IOW it is not possible.
You can't activate LWP from a threads context and simultaneously
activate lwp-system-wide-profiling in the way you suggested it,
Ingo. Either do the first xor do the last, because you only have
one xsave-area/msr/lwpcb that is read by the hardware and only one
LWP-Buffer that is written by the hw.

So, if one thread is running LWP, because he wants to
(selfmonitoring and stuff [like for what lwp was designed]) and a
su or u would activate this system-wide-monitoring, both would
frequently interfere with the each other. I don't think you want
this to be possible at all.

Frankly, it was already a pain to get LWP running from in-kernel,
like it is done now. I would expect a much higher pain, if you
would want to do this with a transparent buffer, that gets passed
around each scheduling (and this would permanently eliminate the
"lightweight" in "LWP").

best regards,
- Benjamin

Ingo Molnar

unread,

Dec 20, 2011, 4:00:02 AM12/20/11

* Benjamin <be...@mageta.org> wrote:

> LWP is highly limited in its ability's to support more than
> one "LWP-Instance" being active for a thread, IOW it is not
> possible.

That's OK, we can deal with various PMU constraints just fine.

> You can't activate LWP from a threads context and
> simultaneously activate lwp-system-wide-profiling in the way
> you suggested it, Ingo. Either do the first xor do the last,

We have other PMU resources that are exclusive in that sense.

> because you only have one xsave-area/msr/lwpcb that is read by
> the hardware and only one LWP-Buffer that is written by the
> hw.

That's similar to PEBS (which we already support), there's only
one Debug Store per CPU, obviously.

> So, if one thread is running LWP, because he wants to
> (selfmonitoring and stuff [like for what lwp was designed])
> and a su or u would activate this system-wide-monitoring, both
> would frequently interfere with the each other. I don't think
> you want this to be possible at all.

THe LWPCB is designed to allow multiple events, and the LWP
ring-buffer is shared between these events.

If the kernel properly manages the lwpcb then no such
'interference' happens during normal use - both outside and
self-installed events can be activated at once, up to the event
limit - similar to how we handle regular PMU events.

[ This is why the threshold IRQ support i requested is key: it
is needed for the flow of events and for the kernel
event-demultiplexer to work transparently. ]

> Frankly, it was already a pain to get LWP running from
> in-kernel, like it is done now. I would expect a much higher
> pain, if you would want to do this with a transparent buffer,
> that gets passed around each scheduling (and this would
> permanently eliminate the "lightweight" in "LWP").

There's no heavyweight 'passing around' of a buffer needed at
context switch time. The buffer context has to be flipped - part
of the job of context switching.

So no, i don't think any of your objections have any merit.

Thanks,

Ingo

Ingo Molnar

unread,

Dec 20, 2011, 4:20:03 AM12/20/11

* Avi Kivity <a...@redhat.com> wrote:

> On 12/19/2011 01:40 PM, Ingo Molnar wrote:
> >
> > 2) the proper solution: creating a 'user-space vmalloc()' that
> > is per mm and that gets inherited transparently, across
> > fork() and exec(), and which lies outside the regular vma
> > spaces. On 64-bit this should be straightforward.
>
> That probably has uses outside perf too, but I can see mm nacks piling up.

This can be done in arch/x86/ code if it's too x86 specific -
the platform controls the VM layout and can (and does) use
special per CPU VM areas.

> > These vmas are not actually 'known' to user-space
> > normally - the kernel PMU code knows about it and does
> > what we do with PEBS: flushes it when necessary and puts
> > it into the regular perf event channels.
> >
> > This solves the inherited perf record workflow
> > immediately: the parent task just creates the buffer,
> > which gets inherited across exec() and fork(), into every
> > portion of the workload.
>

> The buffer still needs to be managed. [...]

Of course, like we manage the DS buffer for PEBS.

> [...] While you may be able to juggle different threads on

> the same cpu using different events, threads on other cpus
> need to use separate LWP contexts and buffers.

Yes, like different threads on different CPUs have different DS
buffers, *here and today*.

Try this on (most) modern Intel CPUs:

perf top -e cycles:pp

That will activate that exact mechanism.

The LWPCB and the LWP ring-buffer are really just an extension
of that concept: per task buffers which are ring 3 visible.

Note that user-space does not actually have to know about any of
these LWP addresses (but can access them if it wants to - no
strong feelings about that) - in the correctly implemented model
it's fully kernel managed.

In fact the PEBS case had one more complication: there's the BTS
branch-tracing feature which we support as well, and which
overlaps PEBS use of the DS.

All these PMU hardware limitations can be supported, as long as
the instrumentation *capability* adds value to the system in one
way or another.

> > System-wide profiling is a small additional variant of
> > this: creating such a user-vmalloc() area for all tasks
> > in the system so that the PMU code has them ready in the
> > context-switch code.
>
> What about security? Do we want to allow any userspace
> process to mess up the buffers? It can even reprogram the LWP
> block, so you're counting different things, or at higher
> frequencies, or into other processes ordinary vmas?

In most usecases it's the application messing up its own
profiling - don't do that if it hurts.

I'd argue that future LWP versions should allow kernel-protected
LWP pages, as long as the LWPCB is privileged as well as well.
That would be useful for another purpose as well: LWP could be
allowed to sample kernel-space execution as well, an obviously
useful feature that was left out from LWP for barely explicable
reasons.

Granted, LWP was mis-designed to quite a degree, those AMD chip
engineers should have talked to people who understand how modern
PMU abstractions are added to the OS kernel properly. But this
mis-design does not keep us from utilizing this piece of
hardware intelligently. PEBS/DS/BTS wasnt a beauty either.

> You could rebuild the LWP block on every context switch I
> guess, but you need to prevent access to other cpus' LWP
> blocks (since they may be running other processes). I think
> this calls for per-cpu cr3, even for threads in the same
> process.

Why would we want to rebuild the LWPCB? Just keep one per task
and do a lightweight switch to it during switch_to() - like we
do it with the PEBS hardware-ring-buffer. It can be in the same
single block of memory with the ring-buffer itself. (PEBS has
similar characteristics)

Thanks,

Ingo

Avi Kivity

unread,

Dec 20, 2011, 4:50:02 AM12/20/11

On 12/20/2011 11:15 AM, Ingo Molnar wrote:
> The LWPCB and the LWP ring-buffer are really just an extension
> of that concept: per task buffers which are ring 3 visible.

No, it's worse. They are ring 3 writeable, and ring 3 configurable.

> Note that user-space does not actually have to know about any of
> these LWP addresses (but can access them if it wants to - no
> strong feelings about that) - in the correctly implemented model
> it's fully kernel managed.

btw, that means that the intended use case - self-monitoring with no
kernel support - cannot be done. That's not an issue per se, it depends
on the cost of the kernel support and whether any information is lost
(like the records inserted by the explicit LWP instructions).

> In fact the PEBS case had one more complication: there's the BTS
> branch-tracing feature which we support as well, and which
> overlaps PEBS use of the DS.

(semi-related: both DS and LWP cannot be used by kvm to monitor a guest
from the host, since they both use virtual addresses)

> All these PMU hardware limitations can be supported, as long as
> the instrumentation *capability* adds value to the system in one
> way or another.
>
> > > System-wide profiling is a small additional variant of
> > > this: creating such a user-vmalloc() area for all tasks
> > > in the system so that the PMU code has them ready in the
> > > context-switch code.
> >
> > What about security? Do we want to allow any userspace
> > process to mess up the buffers? It can even reprogram the LWP
> > block, so you're counting different things, or at higher
> > frequencies, or into other processes ordinary vmas?
>
> In most usecases it's the application messing up its own
> profiling - don't do that if it hurts.

Not in the system profiling case (not that anything truly bad will
happen, but it's not nice to have the kernel supplying data it can't trust).

> I'd argue that future LWP versions should allow kernel-protected
> LWP pages, as long as the LWPCB is privileged as well as well.
> That would be useful for another purpose as well: LWP could be
> allowed to sample kernel-space execution as well, an obviously
> useful feature that was left out from LWP for barely explicable
> reasons.
>
> Granted, LWP was mis-designed to quite a degree, those AMD chip
> engineers should have talked to people who understand how modern
> PMU abstractions are added to the OS kernel properly. But this
> mis-design does not keep us from utilizing this piece of
> hardware intelligently. PEBS/DS/BTS wasnt a beauty either.

LWP was clearly designed for userspace jits, and clearly designed to
work with minimal kernel support. For this use case, it wasn't
mis-designed. Maybe they designed for the wrong requirements and
constraints (for example, it is much harder to get PMU abstractions into
Windows than into Linux), but within those requirements, it appears to
be well done.

I'm worried that shoe-horning LWP into the system profiling role will
result in poor support for that role, *and* prevent its use in the
intended use case.

> > You could rebuild the LWP block on every context switch I
> > guess, but you need to prevent access to other cpus' LWP
> > blocks (since they may be running other processes). I think
> > this calls for per-cpu cr3, even for threads in the same
> > process.
>
> Why would we want to rebuild the LWPCB? Just keep one per task
> and do a lightweight switch to it during switch_to() - like we
> do it with the PEBS hardware-ring-buffer. It can be in the same
> single block of memory with the ring-buffer itself. (PEBS has
> similar characteristics)

If it's in globally visible memory, the user can reprogram the LWP from
another thread to thrash ordinary VMAs. It has to be process local (at
which point, you can just use do_mmap() to allocate it).

--
error compiling committee.c: too many arguments to function

Ingo Molnar

unread,

Dec 20, 2011, 5:20:02 AM12/20/11

* Avi Kivity <a...@redhat.com> wrote:

> On 12/20/2011 11:15 AM, Ingo Molnar wrote:
>
> > The LWPCB and the LWP ring-buffer are really just an
> > extension of that concept: per task buffers which are ring 3
> > visible.
>
> No, it's worse. They are ring 3 writeable, and ring 3
> configurable.

Avi, i know that very well.

> > Note that user-space does not actually have to know about
> > any of these LWP addresses (but can access them if it wants
> > to - no strong feelings about that) - in the correctly
> > implemented model it's fully kernel managed.
>
> btw, that means that the intended use case - self-monitoring

> with no kernel support - cannot be done. [...]

Arguably many years ago the hardware was designed for brain-dead
instrumentation abstractions.

Note that as i said user-space *can* acccess the area if it
thinks it can do it better than the kernel (and we could export
that information in a well defined way - we could do the same
for PEBS as well) - i have no particular strong feelings about
allowing that other than i think it's an obviously inferior
model - *as long* as proper, generic, usable support is added.

From my perspective there's really just one realistic option to
accept this feature: if it's properly fit into existing, modern
instrumentation abstractions. I made that abundantly clear in my
feedback so far.

It can obviously be done, alongside the suggestions i've given.

That was the condition for Intel PEBS/DS/BTS support as well -
which is hardware that has at least as many brain-dead
constraints and roadblocks as LWP.

> > > You could rebuild the LWP block on every context switch I
> > > guess, but you need to prevent access to other cpus' LWP
> > > blocks (since they may be running other processes). I
> > > think this calls for per-cpu cr3, even for threads in the
> > > same process.
> >
> > Why would we want to rebuild the LWPCB? Just keep one per
> > task and do a lightweight switch to it during switch_to() -
> > like we do it with the PEBS hardware-ring-buffer. It can be
> > in the same single block of memory with the ring-buffer
> > itself. (PEBS has similar characteristics)
>
> If it's in globally visible memory, the user can reprogram the

> LWP from another thread to thrash ordinary VMAs. [...]

User-space can smash it and make it not profile or profile the
wrong thing or into the wrong buffer - but LWP itself runs with
ring3 privileges so it won't do anything the user couldnt do
already.

Lack of protection against self-misconfiguration-damage is a
benign hardware mis-feature - something for LWP v2 to specify i
guess.

But i don't want to reject this feature based on this
mis-feature alone - it's a pretty harmless limitation and the
precise, skid-less profiling that LWP offers is obviously
useful.

> [...] It has to be process local (at which point, you can

> just use do_mmap() to allocate it).

get_unmapped_area() + install_special_mapping() is probably
better, but yeah.

Thanks,

Ingo

Joerg Roedel

unread,

Dec 20, 2011, 10:30:03 AM12/20/11

Hi Ingo,

On Tue, Dec 20, 2011 at 11:09:17AM +0100, Ingo Molnar wrote:
> > No, it's worse. They are ring 3 writeable, and ring 3
> > configurable.
>
> Avi, i know that very well.

So you agree that your ideas presented in this thread of integrating LWP
into perf have serious security implications?

> > btw, that means that the intended use case - self-monitoring
> > with no kernel support - cannot be done. [...]
>
> Arguably many years ago the hardware was designed for brain-dead
> instrumentation abstractions.

The point of LWP design is, that it doesn't require abstractions except
for the threshold interrupt.

I am fine with integrating LWP into perf as long as it makes sense and
does not break the intended usage scenario for LWP.

[ Because LWP is a user-space feature and designed as such,
forcing it into an abstraction makes software that uses LWP
unportable. ]

But Ingo, the ideas you presented in this thread are clearly no-gos.
Having a shared per-cpu buffer for LWP data that is read by perf
obviously has very bad security implications, as Avi already pointed
out. It also destroys the intended use-case for LWP because it disturbs
any process that is doing self-profiling with LWP.

> Note that as i said user-space *can* acccess the area if it
> thinks it can do it better than the kernel (and we could export
> that information in a well defined way - we could do the same
> for PEBS as well) - i have no particular strong feelings about
> allowing that other than i think it's an obviously inferior
> model - *as long* as proper, generic, usable support is added.

LWP can't be compared in any serious way with PEBS. The only common
thing is the hardware-managed ring-buffer. But PEBS is an addition to
MSR based performance monitoring resources (for which a kernel
abstraction makes a lot of sense) and can only be controlled from ring 0
while LWP is a complete user-space controlled PMU which has no link at
all to the MSR-based, ring 0 controlled PMU.

> From my perspective there's really just one realistic option to
> accept this feature: if it's properly fit into existing, modern
> instrumentation abstractions. I made that abundantly clear in my
> feedback so far.

The threshold interrupt fits well into the perf-abstraction layer. Even
self-monitoring of processes does, and Hans posted patches from Benjamin
for that. What do you think about this approach?

> User-space can smash it and make it not profile or profile the
> wrong thing or into the wrong buffer - but LWP itself runs with
> ring3 privileges so it won't do anything the user couldnt do
> already.

The point is, if user-space re-programs LWP it will continue to write
its samples to the new ring-buffer virtual-address set up by user-space.
It will still use that virtual address in another address-space after a
task-switch. This allows processes to corrupt memory of other processes.
There are ways to hack around that but these have a serious impact on
task-switch costs so this is also no way to go.

> Lack of protection against self-misconfiguration-damage is a
> benign hardware mis-feature - something for LWP v2 to specify i
> guess.

So what you are saying is (not just here, also in other emails in this
thread) that every hardware not designed for perf is crap?

> get_unmapped_area() + install_special_mapping() is probably
> better, but yeah.

get_unmapped_area() only works on current. So it can't be used for
that purpose too. Please believe me, we considered and evaluated a lot
of ways to install a mapping into a different process, but none of them
worked out. It is clearly not possible in a sane way without major
changes to the VMM code. Feel free to show us a sane way if you disagree
with that.

So okay, where are we now? We have patches from Hans that make LWP
mostly usable in the way it is intended for. There are already a lot of
people waiting for this to support LWP in the kernel (and they want to
use it in the intended way, not via perf). And we have patches from
Benjamin adding the missing threshold interrupt and a self-monitoring
abstraction of LWP for perf. Monitoring other processes using perf is
not possible because we can't reliably install a mapping into another
process. System wide monitoring has bad security implications and
destroys the intended use-cases. So as I see it, the only abstraction
for integrating LWP into perf that is feasible is posted in this thread.
Can we agree to focus on the posted approach?

Thanks,

Joerg

Vince Weaver

unread,

Dec 20, 2011, 10:50:02 AM12/20/11

On Tue, 20 Dec 2011, Ingo Molnar wrote:
> Granted, LWP was mis-designed to quite a degree, those AMD chip
> engineers should have talked to people who understand how modern
> PMU abstractions are added to the OS kernel properly.

You do realize that LWP was probably in design 5+ years ago, at a time
when most Linux kernel developers wanted nothing to do with perf counters,
and thus anyone they did contact for help would have been from the
since-rejected perfctr or perfmon2 camp.

Also, I'm sure Linux isn't the only Operating System that they had in mind
when designing this functionality.

Running LWP through the kernel is a foolish idea. Does anyone have any
numbers on what that would do to overhead?

perf_events creates huge overhead when doing self monitoring. For simple
self-monintoring counter reads it is an *order of magnitude* worse than
doing the same thing with perfctr.
(see numbers here if you don't believe me:
http://web.eecs.utk.edu/~vweaver1/projects/perf-events/benchmarks/rdtsc_overhead/ )

Vince

Ingo Molnar

unread,

Dec 20, 2011, 1:40:02 PM12/20/11

* Vince Weaver <vwea...@eecs.utk.edu> wrote:

> On Tue, 20 Dec 2011, Ingo Molnar wrote:

> > Granted, LWP was mis-designed to quite a degree, those AMD
> > chip engineers should have talked to people who understand
> > how modern PMU abstractions are added to the OS kernel
> > properly.
>
> You do realize that LWP was probably in design 5+ years ago,
> at a time when most Linux kernel developers wanted nothing to
> do with perf counters, and thus anyone they did contact for
> help would have been from the since-rejected perfctr or
> perfmon2 camp.

That does not really contradict what i said.

> Also, I'm sure Linux isn't the only Operating System that they
> had in mind when designing this functionality.
>
> Running LWP through the kernel is a foolish idea. Does anyone
> have any numbers on what that would do to overhead?

At most an LLWPCB instruction is needed.

> perf_events creates huge overhead when doing self monitoring.
> For simple self-monintoring counter reads it is an *order of
> magnitude* worse than doing the same thing with perfctr.

Only if you are comparing apples to oranges: if you compare a
full kernel based read of self-profiling counters with an RDPMC
instruction.

But as we told you previously, you could use RDPMC under perf as
well, last i checked PeterZ posted experimental patches for
that. Peter, what's the status of that?

Thanks,

Ingo

Ingo Molnar

unread,

Dec 20, 2011, 1:50:01 PM12/20/11

* Joerg Roedel <jo...@8bytes.org> wrote:

> Hi Ingo,
>
> On Tue, Dec 20, 2011 at 11:09:17AM +0100, Ingo Molnar wrote:
> > >
> > > No, it's worse. They are ring 3 writeable, and ring 3
> > > configurable.
> >
> > Avi, i know that very well.
>
> So you agree that your ideas presented in this thread of
> integrating LWP into perf have serious security implications?

No, i do not agree at all - you are drastically misrepresending
my position.

> > > btw, that means that the intended use case -
> > > self-monitoring with no kernel support - cannot be done.
> > > [...]
> >
> > Arguably many years ago the hardware was designed for
> > brain-dead instrumentation abstractions.
>
> The point of LWP design is, that it doesn't require
> abstractions except for the threshold interrupt.
>
> I am fine with integrating LWP into perf as long as it makes
> sense and does not break the intended usage scenario for LWP.

That's the wrong way around - in reality we'll integrate LWP
upstream only once it makes sense and works well with the
primary instrumentation abstraction we have in the kernel.

Otherwise my "sorry, it's not convincing enough yet" NAK against
the new feature stands.

In fact as per Linus's rules about new kernel features,
maintainers don't even have to justify NAK's by offering an
implementation roadmap that would make the feature acceptable.

Me or PeterZ could just say "this feature is too limited and not
convincing enough yet, sorry".

*You* who are pushing the feature have to convince the objecting
maintainer that the feature is worth integrating.

But i'm being nice and helpful here by giving you a rough
technical outline of how you could overcome my "sorry, this is
not convincing in its current form yet" rejection of the current
LWP patches.

> [ Because LWP is a user-space feature and designed as such,
> forcing it into an abstraction makes software that uses LWP
> unportable. ]
>
> But Ingo, the ideas you presented in this thread are clearly
> no-gos.

Nonsense.

> Having a shared per-cpu buffer for LWP data that is read by
> perf obviously has very bad security implications, as Avi

> already pointed out. [...]

Stop this stupidity already!

There's no "security implications" whatsoever. LWP is a ring-3
hw feature and it can do nothing that the user-space app could
not already do ...

> [...] It also destroys the intended use-case for LWP because

> it disturbs any process that is doing self-profiling with LWP.

Why would it destroy that? Self-profiling can install events
just fine, the kernel will arbitrate the resource.

The 'intended usecase' is meaningless to me - it was done in
some closed process apparently not talking to anyone who knows a
bit about Linux instrumentation. If you want this code upstream
then you need to convince me that the feature makes sense in the
general and current scheme of things.

I've outlined the (rather workable) technical roadmap for that.

> > Note that as i said user-space *can* acccess the area if it
> > thinks it can do it better than the kernel (and we could
> > export that information in a well defined way - we could do
> > the same for PEBS as well) - i have no particular strong
> > feelings about allowing that other than i think it's an
> > obviously inferior model - *as long* as proper, generic,
> > usable support is added.
>
> LWP can't be compared in any serious way with PEBS. The only

> common thing is the hardware-managed ring-buffer. [...]

Which ring-buffer is actually happens to be one of the main
things that has to be managed ...

> [...] But PEBS is an addition to MSR based performance

> monitoring resources (for which a kernel abstraction makes a
> lot of sense) and can only be controlled from ring 0 while LWP
> is a complete user-space controlled PMU which has no link at
> all to the MSR-based, ring 0 controlled PMU.

It's a ring-3 controlled PMU feature, not a user-space PMU
feature. It *can* be controlled by user-space - but it obviously
can also (and i argue, it should be) - managed by the kernel,
under Linux.

The kernel is running ring-3 code as well, and it's managing
ring-3 accessible resources as well, there's nothing new about
that.

> > From my perspective there's really just one realistic option
> > to accept this feature: if it's properly fit into existing,
> > modern instrumentation abstractions. I made that abundantly
> > clear in my feedback so far.
>
> The threshold interrupt fits well into the perf-abstraction
> layer. Even self-monitoring of processes does, and Hans posted
> patches from Benjamin for that. What do you think about this
> approach?

As as i said it's a promising first step - although the
discussion here convinced me that it needs to be even more
feature complete, i don't really see that you guys understand
how such things should be implemented.

You seem to be dead set on supporting a weird special case
'intended workload' while forgetting the *much* more common
profilin workloads we have under Linux.

I don't mind supporting weird stuff as well, but you have to
keep the common case in mind ...

I'd like to see the ring-buffer and the events managed by the
kernel too, at least so that perf record works fine with this
PMU feature.

> > User-space can smash it and make it not profile or profile
> > the wrong thing or into the wrong buffer - but LWP itself
> > runs with ring3 privileges so it won't do anything the user
> > couldnt do already.
>
> The point is, if user-space re-programs LWP it will continue
> to write its samples to the new ring-buffer virtual-address
> set up by user-space. It will still use that virtual address
> in another address-space after a task-switch. This allows

> processes to corrupt memory of other processes. [...]

That's nonsense. As i said it my previous mail the LWPC should
be per task and switched on task switch - just like the DS/PEBS
context is.

> [...] There are ways to hack around that but these have a

> serious impact on task-switch costs so this is also no way to
> go.

We are seeing no problems with this approach under PEBS.

> > Lack of protection against self-misconfiguration-damage is a
> > benign hardware mis-feature - something for LWP v2 to
> > specify i guess.
>
> So what you are saying is (not just here, also in other emails
> in this thread) that every hardware not designed for perf is
> crap?

No - PMU hardware designed to not allow the profiling of the
kernel is obviously a crappy aspect of it. Also, PMU hardware
that does not allow 100% encapsulation by the kernel is
obviously not very wisely done either.

Those limitations are not a big problem for usable Linux support
- and future iterations of the LWP hardware can trivially
address those shortcomings.

> > get_unmapped_area() + install_special_mapping() is probably
> > better, but yeah.
>

> get_unmapped_area() only works on current. [...]

Which is a perfectly fine first step to support the
'perf record' inheritance-tree case - which is a very
common profiling method.

> [...] So it can't be used for that purpose too. [...]

Hey, i wrote bits of get_unmapped_area(), way back. I had code
on my machine that inserted vmas into other tasks's address
spaces and can confirm that it works. Do you take my word for it
that it's possible?

Firstly, the perf record case - which is an important, primary
workflow - can work with the code as-is just fine.

Secondly, for system-wide profiling vmas can be inserted into
another task's mm context just fine as well: technically we do
that all the time, when a threaded program is running.

Inserting a vma into another task's mm where that mm is not ours
is indeed not typical, but not unprecedented either, UML patches
did that a couple of years ago. (In fact the upcoming uprobes
patches are doing something far more intrusive.)

The VM modification is trivial AFAICS: an 'mm' parameter has to
be added to a new do_mmap() variant, that's all - the code is
already SMP-safe, due to the threaded case.

Otherwise using another task's mm is safe if you acquire it via
get_task_mm()/mmput().

[ Sidenote: as a bonus this would put infrastructure in place to
have user-space accessible trace buffers, insertable via
the LWPINS instruction and recoverable via the regular kernel
perf event processing facilities. LWP has more potential than
just self-profiling, if we use the right abstractions... ]

Thanks,

Ingo

Vince Weaver

unread,

Dec 20, 2011, 5:50:01 PM12/20/11

On Tue, 20 Dec 2011, Ingo Molnar wrote:

>
> * Vince Weaver <vwea...@eecs.utk.edu> wrote:
>
> > On Tue, 20 Dec 2011, Ingo Molnar wrote:
>
> > > Granted, LWP was mis-designed to quite a degree, those AMD
> > > chip engineers should have talked to people who understand
> > > how modern PMU abstractions are added to the OS kernel
> > > properly.
> >
> > You do realize that LWP was probably in design 5+ years ago,
> > at a time when most Linux kernel developers wanted nothing to
> > do with perf counters, and thus anyone they did contact for
> > help would have been from the since-rejected perfctr or
> > perfmon2 camp.
>
> That does not really contradict what i said.

Well I'm just assuming that when you say "people who understand

how modern PMU abstractions are added to the OS kernel properly"

you mean yourself and the perf_event crew.

There are many other schools of thought on what kernel PMU abstractions
should look like, and I'm sure AMD conferred with them.

> > Running LWP through the kernel is a foolish idea. Does anyone
> > have any numbers on what that would do to overhead?
>
> At most an LLWPCB instruction is needed.

you're saying that all the crazy kernel stuff you're proposing will have
no extra overhead when compared to just implementing the proper xsave
context switch code?

> > perf_events creates huge overhead when doing self monitoring.
> > For simple self-monintoring counter reads it is an *order of
> > magnitude* worse than doing the same thing with perfctr.
>
> Only if you are comparing apples to oranges: if you compare a
> full kernel based read of self-profiling counters with an RDPMC
> instruction.

The benchmarks I posted show measurements getting *real data* from the
counters. Yes, on perfctr this is mostly just a rdpmc call plus a quick
access to some mmap'd memory to make sure the context is valid.

perfctr is an order of magnitude less overhead because it was designed
from the beginning to be a very low-overhead way to get self-monitoring
data. A lot of time and tuning was spent getting it that fast.

perf_event throws everything and the kitchen sink in the the kernel. I'm
guessing low-overhead self-monitoring was not really one of your primary
design goals, and it shows.

> But as we told you previously, you could use RDPMC under perf as
> well, last i checked PeterZ posted experimental patches for
> that. Peter, what's the status of that?

yes. If you checked the benchmark results I showed, you'd have seen that
I run tests against that patchset too, and it's really only marginally
better that the current perf_event stuff. I might have written the
benchmark poorly, but that's mainly because as-posted the documentation
for how to use that patchset is a bit unclear.

Vince
vwea...@eecs.utk.edu

Joerg Roedel

unread,

Dec 20, 2011, 7:10:02 PM12/20/11

On Tue, Dec 20, 2011 at 07:40:04PM +0100, Ingo Molnar wrote:

> > I am fine with integrating LWP into perf as long as it makes
> > sense and does not break the intended usage scenario for LWP.
>
> That's the wrong way around - in reality we'll integrate LWP
> upstream only once it makes sense and works well with the
> primary instrumentation abstraction we have in the kernel.

I still don't see why you want an abstraction for a hardware feature
that clearly doesn't need it. From an enablement perspective LWP is much
closer to AVX than to the MSR based PMU. And nobody really wants or
needs a kernel abstraction for AVX, no?

> Me or PeterZ could just say "this feature is too limited and not
> convincing enough yet, sorry".

This statement shows very clearly the bottom-line of our conflict. You
see this as a perf-topic, for everyone else it is an x86 topic.

> But i'm being nice and helpful here [...]

And I appreciate the discussion. But we have fundamentally different
stand-points. I hope we can come to an agreement.

> There's no "security implications" whatsoever. LWP is a ring-3
> hw feature and it can do nothing that the user-space app could
> not already do ...

Really? How could an application count DCache misses today without
instrumentation? I guess your answer is 'with perf', but LWP is a much
more light-weight way to do that because it works _completly_ in
hardware when the kernel supports context-switching it.

>
> > [...] It also destroys the intended use-case for LWP because
> > it disturbs any process that is doing self-profiling with LWP.
>
> Why would it destroy that? Self-profiling can install events
> just fine, the kernel will arbitrate the resource.

Because you can't reliably hand over the LWPCB management to the kernel.
The instruction to load a new LWPCB is executable in ring-3. Any
kernel-use of LWP will never be reliable.

> > So what you are saying is (not just here, also in other emails
> > in this thread) that every hardware not designed for perf is
> > crap?
>
> No - PMU hardware designed to not allow the profiling of the
> kernel is obviously a crappy aspect of it. Also, PMU hardware
> that does not allow 100% encapsulation by the kernel is
> obviously not very wisely done either.

Why? Whats wrong with user-space having control over its own PMU in a
safe way? This is what the feature was designed for.

Thanks,

Joerg

Gleb Natapov

unread,

Dec 21, 2011, 6:50:03 AM12/21/11

On Tue, Dec 20, 2011 at 07:40:04PM +0100, Ingo Molnar wrote:

> > The point is, if user-space re-programs LWP it will continue
> > to write its samples to the new ring-buffer virtual-address
> > set up by user-space. It will still use that virtual address
> > in another address-space after a task-switch. This allows
> > processes to corrupt memory of other processes. [...]
>
> That's nonsense. As i said it my previous mail the LWPC should
> be per task and switched on task switch - just like the DS/PEBS
> context is.
>

Is it? Looking at arch/x86/kernel/cpu/perf_event_intel_ds.c it seems
like DS is per cpu, not per task.

--
Gleb.

Ingo Molnar

unread,

Dec 21, 2011, 7:10:02 AM12/21/11

* Vince Weaver <vwea...@eecs.utk.edu> wrote:

> > But as we told you previously, you could use RDPMC under
> > perf as well, last i checked PeterZ posted experimental
> > patches for that. Peter, what's the status of that?
>
> yes. If you checked the benchmark results I showed, you'd
> have seen that I run tests against that patchset too, and it's
> really only marginally better that the current perf_event

> stuff. I might have written the benchmark poorly, [...]

It is significantly faster for the self-monitoring case - which
is a pretty niche usecase btw.

Have a look at how the 'perf test' self-test utilizes RDPMC in
these commits in tip:perf/fast:

08aa0d1f376e: perf tools: Add x86 RDPMC, RDTSC test
e3f3541c19c8: perf: Extend the mmap control page with time (TSC) fields
0c9d42ed4cee: perf, x86: Provide means for disabling userspace RDPMC
fe4a330885ae: perf, x86: Implement user-space RDPMC support, to allow fast, user-space access to self-monitoring counters
365a4038486b: perf: Fix mmap_page::offset computation
35edc2a5095e: perf, arch: Rework perf_event_index()
9a0f05cb3688: perf: Update the mmap control page on mmap()

You can find these commits in today's -tip. Overhead should be
somewhere around 50 cycles per call (i suspect it could
optimized more), which is a fraction of what a syscall is
costing.

> [...] but that's mainly because as-posted the documentation

> for how to use that patchset is a bit unclear.

In your world there's always someone else to blame.

The thing is, *you* are interested in this niche feature, PeterZ
not so much.

You made a false claim that perf cannot use RDPMC and PeterZ has
proven you wrong once again. Your almost non-stop whining and
the constant misrepresentations you make are not very
productive.

Thanks,

Ingo

Ingo Molnar

unread,

Dec 21, 2011, 7:40:02 AM12/21/11

* Joerg Roedel <jo...@8bytes.org> wrote:

> On Tue, Dec 20, 2011 at 07:40:04PM +0100, Ingo Molnar wrote:
>
> > > I am fine with integrating LWP into perf as long as it makes
> > > sense and does not break the intended usage scenario for LWP.
> >
> > That's the wrong way around - in reality we'll integrate LWP
> > upstream only once it makes sense and works well with the
> > primary instrumentation abstraction we have in the kernel.
>
> I still don't see why you want an abstraction for a hardware

> feature [...]

Because if done properly then Linux users and developers will be
able to utilize the hardware feature well beyond the limited
scope these patches are giving it.

A couple of examples:

1) This command:

perf record -e lwp:instructions ./myapp

will be possible and will be able to do skid-less profiling.

2) In the long run apps might be able to insert lightweight
trace entries without entering the kernel, using the LWPINS
instruction.

3) Maybe LWP will be enhanced with the ability to profile system
mode execution as well - which we'll be able to support very
easily.

These features are *far* more interesting than some limited
self-monitoring use of LWP.

I don't mind niches per se, so i don't mind the self-monitoring
usecase either, as long as they are not trying to be the *only*
feature, at the expense of more interesting features.

I think it can all be supported in a consistent way (see my
previous mails) - but the feature as presented today just does
not look useful enough to me if only supports that niche
self-monitoring usecase.

> > > [...] It also destroys the intended use-case for LWP
> > > because it disturbs any process that is doing
> > > self-profiling with LWP.
> >
> > Why would it destroy that? Self-profiling can install events
> > just fine, the kernel will arbitrate the resource.
>
> Because you can't reliably hand over the LWPCB management to
> the kernel. The instruction to load a new LWPCB is executable
> in ring-3. Any kernel-use of LWP will never be reliable.

It will be reliable for all tasks that don't intentionally
modify their own LWPCB's but stay with the defined APIs and no
task will be able to destroy *another* task's LWPCB (be it in or
outside of any APIs), if properly implemented.

So a task can mess with itself - and it can already do that
today.

So what's your point?

> > > So what you are saying is (not just here, also in other
> > > emails in this thread) that every hardware not designed
> > > for perf is crap?
> >
> > No - PMU hardware designed to not allow the profiling of the
> > kernel is obviously a crappy aspect of it. Also, PMU
> > hardware that does not allow 100% encapsulation by the
> > kernel is obviously not very wisely done either.
>
> Why? Whats wrong with user-space having control over its own
> PMU in a safe way? This is what the feature was designed for.

Read what i've written: 'PMU hardware designed to not allow the
profiling of the kernel is obviously a crappy aspect of it'.

There is no reason why LWP could not allow profiling of kernel
execution as well, with a simple security model to make sure
unprivileged user-space does not profile kernel execution: such
as a LWP-load-time check whether the LWPCB lies on a system pte
or not.

This would allow everything that is possible today - and more.

Allowing user-space access to the PMU does not preclude a proper
PMU abstraction.

Thanks,

Ingo

Avi Kivity

unread,

Dec 21, 2011, 7:50:02 AM12/21/11

On 12/21/2011 02:34 PM, Ingo Molnar wrote:
> I think it can all be supported in a consistent way (see my
> previous mails) - but the feature as presented today just does
> not look useful enough to me if only supports that niche
> self-monitoring usecase.

I hate to re-enter this thread, but this "niche use case" is exactly
what LWP is designed for. And once the JVM is adapted to exploit LWP,
its use will dwarf all of the uses of perf put together (except the NMI
watchdog). You're only causing the developers needless pain by forcing
them to fit this red peg into a green hole.

--
error compiling committee.c: too many arguments to function

Vince Weaver

unread,

Dec 21, 2011, 9:00:02 AM12/21/11

On Wed, 21 Dec 2011, Ingo Molnar wrote:

>
> * Vince Weaver <vwea...@eecs.utk.edu> wrote:
>
> Have a look at how the 'perf test' self-test utilizes RDPMC in
> these commits in tip:perf/fast:

I did. How many times do I have to tell you I already applied, ran, and
benchmarked this code already, and the results were posted on that link in
the previous e-mail.

> You can find these commits in today's -tip. Overhead should be
> somewhere around 50 cycles per call (i suspect it could
> optimized more), which is a fraction of what a syscall is
> costing.

No, it's more than a "50-cycle" call. To get a value out you need to do
two rdpmc calls plus some mucking about with some mmap'd values. It still
benchmarks much slower than the perctr implementation.

I'd be glad to see _actual_ numbers for an _actual_ test that measures
useful values. Until then I'm believing the numbers I measure on three
different architectures which still show that perf_event has high
overhead.

> > [...] but that's mainly because as-posted the documentation
> > for how to use that patchset is a bit unclear.
>
> In your world there's always someone else to blame.

Yes. I was blaming myself for not understanding the code well enough to
write a good benchmark.

> The thing is, *you* are interested in this niche feature, PeterZ
> not so much.

The thing *we* are interested in is the main PAPI use case. It's arguable
that more people use PAPI under Linux than actually use perf.

> You made a false claim that perf cannot use RDPMC and PeterZ has
> proven you wrong once again. Your almost non-stop whining and
> the constant misrepresentations you make are not very
> productive.

I made no such claim. Please cite.

You made the questionable claim that the AMD devels didn't consult with
any competent perf counter experts. What you meant was that they didn't
have foresight 5 years that Ingo Molnar would come in late with some NIH
implementation of some niche kernel functionality and take it over.
Though in retrospect I guess that's inevitable.

Vince

Ingo Molnar

unread,

Dec 21, 2011, 9:50:01 AM12/21/11

* Avi Kivity <a...@redhat.com> wrote:

> On 12/21/2011 02:34 PM, Ingo Molnar wrote:
>
> > I think it can all be supported in a consistent way (see my
> > previous mails) - but the feature as presented today just
> > does not look useful enough to me if only supports that
> > niche self-monitoring usecase.
>
> I hate to re-enter this thread, but this "niche use case" is

> exactly what LWP is designed for. [...]

It's not the only usecase that it can be used in, and that is
what matters to me.

> [...] And once the JVM is adapted to exploit LWP, its use will

> dwarf all of the uses of perf put together (except the NMI
> watchdog). You're only causing the developers needless pain
> by forcing them to fit this red peg into a green hole.

I disagree - i think LWP has been seriously over-sold and
seriously under-designed. Anyway, i'm willing to be convinced
that it's worth to be merged upstream, if it brings tangible
benefits to the usecases i mentioned.

Thanks,

Ingo

Joerg Roedel

unread,

Dec 21, 2011, 5:50:01 PM12/21/11

On Wed, Dec 21, 2011 at 02:22:33PM +0100, Ingo Molnar wrote:
>
> * Avi Kivity <a...@redhat.com> wrote:
> > I hate to re-enter this thread, but this "niche use case" is
> > exactly what LWP is designed for. [...]
>
> It's not the only usecase that it can be used in, and that is
> what matters to me.

The really sad thing about it is, that this only matters to you. At
least I have seen nobody in this thread yet who joined or agreed on your
line of reasoning. Especially sad is that your NAK based on this also
prevents that all the people who want to use LWP in the intended way on
Linux can't use it too.

Joerg

Ingo Molnar

unread,

Dec 23, 2011, 6:00:01 AM12/23/11

* Joerg Roedel <jo...@8bytes.org> wrote:

> On Wed, Dec 21, 2011 at 02:22:33PM +0100, Ingo Molnar wrote:
> >
> > * Avi Kivity <a...@redhat.com> wrote:
> > > I hate to re-enter this thread, but this "niche use case" is
> > > exactly what LWP is designed for. [...]
> >
> > It's not the only usecase that it can be used in, and that is
> > what matters to me.
>
> The really sad thing about it is, that this only matters to

> you. [...]

What the hell are you talking about? Precise, skid-free profiles
matter to pretty much *every* single developer profiling
user-space code on AMD CPUs. That you are not representing them
properly in this thread qualifies you, not me.

Thanks,

Ingo

Ingo Molnar

unread,

Dec 23, 2011, 6:00:01 AM12/23/11

* Gleb Natapov <gl...@redhat.com> wrote:

> On Tue, Dec 20, 2011 at 07:40:04PM +0100, Ingo Molnar wrote:
> > > The point is, if user-space re-programs LWP it will continue
> > > to write its samples to the new ring-buffer virtual-address
> > > set up by user-space. It will still use that virtual address
> > > in another address-space after a task-switch. This allows
> > > processes to corrupt memory of other processes. [...]
> >
> > That's nonsense. As i said it my previous mail the LWPC
> > should be per task and switched on task switch - just like
> > the DS/PEBS context is.
>
> Is it? Looking at arch/x86/kernel/cpu/perf_event_intel_ds.c it
> seems like DS is per cpu, not per task.

We flush it on context switch and reuse it for the next task via
the x86_pmu.drain_pebs() callback - so the buffering of PEBS
events is per task.

Thanks,

Ingo

0 new messages