Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

[PATCH v3 0/7] x86: Rewrite switch_to()

13 views
Skip to first unread message

Brian Gerst

unread,
Aug 13, 2016, 12:40:05 PM8/13/16
to
This patch set simplifies the switch_to() code, by moving the stack switch
code out of line into an asm stub before calling __switch_to(). This ends
up being more readable, and using the C calling convention instead of
clobbering all registers improves code generation. It also allows newly
forked processes to construct a special stack frame to seamlessly flow
to ret_from_fork, instead of using a test and branch, or an unbalanced
call/ret.

Changes from v2:
- Updated comments around kernel threads being uncommon for fork, etc.
- Removed STACK_FRAME_NON_STANDARD annotation from __schedule() per Josh Poimboeuf
- A few minor cleanups added

Changes from v1:
- Added struct inactive_task_frame
- Added comments about kernel threads returning to userspace
- Cleaned up some incorrect uses of thread.sp
- Rearranged inactive stack frame so that BP (frame pointer) is in the natural position right below the return address. This should take care of unwinding issues Josh raised.

Brian Gerst (7):
x86-32, kgdb: Don't use thread.ip in sleeping_thread_to_gdb_regs()
x86-64, kgdb: clear GDB_PS on 64-bit
x86: Add struct inactive_task_frame
x86: Rewrite switch_to() code
x86: Pass kernel thread parameters in fork_frame
x86: Fix thread_saved_pc()
Revert "sched: Mark __schedule() stack frame as non-standard"

arch/x86/entry/entry_32.S | 68 +++++++++++++-----
arch/x86/entry/entry_64.S | 78 ++++++++++++++------
arch/x86/include/asm/processor.h | 13 +---
arch/x86/include/asm/stacktrace.h | 4 +-
arch/x86/include/asm/switch_to.h | 144 ++++++++-----------------------------
arch/x86/include/asm/thread_info.h | 2 -
arch/x86/kernel/asm-offsets.c | 6 ++
arch/x86/kernel/asm-offsets_32.c | 5 ++
arch/x86/kernel/asm-offsets_64.c | 5 ++
arch/x86/kernel/kgdb.c | 8 +--
arch/x86/kernel/process.c | 14 +++-
arch/x86/kernel/process_32.c | 31 +++-----
arch/x86/kernel/process_64.c | 21 +++---
arch/x86/kernel/smpboot.c | 1 -
kernel/sched/core.c | 1 -
15 files changed, 190 insertions(+), 211 deletions(-)

--
2.5.5

Brian Gerst

unread,
Aug 13, 2016, 12:40:05 PM8/13/16
to
switch_to() no longer saves EFLAGS, so it's bogus to look for it on the
stack. Set it to zero like 32-bit.

Signed-off-by: Brian Gerst <brg...@gmail.com>
---
arch/x86/kernel/kgdb.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/kgdb.c b/arch/x86/kernel/kgdb.c
index fe649a5..5e3f294 100644
--- a/arch/x86/kernel/kgdb.c
+++ b/arch/x86/kernel/kgdb.c
@@ -176,7 +176,7 @@ void sleeping_thread_to_gdb_regs(unsigned long *gdb_regs, struct task_struct *p)
gdb_regs[GDB_FS] = 0xFFFF;
gdb_regs[GDB_GS] = 0xFFFF;
#else
- gdb_regs32[GDB_PS] = *(unsigned long *)(p->thread.sp + 8);
+ gdb_regs32[GDB_PS] = 0;
gdb_regs32[GDB_CS] = __KERNEL_CS;
gdb_regs32[GDB_SS] = __KERNEL_DS;
gdb_regs[GDB_R8] = 0;
--
2.5.5

Brian Gerst

unread,
Aug 13, 2016, 12:40:07 PM8/13/16
to
Add struct inactive_task_frame, which defines the layout of the stack for
a sleeping process. For now, the only defined field is the BP register
(frame pointer).

Signed-off-by: Brian Gerst <brg...@gmail.com>
---
arch/x86/include/asm/stacktrace.h | 4 ++--
arch/x86/include/asm/switch_to.h | 5 +++++
arch/x86/kernel/kgdb.c | 3 ++-
arch/x86/kernel/process.c | 3 ++-
4 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/stacktrace.h b/arch/x86/include/asm/stacktrace.h
index 0944218..7646fb2 100644
--- a/arch/x86/include/asm/stacktrace.h
+++ b/arch/x86/include/asm/stacktrace.h
@@ -8,6 +8,7 @@

#include <linux/uaccess.h>
#include <linux/ptrace.h>
+#include <asm/switch_to.h>

extern int kstack_depth_to_print;

@@ -70,8 +71,7 @@ stack_frame(struct task_struct *task, struct pt_regs *regs)
return bp;
}

- /* bp is the last reg pushed by switch_to */
- return *(unsigned long *)task->thread.sp;
+ return ((struct inactive_task_frame *)task->thread.sp)->bp;
}
#else
static inline unsigned long
diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h
index 8f321a1..02de86e 100644
--- a/arch/x86/include/asm/switch_to.h
+++ b/arch/x86/include/asm/switch_to.h
@@ -8,6 +8,11 @@ struct tss_struct;
void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p,
struct tss_struct *tss);

+/* data that is pointed to by thread.sp */
+struct inactive_task_frame {
+ unsigned long bp;
+};
+
#ifdef CONFIG_X86_32

#ifdef CONFIG_CC_STACKPROTECTOR
diff --git a/arch/x86/kernel/kgdb.c b/arch/x86/kernel/kgdb.c
index 5e3f294..8e36f24 100644
--- a/arch/x86/kernel/kgdb.c
+++ b/arch/x86/kernel/kgdb.c
@@ -50,6 +50,7 @@
#include <asm/apicdef.h>
#include <asm/apic.h>
#include <asm/nmi.h>
+#include <asm/switch_to.h>

struct dbg_reg_def_t dbg_reg_def[DBG_MAX_REG_NUM] =
{
@@ -166,7 +167,7 @@ void sleeping_thread_to_gdb_regs(unsigned long *gdb_regs, struct task_struct *p)
gdb_regs[GDB_DX] = 0;
gdb_regs[GDB_SI] = 0;
gdb_regs[GDB_DI] = 0;
- gdb_regs[GDB_BP] = *(unsigned long *)p->thread.sp;
+ gdb_regs[GDB_BP] = ((struct inactive_task_frame *)p->thread.sp)->bp;
#ifdef CONFIG_X86_32
gdb_regs[GDB_DS] = __KERNEL_DS;
gdb_regs[GDB_ES] = __KERNEL_DS;
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 62c0b0e..0115a4a 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -32,6 +32,7 @@
#include <asm/tlbflush.h>
#include <asm/mce.h>
#include <asm/vm86.h>
+#include <asm/switch_to.h>

/*
* per-CPU TSS segments. Threads are completely 'soft' on Linux,
@@ -556,7 +557,7 @@ unsigned long get_wchan(struct task_struct *p)
if (sp < bottom || sp > top)
return 0;

- fp = READ_ONCE_NOCHECK(*(unsigned long *)sp);
+ fp = READ_ONCE_NOCHECK(((struct inactive_task_frame *)sp)->bp);
do {
if (fp < bottom || fp > top)
return 0;
--
2.5.5

Brian Gerst

unread,
Aug 13, 2016, 12:40:07 PM8/13/16
to
Now that the x86 switch_to() uses the standard C calling convention,
STACK_FRAME_NON_STANDARD is no longer needed.

Suggested-by: Josh Poimboeuf <jpoi...@redhat.com>
Signed-off-by: Brian Gerst <brg...@gmail.com>
---
kernel/sched/core.c | 1 -
1 file changed, 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3b6b23c..dbf73db 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3384,7 +3384,6 @@ static void __sched notrace __schedule(bool preempt)

balance_callback(rq);
}
-STACK_FRAME_NON_STANDARD(__schedule); /* switch_to() */

static inline void sched_submit_work(struct task_struct *tsk)
{
--
2.5.5

Brian Gerst

unread,
Aug 13, 2016, 12:50:05 PM8/13/16
to
Move the low-level context switch code to an out-of-line asm stub instead of
using complex inline asm. This allows constructing a new stack frame for the
child process to make it seamlessly flow to ret_from_fork without an extra
test and branch in __switch_to(). It also improves code generation for
__schedule() by using the C calling convention instead of clobbering all
registers.

Signed-off-by: Brian Gerst <brg...@gmail.com>
---
arch/x86/entry/entry_32.S | 37 ++++++++++
arch/x86/entry/entry_64.S | 41 ++++++++++-
arch/x86/include/asm/processor.h | 3 -
arch/x86/include/asm/switch_to.h | 137 ++++++-------------------------------
arch/x86/include/asm/thread_info.h | 2 -
arch/x86/kernel/asm-offsets.c | 6 ++
arch/x86/kernel/asm-offsets_32.c | 5 ++
arch/x86/kernel/asm-offsets_64.c | 5 ++
arch/x86/kernel/process_32.c | 9 ++-
arch/x86/kernel/process_64.c | 9 ++-
arch/x86/kernel/smpboot.c | 1 -
11 files changed, 125 insertions(+), 130 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 0b56666..bf8f221 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -204,6 +204,43 @@
POP_GS_EX
.endm

+/*
+ * %eax: prev task
+ * %edx: next task
+ */
+ENTRY(__switch_to_asm)
+ /*
+ * Save callee-saved registers
+ * This must match the order in struct inactive_task_frame
+ */
+ pushl %ebp
+ pushl %ebx
+ pushl %edi
+ pushl %esi
+
+ /* switch stack */
+ movl %esp, TASK_threadsp(%eax)
+ movl TASK_threadsp(%edx), %esp
+
+#ifdef CONFIG_CC_STACKPROTECTOR
+ movl TASK_stack_canary(%edx), %ebx
+ movl %ebx, PER_CPU_VAR(stack_canary)+stack_canary_offset
+#endif
+
+ /* restore callee-saved registers */
+ popl %esi
+ popl %edi
+ popl %ebx
+ popl %ebp
+
+ jmp __switch_to
+END(__switch_to_asm)
+
+/*
+ * A newly forked process directly context switches into this address.
+ *
+ * eax: prev task we switched from
+ */
ENTRY(ret_from_fork)
pushl %eax
call schedule_tail
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index f6b40e5..c1af8ac 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -368,13 +368,48 @@ END(ptregs_\func)
#include <asm/syscalls_64.h>

/*
+ * %rdi: prev task
+ * %rsi: next task
+ */
+ENTRY(__switch_to_asm)
+ /*
+ * Save callee-saved registers
+ * This must match the order in inactive_task_frame
+ */
+ pushq %rbp
+ pushq %rbx
+ pushq %r12
+ pushq %r13
+ pushq %r14
+ pushq %r15
+
+ /* switch stack */
+ movq %rsp, TASK_threadsp(%rdi)
+ movq TASK_threadsp(%rsi), %rsp
+
+#ifdef CONFIG_CC_STACKPROTECTOR
+ movq TASK_stack_canary(%rsi), %rbx
+ movq %rbx, PER_CPU_VAR(irq_stack_union)+stack_canary_offset
+#endif
+
+ /* restore callee-saved registers */
+ popq %r15
+ popq %r14
+ popq %r13
+ popq %r12
+ popq %rbx
+ popq %rbp
+
+ jmp __switch_to
+END(__switch_to_asm)
+
+/*
* A newly forked process directly context switches into this address.
*
- * rdi: prev task we switched from
+ * rax: prev task we switched from
*/
ENTRY(ret_from_fork)
- LOCK ; btr $TIF_FORK, TI_flags(%r8)
-
+ movq %rax, %rdi
call schedule_tail /* rdi: 'prev' task parameter */

testb $3, CS(%rsp) /* from kernel_thread? */
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 63def95..6fee863 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -389,9 +389,6 @@ struct thread_struct {
unsigned short fsindex;
unsigned short gsindex;
#endif
-#ifdef CONFIG_X86_32
- unsigned long ip;
-#endif
#ifdef CONFIG_X86_64
unsigned long fsbase;
unsigned long gsbase;
diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h
index 02de86e..bf4e2ec 100644
--- a/arch/x86/include/asm/switch_to.h
+++ b/arch/x86/include/asm/switch_to.h
@@ -2,135 +2,40 @@
#define _ASM_X86_SWITCH_TO_H

struct task_struct; /* one of the stranger aspects of C forward declarations */
+
+struct task_struct *__switch_to_asm(struct task_struct *prev,
+ struct task_struct *next);
+
__visible struct task_struct *__switch_to(struct task_struct *prev,
- struct task_struct *next);
+ struct task_struct *next);
struct tss_struct;
void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p,
struct tss_struct *tss);

/* data that is pointed to by thread.sp */
struct inactive_task_frame {
+#ifdef CONFIG_X86_64
+ unsigned long r15;
+ unsigned long r14;
+ unsigned long r13;
+ unsigned long r12;
+#else
+ unsigned long si;
+ unsigned long di;
+#endif
+ unsigned long bx;
unsigned long bp;
+ unsigned long ret_addr;
};

-#ifdef CONFIG_X86_32
-
-#ifdef CONFIG_CC_STACKPROTECTOR
-#define __switch_canary \
- "movl %P[task_canary](%[next]), %%ebx\n\t" \
- "movl %%ebx, "__percpu_arg([stack_canary])"\n\t"
-#define __switch_canary_oparam \
- , [stack_canary] "=m" (stack_canary.canary)
-#define __switch_canary_iparam \
- , [task_canary] "i" (offsetof(struct task_struct, stack_canary))
-#else /* CC_STACKPROTECTOR */
-#define __switch_canary
-#define __switch_canary_oparam
-#define __switch_canary_iparam
-#endif /* CC_STACKPROTECTOR */
+struct fork_frame {
+ struct inactive_task_frame frame;
+ struct pt_regs regs;
+};

-/*
- * Saving eflags is important. It switches not only IOPL between tasks,
- * it also protects other tasks from NT leaking through sysenter etc.
- */
#define switch_to(prev, next, last) \
do { \
- /* \
- * Context-switching clobbers all registers, so we clobber \
- * them explicitly, via unused output variables. \
- * (EAX and EBP is not listed because EBP is saved/restored \
- * explicitly for wchan access and EAX is the return value of \
- * __switch_to()) \
- */ \
- unsigned long ebx, ecx, edx, esi, edi; \
- \
- asm volatile("pushl %%ebp\n\t" /* save EBP */ \
- "movl %%esp,%[prev_sp]\n\t" /* save ESP */ \
- "movl %[next_sp],%%esp\n\t" /* restore ESP */ \
- "movl $1f,%[prev_ip]\n\t" /* save EIP */ \
- "pushl %[next_ip]\n\t" /* restore EIP */ \
- __switch_canary \
- "jmp __switch_to\n" /* regparm call */ \
- "1:\t" \
- "popl %%ebp\n\t" /* restore EBP */ \
- \
- /* output parameters */ \
- : [prev_sp] "=m" (prev->thread.sp), \
- [prev_ip] "=m" (prev->thread.ip), \
- "=a" (last), \
- \
- /* clobbered output registers: */ \
- "=b" (ebx), "=c" (ecx), "=d" (edx), \
- "=S" (esi), "=D" (edi) \
- \
- __switch_canary_oparam \
- \
- /* input parameters: */ \
- : [next_sp] "m" (next->thread.sp), \
- [next_ip] "m" (next->thread.ip), \
- \
- /* regparm parameters for __switch_to(): */ \
- [prev] "a" (prev), \
- [next] "d" (next) \
- \
- __switch_canary_iparam \
- \
- : /* reloaded segment registers */ \
- "memory"); \
+ ((last) = __switch_to_asm((prev), (next))); \
} while (0)

-#else /* CONFIG_X86_32 */
-
-/* frame pointer must be last for get_wchan */
-#define SAVE_CONTEXT "pushq %%rbp ; movq %%rsi,%%rbp\n\t"
-#define RESTORE_CONTEXT "movq %%rbp,%%rsi ; popq %%rbp\t"
-
-#define __EXTRA_CLOBBER \
- , "rcx", "rbx", "rdx", "r8", "r9", "r10", "r11", \
- "r12", "r13", "r14", "r15", "flags"
-
-#ifdef CONFIG_CC_STACKPROTECTOR
-#define __switch_canary \
- "movq %P[task_canary](%%rsi),%%r8\n\t" \
- "movq %%r8,"__percpu_arg([gs_canary])"\n\t"
-#define __switch_canary_oparam \
- , [gs_canary] "=m" (irq_stack_union.stack_canary)
-#define __switch_canary_iparam \
- , [task_canary] "i" (offsetof(struct task_struct, stack_canary))
-#else /* CC_STACKPROTECTOR */
-#define __switch_canary
-#define __switch_canary_oparam
-#define __switch_canary_iparam
-#endif /* CC_STACKPROTECTOR */
-
-/*
- * There is no need to save or restore flags, because flags are always
- * clean in kernel mode, with the possible exception of IOPL. Kernel IOPL
- * has no effect.
- */
-#define switch_to(prev, next, last) \
- asm volatile(SAVE_CONTEXT \
- "movq %%rsp,%P[threadrsp](%[prev])\n\t" /* save RSP */ \
- "movq %P[threadrsp](%[next]),%%rsp\n\t" /* restore RSP */ \
- "call __switch_to\n\t" \
- "movq "__percpu_arg([current_task])",%%rsi\n\t" \
- __switch_canary \
- "movq %P[thread_info](%%rsi),%%r8\n\t" \
- "movq %%rax,%%rdi\n\t" \
- "testl %[_tif_fork],%P[ti_flags](%%r8)\n\t" \
- "jnz ret_from_fork\n\t" \
- RESTORE_CONTEXT \
- : "=a" (last) \
- __switch_canary_oparam \
- : [next] "S" (next), [prev] "D" (prev), \
- [threadrsp] "i" (offsetof(struct task_struct, thread.sp)), \
- [ti_flags] "i" (offsetof(struct thread_info, flags)), \
- [_tif_fork] "i" (_TIF_FORK), \
- [thread_info] "i" (offsetof(struct task_struct, stack)), \
- [current_task] "m" (current_task) \
- __switch_canary_iparam \
- : "memory", "cc" __EXTRA_CLOBBER)
-
-#endif /* CONFIG_X86_32 */
-
#endif /* _ASM_X86_SWITCH_TO_H */
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 8b7c8d8..494c4b5 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -95,7 +95,6 @@ struct thread_info {
#define TIF_UPROBE 12 /* breakpointed or singlestepping */
#define TIF_NOTSC 16 /* TSC is not accessible in userland */
#define TIF_IA32 17 /* IA32 compatibility process */
-#define TIF_FORK 18 /* ret_from_fork */
#define TIF_NOHZ 19 /* in adaptive nohz mode */
#define TIF_MEMDIE 20 /* is terminating due to OOM killer */
#define TIF_POLLING_NRFLAG 21 /* idle is polling for TIF_NEED_RESCHED */
@@ -119,7 +118,6 @@ struct thread_info {
#define _TIF_UPROBE (1 << TIF_UPROBE)
#define _TIF_NOTSC (1 << TIF_NOTSC)
#define _TIF_IA32 (1 << TIF_IA32)
-#define _TIF_FORK (1 << TIF_FORK)
#define _TIF_NOHZ (1 << TIF_NOHZ)
#define _TIF_POLLING_NRFLAG (1 << TIF_POLLING_NRFLAG)
#define _TIF_IO_BITMAP (1 << TIF_IO_BITMAP)
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 2bd5c6f..db3a0af 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -29,6 +29,12 @@

void common(void) {
BLANK();
+ OFFSET(TASK_threadsp, task_struct, thread.sp);
+#ifdef CONFIG_CC_STACKPROTECTOR
+ OFFSET(TASK_stack_canary, task_struct, stack_canary);
+#endif
+
+ BLANK();
OFFSET(TI_flags, thread_info, flags);
OFFSET(TI_status, thread_info, status);

diff --git a/arch/x86/kernel/asm-offsets_32.c b/arch/x86/kernel/asm-offsets_32.c
index ecdc1d2..880aa09 100644
--- a/arch/x86/kernel/asm-offsets_32.c
+++ b/arch/x86/kernel/asm-offsets_32.c
@@ -57,6 +57,11 @@ void foo(void)
/* Size of SYSENTER_stack */
DEFINE(SIZEOF_SYSENTER_stack, sizeof(((struct tss_struct *)0)->SYSENTER_stack));

+#ifdef CONFIG_CC_STACKPROTECTOR
+ BLANK();
+ OFFSET(stack_canary_offset, stack_canary, canary);
+#endif
+
#if defined(CONFIG_LGUEST) || defined(CONFIG_LGUEST_GUEST) || defined(CONFIG_LGUEST_MODULE)
BLANK();
OFFSET(LGUEST_DATA_irq_enabled, lguest_data, irq_enabled);
diff --git a/arch/x86/kernel/asm-offsets_64.c b/arch/x86/kernel/asm-offsets_64.c
index d875f97..210927e 100644
--- a/arch/x86/kernel/asm-offsets_64.c
+++ b/arch/x86/kernel/asm-offsets_64.c
@@ -56,6 +56,11 @@ int main(void)
OFFSET(TSS_sp0, tss_struct, x86_tss.sp0);
BLANK();

+#ifdef CONFIG_CC_STACKPROTECTOR
+ DEFINE(stack_canary_offset, offsetof(union irq_stack_union, stack_canary));
+ BLANK();
+#endif
+
DEFINE(__NR_syscall_max, sizeof(syscalls_64) - 1);
DEFINE(NR_syscalls, sizeof(syscalls_64));

diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index d86be29..4bedbc0 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -133,17 +133,20 @@ int copy_thread_tls(unsigned long clone_flags, unsigned long sp,
unsigned long arg, struct task_struct *p, unsigned long tls)
{
struct pt_regs *childregs = task_pt_regs(p);
+ struct fork_frame *fork_frame = container_of(childregs, struct fork_frame, regs);
+ struct inactive_task_frame *frame = &fork_frame->frame;
struct task_struct *tsk;
int err;

- p->thread.sp = (unsigned long) childregs;
+ frame->bp = 0;
+ p->thread.sp = (unsigned long) fork_frame;
p->thread.sp0 = (unsigned long) (childregs+1);
memset(p->thread.ptrace_bps, 0, sizeof(p->thread.ptrace_bps));

if (unlikely(p->flags & PF_KTHREAD)) {
/* kernel thread */
memset(childregs, 0, sizeof(struct pt_regs));
- p->thread.ip = (unsigned long) ret_from_kernel_thread;
+ frame->ret_addr = (unsigned long) ret_from_kernel_thread;
task_user_gs(p) = __KERNEL_STACK_CANARY;
childregs->ds = __USER_DS;
childregs->es = __USER_DS;
@@ -161,7 +164,7 @@ int copy_thread_tls(unsigned long clone_flags, unsigned long sp,
if (sp)
childregs->sp = sp;

- p->thread.ip = (unsigned long) ret_from_fork;
+ frame->ret_addr = (unsigned long) ret_from_fork;
task_user_gs(p) = get_user_gs(current_pt_regs());

p->thread.io_bitmap_ptr = NULL;
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 63236d8..827eeed 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -141,12 +141,17 @@ int copy_thread_tls(unsigned long clone_flags, unsigned long sp,
{
int err;
struct pt_regs *childregs;
+ struct fork_frame *fork_frame;
+ struct inactive_task_frame *frame;
struct task_struct *me = current;

p->thread.sp0 = (unsigned long)task_stack_page(p) + THREAD_SIZE;
childregs = task_pt_regs(p);
- p->thread.sp = (unsigned long) childregs;
- set_tsk_thread_flag(p, TIF_FORK);
+ fork_frame = container_of(childregs, struct fork_frame, regs);
+ frame = &fork_frame->frame;
+ frame->bp = 0;
+ frame->ret_addr = (unsigned long) ret_from_fork;
+ p->thread.sp = (unsigned long) fork_frame;
p->thread.io_bitmap_ptr = NULL;

savesegment(gs, p->thread.gsindex);
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 067de61..a3fefdf 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -935,7 +935,6 @@ void common_cpu_up(unsigned int cpu, struct task_struct *idle)
per_cpu(cpu_current_top_of_stack, cpu) =
(unsigned long)task_stack_page(idle) + THREAD_SIZE;
#else
- clear_tsk_thread_flag(idle, TIF_FORK);
initial_gs = per_cpu_offset(cpu);
#endif
}
--
2.5.5

Brian Gerst

unread,
Aug 13, 2016, 12:50:05 PM8/13/16
to
thread_saved_pc() was using a completely bogus method to get the return
address. Since switch_to() was previously inlined, there was no sane way
to know where on the stack the return address was stored. Now with the
frame of a sleeping thread well defined, this can be implemented correctly.

Signed-off-by: Brian Gerst <brg...@gmail.com>
---
arch/x86/include/asm/processor.h | 10 ++--------
arch/x86/kernel/process.c | 11 +++++++++++
arch/x86/kernel/process_32.c | 8 --------
3 files changed, 13 insertions(+), 16 deletions(-)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 6fee863..b22fb5a 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -721,8 +721,6 @@ static inline void spin_lock_prefetch(const void *x)
.addr_limit = KERNEL_DS, \
}

-extern unsigned long thread_saved_pc(struct task_struct *tsk);
-
/*
* TOP_OF_KERNEL_STACK_PADDING reserves 8 bytes on top of the ring0 stack.
* This is necessary to guarantee that the entire "struct pt_regs"
@@ -773,17 +771,13 @@ extern unsigned long thread_saved_pc(struct task_struct *tsk);
.addr_limit = KERNEL_DS, \
}

-/*
- * Return saved PC of a blocked thread.
- * What is this good for? it will be always the scheduler or ret_from_fork.
- */
-#define thread_saved_pc(t) READ_ONCE_NOCHECK(*(unsigned long *)((t)->thread.sp - 8))
-
#define task_pt_regs(tsk) ((struct pt_regs *)(tsk)->thread.sp0 - 1)
extern unsigned long KSTK_ESP(struct task_struct *task);

#endif /* CONFIG_X86_64 */

+extern unsigned long thread_saved_pc(struct task_struct *tsk);
+
extern void start_thread(struct pt_regs *regs, unsigned long new_ip,
unsigned long new_sp);

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 0115a4a..c1fa790 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -514,6 +514,17 @@ unsigned long arch_randomize_brk(struct mm_struct *mm)
}

/*
+ * Return saved PC of a blocked thread.
+ * What is this good for? it will be always the scheduler or ret_from_fork.
+ */
+unsigned long thread_saved_pc(struct task_struct *tsk)
+{
+ struct inactive_task_frame *frame =
+ (struct inactive_task_frame *) READ_ONCE(tsk->thread.sp);
+ return READ_ONCE_NOCHECK(frame->ret_addr);
+}
+
+/*
* Called from fs/proc with a reference on @p to find the function
* which called into schedule(). This needs to be done carefully
* because the task might wake up and we might look at a stack
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 18714a1..404efdf 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -55,14 +55,6 @@
#include <asm/switch_to.h>
#include <asm/vm86.h>

-/*
- * Return saved PC of a blocked thread.
- */
-unsigned long thread_saved_pc(struct task_struct *tsk)
-{
- return ((unsigned long *)tsk->thread.sp)[3];
-}
-
void __show_regs(struct pt_regs *regs, int all)
{
unsigned long cr0 = 0L, cr2 = 0L, cr3 = 0L, cr4 = 0L;
--
2.5.5

Linus Torvalds

unread,
Aug 13, 2016, 1:20:06 PM8/13/16
to
On Sat, Aug 13, 2016 at 9:38 AM, Brian Gerst <brg...@gmail.com> wrote:
> This patch set simplifies the switch_to() code, by moving the stack switch
> code out of line into an asm stub before calling __switch_to(). This ends
> up being more readable, and using the C calling convention instead of
> clobbering all registers improves code generation. It also allows newly
> forked processes to construct a special stack frame to seamlessly flow
> to ret_from_fork, instead of using a test and branch, or an unbalanced
> call/ret.

Do you have performance numbers? Is it noticeable/measurable?

Linus

Ingo Molnar

unread,
Aug 14, 2016, 4:30:06 AM8/14/16
to

* Brian Gerst <brg...@gmail.com> wrote:
> How do I measure it? The perf documentation isn't easy to understand.

Something like this:

taskset 1 perf stat -a -e '{instructions,cycles}' --repeat 10 perf bench sched pipe

... will give a very good idea about the general impact of these changes on
context switch overhead.

Thanks,

Ingo

Brian Gerst

unread,
Aug 14, 2016, 6:30:06 AM8/14/16
to
On Sat, Aug 13, 2016 at 1:16 PM, Linus Torvalds
<torv...@linux-foundation.org> wrote:
How do I measure it? The perf documentation isn't easy to understand.

It shouldn't be a significant change. On a 64-bit defconfig build,
__schedule() shrinks by 103 bytes. It's hard to analyse what exactly
changes, but it's likely that GCC can allocate registers better
without all the clobbers of the old inline asm version interfering.
The new stub adds just 39 bytes.

--
Brian Gerst

Andy Lutomirski

unread,
Aug 14, 2016, 7:20:18 AM8/14/16
to
I will be quite surprised if you can measure any effect at all. I've
never seen context switches take fewer than ~2k cycles, and on my
laptop, they take 8k-9k cycles. The scheduler is really, really slow.

(Why doesn't that perf command show cycles per context switch?)

--Andy

Brian Gerst

unread,
Aug 14, 2016, 10:20:05 AM8/14/16
to
Before:
Performance counter stats for 'system wide' (10 runs):

12,010,932,128 instructions # 1.03 insn per
cycle ( +- 0.31% )
11,691,797,513 cycles
( +- 0.76% )

3.487329979 seconds time elapsed
( +- 0.78% )

After:
Performance counter stats for 'system wide' (10 runs):

12,097,706,506 instructions # 1.04 insn per
cycle ( +- 0.14% )
11,612,167,742 cycles
( +- 0.81% )

3.451278789 seconds time elapsed
( +- 0.82% )

The numbers with or without this patch series are roughly the same.
There is noticeable variation in the numbers each time I run it, so
I'm not sure how good of a benchmark this is.

--
Brian Gerst

Ingo Molnar

unread,
Aug 15, 2016, 1:20:05 AM8/15/16
to
Weird, I get an order of magnitude lower noise:

triton:~/tip> taskset 1 perf stat -a -e '{instructions,cycles}' --repeat 10 perf bench sched pipe >/dev/null

Performance counter stats for 'system wide' (10 runs):

11,503,026,062 instructions # 1.23 insn per cycle ( +- 2.64% )
9,377,410,613 cycles ( +- 2.05% )

1.669425407 seconds time elapsed ( +- 0.12% )

But note that I also had '--sync' for perf stat and did a >/dev/null at the end to
make sure no terminal output and subsequent Xorg activities interfere. Also, full
screen terminal.

Maybe try 'taskset 4' as well to put the workload on another CPU, if the first CPU
is busier than the others?

(Any Hyperthreading on your test system?)

Thanks,

Ingo

Brian Gerst

unread,
Aug 15, 2016, 7:50:06 AM8/15/16
to
It is an AMD Phenom(tm) II X6 1055T, no hyperthreading.

--
Brian Gerst

Herbert Xu

unread,
Aug 17, 2016, 1:20:05 AM8/17/16
to
Andy Lutomirski <lu...@amacapital.net> wrote:
>
> I will be quite surprised if you can measure any effect at all. I've
> never seen context switches take fewer than ~2k cycles, and on my
> laptop, they take 8k-9k cycles. The scheduler is really, really slow.

Indeed, I think I still have a bugzilla entry somewhere regarding
the huge performance regression when we first switched over to the
current scheduler.

Over the years people have been trying to hack around it, e.g.,
network driver polling, low latency socket, anything to avoid
going into the scheduler. We really should fix it properly some
day.

Cheers,
--
Email: Herbert Xu <her...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Andy Lutomirski

unread,
Aug 17, 2016, 5:30:05 PM8/17/16
to
I've never investigated for real, but I suspect that cgroups are a big
part of it. If you do a regular perf recording, I think you'll find
that nearly all of the time is in the scheduler.

Josh Poimboeuf

unread,
Aug 17, 2016, 5:30:06 PM8/17/16
to
On Sat, Aug 13, 2016 at 12:38:15PM -0400, Brian Gerst wrote:
> This patch set simplifies the switch_to() code, by moving the stack switch
> code out of line into an asm stub before calling __switch_to(). This ends
> up being more readable, and using the C calling convention instead of
> clobbering all registers improves code generation. It also allows newly
> forked processes to construct a special stack frame to seamlessly flow
> to ret_from_fork, instead of using a test and branch, or an unbalanced
> call/ret.
>
> Changes from v2:
> - Updated comments around kernel threads being uncommon for fork, etc.
> - Removed STACK_FRAME_NON_STANDARD annotation from __schedule() per Josh Poimboeuf
> - A few minor cleanups added

There are a few minor conflicts with my x86 stack dump patch set, but
for the most part they should be orthogonal.

For the series:

Reviewed-by: Josh Poimboeuf <jpoi...@redhat.com>

--
Josh

tip-bot for Brian Gerst

unread,
Aug 24, 2016, 9:20:05 AM8/24/16
to
Commit-ID: 0100301bfdf56a2a370c7157b5ab0fbf9313e1cd
Gitweb: http://git.kernel.org/tip/0100301bfdf56a2a370c7157b5ab0fbf9313e1cd
Author: Brian Gerst <brg...@gmail.com>
AuthorDate: Sat, 13 Aug 2016 12:38:19 -0400
Committer: Ingo Molnar <mi...@kernel.org>
CommitDate: Wed, 24 Aug 2016 12:31:41 +0200

sched/x86: Rewrite the switch_to() code

Move the low-level context switch code to an out-of-line asm stub instead of
using complex inline asm. This allows constructing a new stack frame for the
child process to make it seamlessly flow to ret_from_fork without an extra
test and branch in __switch_to(). It also improves code generation for
__schedule() by using the C calling convention instead of clobbering all
registers.

Signed-off-by: Brian Gerst <brg...@gmail.com>
Reviewed-by: Josh Poimboeuf <jpoi...@redhat.com>
Cc: Andy Lutomirski <lu...@kernel.org>
Cc: Borislav Petkov <b...@alien8.de>
Cc: Denys Vlasenko <dvla...@redhat.com>
Cc: H. Peter Anvin <h...@zytor.com>
Cc: Linus Torvalds <torv...@linux-foundation.org>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Thomas Gleixner <tg...@linutronix.de>
Link: http://lkml.kernel.org/r/1471106302-10159-5-g...@gmail.com
Signed-off-by: Ingo Molnar <mi...@kernel.org>
---
arch/x86/entry/entry_32.S | 37 ++++++++++
arch/x86/entry/entry_64.S | 41 ++++++++++-
arch/x86/include/asm/processor.h | 3 -
arch/x86/include/asm/switch_to.h | 139 ++++++-------------------------------
arch/x86/include/asm/thread_info.h | 2 -
arch/x86/kernel/asm-offsets.c | 6 ++
arch/x86/kernel/asm-offsets_32.c | 5 ++
arch/x86/kernel/asm-offsets_64.c | 5 ++
arch/x86/kernel/process_32.c | 9 ++-
arch/x86/kernel/process_64.c | 9 ++-
arch/x86/kernel/smpboot.c | 1 -
11 files changed, 125 insertions(+), 132 deletions(-)
index ec689c6..886d5ea 100644
--- a/arch/x86/include/asm/switch_to.h
+++ b/arch/x86/include/asm/switch_to.h
@@ -2,8 +2,12 @@
#define _ASM_X86_SWITCH_TO_H

struct task_struct; /* one of the stranger aspects of C forward declarations */
+
+struct task_struct *__switch_to_asm(struct task_struct *prev,
+ struct task_struct *next);
+
__visible struct task_struct *__switch_to(struct task_struct *prev,
- struct task_struct *next);
+ struct task_struct *next);
struct tss_struct;
void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p,
struct tss_struct *tss);
@@ -32,131 +36,30 @@ static inline void prepare_switch_to(struct task_struct *prev,
prepare_switch_to(prev, next); \
- prepare_switch_to(prev, next); \
- \
index c85d2c6..7e52f83 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -942,7 +942,6 @@ void common_cpu_up(unsigned int cpu, struct task_struct *idle)

tip-bot for Brian Gerst

unread,
Aug 24, 2016, 9:20:05 AM8/24/16
to
Commit-ID: ffcb043ba524d3fbd979a9dac2c9ce8ad352000d
Gitweb: http://git.kernel.org/tip/ffcb043ba524d3fbd979a9dac2c9ce8ad352000d
Author: Brian Gerst <brg...@gmail.com>
AuthorDate: Sat, 13 Aug 2016 12:38:21 -0400
Committer: Ingo Molnar <mi...@kernel.org>
CommitDate: Wed, 24 Aug 2016 12:31:51 +0200

sched/x86: Fix thread_saved_pc()

thread_saved_pc() was using a completely bogus method to get the return
address. Since switch_to() was previously inlined, there was no sane way
to know where on the stack the return address was stored. Now with the
frame of a sleeping thread well defined, this can be implemented correctly.

Signed-off-by: Brian Gerst <brg...@gmail.com>
Reviewed-by: Josh Poimboeuf <jpoi...@redhat.com>
Cc: Andy Lutomirski <lu...@kernel.org>
Cc: Borislav Petkov <b...@alien8.de>
Cc: Denys Vlasenko <dvla...@redhat.com>
Cc: H. Peter Anvin <h...@zytor.com>
Cc: Linus Torvalds <torv...@linux-foundation.org>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Thomas Gleixner <tg...@linutronix.de>
Link: http://lkml.kernel.org/r/1471106302-10159-7-g...@gmail.com
Signed-off-by: Ingo Molnar <mi...@kernel.org>

tip-bot for Brian Gerst

unread,
Aug 24, 2016, 9:20:13 AM8/24/16
to
Commit-ID: 7b32aeadbc95d4a41402c1c0da6aa3ab51af4c10
Gitweb: http://git.kernel.org/tip/7b32aeadbc95d4a41402c1c0da6aa3ab51af4c10
Author: Brian Gerst <brg...@gmail.com>
AuthorDate: Sat, 13 Aug 2016 12:38:18 -0400
Committer: Ingo Molnar <mi...@kernel.org>
CommitDate: Wed, 24 Aug 2016 12:27:41 +0200

sched/x86: Add 'struct inactive_task_frame' to better document the sleeping task stack frame

Add 'struct inactive_task_frame', which defines the layout of the stack for
a sleeping process. For now, the only defined field is the BP register
(frame pointer).

Signed-off-by: Brian Gerst <brg...@gmail.com>
Reviewed-by: Josh Poimboeuf <jpoi...@redhat.com>
Cc: Andy Lutomirski <lu...@kernel.org>
Cc: Borislav Petkov <b...@alien8.de>
Cc: Denys Vlasenko <dvla...@redhat.com>
Cc: H. Peter Anvin <h...@zytor.com>
Cc: Linus Torvalds <torv...@linux-foundation.org>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Thomas Gleixner <tg...@linutronix.de>
Link: http://lkml.kernel.org/r/1471106302-10159-4-g...@gmail.com
Signed-off-by: Ingo Molnar <mi...@kernel.org>
---
arch/x86/include/asm/stacktrace.h | 4 ++--
arch/x86/include/asm/switch_to.h | 5 +++++
arch/x86/kernel/kgdb.c | 3 ++-
arch/x86/kernel/process.c | 3 ++-
4 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/stacktrace.h b/arch/x86/include/asm/stacktrace.h
index 0944218..7646fb2 100644
--- a/arch/x86/include/asm/stacktrace.h
+++ b/arch/x86/include/asm/stacktrace.h
@@ -8,6 +8,7 @@

#include <linux/uaccess.h>
#include <linux/ptrace.h>
+#include <asm/switch_to.h>

extern int kstack_depth_to_print;

@@ -70,8 +71,7 @@ stack_frame(struct task_struct *task, struct pt_regs *regs)
return bp;
}

- /* bp is the last reg pushed by switch_to */
- return *(unsigned long *)task->thread.sp;
+ return ((struct inactive_task_frame *)task->thread.sp)->bp;
}
#else
static inline unsigned long
diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h
index 14e4b20..ec689c6 100644
--- a/arch/x86/include/asm/switch_to.h
+++ b/arch/x86/include/asm/switch_to.h
@@ -30,6 +30,11 @@ static inline void prepare_switch_to(struct task_struct *prev,
#endif

tip-bot for Brian Gerst

unread,
Aug 24, 2016, 9:50:06 AM8/24/16
to
Commit-ID: 163630191ecb0dd9e4146d3c910045aba1cfeec1
Gitweb: http://git.kernel.org/tip/163630191ecb0dd9e4146d3c910045aba1cfeec1
Author: Brian Gerst <brg...@gmail.com>
AuthorDate: Sat, 13 Aug 2016 12:38:17 -0400
Committer: Ingo Molnar <mi...@kernel.org>
CommitDate: Wed, 24 Aug 2016 12:27:40 +0200

sched/x86/64, kgdb: Clear GDB_PS on 64-bit

switch_to() no longer saves EFLAGS, so it's bogus to look for it on the
stack. Set it to zero like 32-bit.

Signed-off-by: Brian Gerst <brg...@gmail.com>
Reviewed-by: Josh Poimboeuf <jpoi...@redhat.com>
Cc: Andy Lutomirski <lu...@kernel.org>
Cc: Borislav Petkov <b...@alien8.de>
Cc: Denys Vlasenko <dvla...@redhat.com>
Cc: H. Peter Anvin <h...@zytor.com>
Cc: Jason Wessel <jason....@windriver.com>
Cc: Linus Torvalds <torv...@linux-foundation.org>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Thomas Gleixner <tg...@linutronix.de>
Link: http://lkml.kernel.org/r/1471106302-10159-3-g...@gmail.com
Signed-off-by: Ingo Molnar <mi...@kernel.org>

tip-bot for Brian Gerst

unread,
Aug 24, 2016, 9:50:07 AM8/24/16
to
Commit-ID: 01175255fd8e3e993353a779f819ec8c0c59137e
Gitweb: http://git.kernel.org/tip/01175255fd8e3e993353a779f819ec8c0c59137e
Author: Brian Gerst <brg...@gmail.com>
AuthorDate: Sat, 13 Aug 2016 12:38:22 -0400
Committer: Ingo Molnar <mi...@kernel.org>
CommitDate: Wed, 24 Aug 2016 12:31:51 +0200

sched: Remove __schedule() non-standard frame annotation

Now that the x86 switch_to() uses the standard C calling convention,
the STACK_FRAME_NON_STANDARD() annotation is no longer needed.

Suggested-by: Josh Poimboeuf <jpoi...@redhat.com>
Signed-off-by: Brian Gerst <brg...@gmail.com>
Reviewed-by: Josh Poimboeuf <jpoi...@redhat.com>
Cc: Andy Lutomirski <lu...@kernel.org>
Cc: Borislav Petkov <b...@alien8.de>
Cc: Denys Vlasenko <dvla...@redhat.com>
Cc: H. Peter Anvin <h...@zytor.com>
Cc: Linus Torvalds <torv...@linux-foundation.org>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Thomas Gleixner <tg...@linutronix.de>
Link: http://lkml.kernel.org/r/1471106302-10159-8-g...@gmail.com
Signed-off-by: Ingo Molnar <mi...@kernel.org>
---
kernel/sched/core.c | 1 -
1 file changed, 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2a906f2..3d91b63dd 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3381,7 +3381,6 @@ static void __sched notrace __schedule(bool preempt)
0 new messages