New scheduler for OpenBSD

Michal Mazurek

unread,

Mar 12, 2016, 11:44:46 AM3/12/16

to

Gregor Best attempted to improve the scheduler in 2011:
http://comments.gmane.org/gmane.os.openbsd.tech/27059
Here is another attempt, it takes up where the previous one left off.

This is also mostly based on the main idea behind Linux CFS or
BFS. I found BFS to be described more clearly:
http://ck.kolivas.org/patches/bfs/4.0/4.3/4.3-sched-bfs-467.patch

Some notes:

Chrome is still not very usable.

Much more work is needed, e.g. there is some MD code on sparc64 and
alpha that depends on spc_schedticks that needs to be understood and
rewritten.

Maybe using RB trees to queue what is usually no more than 5 elements
is overkill.

p_usrpri and p_priority will go away, so userland utilities like 'ps'
will need to be changed.

I also want to try and see if implementing one shared queue, instead of
keeping one queue per cpu will improve performance even further. Right
now there are some heuristics to determine whether a process should
switch cpus. This doesn't work very well yet, in my tests with the
attached code sometimes one queue was a second behind another. From
what I understand that's the idea behind BFS and the reason why it
doesn't scale to 4096 CPUs. I see that OpenBSD supports 256 CPUs on
sparc64:
./arch/sparc64/include/cpu.h:#define MAXCPUS 256

Nice is not yet supported, but it's easy to add (uncomment and modify
'offset' in sched_deadline(). I didn't want to add it yet to test just
how fair this method is. Currently all processes are equal.

Other operating systems provide a couple of different schedulers to
choose from, this diff overwrites the 4bsd scheduler with the new code
and leaves no choice.

With the new code I got some visible improvements when compiling the
kernel:
old: 1m46.74s real 5m08.48s user 1m27.37s system
old: 1m46.61s real 5m06.93s user 1m28.13s system
old: 1m49.91s real 5m19.71s user 1m28.25s system
new: 1m38.05s real 5m12.16s user 0m55.86s system
new: 1m38.70s real 5m12.36s user 0m57.82s system
new: 1m39.10s real 5m16.06s user 0m56.91s system

I would be interested in hearing if the new code broke something for
anyone.

Is there a chance for a scheduler rewrite like this to be commited?

Index: sys/kern/init_main.c
===================================================================
RCS file: /cvs/src/sys/kern/init_main.c,v
retrieving revision 1.248
diff -u -p -r1.248 init_main.c
--- sys/kern/init_main.c 3 Jan 2016 00:15:59 -0000 1.248
+++ sys/kern/init_main.c 12 Mar 2016 15:48:36 -0000
@@ -265,6 +265,8 @@ main(void *framep)
*/
pr = &process0;
process_initialize(pr, p);
+ p->p_deadline = 0;
+ p->p_rrticks = 10;

LIST_INSERT_HEAD(&allprocess, pr, ps_list);
atomic_setbits_int(&pr->ps_flags, PS_SYSTEM);
@@ -554,6 +556,7 @@ main(void *framep)
/*
* proc0: nothing to do, back to sleep
*/
+ printf("*** modified scheduler ***\n");
while (1)
tsleep(&proc0, PVM, "scheduler", 0);
/* NOTREACHED */
Index: sys/kern/kern_clock.c
===================================================================
RCS file: /cvs/src/sys/kern/kern_clock.c,v
retrieving revision 1.88
diff -u -p -r1.88 kern_clock.c
--- sys/kern/kern_clock.c 11 Jun 2015 16:03:04 -0000 1.88
+++ sys/kern/kern_clock.c 12 Mar 2016 15:48:36 -0000
@@ -180,7 +180,7 @@ hardclock(struct clockframe *frame)
if (stathz == 0)
statclock(frame);

- if (--ci->ci_schedstate.spc_rrticks <= 0)
+ if (p && (--(p->p_rrticks) <= 0))
roundrobin(ci);

/*
@@ -398,15 +398,7 @@ statclock(struct clockframe *frame)

if (p != NULL) {
p->p_cpticks++;
- /*
- * If no schedclock is provided, call it here at ~~12-25 Hz;
- * ~~16 Hz is best
- */
- if (schedhz == 0) {
- if ((++curcpu()->ci_schedstate.spc_schedticks & 3) ==
- 0)
- schedclock(p);
- }
+ ++curcpu()->ci_schedstate.spc_schedticks;
}
}

Index: sys/kern/kern_fork.c
===================================================================
RCS file: /cvs/src/sys/kern/kern_fork.c,v
retrieving revision 1.184
diff -u -p -r1.184 kern_fork.c
--- sys/kern/kern_fork.c 9 Oct 2015 01:10:27 -0000 1.184
+++ sys/kern/kern_fork.c 12 Mar 2016 15:48:36 -0000
@@ -498,6 +498,7 @@ fork1(struct proc *curp, int flags, void
/*
* Make child runnable and add to run queue.
*/
+ sched_deadline(p);
if ((flags & FORK_IDLE) == 0) {
SCHED_LOCK(s);
p->p_stat = SRUN;
Index: sys/kern/kern_resource.c
===================================================================
RCS file: /cvs/src/sys/kern/kern_resource.c,v
retrieving revision 1.55
diff -u -p -r1.55 kern_resource.c
--- sys/kern/kern_resource.c 5 Dec 2015 10:11:53 -0000 1.55
+++ sys/kern/kern_resource.c 12 Mar 2016 15:48:36 -0000
@@ -178,8 +178,6 @@ int
donice(struct proc *curp, struct process *chgpr, int n)
{
struct ucred *ucred = curp->p_ucred;
- struct proc *p;
- int s;

if (ucred->cr_uid != 0 && ucred->cr_ruid != 0 &&
ucred->cr_uid != chgpr->ps_ucred->cr_uid &&
@@ -193,10 +191,13 @@ donice(struct proc *curp, struct process
if (n < chgpr->ps_nice && suser(curp, 0))
return (EACCES);
chgpr->ps_nice = n;
+ /* XXXNICE */
+ /*
SCHED_LOCK(s);
TAILQ_FOREACH(p, &chgpr->ps_threads, p_thr_link)
(void)resetpriority(p);
SCHED_UNLOCK(s);
+ */
return (0);
}

Index: sys/kern/kern_sched.c
===================================================================
RCS file: /cvs/src/sys/kern/kern_sched.c,v
retrieving revision 1.41
diff -u -p -r1.41 kern_sched.c
--- sys/kern/kern_sched.c 23 Dec 2015 14:51:17 -0000 1.41
+++ sys/kern/kern_sched.c 12 Mar 2016 15:48:36 -0000
@@ -33,6 +33,21 @@ void sched_kthreads_create(void *);
int sched_proc_to_cpu_cost(struct cpu_info *ci, struct proc *p);
struct proc *sched_steal_proc(struct cpu_info *);

+int
+sched_cmp_proc(struct proc *e1, struct proc *e2)
+{
+ if (e1->p_deadline == e2->p_deadline) {
+ /*
+ * this implementation of rb trees needs the keys to be unique
+ * XXX compare pointers for now, UB
+ */
+ return (e1 < e2 ? -1 : e1 > e2);
+ }
+ return (e1->p_deadline < e2->p_deadline ? -1 : e1->p_deadline > e2->p_deadline);
+}
+RB_PROTOTYPE(prochead2, proc, p_runq2, sched_cmp_proc);
+RB_GENERATE(prochead2, proc, p_runq2, sched_cmp_proc);
+
/*
* To help choosing which cpu should run which process we keep track
* of cpus which are currently idle and which cpus have processes
@@ -81,10 +96,8 @@ void
sched_init_cpu(struct cpu_info *ci)
{
struct schedstate_percpu *spc = &ci->ci_schedstate;
- int i;

- for (i = 0; i < SCHED_NQS; i++)
- TAILQ_INIT(&spc->spc_qs[i]);
+ RB_INIT(&spc->spc_rq);

spc->spc_idleproc = NULL;

@@ -238,14 +251,13 @@ void
setrunqueue(struct proc *p)
{
struct schedstate_percpu *spc;
- int queue = p->p_priority >> 2;

SCHED_ASSERT_LOCKED();
spc = &p->p_cpu->ci_schedstate;
spc->spc_nrun++;

- TAILQ_INSERT_TAIL(&spc->spc_qs[queue], p, p_runq);
- spc->spc_whichqs |= (1 << queue);
+ RB_INSERT(prochead2, &spc->spc_rq, p);
+ spc->spc_whichqs = 1;
cpuset_add(&sched_queued_cpus, p->p_cpu);

if (cpuset_isset(&sched_idle_cpus, p->p_cpu))
@@ -256,17 +268,15 @@ void
remrunqueue(struct proc *p)
{
struct schedstate_percpu *spc;
- int queue = p->p_priority >> 2;

SCHED_ASSERT_LOCKED();
spc = &p->p_cpu->ci_schedstate;
spc->spc_nrun--;

- TAILQ_REMOVE(&spc->spc_qs[queue], p, p_runq);
- if (TAILQ_EMPTY(&spc->spc_qs[queue])) {
- spc->spc_whichqs &= ~(1 << queue);
- if (spc->spc_whichqs == 0)
- cpuset_del(&sched_queued_cpus, p->p_cpu);
+ RB_REMOVE(prochead2, &spc->spc_rq, p);
+ if (RB_EMPTY(&spc->spc_rq)) {
+ spc->spc_whichqs = 0;
+ cpuset_del(&sched_queued_cpus, p->p_cpu);
}
}

@@ -282,13 +292,11 @@ sched_chooseproc(void)
#ifdef MULTIPROCESSOR
if (spc->spc_schedflags & SPCF_SHOULDHALT) {
if (spc->spc_whichqs) {
- for (queue = 0; queue < SCHED_NQS; queue++) {
- while ((p = TAILQ_FIRST(&spc->spc_qs[queue]))) {
- remrunqueue(p);
- p->p_cpu = sched_choosecpu(p);
- KASSERT(p->p_cpu != curcpu());
- setrunqueue(p);
- }
+ while ((p = RB_MIN(prochead2, &spc->spc_rq)) != NULL) {
+ remrunqueue(p);
+ p->p_cpu = sched_choosecpu(p);
+ KASSERT(p->p_cpu != curcpu());
+ setrunqueue(p);
}
}
p = spc->spc_idleproc;
@@ -302,7 +310,7 @@ sched_chooseproc(void)
again:
if (spc->spc_whichqs) {
queue = ffs(spc->spc_whichqs) - 1;
- p = TAILQ_FIRST(&spc->spc_qs[queue]);
+ p = RB_MIN(prochead2, &spc->spc_rq);
remrunqueue(p);
sched_noidle++;
KASSERT(p->p_stat == SRUN);
@@ -478,7 +486,7 @@ sched_steal_proc(struct cpu_info *self)
spc = &ci->ci_schedstate;

queue = ffs(spc->spc_whichqs) - 1;
- TAILQ_FOREACH(p, &spc->spc_qs[queue], p_runq) {
+ RB_FOREACH(p, prochead2, &spc->spc_rq) {
if (p->p_flag & P_CPUPEG)
continue;

@@ -550,8 +558,9 @@ sched_proc_to_cpu_cost(struct cpu_info *
* and the higher the priority of the proc.
*/
if (!cpuset_isset(&sched_idle_cpus, ci)) {
- cost += (p->p_priority - spc->spc_curpriority) *
- sched_cost_priority;
+ /* 100ms difference? */
+ cost += ((p->p_deadline - spc->spc_curdeadline) / 100000000)
+ * sched_cost_priority;
cost += sched_cost_runnable;
}
if (cpuset_isset(&sched_queued_cpus, ci))
Index: sys/kern/kern_sig.c
===================================================================
RCS file: /cvs/src/sys/kern/kern_sig.c,v
retrieving revision 1.193
diff -u -p -r1.193 kern_sig.c
--- sys/kern/kern_sig.c 9 Mar 2016 13:38:50 -0000 1.193
+++ sys/kern/kern_sig.c 12 Mar 2016 15:48:36 -0000
@@ -1894,6 +1894,7 @@ userret(struct proc *p)
}

p->p_cpu->ci_schedstate.spc_curpriority = p->p_priority = p->p_usrpri;
+ p->p_cpu->ci_schedstate.spc_curdeadline = p->p_deadline;
}

int
Index: sys/kern/kern_synch.c
===================================================================
RCS file: /cvs/src/sys/kern/kern_synch.c,v
retrieving revision 1.129
diff -u -p -r1.129 kern_synch.c
--- sys/kern/kern_synch.c 9 Mar 2016 13:38:50 -0000 1.129
+++ sys/kern/kern_synch.c 12 Mar 2016 15:48:36 -0000
@@ -274,6 +274,7 @@ sleep_finish(struct sleep_state *sls, in
#endif

p->p_cpu->ci_schedstate.spc_curpriority = p->p_usrpri;
+ p->p_cpu->ci_schedstate.spc_curdeadline = p->p_deadline;
SCHED_UNLOCK(sls->sls_s);

/*
Index: sys/kern/sched_bsd.c
===================================================================
RCS file: /cvs/src/sys/kern/sched_bsd.c,v
retrieving revision 1.43
diff -u -p -r1.43 sched_bsd.c
--- sys/kern/sched_bsd.c 9 Mar 2016 13:38:50 -0000 1.43
+++ sys/kern/sched_bsd.c 12 Mar 2016 15:48:36 -0000
@@ -61,7 +61,6 @@ struct __mp_lock sched_lock;
#endif

void schedcpu(void *);
-void updatepri(struct proc *);

void
scheduler_start(void)
@@ -91,6 +90,7 @@ roundrobin(struct cpu_info *ci)
spc->spc_rrticks = rrticks_init;

if (ci->ci_curproc != NULL) {
+ sched_deadline(ci->ci_curproc);
if (spc->spc_schedflags & SPCF_SEENRR) {
/*
* The process has already been through a roundrobin
@@ -109,74 +109,6 @@ roundrobin(struct cpu_info *ci)
need_resched(ci);
}

-/*
- * Constants for digital decay and forget:
- * 90% of (p_estcpu) usage in 5 * loadav time
- * 95% of (p_pctcpu) usage in 60 seconds (load insensitive)
- * Note that, as ps(1) mentions, this can let percentages
- * total over 100% (I've seen 137.9% for 3 processes).
- *
- * Note that hardclock updates p_estcpu and p_cpticks independently.
- *
- * We wish to decay away 90% of p_estcpu in (5 * loadavg) seconds.
- * That is, the system wants to compute a value of decay such
- * that the following for loop:
- * for (i = 0; i < (5 * loadavg); i++)
- * p_estcpu *= decay;
- * will compute
- * p_estcpu *= 0.1;
- * for all values of loadavg:
- *
- * Mathematically this loop can be expressed by saying:
- * decay ** (5 * loadavg) ~= .1
- *
- * The system computes decay as:
- * decay = (2 * loadavg) / (2 * loadavg + 1)
- *
- * We wish to prove that the system's computation of decay
- * will always fulfill the equation:
- * decay ** (5 * loadavg) ~= .1
- *
- * If we compute b as:
- * b = 2 * loadavg
- * then
- * decay = b / (b + 1)
- *
- * We now need to prove two things:
- * 1) Given factor ** (5 * loadavg) ~= .1, prove factor == b/(b+1)
- * 2) Given b/(b+1) ** power ~= .1, prove power == (5 * loadavg)
- *
- * Facts:
- * For x close to zero, exp(x) =~ 1 + x, since
- * exp(x) = 0! + x**1/1! + x**2/2! + ... .
- * therefore exp(-1/b) =~ 1 - (1/b) = (b-1)/b.
- * For x close to zero, ln(1+x) =~ x, since
- * ln(1+x) = x - x**2/2 + x**3/3 - ... -1 < x < 1
- * therefore ln(b/(b+1)) = ln(1 - 1/(b+1)) =~ -1/(b+1).
- * ln(.1) =~ -2.30
- *
- * Proof of (1):
- * Solve (factor)**(power) =~ .1 given power (5*loadav):
- * solving for factor,
- * ln(factor) =~ (-2.30/5*loadav), or
- * factor =~ exp(-1/((5/2.30)*loadav)) =~ exp(-1/(2*loadav)) =
- * exp(-1/b) =~ (b-1)/b =~ b/(b+1). QED
- *
- * Proof of (2):
- * Solve (factor)**(power) =~ .1 given factor == (b/(b+1)):
- * solving for power,
- * power*ln(b/(b+1)) =~ -2.30, or
- * power =~ 2.3 * (b + 1) = 4.6*loadav + 2.3 =~ 5*loadav. QED
- *
- * Actual power values for the implemented algorithm are as follows:
- * loadav: 1 2 3 4
- * power: 5.68 10.32 14.94 19.55
- */
-
-/* calculations for digital decay to forget 90% of usage in 5*loadav sec */
-#define loadfactor(loadav) (2 * (loadav))
-#define decay_cpu(loadfac, cpu) (((loadfac) * (cpu)) / ((loadfac) + FSCALE))
-
/* decay 95% of `p_pctcpu' in 60 seconds; see CCPU_SHIFT before changing */
fixpt_t ccpu = 0.95122942450071400909 * FSCALE; /* exp(-1/20) */

@@ -197,17 +129,40 @@ fixpt_t ccpu = 0.95122942450071400909 *
/*
* Recompute process priorities, every second.
*/
+RB_PROTOTYPE(prochead2, proc, p_runq2, sched_cmp_proc);
void
schedcpu(void *arg)
{
struct timeout *to = (struct timeout *)arg;
- fixpt_t loadfac = loadfactor(averunnable.ldavg[0]);
struct proc *p;
int s;
- unsigned int newcpu;
int phz;

/*
+ CPU_INFO_ITERATOR cii;
+ struct cpu_info *ci;
+ int i = 0;
+ CPU_INFO_FOREACH(cii, ci) {
+ i++;
+ printf("%d: ", i);
+ if (ci->ci_curproc != NULL) {
+ printf("[%s %llu %d] ",
+ ci->ci_curproc->p_comm,
+ ci->ci_curproc->p_deadline,
+ ci->ci_curproc->p_rrticks);
+
+ }
+ RB_FOREACH(p, prochead2, &ci->ci_schedstate.spc_rq) {
+ printf("<%s %llu %d>",
+ p->p_comm,
+ p->p_deadline,
+ p->p_rrticks);
+ }
+ printf("\n");
+ }
+ */
+
+ /*
* If we have a statistics clock, use that to calculate CPU
* time, otherwise revert to using the profiling clock (which,
* in turn, defaults to hz if there is no separate profiling
@@ -243,19 +198,6 @@ schedcpu(void *arg)
(p->p_cpticks * FSCALE / phz)) >> FSHIFT;
#endif
p->p_cpticks = 0;
- newcpu = (u_int) decay_cpu(loadfac, p->p_estcpu);
- p->p_estcpu = newcpu;
- resetpriority(p);
- if (p->p_priority >= PUSER) {
- if (p->p_stat == SRUN &&
- (p->p_priority / SCHED_PPQ) !=
- (p->p_usrpri / SCHED_PPQ)) {
- remrunqueue(p);
- p->p_priority = p->p_usrpri;
- setrunqueue(p);
- } else
- p->p_priority = p->p_usrpri;
- }
SCHED_UNLOCK(s);
}
uvm_meter();
@@ -264,27 +206,32 @@ schedcpu(void *arg)
}

/*
- * Recalculate the priority of a process after it has slept for a while.
- * For all load averages >= 1 and max p_estcpu of 255, sleeping for at
- * least six times the loadfactor will decay p_estcpu to zero.
+ * 10% more for every nice level. scaled by a power of 2, these are
+ * later divided
*/
+int nice_mul[] = {
+ 1024, 1126, 1239, 1362, 1499,
+ 1649, 1814, 1995, 2195, 2414,
+ 2655, 2921, 3213, 3535, 3888,
+ 4277, 4705, 5175, 5693, 6262,
+ 6888, 7577, 8335, 9169, 10086,
+ 11094, 12204, 13424, 14767, 16243,
+ 17868, 19655, 21620, 23782, 26160,
+ 28776, 31654, 34820, 38302, 42132
+};
+
void
-updatepri(struct proc *p)
+sched_deadline(struct proc *p)
{
- unsigned int newcpu = p->p_estcpu;
- fixpt_t loadfac = loadfactor(averunnable.ldavg[0]);
+ u_int64_t niffies;
+ /* u_int64_t offset = 100 * 1000 * 1000; */
+ struct timespec now;

- SCHED_ASSERT_LOCKED();
+ nanouptime(&now);
+ niffies = now.tv_sec * (1000 * 1000 * 1000) + now.tv_nsec;

- if (p->p_slptime > 5 * loadfac)
- p->p_estcpu = 0;
- else {
- p->p_slptime--; /* the first time was done in schedcpu */
- while (newcpu && --p->p_slptime)
- newcpu = (int) decay_cpu(loadfac, newcpu);
- p->p_estcpu = newcpu;
- }
- resetpriority(p);
+ p->p_deadline = niffies; // + offset;
+ p->p_rrticks = rrticks_init;
}

/*
@@ -326,6 +273,7 @@ preempt(struct proc *newp)

SCHED_LOCK(s);
p->p_priority = p->p_usrpri;
+ sched_deadline(p);
p->p_stat = SRUN;
p->p_cpu = sched_choosecpu(p);
setrunqueue(p);
@@ -473,7 +421,7 @@ resched_proc(struct proc *p, u_char pri)
* sched state, which we currently do not do.
*/
ci = (p->p_cpu != NULL) ? p->p_cpu : curcpu();
- if (pri < ci->ci_schedstate.spc_curpriority)
+ if (ci->ci_curproc && p->p_deadline < ci->ci_curproc->p_deadline)
need_resched(ci);
}

@@ -509,56 +457,8 @@ setrunnable(struct proc *p)
p->p_stat = SRUN;
p->p_cpu = sched_choosecpu(p);
setrunqueue(p);
- if (p->p_slptime > 1)
- updatepri(p);
p->p_slptime = 0;
resched_proc(p, p->p_priority);
-}
-
-/*
- * Compute the priority of a process when running in user mode.
- * Arrange to reschedule if the resulting priority is better
- * than that of the current process.
- */
-void
-resetpriority(struct proc *p)
-{
- unsigned int newpriority;
-
- SCHED_ASSERT_LOCKED();
-
- newpriority = PUSER + p->p_estcpu +
- NICE_WEIGHT * (p->p_p->ps_nice - NZERO);
- newpriority = min(newpriority, MAXPRI);
- p->p_usrpri = newpriority;
- resched_proc(p, p->p_usrpri);
-}
-
-/*
- * We adjust the priority of the current process. The priority of a process
- * gets worse as it accumulates CPU time. The cpu usage estimator (p_estcpu)
- * is increased here. The formula for computing priorities (in kern_synch.c)
- * will compute a different value each time p_estcpu increases. This can
- * cause a switch, but unless the priority crosses a PPQ boundary the actual
- * queue will not change. The cpu usage estimator ramps up quite quickly
- * when the process is running (linearly), and decays away exponentially, at
- * a rate which is proportionally slower when the system is busy. The basic
- * principle is that the system will 90% forget that the process used a lot
- * of CPU time in 5 * loadav seconds. This causes the system to favor
- * processes which haven't run much recently, and to round-robin among other
- * processes.
- */
-void
-schedclock(struct proc *p)
-{
- int s;
-
- SCHED_LOCK(s);
- p->p_estcpu = ESTCPULIM(p->p_estcpu + 1);
- resetpriority(p);
- if (p->p_priority >= PUSER)
- p->p_priority = p->p_usrpri;
- SCHED_UNLOCK(s);
}

void (*cpu_setperf)(int);
Index: sys/sys/proc.h
===================================================================
RCS file: /cvs/src/sys/sys/proc.h,v
retrieving revision 1.217
diff -u -p -r1.217 proc.h
--- sys/sys/proc.h 9 Mar 2016 13:38:50 -0000 1.217
+++ sys/sys/proc.h 12 Mar 2016 15:48:37 -0000
@@ -44,6 +44,7 @@
#include <sys/selinfo.h> /* For struct selinfo */
#include <sys/syslimits.h> /* For LOGIN_NAME_MAX */
#include <sys/queue.h>
+#include <sys/tree.h>
#include <sys/timeout.h> /* For struct timeout */
#include <sys/event.h> /* For struct klist */
#include <sys/mutex.h> /* For struct mutex */
@@ -267,6 +268,7 @@ struct process {

struct proc {
TAILQ_ENTRY(proc) p_runq;
+ RB_ENTRY(proc) p_runq2;
LIST_ENTRY(proc) p_list; /* List of all threads. */

struct process *p_p; /* The process of this thread. */
@@ -320,6 +322,8 @@ struct proc {

/* The following fields are all copied upon creation in fork. */
#define p_startcopy p_sigmask
+ u_int64_t p_deadline;
+ int p_rrticks;
sigset_t p_sigmask; /* Current signal mask. */

u_char p_priority; /* Process priority. */
@@ -488,6 +492,7 @@ void fixjobc(struct process *, struct pg
int inferior(struct process *, struct process *);
void leavepgrp(struct process *);
void preempt(struct proc *);
+void sched_deadline(struct proc *);
void pgdelete(struct pgrp *);
void procinit(void);
void resetpriority(struct proc *);
@@ -570,4 +575,3 @@ struct cpu_info *cpuset_first(struct cpu

#endif /* _KERNEL */
#endif /* !_SYS_PROC_H_ */
-
Index: sys/sys/sched.h
===================================================================
RCS file: /cvs/src/sys/sys/sched.h,v
retrieving revision 1.40
diff -u -p -r1.40 sched.h
--- sys/sys/sched.h 9 Mar 2016 13:38:50 -0000 1.40
+++ sys/sys/sched.h 12 Mar 2016 15:48:37 -0000
@@ -70,6 +70,7 @@
#define _SYS_SCHED_H_

#include <sys/queue.h>
+#include <sys/tree.h>

/*
* Posix defines a <sched.h> which may want to include <sys/sched.h>
@@ -99,6 +100,7 @@ struct schedstate_percpu {
u_int spc_schedticks; /* ticks for schedclock() */
u_int64_t spc_cp_time[CPUSTATES]; /* CPU state statistics */
u_char spc_curpriority; /* usrpri of curproc */
+ u_int64_t spc_curdeadline;
int spc_rrticks; /* ticks until roundrobin() */
int spc_pscnt; /* prof/stat counter */
int spc_psdiv; /* prof/stat divisor */
@@ -109,6 +111,8 @@ struct schedstate_percpu {

TAILQ_HEAD(prochead, proc) spc_qs[SCHED_NQS];
volatile uint32_t spc_whichqs;
+
+ RB_HEAD(prochead2, proc) spc_rq;

#ifdef notyet
struct proc *spc_reaper; /* dead proc reaper */
--
Michal Mazurek

Juan Francisco Cantero Hurtado

unread,

Mar 12, 2016, 3:16:28 PM3/12/16

to

On Sat, Mar 12, 2016 at 05:36:21PM +0100, Michal Mazurek wrote:
> Gregor Best attempted to improve the scheduler in 2011:
> http://comments.gmane.org/gmane.os.openbsd.tech/27059
> Here is another attempt, it takes up where the previous one left off.
>
> This is also mostly based on the main idea behind Linux CFS or
> BFS. I found BFS to be described more clearly:
> http://ck.kolivas.org/patches/bfs/4.0/4.3/4.3-sched-bfs-467.patch
>
>
> Some notes:
>
> Chrome is still not very usable.

Chrome is not the best benchmark for the scheduler ATM.

Until the work in malloc to enhance the support for multithreaded
programs is finished, forget Chrome and Firefox.

That is going to make happy to people running bulk builds :)

--
Juan Francisco Cantero Hurtado http://juanfra.info

Juan Francisco Cantero Hurtado

unread,

Mar 12, 2016, 10:45:59 PM3/12/16

to

From time to time, I run a batch video conversion. I tried the same
commands with and without your patch. Your patch is slowing down the
frames-per-second about 30%.

Here are the commands:
$ download some 20-30 minutes videos in mp4 format (youtube is fine for
this)
$ install the packages ffmpeg and libx264
$ mkdir output
$ for i in *.mp4; do ffmpeg -y -i "$i" -vf 'scale=320:-1' -map 0:v:0 -map 0:a:0 -sws_flags spline -map_metadata -1 -c:v libx264 -profile:v baseline -b:v 360k -c:a aac -strict -2 -ar 44100 -ac 2 -b:a 128k -movflags faststart "output/$i"; done

Martin Pieuchot

unread,

Mar 13, 2016, 11:40:13 AM3/13/16

to

On 12/03/16(Sat) 17:36, Michal Mazurek wrote:
> [...]

> Some notes:
>
> Chrome is still not very usable.

Are you wanting to improve the browser experience on OpenBSD? If that's
your goal then I'd suggest you to start by analysing how the browsers
behave. My personal analysis makes me believe that librthread is what
needs some love.

> Much more work is needed, e.g. there is some MD code on sparc64 and
> alpha that depends on spc_schedticks that needs to be understood and
> rewritten.
>
>
> Maybe using RB trees to queue what is usually no more than 5 elements
> is overkill.

I think it's overkill and makes your diff harder to read.

> p_usrpri and p_priority will go away, so userland utilities like 'ps'
> will need to be changed.

So what's your plan for process priorities?

> I also want to try and see if implementing one shared queue, instead of
> keeping one queue per cpu will improve performance even further. Right
> now there are some heuristics to determine whether a process should
> switch cpus. This doesn't work very well yet, in my tests with the
> attached code sometimes one queue was a second behind another. From
> what I understand that's the idea behind BFS and the reason why it
> doesn't scale to 4096 CPUs. I see that OpenBSD supports 256 CPUs on
> sparc64:
> ./arch/sparc64/include/cpu.h:#define MAXCPUS 256

"improve performance" does not say much. For which workload on which
machine? There's maybe room for improvement in this area, but you
really need more than "building a kernel" as test bed.

> Is there a chance for a scheduler rewrite like this to be commited?

Maybe but not like this. It would be nice if you could analyse various
workloads with the existing scheduler in order to argue that some of the
decisions made could be improved, then explain how the changes that
you're proposing in the existing scheduler improve things.

It's nice to see you experimenting with this and I think experimenting
is important but if your goal is to improve the existing software, you
also need to do measurements ;)

More comments on the diff below.

> Index: sys/kern/init_main.c
> ===================================================================
> RCS file: /cvs/src/sys/kern/init_main.c,v
> retrieving revision 1.248
> diff -u -p -r1.248 init_main.c
> --- sys/kern/init_main.c 3 Jan 2016 00:15:59 -0000 1.248
> +++ sys/kern/init_main.c 12 Mar 2016 15:48:36 -0000
> @@ -265,6 +265,8 @@ main(void *framep)
> */
> pr = &process0;
> process_initialize(pr, p);
> + p->p_deadline = 0;
> + p->p_rrticks = 10;

Here you could use rrticks_init ;)

>
> LIST_INSERT_HEAD(&allprocess, pr, ps_list);
> atomic_setbits_int(&pr->ps_flags, PS_SYSTEM);
> @@ -554,6 +556,7 @@ main(void *framep)
> /*
> * proc0: nothing to do, back to sleep
> */
> + printf("*** modified scheduler ***\n");
> while (1)
> tsleep(&proc0, PVM, "scheduler", 0);
> /* NOTREACHED */
> Index: sys/kern/kern_clock.c
> ===================================================================
> RCS file: /cvs/src/sys/kern/kern_clock.c,v
> retrieving revision 1.88
> diff -u -p -r1.88 kern_clock.c
> --- sys/kern/kern_clock.c 11 Jun 2015 16:03:04 -0000 1.88
> +++ sys/kern/kern_clock.c 12 Mar 2016 15:48:36 -0000
> @@ -180,7 +180,7 @@ hardclock(struct clockframe *frame)
> if (stathz == 0)
> statclock(frame);
>
> - if (--ci->ci_schedstate.spc_rrticks <= 0)
> + if (p && (--(p->p_rrticks) <= 0))
> roundrobin(ci);

That's an interesting change. Why did you decide to move a per-CPU
counter to a per-process one? Can you explain the effect it will have?
Or even better can you measure its effect on different workloads?

Can you explain how you chose this period of time and what it is
supposed to represent?

What's happening to p_slptime?

Amit Kulkarni

unread,

Mar 13, 2016, 1:13:52 PM3/13/16

to

On Sat, Mar 12, 2016 at 10:36 AM, Michal Mazurek <akf...@jasminek.net>

wrote:

> Gregor Best attempted to improve the scheduler in 2011:
> http://comments.gmane.org/gmane.os.openbsd.tech/27059
> Here is another attempt, it takes up where the previous one left off.
>
> This is also mostly based on the main idea behind Linux CFS or
> BFS. I found BFS to be described more clearly:
> http://ck.kolivas.org/patches/bfs/4.0/4.3/4.3-sched-bfs-467.patch
>
>

> Some notes:
>
> Chrome is still not very usable.
>

> Much more work is needed, e.g. there is some MD code on sparc64 and
> alpha that depends on spc_schedticks that needs to be understood and
> rewritten.
>
> Maybe using RB trees to queue what is usually no more than 5 elements
> is overkill.
>

> p_usrpri and p_priority will go away, so userland utilities like 'ps'
> will need to be changed.
>

> I also want to try and see if implementing one shared queue, instead of
> keeping one queue per cpu will improve performance even further. Right
> now there are some heuristics to determine whether a process should
> switch cpus. This doesn't work very well yet, in my tests with the
> attached code sometimes one queue was a second behind another. From
> what I understand that's the idea behind BFS and the reason why it
> doesn't scale to 4096 CPUs. I see that OpenBSD supports 256 CPUs on
> sparc64:
> ./arch/sparc64/include/cpu.h:#define MAXCPUS 256
>
>

Hi Michal,
One shared queue is bad when # of CPU goes on increasing as it effectively
trends towards a global lock.
Thanks

Michal Mazurek

unread,

Mar 14, 2016, 11:17:57 AM3/14/16

to

On 04:41:05, 13.03.16, Juan Francisco Cantero Hurtado wrote:
> Here are the commands:
> ...
> ffmpeg
> ...

Thank you for this.

ffmpeg runs differently from gcc or make - it creates a lot of threads.
I can verify that it is indeed slower. Instead of spending 2 seconds in
'system' it takes 30 or 40 seconds on the new scheduler.

After some tests I realised that ffmpeg, when running on the present BSD
scheduler, will call sys_sched_yield() 321,068 times. When running on the
new scheduler, it will call sys_sched_yield() 2,507,894 times - nearly ten
times that. The bulk of the calls are made from this function:

lib/librthread/rthread.c:
void
_spinlock(volatile struct _spinlock *lock)
{
while (_atomic_lock(&lock->ticket))
sched_yield();
}

To verify I replaced sched_yield() with usleep():

void
_spinlock(volatile struct _spinlock *lock)
{
while (_atomic_lock(&lock->ticket))
usleep(1);
}

The number of calls to yield() dropped to 4,576.

I had a look how the Linux BFS scheduler solves this problem, and I saw
this:
/**
* yield - yield the current processor to other threads.
*
* Do not ever use this function, there's a 99% chance you're doing it wrong.
*
* The scheduler is at all times free to pick the calling task as the most
* eligible task to run, if removing the yield() call from your code breaks
* it, its already broken.
*
* Typical broken usage is:
*
* while (!event)
* yield();
*
* where one assumes that yield() will let 'the other' process run that will
* make event true. If the current task is a SCHED_FIFO task that will never
* happen. Never use yield() as a progress guarantee!!
*
* If you want to use yield() to wait for something, use wait_event().
* If you want to use yield() to be 'nice' for others, use cond_resched().
* If you still want to use yield(), do not!
*/
void __sched yield(void)
{
set_current_state(TASK_RUNNING);
sys_sched_yield();
}
EXPORT_SYMBOL(yield);

This is where I get stuck, I don't know how to replace the call to
sched_yield(), or whether it is a good idea to do so. Any advice?

--
Michal Mazurek

Martin Pieuchot

unread,

Mar 14, 2016, 11:36:26 AM3/14/16

to

This is really similar to what I observed with Firefox and Chrome.

> This is where I get stuck, I don't know how to replace the call to
> sched_yield(), or whether it is a good idea to do so. Any advice?

I'd assume the problem is not in the _spinlock() implementation itself
but rather on the code calling this lock. Do you know which code is
triggering this?

See r1.13 of lib/librthread/rthread_libc.c, this is an example where a
the use of a spinlock created similar symptoms.

Michal Mazurek

unread,

Mar 14, 2016, 11:36:29 AM3/14/16

to

On 16:35:49, 13.03.16, Martin Pieuchot wrote:
> On 12/03/16(Sat) 17:36, Michal Mazurek wrote:
> > [...]

> > Some notes:
> >
> > Chrome is still not very usable.
>

> Are you wanting to improve the browser experience on OpenBSD? If that's
> your goal then I'd suggest you to start by analysing how the browsers
> behave. My personal analysis makes me believe that librthread is what
> needs some love.

It seems you were right, see my other email.

> > p_usrpri and p_priority will go away, so userland utilities like 'ps'
> > will need to be changed.
>

> So what's your plan for process priorities?

> ...

> "improve performance" does not say much. For which workload on which
> machine? There's maybe room for improvement in this area, but you
> really need more than "building a kernel" as test bed.

Replace them with "deadlines", to achieve "fair" scheduling. This should
provide superior scheduling for interactive applications. We'll see :)

> > - if (--ci->ci_schedstate.spc_rrticks <= 0)
> > + if (p && (--(p->p_rrticks) <= 0))
> > roundrobin(ci);
>
> That's an interesting change. Why did you decide to move a per-CPU
> counter to a per-process one? Can you explain the effect it will have?
> Or even better can you measure its effect on different workloads?

That's the supposedly genious idea behind BFS. An interactive process
can yield, and let batch processes do their work. Then when it's time to
process cursor movement, it will come back with the best priority, but
with used up p_rrticks to prevent abuse. We'll see...

Perhaps it should be operated on as an element of the ci_schedstate
structure, similar to how spc_curpriority is a copy of p_piority now,
and copied to/from proc during context switches? I don't know yet.

> > - SCHED_ASSERT_LOCKED();
> > + nanouptime(&now);
> > + niffies = now.tv_sec * (1000 * 1000 * 1000) + now.tv_nsec;
>
> Can you explain how you chose this period of time and what it is
> supposed to represent?

These are niffies, like jiffies but nanoseconds. A high precision
monotinically increasing counter. The value itself isn't really
important. Some multipliers for 'nice' offsets will need to be
chosen though.

> > @@ -509,56 +457,8 @@ setrunnable(struct proc *p)
> > p->p_stat = SRUN;
> > p->p_cpu = sched_choosecpu(p);
> > setrunqueue(p);
> > - if (p->p_slptime > 1)
> > - updatepri(p);
>
> What's happening to p_slptime?

It's not needed for the new scheduler. The 4BSD scheduler does stuff to
the priorities based on how many seconds a process has been sleeping.

--
Michal Mazurek

Alexandre Ratchov

unread,

Mar 15, 2016, 10:00:54 AM3/15/16

to

On Sat, Mar 12, 2016 at 05:36:21PM +0100, Michal Mazurek wrote:
>

> p_usrpri and p_priority will go away, so userland utilities like 'ps'
> will need to be changed.
>

AFAIU, this would hurt interactive programs (audio, players, games,
etc). Currently i/o bound processes wake up with increased
priority and steal the cpu from cpu-bound processes. Removing
p_priority would break this mechanism.

Michal Mazurek

unread,

Mar 15, 2016, 10:10:57 AM3/15/16

to

On 16:28:33, 14.03.16, Martin Pieuchot wrote:
> > The number of calls to yield() dropped to 4,576.
>
> This is really similar to what I observed with Firefox and Chrome.
>
> > This is where I get stuck, I don't know how to replace the call to
> > sched_yield(), or whether it is a good idea to do so. Any advice?
>
> I'd assume the problem is not in the _spinlock() implementation itself
> but rather on the code calling this lock. Do you know which code is
> triggering this?
>
> See r1.13 of lib/librthread/rthread_libc.c, this is an example where a
> the use of a spinlock created similar symptoms.

I don't know how to fix it, but in the meanwhile here is a workaround so
I can continue working on the scheduler. In yield():

+ tmp = p->p_rrticks;
+ sched_deadline(p);
+ p->p_rrticks = tmp;

So penalise a process calling yield() by giving it the worst deadline,
i.e. make it go to the end of the run queue.

Other than this, no changes.

Chrome still isn't smooth.

Please test, and let me know if the performance of something else
degrades.

Index: kern/init_main.c

===================================================================
RCS file: /cvs/src/sys/kern/init_main.c,v
retrieving revision 1.248
diff -u -p -r1.248 init_main.c

--- kern/init_main.c 3 Jan 2016 00:15:59 -0000 1.248
+++ kern/init_main.c 15 Mar 2016 13:56:58 -0000

@@ -265,6 +265,8 @@ main(void *framep)
*/
pr = &process0;
process_initialize(pr, p);
+ p->p_deadline = 0;
+ p->p_rrticks = 10;

LIST_INSERT_HEAD(&allprocess, pr, ps_list);
atomic_setbits_int(&pr->ps_flags, PS_SYSTEM);
@@ -554,6 +556,7 @@ main(void *framep)
/*
* proc0: nothing to do, back to sleep
*/
+ printf("*** modified scheduler ***\n");
while (1)
tsleep(&proc0, PVM, "scheduler", 0);
/* NOTREACHED */

Index: kern/kern_clock.c

===================================================================
RCS file: /cvs/src/sys/kern/kern_clock.c,v
retrieving revision 1.88
diff -u -p -r1.88 kern_clock.c

--- kern/kern_clock.c 11 Jun 2015 16:03:04 -0000 1.88
+++ kern/kern_clock.c 15 Mar 2016 13:56:58 -0000

@@ -180,7 +180,7 @@ hardclock(struct clockframe *frame)
if (stathz == 0)
statclock(frame);

- if (--ci->ci_schedstate.spc_rrticks <= 0)
+ if (p && (--(p->p_rrticks) <= 0))
roundrobin(ci);

/*
@@ -398,15 +398,7 @@ statclock(struct clockframe *frame)

if (p != NULL) {
p->p_cpticks++;
- /*
- * If no schedclock is provided, call it here at ~~12-25 Hz;
- * ~~16 Hz is best
- */
- if (schedhz == 0) {
- if ((++curcpu()->ci_schedstate.spc_schedticks & 3) ==
- 0)
- schedclock(p);
- }
+ ++curcpu()->ci_schedstate.spc_schedticks;
}
}

Index: kern/kern_fork.c
===================================================================
RCS file: /cvs/src/sys/kern/kern_fork.c,v
retrieving revision 1.185
diff -u -p -r1.185 kern_fork.c
--- kern/kern_fork.c 11 Mar 2016 19:10:14 -0000 1.185
+++ kern/kern_fork.c 15 Mar 2016 13:56:58 -0000

@@ -498,6 +498,7 @@ fork1(struct proc *curp, int flags, void
/*
* Make child runnable and add to run queue.
*/
+ sched_deadline(p);
if ((flags & FORK_IDLE) == 0) {
SCHED_LOCK(s);
p->p_stat = SRUN;

Index: kern/kern_resource.c

===================================================================
RCS file: /cvs/src/sys/kern/kern_resource.c,v
retrieving revision 1.55
diff -u -p -r1.55 kern_resource.c

--- kern/kern_resource.c 5 Dec 2015 10:11:53 -0000 1.55
+++ kern/kern_resource.c 15 Mar 2016 13:56:58 -0000

@@ -178,8 +178,6 @@ int
donice(struct proc *curp, struct process *chgpr, int n)
{
struct ucred *ucred = curp->p_ucred;
- struct proc *p;
- int s;

if (ucred->cr_uid != 0 && ucred->cr_ruid != 0 &&
ucred->cr_uid != chgpr->ps_ucred->cr_uid &&
@@ -193,10 +191,13 @@ donice(struct proc *curp, struct process
if (n < chgpr->ps_nice && suser(curp, 0))
return (EACCES);
chgpr->ps_nice = n;
+ /* XXXNICE */
+ /*
SCHED_LOCK(s);
TAILQ_FOREACH(p, &chgpr->ps_threads, p_thr_link)
(void)resetpriority(p);
SCHED_UNLOCK(s);
+ */
return (0);
}

Index: kern/kern_sched.c

===================================================================
RCS file: /cvs/src/sys/kern/kern_sched.c,v
retrieving revision 1.41
diff -u -p -r1.41 kern_sched.c

--- kern/kern_sched.c 23 Dec 2015 14:51:17 -0000 1.41
+++ kern/kern_sched.c 15 Mar 2016 13:56:58 -0000

@@ -550,8 +558,6 @@ sched_proc_to_cpu_cost(struct cpu_info *

* and the higher the priority of the proc.
*/
if (!cpuset_isset(&sched_idle_cpus, ci)) {
- cost += (p->p_priority - spc->spc_curpriority) *
- sched_cost_priority;

cost += sched_cost_runnable;
}
if (cpuset_isset(&sched_queued_cpus, ci))

Index: kern/kern_sig.c

===================================================================
RCS file: /cvs/src/sys/kern/kern_sig.c,v
retrieving revision 1.193
diff -u -p -r1.193 kern_sig.c

--- kern/kern_sig.c 9 Mar 2016 13:38:50 -0000 1.193
+++ kern/kern_sig.c 15 Mar 2016 13:56:58 -0000

@@ -1894,6 +1894,7 @@ userret(struct proc *p)
}

p->p_cpu->ci_schedstate.spc_curpriority = p->p_priority = p->p_usrpri;
+ p->p_cpu->ci_schedstate.spc_curdeadline = p->p_deadline;
}

int

Index: kern/kern_synch.c

===================================================================
RCS file: /cvs/src/sys/kern/kern_synch.c,v
retrieving revision 1.129
diff -u -p -r1.129 kern_synch.c

--- kern/kern_synch.c 9 Mar 2016 13:38:50 -0000 1.129
+++ kern/kern_synch.c 15 Mar 2016 13:56:58 -0000

@@ -274,6 +274,7 @@ sleep_finish(struct sleep_state *sls, in
#endif

p->p_cpu->ci_schedstate.spc_curpriority = p->p_usrpri;
+ p->p_cpu->ci_schedstate.spc_curdeadline = p->p_deadline;
SCHED_UNLOCK(sls->sls_s);

/*

Index: kern/sched_bsd.c

===================================================================
RCS file: /cvs/src/sys/kern/sched_bsd.c,v
retrieving revision 1.43
diff -u -p -r1.43 sched_bsd.c

--- kern/sched_bsd.c 9 Mar 2016 13:38:50 -0000 1.43
+++ kern/sched_bsd.c 15 Mar 2016 13:56:58 -0000

@@ -61,7 +61,6 @@ struct __mp_lock sched_lock;
#endif

void schedcpu(void *);
-void updatepri(struct proc *);

void
scheduler_start(void)

@@ -88,9 +87,8 @@ roundrobin(struct cpu_info *ci)

{
struct schedstate_percpu *spc = &ci->ci_schedstate;

- spc->spc_rrticks = rrticks_init;
-

if (ci->ci_curproc != NULL) {
+ sched_deadline(ci->ci_curproc);
if (spc->spc_schedflags & SPCF_SEENRR) {
/*
* The process has already been through a roundrobin

@@ -109,74 +107,6 @@ roundrobin(struct cpu_info *ci)

@@ -197,14 +127,13 @@ fixpt_t ccpu = 0.95122942450071400909 *

/*
* Recompute process priorities, every second.
*/
+RB_PROTOTYPE(prochead2, proc, p_runq2, sched_cmp_proc);
void
schedcpu(void *arg)
{
struct timeout *to = (struct timeout *)arg;
- fixpt_t loadfac = loadfactor(averunnable.ldavg[0]);
struct proc *p;
int s;
- unsigned int newcpu;
int phz;

/*

@@ -243,19 +172,6 @@ schedcpu(void *arg)

(p->p_cpticks * FSCALE / phz)) >> FSHIFT;
#endif
p->p_cpticks = 0;
- newcpu = (u_int) decay_cpu(loadfac, p->p_estcpu);
- p->p_estcpu = newcpu;
- resetpriority(p);
- if (p->p_priority >= PUSER) {
- if (p->p_stat == SRUN &&
- (p->p_priority / SCHED_PPQ) !=
- (p->p_usrpri / SCHED_PPQ)) {
- remrunqueue(p);
- p->p_priority = p->p_usrpri;
- setrunqueue(p);
- } else
- p->p_priority = p->p_usrpri;
- }
SCHED_UNLOCK(s);
}
uvm_meter();

@@ -264,27 +180,32 @@ schedcpu(void *arg)

- SCHED_ASSERT_LOCKED();
+ nanouptime(&now);
+ niffies = now.tv_sec * (1000 * 1000 * 1000) + now.tv_nsec;

- if (p->p_slptime > 5 * loadfac)
- p->p_estcpu = 0;
- else {
- p->p_slptime--; /* the first time was done in schedcpu */
- while (newcpu && --p->p_slptime)
- newcpu = (int) decay_cpu(loadfac, newcpu);
- p->p_estcpu = newcpu;
- }
- resetpriority(p);
+ p->p_deadline = niffies; // + offset;
+ p->p_rrticks = rrticks_init;
}

/*

@@ -296,8 +217,16 @@ yield(void)
{
struct proc *p = curproc;
int s;
+ int tmp;

SCHED_LOCK(s);
+ /*
+ * workaround for rthreads calling yield() from _spinlock
+ */
+ tmp = p->p_rrticks;
+ sched_deadline(p);
+ p->p_rrticks = tmp;
+

p->p_priority = p->p_usrpri;

p->p_stat = SRUN;
setrunqueue(p);
@@ -326,6 +255,7 @@ preempt(struct proc *newp)

SCHED_LOCK(s);
p->p_priority = p->p_usrpri;
+ sched_deadline(p);

p->p_stat = SRUN;
p->p_cpu = sched_choosecpu(p);
setrunqueue(p);

@@ -455,7 +385,7 @@ mi_switch(void)
}

static __inline void
-resched_proc(struct proc *p, u_char pri)
+resched_proc(struct proc *p)
{
struct cpu_info *ci;

@@ -473,7 +403,7 @@ resched_proc(struct proc *p, u_char pri)

* sched state, which we currently do not do.
*/
ci = (p->p_cpu != NULL) ? p->p_cpu : curcpu();
- if (pri < ci->ci_schedstate.spc_curpriority)
+ if (ci->ci_curproc && p->p_deadline < ci->ci_curproc->p_deadline)
need_resched(ci);
}

@@ -509,56 +439,8 @@ setrunnable(struct proc *p)

p->p_stat = SRUN;
p->p_cpu = sched_choosecpu(p);
setrunqueue(p);
- if (p->p_slptime > 1)
- updatepri(p);

p->p_slptime = 0;
- resched_proc(p, p->p_priority);

+ resched_proc(p);
}

void (*cpu_setperf)(int);
Index: sys/proc.h

===================================================================
RCS file: /cvs/src/sys/sys/proc.h,v
retrieving revision 1.217
diff -u -p -r1.217 proc.h

--- sys/proc.h 9 Mar 2016 13:38:50 -0000 1.217
+++ sys/proc.h 15 Mar 2016 13:56:58 -0000

Index: sys/sched.h

===================================================================
RCS file: /cvs/src/sys/sys/sched.h,v
retrieving revision 1.40
diff -u -p -r1.40 sched.h

--- sys/sched.h 9 Mar 2016 13:38:50 -0000 1.40
+++ sys/sched.h 15 Mar 2016 13:56:58 -0000

Michal Mazurek

unread,

Mar 15, 2016, 10:16:37 AM3/15/16

to

No, with the new scheduler a process gets assigned a 'deadline' - the
value of an arbitrary monotonically increasing timer. If it runs out of
ticks it gets a new deadline. The process with the smallest deadline
gets to run.

If a process goes to sleep, and then after a while wakes up, it will
have a smaller deadline than the processes that were allwed to keep
running (and thus kept getting new deadlines). It will therefore preempt
other tasks.

--
Michal Mazurek

Chris Cappuccio

unread,

Mar 17, 2016, 3:01:01 AM3/17/16

to

Michal Mazurek [akf...@jasminek.net] wrote:
> On 16:28:33, 14.03.16, Martin Pieuchot wrote:
> > > The number of calls to yield() dropped to 4,576.
> >
> > This is really similar to what I observed with Firefox and Chrome.
> >
> > > This is where I get stuck, I don't know how to replace the call to
> > > sched_yield(), or whether it is a good idea to do so. Any advice?
> >
> > I'd assume the problem is not in the _spinlock() implementation itself
> > but rather on the code calling this lock. Do you know which code is
> > triggering this?
> >
> > See r1.13 of lib/librthread/rthread_libc.c, this is an example where a
> > the use of a spinlock created similar symptoms.
>
> I don't know how to fix it, but in the meanwhile here is a workaround so
> I can continue working on the scheduler. In yield():
>
> + tmp = p->p_rrticks;
> + sched_deadline(p);
> + p->p_rrticks = tmp;
>
> So penalise a process calling yield() by giving it the worst deadline,
> i.e. make it go to the end of the run queue.
>
> Other than this, no changes.
>
> Chrome still isn't smooth.
>

On a frankenstein 5.9 kernel with bob beck's latest NOWAIT vfs_biomem buffer
alloc and this, I'm getting very smooth action on chrome, with video, even on
an old x201. I don't remember the last time it could re-open 20 tabs without
constantly pausing most of itself, or the last time youtube videos would
play on this laptop in chrome, without random and frequenty stuttering.

Ray Lai

unread,

Mar 17, 2016, 4:12:02 PM3/17/16

to

With this diff on my X200, YouTube used to be a stuttering mess on both
chrome and firefox, and is now buttery smooth, even at 1080p. Thanks!

Henrik Friedrichsen

unread,

Mar 17, 2016, 4:29:49 PM3/17/16

to

Hey,

On Tue, Mar 15, 2016 at 03:05:47PM +0100, Michal Mazurek wrote:
> Chrome still isn't smooth.
>

> Please test, and let me know if the performance of something else
> degrades.

While Chrome may not be 100% smooth yet, the system is a lot more
interactive. I can now play YouTube videos without stutters while doing
other things.

So for me it's a major improvement, thanks!

Juan Francisco Cantero Hurtado

unread,

Mar 17, 2016, 9:05:53 PM3/17/16

to

On Tue, Mar 15, 2016 at 03:05:47PM +0100, Michal Mazurek wrote:

> On 16:28:33, 14.03.16, Martin Pieuchot wrote:
> > > The number of calls to yield() dropped to 4,576.
> >
> > This is really similar to what I observed with Firefox and Chrome.
> >
> > > This is where I get stuck, I don't know how to replace the call to
> > > sched_yield(), or whether it is a good idea to do so. Any advice?
> >
> > I'd assume the problem is not in the _spinlock() implementation itself
> > but rather on the code calling this lock. Do you know which code is
> > triggering this?
> >
> > See r1.13 of lib/librthread/rthread_libc.c, this is an example where a
> > the use of a spinlock created similar symptoms.
>
> I don't know how to fix it, but in the meanwhile here is a workaround so
> I can continue working on the scheduler. In yield():
>
> + tmp = p->p_rrticks;
> + sched_deadline(p);
> + p->p_rrticks = tmp;
>
> So penalise a process calling yield() by giving it the worst deadline,
> i.e. make it go to the end of the run queue.
>
> Other than this, no changes.
>

> Chrome still isn't smooth.
>
> Please test, and let me know if the performance of something else
> degrades.

The performance in the ffmpeg test has increased about 30-40% compared with
the default scheduler.

Please keep working on it.

Edd Barrett

unread,

Mar 18, 2016, 6:09:03 AM3/18/16

to

On Thu, Mar 17, 2016 at 09:26:08PM +0100, Henrik Friedrichsen wrote:
> On Tue, Mar 15, 2016 at 03:05:47PM +0100, Michal Mazurek wrote:
> > Chrome still isn't smooth.
> >
> > Please test, and let me know if the performance of something else
> > degrades.
>

> While Chrome may not be 100% smooth yet, the system is a lot more
> interactive. I can now play YouTube videos without stutters while doing
> other things.

I can't vouch for the code, but this makes video playback in firefox
usable on my x240t. Before it would stutter beyond belief.

I'll run with this for a while and let you know if anything comes up.

--
Best Regards
Edd Barrett

http://www.theunixzoo.co.uk

Alexandre Ratchov

unread,

Mar 18, 2016, 7:23:46 AM3/18/16

to

On Tue, Mar 15, 2016 at 03:05:47PM +0100, Michal Mazurek wrote:
>
> Please test, and let me know if the performance of something else
> degrades.
>

With your diff firefox consumes twice less cpu (watched the same
video with and without the diff). This suggests firefox spins
somewhere and your diff makes it spin less.

When audio device is using small blocks it stutters more with your
diff though; according to device counters the stuttering is caused
by sndiod not getting the cpu fast enough.

Norman Golisz

unread,

Mar 18, 2016, 10:57:10 AM3/18/16

to

Hi Michal,

On Fri Mar 18 2016 10:03, Edd Barrett wrote:
> On Thu, Mar 17, 2016 at 09:26:08PM +0100, Henrik Friedrichsen wrote:

> > On Tue, Mar 15, 2016 at 03:05:47PM +0100, Michal Mazurek wrote:

> > > Chrome still isn't smooth.
> > >

> > > Please test, and let me know if the performance of something else
> > > degrades.
> >

> > While Chrome may not be 100% smooth yet, the system is a lot more
> > interactive. I can now play YouTube videos without stutters while doing
> > other things.
>
> I can't vouch for the code, but this makes video playback in firefox
> usable on my x240t. Before it would stutter beyond belief.
>
> I'll run with this for a while and let you know if anything comes up.

I can also confirm this patch makes a HUGE difference in video playback
performance in firefox and minitube on my T400.

This is the first time I can watch videos without stuttering (even in
HD/full screen).

And it seems to improve "GUI responsiveness" in general, too.

Thank you for working on this!

Bob Beck

unread,

Mar 18, 2016, 11:25:09 AM3/18/16

to

this is cool .. but

I would be interested in someone comparing server workloads, as
opposed to interactive GUI response, using this.

I wouldn't be surprised that inspiriation from BFS would produce
better interactive response, my bigger concern
would be does this adversely impact how we deal with non-interactive workloads.

Michael McConville

unread,

Mar 18, 2016, 11:33:06 AM3/18/16

to

Bob Beck wrote:
> this is cool .. but
>
> I would be interested in someone comparing server workloads, as
> opposed to interactive GUI response, using this.
>
> I wouldn't be surprised that inspiriation from BFS would produce
> better interactive response, my bigger concern would be does this
> adversely impact how we deal with non-interactive workloads.

Those interested might find the benchmarks/siege port useful.

Chris Cappuccio

unread,

Mar 18, 2016, 11:37:06 AM3/18/16

to

Alexandre Ratchov [al...@caoua.org] wrote:
> On Tue, Mar 15, 2016 at 03:05:47PM +0100, Michal Mazurek wrote:
> >
> > Please test, and let me know if the performance of something else
> > degrades.
> >
>

> With your diff firefox consumes twice less cpu (watched the same
> video with and without the diff). This suggests firefox spins
> somewhere and your diff makes it spin less.
>
> When audio device is using small blocks it stutters more with your
> diff though; according to device counters the stuttering is caused
> by sndiod not getting the cpu fast enough.

there is no 'nice' functionality at the moment

Chris Cappuccio

unread,

Mar 18, 2016, 12:25:35 PM3/18/16

to

Bob Beck [be...@openbsd.org] wrote:
> this is cool .. but
>
> I would be interested in someone comparing server workloads, as
> opposed to interactive GUI response, using this.
>
> I wouldn't be surprised that inspiriation from BFS would produce
> better interactive response, my bigger concern
> would be does this adversely impact how we deal with non-interactive workloads.

I've been testing it on my heavily loaded Zabbix server, which regularly
get dozens of variables from over 5,000 devices. I get mixed results, the
load avg is higher, and the idle cpu time is generally higher, I regularly
see 1 second (top 's 1') CPU idle of 50-70% where I regularly saw 20-50%
with the old scheduler. This is in top with all cpus collasped into 1 (top
'1'). I suspect if I averaged it over time, the difference could be less
dramatic.

I'm using Postgres, there is no heavily threaded stuff, so I have
no idea why the idle CPU seems to be higher. The load avg is a bit higher,
it seems to stay up around 48 longer with the new scheduler, and also
shoots up and down more quickly. None of this is particularly well measured,
just a weird observation. With the old scheduler the load avg would fluctuate
from 32-42, and seemed to stay at a particular value for a longer period
of time (whatever that means.)

Zabbix is a horrible pig, and my polling confiuguration is not fine-tuned
to top it off. The box is a "Xeon E2-1230 v3 @ 3.30GHz" with 16GB RAM and
two Samsung 845DC SSDs in softraid raid1. We use Postgres as a time-series
database, I'm looking at alternatives using collectd and graphite and
whatever, I really want to get away from Zabbix + Postgres.

Since this scheduler has a hack to help spinning threaded apps, that
explains why there is less CPU usage during video playback, but I don't
know how to explain my Zabbix results. It takes at least 2 minutes after
boot before the Zabbix box stabilizes to the levels I'm describing
here.

If I set top to .1 second (top 's .1') then the new scheduler seems to
drive all CPUs to 0% idle for shorter periods of time, but more
frequently.

These are really rough observations. This box spawns lots of dirty
perl processes and also lots of fping processes for monitoring. I've
not spent the time to optimize this area at all. I was curious if this
scheduler or other OS level optmizations might take away some of the
costs here. With this rather stupid polling architecture, perhaps
copy-on-write is still the biggest win...

Chris

Michal Mazurek

unread,

Mar 18, 2016, 1:08:16 PM3/18/16

to

On 09:22:18, 18.03.16, Chris Cappuccio wrote:
> These are really rough observations. This box spawns lots of dirty
> perl processes and also lots of fping processes for monitoring.

The next step I had planned was related to juggling processes between
cpus. Right now that code is untouched, other than removing a line
related to priority. One idea is to compare the deadline of the last
element on every queue, and balance based on that. Maybe this will
improve this use case?

The original diff contains commented out code (grep for
CPU_INFO_ITERATOR) that prints out the queues every second. When running
tests with it I sometimes observe one cpu is 0.5-1 second behind another,
so I think this is a good next step for this scheduler.

BFS has one shared queue for all CPUs, maybe there is a very good reason
for that, we'll see.

I'd like to thank everyone for all the feedback.

--
Michal Mazurek

Mike Belopuhov

unread,

Mar 18, 2016, 1:33:13 PM3/18/16

to

Something else to consider: it would be nice if sched_bsd.c
wouldn't be changed and new code would be placed in sched_bfs.c
(or sched_deadline.c or whatever) and kern_sched.c would act as
an interface to that code so that we could switch schedulers
via a compile time option (SCHED_xxx). I think this was the
intention behind splitting the code.

> --
> Michal Mazurek
>

Mark Kettenis

unread,

Mar 18, 2016, 3:02:16 PM3/18/16

to

> From: Bob Beck <be...@openbsd.org>
> Date: Fri, 18 Mar 2016 09:20:35 -0600

>
> this is cool .. but
>
> I would be interested in someone comparing server workloads, as
> opposed to interactive GUI response, using this.
>
> I wouldn't be surprised that inspiriation from BFS would produce
> better interactive response, my bigger concern would be does this
> adversely impact how we deal with non-interactive workloads.

One other important case to test is network packet forwarding. Some
of our network stack is now running inside a kernel thread. And any
changes in the scheduling behaviour have the potential of having a
significant impact there.

Another interesting case is the page zeroing thread. It relies on
priority-based scheduling to make sure it only runs if we have nothing
better to do.

So I'm skeptical about this BFS scheduler. Don't get me wrong; I do
think the scheduler needs attention. But I'm not sure a scheduler
designed for interactive desktop use is the best option for OpenBSD.
I'm willing to be proven wrong. But that means we need careful
benchmarking on a wide variety of workloads and hardware.

> On Fri, Mar 18, 2016 at 8:56 AM, Norman Golisz <li...@zcat.de> wrote:
> > Hi Michal,
> >
> > On Fri Mar 18 2016 10:03, Edd Barrett wrote:

> >> On Thu, Mar 17, 2016 at 09:26:08PM +0100, Henrik Friedrichsen wrote:
> >> > On Tue, Mar 15, 2016 at 03:05:47PM +0100, Michal Mazurek wrote:

> >> > > Chrome still isn't smooth.
> >> > >

> >> > > Please test, and let me know if the performance of something else
> >> > > degrades.
> >> >

Theo de Raadt

unread,

Mar 18, 2016, 3:11:44 PM3/18/16

to

> So I'm skeptical about this BFS scheduler. Don't get me wrong; I do
> think the scheduler needs attention. But I'm not sure a scheduler
> designed for interactive desktop use is the best option for OpenBSD.
> I'm willing to be proven wrong. But that means we need careful
> benchmarking on a wide variety of workloads and hardware.

I don't think this scheduler is making things better because of
it's scheduling behaviour.

I think the scheduling choices it makes are hiding a poor-performing
artifact we have elsewhere.

Karel Gardas

unread,

Mar 18, 2016, 4:44:34 PM3/18/16

to

On Fri, Mar 18, 2016 at 6:02 PM, Michal Mazurek <akf...@jasminek.net> wrote:
> BFS has one shared queue for all CPUs, maybe there is a very good reason
> for that, we'll see.

Michal,

first of all congrats to optimistic results in interactive workloads.
Honestly I'm a little bit worried about your attempts since I think if
doing any scheduler modernization then it shall also consider to be
NUMA-aware or at least take NUMA into consideration. The problem is
that NUMA is no longer domain of high-cpu sockets boxes, but it's also
available or possibly to be available in 1 socket Haswell boxes when
configured right. Intel calls this "cluster-on-die" for your
information.
Anyway, although worried, I'm still curious about your hacking in
scheduler, so I keep my fingers crossed! Thanks, Karel

Martin Pieuchot

unread,

Mar 18, 2016, 5:21:09 PM3/18/16

to

I think the same thing. After reading your diff carefully I think that
it might be nice to split it in pieces to be able to see the effect of
each of them.

I saw the following interesting items, but there might be even more:

1. Each process gets the same deadline. So you basically don't have
priorities or did I misunderstood something?

2. In any case, the cost of the migration of a process calculated in
sched_proc_to_cpu_cost() and used via sched_choosecpu() no longer
take the priority of a process in account. But this might just be
an side-effect of 1.

3. You changed the semantic of roundrobin(). With your diff there's no
context switch every 100ms. There's now a context switch every time
a process has been has been "curproc" for ``rrticks'' long.

4. You are no longer making a difference between ``p_usrpri'' and
``p_priority'' there's only priority: ``p_deadline''.

4. You use only one runqueue.

5. You're using deadlines which means you don't have all these
complicated, different and possibly buggy logics to recalculate the
priority of a process.

Please correct me if I said anything wrong.

Alexandre Ratchov

unread,

Mar 19, 2016, 4:07:19 AM3/19/16

to

On Fri, Mar 18, 2016 at 08:00:35PM +0100, Mark Kettenis wrote:
> > From: Bob Beck <be...@openbsd.org>
> > Date: Fri, 18 Mar 2016 09:20:35 -0600
> >
> > this is cool .. but
> >
> > I would be interested in someone comparing server workloads, as
> > opposed to interactive GUI response, using this.
> >
> > I wouldn't be surprised that inspiriation from BFS would produce
> > better interactive response, my bigger concern would be does this
> > adversely impact how we deal with non-interactive workloads.
>
> One other important case to test is network packet forwarding. Some
> of our network stack is now running inside a kernel thread. And any
> changes in the scheduling behaviour have the potential of having a
> significant impact there.
>
> Another interesting case is the page zeroing thread. It relies on
> priority-based scheduling to make sure it only runs if we have nothing
> better to do.
>

> So I'm skeptical about this BFS scheduler. Don't get me wrong; I do
> think the scheduler needs attention. But I'm not sure a scheduler
> designed for interactive desktop use is the best option for OpenBSD.

IMHO current OpenBSD scheduler *is* designed for interactive
programs too, that's why I keep using OpenBSD for real-time audio
and didn't switch to Linux or OS X or whatever.

The browsers problems seem caused by the way pthreads behave;
browsers appear to spin. With the proposed scheduler they spin
less. But the real question is why they spin at all?

My 2 cents.

Edd Barrett

unread,

Mar 19, 2016, 9:01:44 AM3/19/16

to

On Sat, Mar 19, 2016 at 09:06:26AM +0100, Alexandre Ratchov wrote:
> The browsers problems seem caused by the way pthreads behave;
> browsers appear to spin. With the proposed scheduler they spin
> less. But the real question is why they spin at all?

Inspired by this comment, I set out to see if I could find *where* the browsers
are spinning. "Just build the browsers with profiling instrumentation and look
where the time goes when playing a youtube video" I thought. "Easy" I thought.

Well no. We build our chrome and firefox with clang, which doesn't
support gprof style profiling.

Clang does have this -fprofile-instr-generate flag, which throws out some
profiling data at runtime, but it appears it is designed to be used by the
compiler itself as compile-time optimisation hints[1]. It's not even clear if
there is any timing data in there.

There's another clang profiling mode which depends upon Linux perf, which is
obviously not an option for us.

Nevertheless, I decided to try it on the off-chance that clang's profiling data
could be useful (and I'm totally accepting that, if it is, I will probably have
to write a python script to make sense of the output). Sadly I stumbled at the
first hurdle:

---8<---
$ cat a.c
#include <stdio.h>
#include <stdlib.h>

int
main(int argc, char **argv)
{
return (EXIT_SUCCESS);
}
$clang -fprofile-instr-generate a.c -lclang
/usr/local/lib/libclang.so.2.0: warning: warning: sprintf() is often misused, please use snprintf()
/tmp/a-f91c15.o: In function `__llvm_profile_register_functions':
a.c:(.text+0x3c): undefined reference to `__llvm_profile_register_function'
/tmp/a-f91c15.o: In function `__llvm_profile_runtime_user':
a.c:(.text+0x53): undefined reference to `__llvm_profile_runtime'
clang-3.7: error: linker command failed with exit code 1 (use -v to see invocation)
--->8---

Any clang afficionados know what is missing here? Or is there a better way to
find the spinning code? There has to be a better way.

A potential option would be to port and use the profiling portion of [2].

[1] http://clang.llvm.org/docs/UsersManual.html#profile-guided-optimization
[2] https://github.com/gperftools/gperftools

li...@wrant.com

unread,

Mar 19, 2016, 11:10:25 AM3/19/16

to

Fri, 18 Mar 2016 13:04:49 -0600 Theo de Raadt <der...@cvs.openbsd.org>

> > So I'm skeptical about this BFS scheduler. Don't get me wrong; I do
> > think the scheduler needs attention. But I'm not sure a scheduler
> > designed for interactive desktop use is the best option for OpenBSD.

> > I'm willing to be proven wrong. But that means we need careful
> > benchmarking on a wide variety of workloads and hardware.
>
> I don't think this scheduler is making things better because of
> it's scheduling behaviour.
>
> I think the scheduling choices it makes are hiding a poor-performing
> artifact we have elsewhere.

One can recommend the proposing entity to be able to devise multiple
schedulers and compare these, before qualifying for justification of
decisions taken. And to be available to maintain the work proposed.

Michael McConville

unread,

Mar 19, 2016, 2:06:09 PM3/19/16

to

Edd Barrett wrote:
> On Sat, Mar 19, 2016 at 09:06:26AM +0100, Alexandre Ratchov wrote:
> > The browsers problems seem caused by the way pthreads behave;
> > browsers appear to spin. With the proposed scheduler they spin
> > less. But the real question is why they spin at all?
>
> Inspired by this comment, I set out to see if I could find *where* the
> browsers are spinning. "Just build the browsers with profiling
> instrumentation and look where the time goes when playing a youtube
> video" I thought. "Easy" I thought.
>
> Well no. We build our chrome and firefox with clang, which doesn't
> support gprof style profiling.
>
> Clang does have this -fprofile-instr-generate flag, which throws out
> some profiling data at runtime, but it appears it is designed to be
> used by the compiler itself as compile-time optimisation hints[1].
> It's not even clear if there is any timing data in there.
>
> There's another clang profiling mode which depends upon Linux perf,
> which is obviously not an option for us.
>
> Nevertheless, I decided to try it on the off-chance that clang's
> profiling data could be useful (and I'm totally accepting that, if it
> is, I will probably have to write a python script to make sense of the
> output). Sadly I stumbled at the first hurdle:

I've also been meaning to try something like this:

http://poormansprofiler.org/

Seems applicable here.

Mark Kettenis

unread,

Mar 20, 2016, 2:43:49 PM3/20/16

to

> Date: Sat, 19 Mar 2016 09:06:26 +0100
> From: Alexandre Ratchov <al...@caoua.org>

>
> On Fri, Mar 18, 2016 at 08:00:35PM +0100, Mark Kettenis wrote:
> > > From: Bob Beck <be...@openbsd.org>
> > > Date: Fri, 18 Mar 2016 09:20:35 -0600
> > >
> > > this is cool .. but
> > >
> > > I would be interested in someone comparing server workloads, as
> > > opposed to interactive GUI response, using this.
> > >
> > > I wouldn't be surprised that inspiriation from BFS would produce
> > > better interactive response, my bigger concern would be does this
> > > adversely impact how we deal with non-interactive workloads.
> >
> > One other important case to test is network packet forwarding. Some
> > of our network stack is now running inside a kernel thread. And any
> > changes in the scheduling behaviour have the potential of having a
> > significant impact there.
> >
> > Another interesting case is the page zeroing thread. It relies on
> > priority-based scheduling to make sure it only runs if we have nothing
> > better to do.
> >

> > So I'm skeptical about this BFS scheduler. Don't get me wrong; I do
> > think the scheduler needs attention. But I'm not sure a scheduler
> > designed for interactive desktop use is the best option for OpenBSD.
>

> IMHO current OpenBSD scheduler *is* designed for interactive
> programs too, that's why I keep using OpenBSD for real-time audio
> and didn't switch to Linux or OS X or whatever.
>

> The browsers problems seem caused by the way pthreads behave;
> browsers appear to spin. With the proposed scheduler they spin
> less. But the real question is why they spin at all?

If we still hit the remaining spinlocks in libpthread, we
should probably replace the ones we hit with mutexes like I did for
malloc.

Hrvoje Popovski

unread,

Mar 20, 2016, 6:49:12 PM3/20/16

to

On 18.3.2016. 20:00, Mark Kettenis wrote:
> One other important case to test is network packet forwarding. Some
> of our network stack is now running inside a kernel thread. And any
> changes in the scheduling behaviour have the potential of having a
> significant impact there.

I've done some basic PPS test over ix interface's and numbers are the
same with or without new schedule, ie. 720pps.

routing only
pf=NO
kern.pool_debug=0
net.inet.ip.forwarding=1
net.inet.ip.ifq.maxlen=8192
kern.maxclusters=32768