[PATCH 0 of 4] Generic AIO by scheduling stacks

Zach Brown

unread,

Jan 30, 2007, 4:41:28 PM1/30/07

to linux-...@vger.kernel.org

This very rough patch series introduces a different way to provide AIO support
for system calls.

Right now to provide AIO support for a system call you have to express your
interface in the iocb argument struct for sys_io_submit(), teach fs/aio.c to
translate this into some call path in the kernel that passes in an iocb, and
then update your code path implement with either completion-based (EIOCBQUEUED)
or retry-based (EIOCBRETRY) AIO with the iocb.

This patch series changes this by moving the complexity into generic code such
that a system call handler would provide AIO support in exactly the same way
that it supports a synchronous call. It does this by letting a task have
multiple stacks executing system calls in the kernel. Stacks are switched in
schedule() as they block and are made runnable.

First, let's introduce the term 'fibril'. It means small fiber or thread. It
represents the stack and the bits of data which manage its scheduling under the
task_struct. There's a 1:1:1 relationship between a fibril, the stack it's
managing, and the thread_struct it's operating under. They're static for the
fibril's lifetime. I came to choosing a funny new term after a few iterations
of trying to use existing terms (stack, call, thread, task, path, fiber)
without going insane. They were just too confusing to use with a clear
conscious.

So, let's illustrate the changes by walking through the execution of an asys
call. Let's say sys_nanosleep(). Then we'll talk about the trade-offs.

Maybe it'd make sense to walk through the patches in another window while
reading the commentary. I'm sorry if this seems tedious, but this is
non-trivial stuff. I want to get the point across.

We start in sys_asys_submit(). It allocates a fibril for this executing
submission syscall and hangs it off the task_struct. This lets this submission
fibril be scheduled along with the later asys system calls themselves.

sys_asys_submit() has arguments which specify system call numbers and
arguments. For each of these calls the submission syscall allocates a fibril
and constructs an initial stack. The fibril has eip pointing to the system
call handler and esp pointing to the new stack such that when the handler is
called its arguments will be on the stack. The stack is also constructed so
that when the handler returns it jumps to a function which queues the return
code for collection in userspace.

sys_asys_submit() asks the scheduler to put these new fibrils on the simple run
queue in the task_struct. It's just a list_head for now. It then calls
schedule().

After we've gotten the run queue lock in the scheduler we notice that there are
fibrils on the task_struct's run queue. Instead of continuing with schedule(),
then, we switch fibrils. The old submission fibril will still be in schedule()
and we'll start executing sys_nanosleep() in the context of the submission
task_struct.

The specific switching mechanics of this implementation rely on the notion of
tracking a stack as a full thread_info pointer. To make the switch we transfer
the non-stack bits of the thread_info from the old fibril's ti to the new
fibril's ti. We update the book keeping in the task_struct to
consider the new thread_info as the current thread_info for the task. Like so:

*next->ti = *ti;
*thread_info_pt_regs(next->ti) = *thread_info_pt_regs(ti);

current->thread_info = next->ti;
current->thread.esp0 = (unsigned long)(thread_info_pt_regs(next->ti) + 1);
current->fibril = next;
current->state = next->state;
current->per_call = next->per_call;

Yeah, messy. I'm interested in aggressive feedback on how to do this sanely.
Especially from the perspective of worrying about all the archs.

Did everyone catch that "per_call" thing there? That's to switch members of
task_struct which are local to a specific call. link_count, journal_info, that
sort of thing. More on that as we talk about the costs later.

After the switch we're executing in sys_nanosleep(). Eventually it gets to the
point where it's building its timer which will wake it after the sleep
interval. Currently it would store a bare task_struct reference for
wake_up_process(). Instead we introduce a helper which returns a cookie that
is given to a specific wake_up_*() variant. Like so:

- sl->task = task;
+ sl->wake_target = task_wake_target(task);

It then marks itself as TASK_INTERRUPTIBLE and calls schedule(). schedule()
notices that we have another fibril on the run queue. It's the submission
fibril that we switched from earlier. As we switched we saw that it was still
TASK_RUNNING so we put it on the run queue as we switched. We now switch back to
the submission fibril, leaving the sys_nanosleep() fibril sleeping. Let's say
the submission task returns to userspace which then immediately calls
sys_asys_await_completion(). It's an easy case :). It goes to sleep, there
are no running fibrils and the schedule() path really puts the task to sleep.

Eventually the timer fires and the hrtimer code path wakes the fibril:

- if (task)
- wake_up_process(task);
+ if (wake_target)
+ wake_up_target(wake_target);

We've doctored try_to_wake_up() to be able to tell if its argument is a
task_struct or one of these fibril targets. In the fibril case it calls
try_to_wake_up_fibril(). It notices that the target fibril does need to be
woken and sets it TASK_RUNNING. It notices that it it's current in the task so
it puts the fibril on the task's fibril run queue and wakes the task. There's
grossness here. It needs the task to come through schedule() again so that it
can find the new runnable fibril instead of continuing to execute its current
fibril. To this end, wake-up marks the task's current ti TIF_NEED_RESCHED.
This seems to work, but there are some pretty terrifying interactions between
schedule, wake-up, and the maintenance of fibril->state and task->state that
need to be sorted out.

Remember our task was sleeping in asys_await_completion()? The task was woken
by the fibril wake-up path, but it's still executing the
asys_await_completion() fibril. It comes out of schedule() and sees
TIF_NEED_RESCHED and comes back through the top of schedule(). This time it
finds the runnable sys_nanosleep() fibril and switches to it. sys_nanosleep()
runs to completion and it returns which, because of the way we built its stack,
calls asys_teardown_stack().

asys_teardown_stack() takes the return code and puts it off in a list for
asys_await_completion(). It wakes a wait_queue to notify waiters of pending
completions. In so doing it wakes up the asys_await_completion() fibril that
was sleeping in our task.

Then it has to tear down the fibril for the call that just completed. In the
current implementation the fibril struct is actually embedded in an "asys_call"
struct. asys_teardown_stack() frees the asys_call struct, and so the fibril,
after having cleared current->fibril. It then calls schedule(). Our
asys_await_completion() fibril is on the run queue so we switch to it.
Switching notices the null current->fibril that we're switching from and takes
that as a queue to mark the previous thread_info for freeing *after* the
switch.

After the switch we're in asys_await_completion(). We find the waiting return
code completion struct in the list that was left by asys_teardown_stack(). We
return it to userspace.

Phew. OK, so what are the trade-offs here? I'll start with the benefits for
obvious reasons :).

- With get AIO support for all syscalls. Every single one. (Well, please, no
asys sys_exit() :)). Buffered IO, sendfile, recvmsg, poll, epoll, hardware
crypto ioctls, open, mmap, getdents, the entire splice API, etc.

- The syscall API does not change just because calls are being issued AIO,
particularly things that reference task_struct. AIO sys_getpid() does what
you'd expect, signal masking, etc. You don't have to worry about your AIO call
being secretly handled by some worker threads that get very different results
from current-> references.

- We wouldn't multiply testing and maintenance burden with separate AIO paths.
No is_sync_kiocb() testing and divergence between returning or calling
aio_complete(). No auditing to make sure that EIOCBRETRY only being returned
after any significant references of current->. No worries about completion
racing from the submission return path and some aio_complete() being called
from another context. In this scheme if your sync syscall path isn't broken,
your AIO path stands a great chance of working.

- The submission syscall won't block while handling submitted calls. Not for
metadata IO, not for memory allocation, not for mutex contention, nothing.

- AIO syscalls which *don't* block see very little overhead. They'll allocate
stacks and juggle the run queue locks a little, but they'll execute in turn on
the submitting (cache-hot, presumably) processor. There's room to optimize
this path, too, of course.

- We don't need to duplicate interfaces for each AIO interface we want to
support. No iocb unions mirroring the syscall API, no magical AIO sys_
variants.

And the costs? It's not free.

- The 800lb elephant in the room. It uses a full stack per blocked operation.
I believe this is a reasonable price to pay for the flexibility of having *any*
call pending. It rules out some loads which would want to keep *millions* of
operations pending, but I humbly submit that a load rarely takes that number of
concurrent ops to saturate a resource. (think of it this way: we've gotten
this far by having to burn a full *task* to have *any* syscall pending.) While
not optimal, it opens to door to a lot of functionality without having to
rewrite the kernel as a giant non-blocking state machine.

It should be noted that my very first try was to copy the used part of stacks
in to and out of one full allocated stack. This uses less memory per blocking
operation at the cpu cost of copying the used regions. And it's a terrible
idea, for at least two reasons. First, to actually get the memory overhead
savings you allocate at stack switch time. If that allocation can't be
satisfied you are in *trouble* because you might not be able to switch over to
a fibril that is trying to free up memory. Deadlock city. Second, it means
that *you can't reference on-stack data in the wake-up path*. This is a
nightmare. Even our trivial sys_nanosleep() example would have had to take its
hrtimer_sleeper off the stack and allocate it. Never mind, you know, basically
every single user of <linux/wait.h>. My current thinking is that it's just
not worth it.

- We would now have some measure of task_struct concurrency. Read that twice,
it's scary. As two fibrils execute and block in turn they'll each be
referencing current->. It means that we need to audit task_struct to make sure
that paths can handle racing as its scheduled away. The current implementation
*does not* let preemption trigger a fibril switch. So one only has to worry
about racing with voluntary scheduling of the fibril paths. This can mean
moving some task_struct members under an accessor that hides them in a struct
in task_struct so they're switched along with the fibril. I think this is a
manageable burden.

- The fibrils can only execute in the submitter's task_struct. While I think
this is in fact a feature, it does imply some interesting behaviour.
Submitters will be required to explicitly manage any concurrent between CPUs by
issuing their ops in tasks. To guarantee forward-progress in syscall handling
paths (releasing i_mutex, say) we'll have to interrupt userspace when a fibril
is ready to run.

- Signals. I have no idea what behaviour we want. Help? My first guess is
that we'll want signal state to be shared by fibrils by keeping it in the
task_struct. If we want something like individual cancellation, we'll augment
signal_pending() with some some per-fibril test which will cause it to return
from TASK_INTERRUPTIBLE (the only reasonable way to implement generic
cancellation, I'll argue) as it would have if a signal was pending.

- lockdep and co. will need to be updated to track fibrils instead of tasks.
sysrq-t might want to dump fibril stacks, too. That kind of thing. Sorry.

As for the current implementation, it's obviously just a rough sketch. I'm
sending it out in this state because this is the first point at which a tree
walker using AIO openat(), fstat(), and getdents() actually worked on ext3.
Generally, though, these are paths that I don't have the most experience in.
I'd be thrilled to implement whatever the experts think is the right way to do
this.

Blah blah blah. Too much typing. What do people think?

Zach Brown

unread,

Jan 30, 2007, 4:42:27 PM1/30/07

to linux-...@vger.kernel.org

This finally does something useful with the notion of being able to schedule
stacks as fibrils under a task_struct. Again, i386-specific and in need of
proper layering with archs.

sys_asys_submit() is added to let userspace submit asynchronous system calls.
It specifies the system call number and arguments. A fibril is constructed for
each call. Each starts with a stack which executes the given system call
handler and then returns to a function which records the return code of the
system call handler. sys_asys_await_completion() then lets userspace collect
these results.

sys_asys_submit() is careful to construct a fibril for the submission syscall
itself so that it can return to userspace if the calls it is dispatching block.
If none of them block, however, they will have all been run hot in this
submitting task on this processor.

It allocates and runs each system call in turn. It could certainly work in
batches to decrease locking overhead at the cost of increased peak memory
overhead for calls which don't end up blocking.

The complexity of a fully-formed submission and completion interface hasn't
been addressed. Details like targeting explicit completion contexts, batching,
timeouts, signal delivery, and syscall-free submission and completion (now with
more rings!) can all be hashed out in some giant thread, no doubt. I didn't
want them to cloud the basic mechanics being presented here.

diff -r 4ea674e8825e -r 5bdda0f7bef2 arch/i386/kernel/syscall_table.S
--- a/arch/i386/kernel/syscall_table.S Mon Jan 29 15:46:47 2007 -0800
+++ b/arch/i386/kernel/syscall_table.S Mon Jan 29 15:50:10 2007 -0800
@@ -319,3 +319,5 @@ ENTRY(sys_call_table)
.long sys_move_pages
.long sys_getcpu
.long sys_epoll_pwait
+ .long sys_asys_submit /* 320 */
+ .long sys_asys_await_completion
diff -r 4ea674e8825e -r 5bdda0f7bef2 include/asm-i386/unistd.h
--- a/include/asm-i386/unistd.h Mon Jan 29 15:46:47 2007 -0800
+++ b/include/asm-i386/unistd.h Mon Jan 29 15:50:10 2007 -0800
@@ -325,6 +325,8 @@
#define __NR_move_pages 317
#define __NR_getcpu 318
#define __NR_epoll_pwait 319
+#define __NR_asys_submit 320
+#define __NR_asys_await_completion 321

#ifdef __KERNEL__

diff -r 4ea674e8825e -r 5bdda0f7bef2 include/linux/init_task.h
--- a/include/linux/init_task.h Mon Jan 29 15:46:47 2007 -0800
+++ b/include/linux/init_task.h Mon Jan 29 15:50:10 2007 -0800
@@ -148,6 +148,8 @@ extern struct group_info init_groups;
.pi_lock = SPIN_LOCK_UNLOCKED, \
INIT_TRACE_IRQFLAGS \
INIT_LOCKDEP \
+ .asys_wait = __WAIT_QUEUE_HEAD_INITIALIZER(tsk.asys_wait), \
+ .asys_completed = LIST_HEAD_INIT(tsk.asys_completed), \
}

diff -r 4ea674e8825e -r 5bdda0f7bef2 include/linux/sched.h
--- a/include/linux/sched.h Mon Jan 29 15:46:47 2007 -0800
+++ b/include/linux/sched.h Mon Jan 29 15:50:10 2007 -0800
@@ -1019,6 +1019,14 @@ struct task_struct {

/* Protection of the PI data structures: */
spinlock_t pi_lock;
+
+ /*
+ * XXX This is just a dummy that should be in a seperately managed
+ * context. An explicit contexts lets asys calls be nested (!) and
+ * will let us provide the sys_io_*() API on top of asys.
+ */
+ struct list_head asys_completed;
+ wait_queue_head_t asys_wait;

#ifdef CONFIG_RT_MUTEXES
/* PI waiters blocked on a rt_mutex held by this task */
diff -r 4ea674e8825e -r 5bdda0f7bef2 kernel/Makefile
--- a/kernel/Makefile Mon Jan 29 15:46:47 2007 -0800
+++ b/kernel/Makefile Mon Jan 29 15:50:10 2007 -0800
@@ -8,7 +8,7 @@ obj-y = sched.o fork.o exec_domain.o
signal.o sys.o kmod.o workqueue.o pid.o \
rcupdate.o extable.o params.o posix-timers.o \
kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o mutex.o \
- hrtimer.o rwsem.o latency.o nsproxy.o srcu.o
+ hrtimer.o rwsem.o latency.o nsproxy.o srcu.o asys.o

obj-$(CONFIG_STACKTRACE) += stacktrace.o
obj-y += time/
diff -r 4ea674e8825e -r 5bdda0f7bef2 kernel/exit.c
--- a/kernel/exit.c Mon Jan 29 15:46:47 2007 -0800
+++ b/kernel/exit.c Mon Jan 29 15:50:10 2007 -0800
@@ -42,6 +42,7 @@
#include <linux/audit.h> /* for audit_free() */
#include <linux/resource.h>
#include <linux/blkdev.h>
+#include <linux/asys.h>

#include <asm/uaccess.h>
#include <asm/unistd.h>
@@ -926,6 +927,8 @@ fastcall NORET_TYPE void do_exit(long co
taskstats_exit(tsk, group_dead);

exit_mm(tsk);
+
+ asys_task_exiting(tsk);

if (group_dead)
acct_process();
diff -r 4ea674e8825e -r 5bdda0f7bef2 kernel/fork.c
--- a/kernel/fork.c Mon Jan 29 15:46:47 2007 -0800
+++ b/kernel/fork.c Mon Jan 29 15:50:10 2007 -0800
@@ -49,6 +49,7 @@
#include <linux/delayacct.h>
#include <linux/taskstats_kern.h>
#include <linux/random.h>
+#include <linux/asys.h>

#include <asm/pgtable.h>
#include <asm/pgalloc.h>
@@ -987,6 +988,8 @@ static struct task_struct *copy_process(
goto fork_out;

rt_mutex_init_task(p);
+
+ asys_init_task(p);

#ifdef CONFIG_TRACE_IRQFLAGS
DEBUG_LOCKS_WARN_ON(!p->hardirqs_enabled);
diff -r 4ea674e8825e -r 5bdda0f7bef2 include/linux/asys.h
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/include/linux/asys.h Mon Jan 29 15:50:10 2007 -0800
@@ -0,0 +1,7 @@
+#ifndef _LINUX_ASYS_H
+#define _LINUX_ASYS_H
+
+void asys_task_exiting(struct task_struct *tsk);
+void asys_init_task(struct task_struct *tsk);
+
+#endif
diff -r 4ea674e8825e -r 5bdda0f7bef2 kernel/asys.c
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/kernel/asys.c Mon Jan 29 15:50:10 2007 -0800
@@ -0,0 +1,252 @@
+#include <linux/sched.h>
+#include <linux/uaccess.h>
+#include <linux/err.h>
+#include <linux/asys.h>
+
+/* XXX */
+#include <asm/processor.h>
+
+/*
+ * system call and argument specification given to _submit from userspace
+ */
+struct asys_input {
+ int syscall_nr;
+ unsigned long cookie;
+ unsigned long nr_args;
+ unsigned long *args;
+};
+
+/*
+ * system call completion event given to userspace
+ * XXX: compat
+ */
+struct asys_completion {
+ long return_code;
+ unsigned long cookie;
+};
+
+/*
+ * This record of a completed async system call is kept around until it
+ * is collected by userspace.
+ */
+struct asys_result {
+ struct list_head item;
+ struct asys_completion comp;
+};
+
+/*
+ * This stack is built-up and handed to the scheduler to first process
+ * the system call. It stores the progress of the call until the call returns
+ * and this structure is freed.
+ */
+struct asys_call {
+ struct asys_result *result;
+ struct fibril fibril;
+};
+
+void asys_init_task(struct task_struct *tsk)
+{
+ INIT_LIST_HEAD(&tsk->asys_completed);
+ init_waitqueue_head(&tsk->asys_wait);
+}
+
+void asys_task_exiting(struct task_struct *tsk)
+{
+ struct asys_result *res, *next;
+
+ list_for_each_entry_safe(res, next, &tsk->asys_completed, item)
+ kfree(res);
+
+ /*
+ * XXX this only works if tsk->fibril was allocated by
+ * sys_asys_submit(), not if its embedded in an asys_call. This
+ * implies that we must forbid sys_exit in asys_submit.
+ */
+ if (tsk->fibril) {
+ BUG_ON(!list_empty(&tsk->fibril->run_list));
+ kfree(tsk->fibril);
+ tsk->fibril = NULL;
+ }
+}
+
+/*
+ * Initial asys call stacks are constructed such that this is called when
+ * the system call handler returns. It records the return code from
+ * the handler in a completion event and frees data associated with the
+ * completed asys call.
+ *
+ * XXX we know that the x86 syscall handlers put their return code in eax and
+ * that regparm(3) here will take our rc argument from eax.
+ */
+static void fastcall NORET_TYPE asys_teardown_stack(long rc)
+{
+ struct asys_result *res;
+ struct asys_call *call;
+ struct fibril *fibril;
+
+ fibril = current->fibril;
+ call = container_of(fibril, struct asys_call, fibril);
+ res = call->result;
+ call->result = NULL;
+
+ res->comp.return_code = rc;
+ list_add_tail(&res->item, &current->asys_completed);
+ wake_up(&current->asys_wait);
+
+ /*
+ * We embedded the fibril in the call so that we could dereference it
+ * here without adding some tracking to the fibril. We then free the
+ * call and fibril because we're done with them.
+ *
+ * The ti itself, though, is still in use. It will only be freed once
+ * the scheduler switches away from it to another fibril. It does
+ * that when it sees current->fibril assigned to NULL.
+ */
+ current->fibril = NULL;
+ BUG_ON(!list_empty(&fibril->run_list));
+ kfree(call);
+
+ /*
+ * XXX This is sloppy. We "know" this is likely for now as the task
+ * with fibrils is only going to be in sys_asys_submit() or
+ * sys_asys_complete()
+ */
+ BUG_ON(list_empty(&current->runnable_fibrils));
+
+ schedule();
+ BUG();
+}
+
+asmlinkage long sys_asys_await_completion(struct asys_completion __user *comp)
+{
+ struct asys_result *res;
+ long ret;
+
+ ret = wait_event_interruptible(current->asys_wait,
+ !list_empty(&current->asys_completed));
+ if (ret)
+ goto out;
+
+ res = list_entry(current->asys_completed.next, struct asys_result,
+ item);
+
+ /* XXX compat */
+ ret = copy_to_user(comp, &res->comp, sizeof(struct asys_completion));
+ if (ret) {
+ ret = -EFAULT;
+ goto out;
+ }
+
+ list_del(&res->item);
+ kfree(res);
+ ret = 1;
+
+out:
+ return ret;
+}
+
+/*
+ * This initializes a newly allocated fibril so that it can be handed to the
+ * scheduler. The fibril is private to this code path at this point.
+ *
+ * XXX
+ * - this is arch specific
+ * - should maybe have a sched helper that uses INIT_PER_CALL_CHAIN
+ */
+extern unsigned long sys_call_table[]; /* XXX */
+static int asys_init_fibril(struct fibril *fibril, struct thread_info *ti,
+ struct asys_input *inp)
+{
+ unsigned long *stack_bottom;
+
+ INIT_LIST_HEAD(&fibril->run_list);
+ fibril->ti = ti;
+
+ /* XXX sanity check syscall_nr */
+ fibril->eip = sys_call_table[inp->syscall_nr];
+ /* this mirrors copy_thread()'s use of task_pt_regs() */
+ fibril->esp = (unsigned long)thread_info_pt_regs(ti) -
+ ((inp->nr_args + 1) * sizeof(long));
+
+ /*
+ * now setup the stack so that our syscall handler gets its arguments
+ * and we return to asys_teardown_stack.
+ */
+ stack_bottom = (unsigned long *)fibril->esp;
+ stack_bottom[0] = (unsigned long)asys_teardown_stack;
+ /* XXX compat */
+ if (copy_from_user(&stack_bottom[1], inp->args,
+ inp->nr_args * sizeof(long)))
+ return -EFAULT;
+
+ return 0;
+}
+
+asmlinkage long sys_asys_submit(struct asys_input __user *user_inp,
+ unsigned long nr_inp)
+{
+ struct asys_input inp;
+ struct asys_result *res;
+ struct asys_call *call;
+ struct thread_info *ti;
+ unsigned long i;
+ long err = 0;
+
+ /* Allocate a fibril for the submitter's thread_info */
+ if (current->fibril == NULL) {
+ current->fibril = kzalloc(sizeof(struct fibril), GFP_KERNEL);
+ if (current->fibril == NULL)
+ return -ENOMEM;
+
+ INIT_LIST_HEAD(&current->fibril->run_list);
+ current->fibril->state = TASK_RUNNING;
+ current->fibril->ti = current_thread_info();
+ }
+
+ for (i = 0; i < nr_inp; i++) {
+
+ if (copy_from_user(&inp, &user_inp[i], sizeof(inp))) {
+ err = -EFAULT;
+ break;
+ }
+
+ res = kmalloc(sizeof(struct asys_result), GFP_KERNEL);
+ if (res == NULL) {
+ err = -ENOMEM;
+ break;
+ }
+
+ /* XXX kzalloc to init call.fibril.per_cpu, add helper */
+ call = kzalloc(sizeof(struct asys_call), GFP_KERNEL);
+ if (call == NULL) {
+ kfree(res);
+ err = -ENOMEM;
+ break;
+ }
+
+ ti = alloc_thread_info(tsk);
+ if (ti == NULL) {
+ kfree(res);
+ kfree(call);
+ err = -ENOMEM;
+ break;
+ }
+
+ err = asys_init_fibril(&call->fibril, ti, &inp);
+ if (err) {
+ kfree(res);
+ kfree(call);
+ free_thread_info(ti);
+ break;
+ }
+
+ res->comp.cookie = inp.cookie;
+ call->result = res;
+ ti->task = current;
+
+ sched_new_runnable_fibril(&call->fibril);
+ schedule();
+ }
+
+ return i ? i : err;
+}
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Linus Torvalds

unread,

Jan 30, 2007, 4:58:56 PM1/30/07

to Zach Brown

On Tue, 30 Jan 2007, Zach Brown wrote:
>
> This very rough patch series introduces a different way to provide AIO support
> for system calls.

Yee-haa!

I looked at this approach a long time ago, and basically gave up because
it looked like too much work.

I heartily approve, although I only gave the actual patches a very cursory
glance. I think the approach is the proper one, but the devil is in the
details. It might be that the stack allocation overhead or some other
subtle fundamental problem ends up making this impractical in the end, but
I would _really_ like for this to basically go in.

One of the biggest issues I see is signalling. Your mention it, and I
think that's going to be one of the big issues.

It won't matter at all for a certain class of calls (a lot of the people
who want to do AIO really end up doing non-interruptible things, and
signalling is a non-issue), but not only is it going to matter for some
others, we will almost certainly want to have a way to not just signal a
task, but a single "fibril" (and let me say that I'm not convinced about
your naming, but I don't hate it either ;)

But from a quick overview of the patches, I really don't see anything
fundamentally wrong. It needs some error checking and some limiting (I
_really_ don't think we want a regular user starting a thousand fibrils
concurrently), but it actually looks much less invasive than I thought it
would be.

So yay! Consider me at least preliminarily very happy.

Linus

Linus Torvalds

unread,

Jan 30, 2007, 5:23:45 PM1/30/07

to Zach Brown

On Tue, 30 Jan 2007, Linus Torvalds wrote:
>
> But from a quick overview of the patches, I really don't see anything
> fundamentally wrong. It needs some error checking and some limiting (I
> _really_ don't think we want a regular user starting a thousand fibrils
> concurrently), but it actually looks much less invasive than I thought it
> would be.

Side note (and maybe this was obvious to people already): I would suggest
that the "limiting" not be done the way fork() is limited ("return EAGAIN
if you go over a limit") but be done as a per-task counting semaphore
(down() on submit, up() on fibril exit).

So we should limit these to basically have some maximum concurrency
factor, but rather than consider it an error to go over it, we'd just cap
the concurrency by default, so that people can freely use asynchronous
interfaces without having to always worry about what happens if their
resources run out..

However, that also implies that we should probably add a "flags" parameter
to "async_submit()" and have a FIBRIL_IMMEDIATE flag (or a timeout) or
something to tell the kernel to rather return EAGAIN than wait. Sometimes
you don't want to block just because you already have too much work.

Hmm?

Zach Brown

unread,

Jan 30, 2007, 5:47:58 PM1/30/07

to Linus Torvalds

> I looked at this approach a long time ago, and basically gave up
> because
> it looked like too much work.

Indeed, your mention of it in that thread.. a year ago?.. is what got
this notion sitting in the back of my head. I didn't like it at
first, but it grew on me.

> I heartily approve, although I only gave the actual patches a very
> cursory
> glance. I think the approach is the proper one, but the devil is in
> the
> details. It might be that the stack allocation overhead or some other
> subtle fundamental problem ends up making this impractical in the
> end, but
> I would _really_ like for this to basically go in.

As for efficiency and overhead, I hope to get some time with the team
that work on the Giant Database Software Whose Name We Shall Not
Speak. That'll give us some non-trival loads to profile.

> It won't matter at all for a certain class of calls (a lot of the
> people
> who want to do AIO really end up doing non-interruptible things, and
> signalling is a non-issue), but not only is it going to matter for
> some
> others, we will almost certainly want to have a way to not just
> signal a
> task, but a single "fibril" (and let me say that I'm not convinced
> about
> your naming, but I don't hate it either ;)

Yeah, no doubt. I'm wildly open to discussion here. (and yeah, me
either, but I don't care much about the name. I got tired of
qualifying overloaded uses of 'stack' or 'thread', that's all :)).

> But from a quick overview of the patches, I really don't see anything
> fundamentally wrong. It needs some error checking and some limiting (I
> _really_ don't think we want a regular user starting a thousand
> fibrils
> concurrently), but it actually looks much less invasive than I
> thought it
> would be.

I think we'll also want to flesh out the submission and completion
interface so that we don't find ourselves frustrated with it in
another 5 years. What's there now is just scaffolding to support the
interesting kernel-internal part. No doubt the kevent thread will
come into play here.

- z

Zach Brown

unread,

Jan 30, 2007, 5:54:11 PM1/30/07

to Linus Torvalds

> So we should limit these to basically have some maximum concurrency
> factor, but rather than consider it an error to go over it, we'd
> just cap
> the concurrency by default, so that people can freely use asynchronous
> interfaces without having to always worry about what happens if their
> resources run out..

Yeah, call it the socket transmit queue model :). Maybe tuned by a
ulimit?

I don't have very strong opinions abou the specific mechanics of
limiting concurrent submissions, as long as they're there. Some
folks in Oracle complain about having one more thing to have to tune,
but the alternative seems worse.

> However, that also implies that we should probably add a "flags"
> parameter
> to "async_submit()" and have a FIBRIL_IMMEDIATE flag (or a timeout) or
> something to tell the kernel to rather return EAGAIN than wait.
> Sometimes
> you don't want to block just because you already have too much work.

EAGAIN or the initial number of submissions completed before the one
that ran over the limit, perhaps. Sure. Nothing too controversial
here :). I have this kind of stuff queued up for worrying about
once the internal mechanics are stronger.

- z

Linus Torvalds

unread,

Jan 30, 2007, 5:55:08 PM1/30/07

to Zach Brown

On Tue, 30 Jan 2007, Zach Brown wrote:
>

> I think we'll also want to flesh out the submission and completion interface
> so that we don't find ourselves frustrated with it in another 5 years. What's
> there now is just scaffolding to support the interesting kernel-internal part.
> No doubt the kevent thread will come into play here.

Actually, the thing I like about kernel micro-threads (and that "fibril"
name is starting to grow on me) is that I'm hoping we might be able to use
them for that kevent thing too. And perhaps some other issues (ACPI has
some "events" that might work with synchronously scheduled threads).

IOW, synchronous threading does have its advantages..

Btw, I noticed that you didn't Cc Ingo. Definitely worth doing. Not just
because he's basically the normal scheduler maintainer, but also because
he's historically been involved in things like the async filename lookup
that the in-kernel web server thing used. EXACTLY the kinds of things
where fibrils actually give you *much* nicer interfaces.

Ingo - see linux-kernel for the announcement and WIP patches.

Linus

Zach Brown

unread,

Jan 30, 2007, 6:47:11 PM1/30/07

to Linus Torvalds

> Btw, I noticed that you didn't Cc Ingo. Definitely worth doing. Not
> just
> because he's basically the normal scheduler maintainer, but also
> because
> he's historically been involved in things like the async filename
> lookup
> that the in-kernel web server thing used.

Yeah, that was dumb. I had him in the cc: initially, then thought it
was too large and lobbed a bunch off. My mistake.

Ingo, I'm interested in your reaction to the i386-specific mechanics
here (the thread_info copies terrify me) and the general notion of
how to tie this cleanly into the scheduler.

- z

Benjamin Herrenschmidt

unread,

Jan 30, 2007, 9:04:55 PM1/30/07

to Zach Brown

> - We would now have some measure of task_struct concurrency. Read that twice,
> it's scary. As two fibrils execute and block in turn they'll each be
> referencing current->. It means that we need to audit task_struct to make sure
> that paths can handle racing as its scheduled away. The current implementation
> *does not* let preemption trigger a fibril switch. So one only has to worry
> about racing with voluntary scheduling of the fibril paths. This can mean
> moving some task_struct members under an accessor that hides them in a struct
> in task_struct so they're switched along with the fibril. I think this is a
> manageable burden.

That's the one scaring me in fact ... Maybe it will end up being an easy
one but I don't feel too comfortable... we didn't create fibril-like
things for threads, instead, we share PIDs between tasks. I wonder if
the sane approach would be to actually create task structs (or have a
pool of them pre-created sitting there for performances) and add a way
to share the necessary bits so that syscalls can be run on those
spin-offs.

Ben.

Benjamin Herrenschmidt

unread,

Jan 30, 2007, 9:08:03 PM1/30/07

to Zach Brown

On Tue, 2007-01-30 at 15:45 -0800, Zach Brown wrote:
> > Btw, I noticed that you didn't Cc Ingo. Definitely worth doing. Not
> > just
> > because he's basically the normal scheduler maintainer, but also
> > because
> > he's historically been involved in things like the async filename
> > lookup
> > that the in-kernel web server thing used.
>
> Yeah, that was dumb. I had him in the cc: initially, then thought it
> was too large and lobbed a bunch off. My mistake.
>
> Ingo, I'm interested in your reaction to the i386-specific mechanics
> here (the thread_info copies terrify me) and the general notion of
> how to tie this cleanly into the scheduler.

Thread info copies aren't such a big deal, we do that for irq stacks
already afaik

Ben.

Linus Torvalds

unread,

Jan 30, 2007, 9:47:04 PM1/30/07

to Benjamin Herrenschmidt

On Wed, 31 Jan 2007, Benjamin Herrenschmidt wrote:

> > - We would now have some measure of task_struct concurrency. Read that twice,
> > it's scary. As two fibrils execute and block in turn they'll each be
> > referencing current->. It means that we need to audit task_struct to make sure
> > that paths can handle racing as its scheduled away. The current implementation
> > *does not* let preemption trigger a fibril switch. So one only has to worry
> > about racing with voluntary scheduling of the fibril paths. This can mean
> > moving some task_struct members under an accessor that hides them in a struct
> > in task_struct so they're switched along with the fibril. I think this is a
> > manageable burden.
>
> That's the one scaring me in fact ... Maybe it will end up being an easy
> one but I don't feel too comfortable...

We actually have almost zero "interesting" data in the task-struct.

All the real meat of a task has long since been split up into structures
that can be shared for threading anyway (ie signal/files/mm/etc).

Which is why I'm personally very comfy with just re-using task_struct
as-is.

NOTE! This is with the understanding that we *never* do any preemption.
The whole point of the microthreading as far as I'm concerned is exactly
that it is cooperative. It's not preemptive, and it's emphatically *not*
concurrent (ie you'd never have two fibrils running at the same time on
separate CPU's).

If you want preemptive of concurrent CPU usage, you use separate threads.
The point of AIO scheduling is very much inherent in its name: it's for
filling up CPU's when there's IO.

So the theory (and largely practice) is that you want to use real threads
to fill your CPU's, but then *within* those threads you use AIO to make
sure that each thread actually uses the CPU efficiently and doesn't just
block with nothing to do.

So with the understanding that this is neither CPU-concurrent nor
preemptive (*within* a fibril group - obviously the thread itself gets
both preempted and concurrently run with other threads), I don't worry at
all about sharing "struct task_struct".

Does that mean that we might not have some cases where we'd need to make
sure we do things differently? Of course not. Something migt show up. But
this actually makes it very clear what the difference between "struct
thread_struct" and "struct task_struct" are. One is shared between
fibrils, the other isn't.

Linus

Linus Torvalds

unread,

Jan 30, 2007, 10:02:54 PM1/30/07

to Benjamin Herrenschmidt

On Tue, 30 Jan 2007, Linus Torvalds wrote:
>
> Does that mean that we might not have some cases where we'd need to make
> sure we do things differently? Of course not. Something migt show up. But
> this actually makes it very clear what the difference between "struct
> thread_struct" and "struct task_struct" are. One is shared between
> fibrils, the other isn't.

Btw, this is also something where we should just disallow certain system
calls from being done through the asynchronous method.

Notably, clone/fork(), execve() and exit() are all things that we probably
simply shouldn't allow as "AIO" events.

The process handling ones are obvious: they are very much about the shared
"struct task_struct", so they rather clearly should only done "natively".

More interesting is the question about "close()", though. Currently we
have an optimization (fget/fput_light) that basically boils down to "we
know we are the only owners". That optimization becomes more "interesting"
with AIO - we need to disable it when fibrils are active (because other
fibrils or the main thread can do it), but we can still keep it for the
non-fibril case.

So we can certainly allow close() to happen in a fibril, but at the same
time, this is an area where just the *existence* of fibrils means that
certain other decisions that were thread-related may be modified to be
aware of the micro-threads too.

I suspect there are rather few of those, though. The only one I can think
of is literally the fget/fput_light() case, but there could be others.

Benjamin Herrenschmidt

unread,

Jan 31, 2007, 12:17:33 AM1/31/07

to Linus Torvalds

> NOTE! This is with the understanding that we *never* do any preemption.
> The whole point of the microthreading as far as I'm concerned is exactly
> that it is cooperative. It's not preemptive, and it's emphatically *not*
> concurrent (ie you'd never have two fibrils running at the same time on
> separate CPU's).

That makes it indeed much less worrisome...

> If you want preemptive of concurrent CPU usage, you use separate threads.
> The point of AIO scheduling is very much inherent in its name: it's for
> filling up CPU's when there's IO.

Ok, I see, that's in fact pretty similar to some task switching hack I
did about 10 years ago on MacOS to have "asynchronous" IO code be
implemented linearily :-)

Makes lots of sense imho.

Ben.

Nick Piggin

unread,

Jan 31, 2007, 12:38:11 AM1/31/07

to Linus Torvalds

Linus Torvalds wrote:
>
> On Wed, 31 Jan 2007, Benjamin Herrenschmidt wrote:
>
>
>>>- We would now have some measure of task_struct concurrency. Read that twice,
>>>it's scary. As two fibrils execute and block in turn they'll each be
>>>referencing current->. It means that we need to audit task_struct to make sure
>>>that paths can handle racing as its scheduled away. The current implementation
>>>*does not* let preemption trigger a fibril switch. So one only has to worry
>>>about racing with voluntary scheduling of the fibril paths. This can mean
>>>moving some task_struct members under an accessor that hides them in a struct
>>>in task_struct so they're switched along with the fibril. I think this is a
>>>manageable burden.
>>
>>That's the one scaring me in fact ... Maybe it will end up being an easy
>>one but I don't feel too comfortable...
>
>
> We actually have almost zero "interesting" data in the task-struct.
>
> All the real meat of a task has long since been split up into structures
> that can be shared for threading anyway (ie signal/files/mm/etc).
>
> Which is why I'm personally very comfy with just re-using task_struct
> as-is.
>
> NOTE! This is with the understanding that we *never* do any preemption.
> The whole point of the microthreading as far as I'm concerned is exactly
> that it is cooperative. It's not preemptive, and it's emphatically *not*
> concurrent (ie you'd never have two fibrils running at the same time on
> separate CPU's).

So using stacks to hold state is (IMO) the logical choice to do async
syscalls, especially once you have a look at some of the other AIO
stuff going around.

I always thought that the AIO people didn't do this because they wanted
to avoid context switch overhead.

So now if we introduce the context switch overhead back, why do we need
just another scheduling primitive? What's so bad about using threads? The
upside is that almost everything is already there and working, and also
they don't have any of these preemption or concurrency restrictions.

The only thing I saw in Zach's post against the use of threads is that
some kernel API would change. But surely if this is the showstopper then
there must be some better argument than sys_getpid()?!

Aside from that, I'm glad that someone is looking at this way for AIO,
because I really don't like some aspects in the other approach.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

Nick Piggin

unread,

Jan 31, 2007, 12:52:28 AM1/31/07

to

In other words, while I share the appreciation for this clever trick of
using cooperative switching between these little thriblets, I don't
actually feel it is very elegant to then have to change the kernel so
much in order to handle them.

I would be fascinated to see where such a big advantage comes from using
these rather than threads. Maybe we can then improve threads not to suck
so much and everybody wins.

Linus Torvalds

unread,

Jan 31, 2007, 1:07:44 AM1/31/07

to Nick Piggin

On Wed, 31 Jan 2007, Nick Piggin wrote:
>
> I always thought that the AIO people didn't do this because they wanted
> to avoid context switch overhead.

I don't think that scheduling overhead was ever a really the reason, at
least not the primary one, and at least not on Linux. Sure, we can
probably make cooperative thread switching a bit faster than even
VM-sharing thread switching (maybe), but it's not going to be *that* big
an issue.

Ifaik, the bigger issues were about setup costs (but also purely semantic
- it was hard to do AIO semantics with threads).

And memory costs. The "one stack page per outstanding AIO" may end up
still being too expensive, but threads were even more so.

[ Of course, that used to also be the claim by all the people who thought
we couldn't do native kernel threads for "normal" threading either, and
should go with the n*m setup. Shows how much they knew ;^]

But I've certainly _personally_ always wanted to do AIO with threads. I
wanted to do it with regular threads (ie the "clone()" kind). It didn't
fly. But I think we can possibly lower both the setup costs and the memory
costs with the cooperative approach, to the point where maybe this one is
more palatable and workable.

And maybe it also solves some of the scalability worries (threads have ID
space and scheduling setup things that essentially go away by just not
doing them - which is what the fibrils simply wouldn't have).

(Sadly, some of the people who really _use_ AIO are the database people,
and they really only care about a particularly stupid and trivial case:
pure reads and writes. A lot of other loads care about much more complex
things: filename lookups etc, that traditional AIO cannot do at all, and
that you really want something more thread-like for. But those other loads
get kind of swamped by the DB needs, which are might tighter and trivial
enough that you don't "need" a real thread for them).

Linus

Andi Kleen

unread,

Jan 31, 2007, 2:58:27 AM1/31/07

to Zach Brown

Zach Brown <zach....@oracle.com> writes:

> This finally does something useful with the notion of being able to schedule
> stacks as fibrils under a task_struct. Again, i386-specific and in need of
> proper layering with archs.
>
> sys_asys_submit() is added to let userspace submit asynchronous system calls.
> It specifies the system call number and arguments. A fibril is constructed for
> each call. Each starts with a stack which executes the given system call
> handler and then returns to a function which records the return code of the
> system call handler. sys_asys_await_completion() then lets userspace collect
> these results.

Do you have any numbers how this compares cycle wise to just doing
clone+syscall+exit in user space?

If the difference is not too big might it be easier to fix
clone+syscall to be more efficient than teach all the rest
of the kernel about fibrils?

-Andi

Ingo Molnar

unread,

Jan 31, 2007, 3:45:59 AM1/31/07

to Linus Torvalds

* Linus Torvalds <torv...@linux-foundation.org> wrote:

> [ Of course, that used to also be the claim by all the people who
> thought we couldn't do native kernel threads for "normal" threading
> either, and should go with the n*m setup. Shows how much they knew
> ;^]
>
> But I've certainly _personally_ always wanted to do AIO with threads.
> I wanted to do it with regular threads (ie the "clone()" kind). It
> didn't fly. But I think we can possibly lower both the setup costs and
> the memory costs with the cooperative approach, to the point where
> maybe this one is more palatable and workable.

as you know i've been involved in all the affected IO and scheduling
disciplines in question: i hacked on scheduling, 1:1 threading, on Tux,
i even optimized kaio under TPC workloads, and various other things. So
i believe to have seen most of these things first-hand and thus i
(pretend to) have no particular prejudice or other subjective
preference. In the following, rather long (and boring) analysis i touch
upon both 1:1 threading, light threads, state machines and AIO.

The first and most important thing is that i think there are only /two/
basic, fundamental IO approaches that matter to kernel design:

1) the most easy to program one.

2) the fastest one.

1), the most easily programmed model: this is tasks and synchronous IO.
Samba (and Squid, and even Apache most of the time ...) is still using
plain old /processes/ and is doing fine! Most of the old LWP arguments
are not really true anymore, TLB switches are very fast meanwhile (per
hardware improvements) and our scheduler is still pretty light.
Furthermore, for certain specialized environments that are able isolate
the programmer from the risks and complexities of thread programming,
threading can be somewhat faster (such as Java or C#). But for the
general C/C++ based server environment even threading is pure hell. (We
know it from the kernel that programming a threaded design is /hard/,
needs alot of discipline and takes alot more resources than any of the
saner models.)

2) the fastest model: is a pure, minimal state-machine tailored to the
task/workload we want to do. Tux does that (and i'm keeping it uptodate
for current 2.6 kernels too), and it still beats the pants off anything
user-space.

But reality is that most people care more about programmability: 300
MB/sec or 500 MB/sec web throughput limit (or a serving limit 30K
requests in parallel or 60K requests in parallel - whatever the
performance metric you care about is) isnt really making any difference
in all but the most specialized environments - /because/ the price to
pay for it is much worse programmability (and hence much worse
flexibility). [ And this is all about basic economy: if it's 10 times
harder to program something and what takes 1 month to program in one
model takes 10 months to program in the other model - but in 10 months
the hardware already got another 50% faster, likely offsetting the
advantage of that 'other' model to begin with ... ]

Note that often the speed difference between task based and state based
designs is alot smaller. But God is the programming complexity
difference huge. Not even huge but /HUGE/. All our programming languages
are procedural and stack based. (Not only is there a basic lack of
state-based programming languages, it's a product of our brain
structure: state machines are not really what the human mind is
structured for, but i digress.)

Now where do all these LWP, fibre, firbril, micro-thread or N:M concepts
fit? Most of the time they are just a /weakening/ of the #1 concept. And
that's why they will lose out, because #1 is all about programmability
and they dont offer anything new: because they cannot. Either you go for
programmability or you go for performance. There is /no/ middle ground
for us in the kernel! (User-space can do the middle ground thing, but
the kernel must only be structured around the two most fundamental
concepts that exist. [NOTE: more about KAIO later.])

Having a 1:1 relationship between user-space and kernel-space context is
a strong concept, and on modern CPUs we perform /process/ context
switches in 650 nsecs, we do user thread context switches in 500 nsecs,
and kernel<->kernel thread context switches in 350 nsecs. That's pretty
damn fast.

The big problem is, the M:N concepts (and i consider micro-threads or
'schedulable stacks' to be of that type) tend to operate the following
way: they build up hype by being "faster" - but the performance
advantage only comes from weakening #1! (for example by not making the
contexts generally schedulable, bindable, load-balancable, suspendable,
migratable, etc.) But then a few years later they grow (by virtue of
basic pressure from programmers, who /want/ a clean easily programmable
concept #1) the /very same/ features that they "optimized away" in the
beginning. With the difference that now all that infrastructure they
left out initially is replicated in user-space, in a poorer and
inconsistent way, blowing up cache-size, and slowing the /sane/ things
down in the end. (and then i havent even begun about the nightmare of
extending the debug infrastructure to light threads, ABI dependencies,
etc, etc...)

One often repeated (because pretty much only) performance advantage of
'light threads' is context-switch performance between user-space
threads. But reality is, nobody /cares/ about being able to
context-switch between "light user-space threads"! Why? Because there
are only two reasons why such a high-performance context-switch would
occur:

1) there's contention between those two tasks. Wonderful: now two
artificial threads are running on the /same/ CPU and they are even
contending each other. Why not run a single context on a single CPU
instead and only get contended if /another/ CPU runs a conflicting
context?? While this makes for nice "pthread locking benchmarks",
it is not really useful for anything real.

2) there has been an IO event. The thing is, for IO events we enter the
kernel no matter what - and we'll do so for the next 10 years at
minimum. We want to abstract away the hardware, we want to do
reliable resource accounting, we want to share hardware resources,
we want to rate-limit, etc., etc. While in /theory/ you could handle
IO purely from user-space, in practice you dont want to do that. And
if we accept the premise that we'll enter the kernel anyway, there's
zero performance difference between scheduling right there in the
kernel, or returning back to user-space to schedule there. (in fact
i submit that the former is faster). Or if we accept the theoretical
possibility of 'perfect IO hardware' that implements /all/ the
features that the kernel wants (in a secure and generic way, and
mind you, such IO hardware does not exist yet), then /at most/ the
performance advantage of user-space doing the scheduling is the
overhead of a null syscall entry. Which is a whopping 100 nsecs on
modern CPUs! That's roughly the latency of a /single/ DRAM access!

Furthermore, 'light thread' concepts can no way approach the performance
of #2 state-machines: if you /know/ what the structure of your context
is, and you can program it in a specialized state-machine way, there's
just so many shortcuts possible that it's not even funny.

> And maybe it also solves some of the scalability worries (threads have
> ID space and scheduling setup things that essentially go away by just
> not doing them - which is what the fibrils simply wouldn't have).

our PID management is very fast - i've never seen it show up in
profiles. (We could make it even more scalable if the need arises, for
example right now we still maintain the luxory of having a globally
increasing and tightly compressed PID/TID space.)

until a few days ago we even had a /global lock/ in the thread/task
creation and destruction hotpath since 2.5.43 (when oprofile support was
added, more than 4 years ago, the oprofile exit notifier lock) and
nobody really noticed or cared!

so i believe there are only two directions that make sense in the long
run:

1) improve our basic #1 design gradually. If something is a bottleneck,
if the scheduler has grown too fat, cut some slack. If micro-threads
or fibrils offer anything nice for our basic thread model: integrate
it into the kernel. (Dont worry about having the enter the kernel for
that, we /want/ that isolation! Resource control /needs/ a secure and
trustable context. Good debuggability /needs/ a secure and trustable
context. Hardware makers will make damn sure that such hardware-based
isolation is fast as lightning. SYSENTER will continue to be very
fast.) In any case, unless someone can answer the issues i raised
above, please dont mess with the basic 1:1 task model we have. It's
really what we want to have.

2) enable crazy people to do IO in the #2 way. For this i think the most
programmable interface is KAIO - because the 'context' /only/
involves the IO entity itself, which is simple enough for our brain
to wrap itself around. (and the hardware itself has the concept of
'in flight IO' anyway, so people are forced to learn about it and to
deal with it anyway, even in the synchronous IO model. So there's
quite some mindshare to build upon.) This also happens to realize
/most/ of the performance advantages that a state-machine like Tux
tends to have. (In that sense i dont see epoll to be conceptually
different from KAIO, although internally within the kernel it's quite
a bit different.)

now, KAIO 'behind the hood' (within the kernel) is like /very/ hard to
program in a fully state-driven manner - but we are crazy folks who
/might/ be able to pull it off. Initially we should just simulate the
really hard bits via a pool of in-kernel synchronous threads. Some IO
disciplines are already state-driven internally (networking most
notably, but also timers), so there KAIO can be implemented 'natively'.

And it will be /faster/ than micro-threads based AIO!

but frankly, KAIO is the most i can see a normal, sane human programmer
to be able to deal with. Any more exposure to state-machines will just
drive them crazy, and they'll lynch us (or more realistically: ignore
us) if we try anything more. And with KAIO they still have the /option/
to make all their user-space functionality state-driven. Also, abstract,
fully managed programming environments like Java can hide KAIO
complexities by implementing their synchronous primitives using KAIO.

Ingo

Xavier Bestel

unread,

Jan 31, 2007, 5:51:13 AM1/31/07

to Linus Torvalds

On Tue, 2007-01-30 at 19:02 -0800, Linus Torvalds wrote:
> Btw, this is also something where we should just disallow certain system
> calls from being done through the asynchronous method.

Does that mean that doing an AIO-disabled syscall will wait for all in-
flight AIO syscalls to finish ?

Xav

Zach Brown

unread,

Jan 31, 2007, 12:16:14 PM1/31/07

to Andi Kleen

On Jan 31, 2007, at 12:58 AM, Andi Kleen wrote:

> Do you have any numbers how this compares cycle wise to just doing
> clone+syscall+exit in user space?

Not yet, no. Release early, release often, and all that. I'll throw
something together.

- z

Andi Kleen

unread,

Jan 31, 2007, 12:24:06 PM1/31/07

to Zach Brown

On Wednesday 31 January 2007 18:15, Zach Brown wrote:
>
> On Jan 31, 2007, at 12:58 AM, Andi Kleen wrote:
>
> > Do you have any numbers how this compares cycle wise to just doing
> > clone+syscall+exit in user space?
>
> Not yet, no. Release early, release often, and all that. I'll throw
> something together.

So what was the motivation for doing this then? It's only point
is to have smaller startup costs for AIO than clone+fork without
fixing the VFS code to be a state machine, right?

I'm personally unclear if it's really less work to teach a lot of
code in the kernel about a new thread abstraction than changing VFS.

Your patches don't look that complicated yet but you openly
admitted you waved away many of the more tricky issues (like
signals etc.) and I bet there are yet-unknown side effects
of this too that will need more changes.

I would expect a VFS solution to be the fastest of any at least.

I'm not sure the fibrils thing will be that much faster than
a possibly somewhat fast pathed for this case clone+syscall+exit.

-Andi

Zach Brown

unread,

Jan 31, 2007, 12:39:05 PM1/31/07

to Benjamin Herrenschmidt

>> - We would now have some measure of task_struct concurrency. Read
>> that twice,
>> it's scary.

> That's the one scaring me in fact ... Maybe it will end up being an

> easy
> one but I don't feel too comfortable...

Indeed, that was my first reaction too. I dismissed the idea for a
good six months after initially realizing that it implied sharing
journal_info, etc.

But when I finally sat down and started digging through the
task_struct members and, after quickly dismissing involuntary
preemption of the fibrils, it didn't seem so bad. I haven't done an
exhaustive audit yet (and I won't advocate merging until I have) but
I haven't seen any train wrecks.

> we didn't create fibril-like
> things for threads, instead, we share PIDs between tasks. I wonder if
> the sane approach would be to actually create task structs (or have a
> pool of them pre-created sitting there for performances) and add a way
> to share the necessary bits so that syscalls can be run on those
> spin-offs.

Maybe, if it comes to that. I have some hopes that sharing by
default and explicitly marking the bits that we shouldn't share will
be good enough.

- z

Zach Brown

unread,

Jan 31, 2007, 12:49:07 PM1/31/07

to Linus Torvalds

> Does that mean that we might not have some cases where we'd need to
> make
> sure we do things differently? Of course not. Something migt show up.

Might, and has. For a good time, take journal_info out of
per_call_chain() in the patch set and watch it helpfully and loudly
wet itself. There really really are bits of thread_struct which are
strictly thread-local-storage, of a sort, for a kernel call path.
Sharing them, even only through cooperate scheduling, is fatal.
link_count is another obvious one.

They're also the only ones I've bothered to discover so far :).

> But
> this actually makes it very clear what the difference between "struct
> thread_struct" and "struct task_struct" are. One is shared between
> fibrils, the other isn't.

Indeed.

Right now the per-fibril uses of task_struct members are left inline
in task_struct and are copied on fibril switches.

We *could* put them in thread_info, at the cost of stack pressure, or
hang them off task_struct in their own struct to avoid the copies, at
the cost of indirection. I didn't like imposing a cost on paths that
don't use fibrils, though, so I left them inline.

(I think you know all this. I'm clarifying for the peanut gallery, I
hope.)

- z

Benjamin LaHaise

unread,

Jan 31, 2007, 12:52:00 PM1/31/07

to Zach Brown

On Wed, Jan 31, 2007 at 09:38:11AM -0800, Zach Brown wrote:
> Indeed, that was my first reaction too. I dismissed the idea for a
> good six months after initially realizing that it implied sharing
> journal_info, etc.
>
> But when I finally sat down and started digging through the
> task_struct members and, after quickly dismissing involuntary
> preemption of the fibrils, it didn't seem so bad. I haven't done an
> exhaustive audit yet (and I won't advocate merging until I have) but
> I haven't seen any train wrecks.

I'm still of the opinion that you cannot do this without creating actual
threads. That said, there is room for setting up the task_struct beforehand
without linking it into the system lists. The reason I don't think this
approach works (and I looked at it a few times) is that many things end
up requiring special handling: things like permissions, signals, FPU state,
segment registers.... The list ends up looking exactly the way task_struct
is, making the actual savings very small.

What the fibrils approach is useful for is the launching of the thread
initially, as you don't have to retain things like the current FPU state,
change segment registers, etc. Changing the stack is cheap, the rest of
the work is not.

-ben
--
"Time is of no importance, Mr. President, only life is important."
Don't Email: <do...@kvack.org>.

Zach Brown

unread,

Jan 31, 2007, 1:00:45 PM1/31/07

to Linus Torvalds

> Btw, this is also something where we should just disallow certain
> system
> calls from being done through the asynchronous method.

Yeah. Maybe just a bitmap built from __NR_ constants? I don't know
if we can do it in a way that doesn't require arch maintainer's
attention.

It seems like it would be nice to avoid putting a test in the
handlers themselves, and leave it up to the aio syscall submission
processing.

> More interesting is the question about "close()", though. Currently we
> have an optimization (fget/fput_light) that basically boils down to
> "we
> know we are the only owners". That optimization becomes more
> "interesting"
> with AIO - we need to disable it when fibrils are active (because
> other
> fibrils or the main thread can do it), but we can still keep it for
> the
> non-fibril case.

I'll take a look, thanks for pointing it out.

- z

Zach Brown

unread,

Jan 31, 2007, 1:22:43 PM1/31/07

to Nick Piggin

> The only thing I saw in Zach's post against the use of threads is that
> some kernel API would change. But surely if this is the showstopper
> then
> there must be some better argument than sys_getpid()?!

Haha, yeah, that's the silly example I keep throwing around :). I
guess it does leave a little too much of the exercise up to the reader.

Perhaps a less goofy example are the uses of current->ioprio and
current->io_context?

If you create and destroy threads around each operation then you're
going to be creating and destroying an io_context around each op
instead of getting a reference on a pre-existing context in
additional ops. ioprio is inherited it seems, though, so that's not
so bad.

If you have a pool of threads and you want to update the ioprio for
future IOs, you now have to sync up the pool's ioprio with new
desired priority.

It's all solvable, sure. Get an io_context ref in copy_process().
Share ioprio instead of inheriting it. Have a fun conversation with
Jens about the change in behaviour this implies. Broadcasting to
threads to update ioprio isn't exactly rocket science.

But with the fibril model the user don't have to know to worry about
the inconsistencies and we kernel developers don't have to worry
about pro-actively stamping them out. A series of sync write and
ioprio setting calls behaves exactly the same as that series of calls
issued sequentially as "async" calls. That's worth aiming for, I think.

- z

Zach Brown

unread,

Jan 31, 2007, 2:24:43 PM1/31/07

to Andi Kleen

On Jan 31, 2007, at 9:21 AM, Andi Kleen wrote:

> On Wednesday 31 January 2007 18:15, Zach Brown wrote:
>>
>> On Jan 31, 2007, at 12:58 AM, Andi Kleen wrote:
>>
>>> Do you have any numbers how this compares cycle wise to just doing
>>> clone+syscall+exit in user space?
>>
>> Not yet, no. Release early, release often, and all that. I'll throw
>> something together.
>
> So what was the motivation for doing this then?

Most fundamentally? Providing AIO system call functionality at a
much lower maintenance cost. The hope is that the cost of adopting
these fibril things will be lower than the cost of having to touch a
code path that wants AIO support.

I simply don't believe that it's cheap to update code paths to
support non-blocking state machines. As just one example of a
looming cost, consider the retry-based buffered fs AIO patches that
exist today. Their requirement to maintain these precisely balanced
code paths that know to only return -EIOCBRETRY once they're at a
point where retries won't access current-> seems.. unsustainable to
me. This stems from the retries being handled off in the aio kernel
threads which have their own task_struct. fs/aio.c goes to the
trouble of migrating ->mm from the submitting task_struct, but
nothing else. Continually adjusting this finely balanced
relationship between paths that return -EIOCBRETY and the fields of
task_struct that fs/aio.c knows to share with the submitting context
seems unacceptably fragile.

Even with those buffered IO patches we still only get non-blocking
behaviour at a few specific blocking points in the buffered IO path.
It's nothing like the guarantee of non-blocking submission returns
that the fibril-based submission guarantees.

> It's only point
> is to have smaller startup costs for AIO than clone+fork without
> fixing the VFS code to be a state machine, right?

Smaller startup costs and fewer behavioural differences. Did that
message to Nick about ioprio and io_context resonate with you at all?

> I'm personally unclear if it's really less work to teach a lot of
> code in the kernel about a new thread abstraction than changing VFS.

Why are we limiting the scope of moving to a state machine just to
the VFS? If you look no further than some hypothetical AIO iscsi/aoe/
nbd/whatever target you obviously include networking. Probably splice
() if you're aggressive :).

Let's be clear. I would be thrilled if AIO was implemented by native
non-blocking handler implementations. I don't think it will happen.
Not because we don't think it sounds great on paper, but because it's
a hugely complex endeavor that would take development and maintenance
effort away from the task of keeping basic functionality working.

So the hope with fibrils is that we lower the barrier to getting AIO
syscall support across the board at an acceptable cost.

It doesn't *stop* us from migrating very important paths (storage,
networking) to wildly optimized AIO implementations. But it also
doesn't force AIO support to wait for that.

> Your patches don't look that complicated yet but you openly
> admitted you waved away many of the more tricky issues (like
> signals etc.) and I bet there are yet-unknown side effects
> of this too that will need more changes.

To quibble, "waved away" implies that they've been dismissed. That's
not right. It's a work in progress, so yes, there will be more
fiddly details discovered and addressed over time. The hope is that
when it's said and done it'll still be worth merging. If at some
point it gets to be too much, well, at least we'll have this work to
reference as a decisive attempt.

> I'm not sure the fibrils thing will be that much faster than
> a possibly somewhat fast pathed for this case clone+syscall+exit.

I'll try and get some numbers for you sooner rather than later.

Thanks for being diligent, this is exactly the kind of hard look I
want this work to get.

- z

Zach Brown

unread,

Jan 31, 2007, 2:26:37 PM1/31/07

to Benjamin LaHaise

> without linking it into the system lists. The reason I don't think
> this
> approach works (and I looked at it a few times) is that many things
> end
> up requiring special handling: things like permissions, signals,
> FPU state,
> segment registers....

Can you share a specific example of the special handling required?

- z

Zach Brown

unread,

Jan 31, 2007, 2:29:32 PM1/31/07

to Xavier Bestel

On Jan 31, 2007, at 2:50 AM, Xavier Bestel wrote:

> On Tue, 2007-01-30 at 19:02 -0800, Linus Torvalds wrote:
>> Btw, this is also something where we should just disallow certain
>> system
>> calls from being done through the asynchronous method.
>
> Does that mean that doing an AIO-disabled syscall will wait for all
> in-
> flight AIO syscalls to finish ?

That seems unlikely. I imagine we'd just return EINVAL or ENOSYS or
something to that effect.

- z

Benjamin LaHaise

unread,

Jan 31, 2007, 3:05:46 PM1/31/07

to Zach Brown

On Wed, Jan 31, 2007 at 11:25:30AM -0800, Zach Brown wrote:
> >without linking it into the system lists. The reason I don't think
> >this
> >approach works (and I looked at it a few times) is that many things
> >end
> >up requiring special handling: things like permissions, signals,
> >FPU state,
> >segment registers....
>
> Can you share a specific example of the special handling required?

Take FPU state: memory copies and RAID xor functions use MMX/SSE and
require that the full task state be saved and restored.

Task priority is another. POSIX AIO lets you specify request priority, and
it really is needed for realtime workloads where things like keepalive
must be processed at a higher priority. This is especially important on
embedded systems which don't have a surplus of CPU cycles.

-ben
--
"Time is of no importance, Mr. President, only life is important."
Don't Email: <do...@kvack.org>.

Joel Becker

unread,

Jan 31, 2007, 3:21:17 PM1/31/07

to Linus Torvalds

On Tue, Jan 30, 2007 at 10:06:48PM -0800, Linus Torvalds wrote:
> (Sadly, some of the people who really _use_ AIO are the database people,
> and they really only care about a particularly stupid and trivial case:
> pure reads and writes. A lot of other loads care about much more complex
> things: filename lookups etc, that traditional AIO cannot do at all, and
> that you really want something more thread-like for. But those other loads
> get kind of swamped by the DB needs, which are might tighter and trivial
> enough that you don't "need" a real thread for them).

While certainly not an exhaustive list, DB people love their
reads and writes, but are pining for network reads and writes. They
also are very excited about async poll(), connect(), and accept(). One
of the big problems today is that you can either sleep for your I/O in
io_getevents() or for your connect()/accept() in poll()/epoll(), but
there is nowhere you can sleep for all of them at once. That's why the
aio list continually proposes aio_poll() or returning aio events
via epoll().
Fibril based syscalls would allow async connect()/accept() and
the rest of networking, plus one stop shopping for completions.

Joel

--

"Born under a bad sign.
I been down since I began to crawl.
If it wasn't for bad luck,
I wouldn't have no luck at all."

Joel Becker
Principal Software Developer
Oracle
E-mail: joel....@oracle.com
Phone: (650) 506-8127

Zach Brown

unread,

Jan 31, 2007, 3:42:32 PM1/31/07

to Benjamin LaHaise

> Take FPU state: memory copies and RAID xor functions use MMX/SSE and
> require that the full task state be saved and restored.

Sure, that much is obvious. I was hoping to see what FPU state
juggling actually requires. I'm operating under the assumption that
it won't be *terrible*.

> Task priority is another. POSIX AIO lets you specify request
> priority, and
> it really is needed for realtime workloads where things like keepalive
> must be processed at a higher priority.

Yeah. A first-pass approximation might be to have threads with asys
system calls grouped by priority. Leaving all that priority handling
to the *task* scheduler, instead of the dirt-stupid fibril
"scheduler", would be great. If we can get away with it. I don't
have a good feeling for what portion of the world actually cares
about this, or to what degree.

- z

Suparna Bhattacharya

unread,

Feb 1, 2007, 6:08:35 AM2/1/07

to Zach Brown

Wooo ...hold on ... I think this is swinging out of perspective :)

I have said some of this before, but let me try again.

As you already discovered when going down the fibril path, there are
two kinds of accesses to current-> state, (1) common state
for a given call chain (e.g. journal info etc), and (2) for
various validations against the caller's process (uid, ulimit etc).

(1) is not an issue when it comes to execution in background threads
(the VFS already uses background writeback for example).

As for (2), such checks need to happen upfront at the time of IO submission,
so again are not an issue.

This is aside from access to the caller's address space, a familiar
concept which the AIO threads use. If there is any state that
relates to address space access, then it belongs in the ->mm struct, rather
than in current (and we should fix that if we find anything which isn't
already there).

I don't see any other reason why IO paths should be assuming that they are
running in the original caller's context, midway through doing the IO. If
that were the case background writeouts and readaheads could be fragile as
well (or ptrace). The reason it isn't is because of this conceptual division of
responsibility.

Sure, having explicit separation of submission checks as an interface
would likely make this clearer and cleaner, but I'm just
pointing out that usage of current-> state isn't and shouldn't be arbitrary
when it comes to filesystem IO paths. We should be concerned in any case
if that starts happening.

Of course, this is fundamentally connected to the way filesystem IO is
designed to work, and may not necessarily apply to all syscal

When you want to make any and every syscall asynchronous, then indeed
the challenge is magnified and that is where it could get scary. But that
isn't the problem the current AIO code is trying to tackle.

>
> Even with those buffered IO patches we still only get non-blocking
> behaviour at a few specific blocking points in the buffered IO path.
> It's nothing like the guarantee of non-blocking submission returns
> that the fibril-based submission guarantees.

This one is a better reason, and why I have thought of fibrils (or the
equivalent alternative of enhancing kernel theads to become even lighter)
as an interesting fallback option to implement AIO for cases which we
haven't (maybe some of which are too complicated) gotton around to
supporting natively. Especially if it could be coupled with some clever
tricks to keep stack space to be minimal (I have wondered whether any of
the ideas from similar user-level efforts like Cappricio, or laio would help).

BTW, I like the way you are approaching this with a cautiously
critical eye cognizant of lurking details/issues, despite the obvious
(and justified) excitement/eureka feeling. AIO _is_ hard !

Regards
Suparna

>
> - z
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-aio' in
> the body to majo...@kvack.org. For more info on Linux AIO,
> see: http://www.kvack.org/aio/
> Don't email: <a href=mailto:"aa...@kvack.org">aa...@kvack.org</a>

--
Suparna Bhattacharya (sup...@in.ibm.com)
Linux Technology Center
IBM Software Lab, India

Trond Myklebust

unread,

Feb 1, 2007, 2:50:54 PM2/1/07

to sup...@in.ibm.com

On Thu, 2007-02-01 at 16:43 +0530, Suparna Bhattacharya wrote:
> Wooo ...hold on ... I think this is swinging out of perspective :)
>
> I have said some of this before, but let me try again.
>
> As you already discovered when going down the fibril path, there are
> two kinds of accesses to current-> state, (1) common state
> for a given call chain (e.g. journal info etc), and (2) for
> various validations against the caller's process (uid, ulimit etc).
>
> (1) is not an issue when it comes to execution in background threads
> (the VFS already uses background writeback for example).
>
> As for (2), such checks need to happen upfront at the time of IO submission,
> so again are not an issue.

Wrong! These checks can and do occur well after the time of I/O
submission in the case of remote filesystems with asynchronous writeback
support.

Consider, for instance, the cases where the server reboots and loses all
state. Then there is the case of failover and/or migration events, where
the entire filesystem gets moved from one server to another, and again
you may have to recover state, etc...

> I don't see any other reason why IO paths should be assuming that they are
> running in the original caller's context, midway through doing the IO. If
> that were the case background writeouts and readaheads could be fragile as
> well (or ptrace). The reason it isn't is because of this conceptual division of
> responsibility.

The problem with this is that the security context is getting
progressively more heavy as we add more and more features. In addition
to the original uid/gid/fsuid/fsgid/groups, we now have stuff like
keyrings to carry around. Then there is all the context needed to
support selinux,...
In the end, you end up recreating most of struct task_struct...

Cheers
Trond

bert hubert

unread,

Feb 1, 2007, 3:27:01 PM2/1/07

to Zach Brown

On Tue, Jan 30, 2007 at 01:39:45PM -0700, Zach Brown wrote:

> sys_asys_submit() is added to let userspace submit asynchronous system calls.
> It specifies the system call number and arguments. A fibril is constructed for
> each call. Each starts with a stack which executes the given system call
> handler and then returns to a function which records the return code of the
> system call handler. sys_asys_await_completion() then lets userspace collect
> these results.

Zach,

Do you have any userspace code that can be used to get started experimenting
with your fibril based AIO stuff?

I want to try it on from a userspace perspective.

I'm confident I could hack it up from scratch, but I'm sure you'll have some
test code lying around.

I scanned the discussion so far, but I might've missed any references to
userspace code so far.

Thanks!

--
http://www.PowerDNS.com Open source, database driven DNS Software
http://netherlabs.nl Open and Closed source services

Zach Brown

unread,

Feb 1, 2007, 4:30:51 PM2/1/07

to bert hubert

> Do you have any userspace code that can be used to get started
> experimenting
> with your fibril based AIO stuff?

I only have a goofy little test app so far:

http://www.zabbo.net/~zab/aio-walk-tree.c

It's not to be taken too seriously :)

> I want to try it on from a userspace perspective.

Frankly, I'm not sure its ready for that yet. You're welcome to give
it a try, but it's early enough that you're sure to hit problems
almost immediately.

- z

Zach Brown

unread,

Feb 1, 2007, 5:20:06 PM2/1/07

to sup...@in.ibm.com

> Wooo ...hold on ... I think this is swinging out of perspective :)

I'm sorry, but I don't. I think using the EIOCBRETRY method in
complicated code paths requires too much maintenance cost to justify
its benefits. We can agree to disagree on that judgement :).

- z

Suparna Bhattacharya

unread,

Feb 1, 2007, 10:31:00 PM2/1/07

to Zach Brown

On Thu, Feb 01, 2007 at 02:18:55PM -0800, Zach Brown wrote:
> >Wooo ...hold on ... I think this is swinging out of perspective :)
>
> I'm sorry, but I don't. I think using the EIOCBRETRY method in
> complicated code paths requires too much maintenance cost to justify
> its benefits. We can agree to disagree on that judgement :).

I don't disagree about limitations of the EIOCBRETRY approach, nor do I
recommend it for all sorts of complicated code paths. It is only good as
an approximation for specific blocking points involving idempotent
behaviour, and I was trying to emphasise that that is *all* it was ever
intended for. I certainly do not see it as a viable path to make all syscalls
asynchronous, or to address all blocking points in filesystem IO.

And I do like the general direction of your approach, only need to think
deeper about the details like how to reduce stack per IO request so this
can scale better. So we don't disagree as much as you think :)

The point where we seem to disagree is that I think there is goodness in
maintaining the conceptual clarity between what parts of the operation assume
that it is executing in the original submitters context. For the IO paths
this is what allows things like readahead and writeback to work and to cluster
operations which may end up to/from multiple submitters. This means that
if there is some context that is needed thereafter it could be associated
with the IO request (as an argument or in some other manner), so that this
division is still maintained.

Regards
Suparna

>
> - z
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-aio' in
> the body to majo...@kvack.org. For more info on Linux AIO,
> see: http://www.kvack.org/aio/
> Don't email: <a href=mailto:"aa...@kvack.org">aa...@kvack.org</a>

--
Suparna Bhattacharya (sup...@in.ibm.com)
Linux Technology Center
IBM Software Lab, India

-

bert hubert

unread,

Feb 2, 2007, 2:13:02 AM2/2/07

to Zach Brown

On Thu, Feb 01, 2007 at 01:29:41PM -0800, Zach Brown wrote:

> >I want to try it on from a userspace perspective.
>
> Frankly, I'm not sure its ready for that yet. You're welcome to give
> it a try, but it's early enough that you're sure to hit problems
> almost immediately.

I'm counting on it - what I want to taste is if the concept is a match for
the things I want to do.

Thanks!

--
http://www.PowerDNS.com Open source, database driven DNS Software
http://netherlabs.nl Open and Closed source services

Suparna Bhattacharya

unread,

Feb 2, 2007, 2:14:53 AM2/2/07

to Trond Myklebust

Isn't that kind of information supposed to be captured in nfs_open_context ?
Which is associated with the open file instance ...

I know this has been a traditional issue with network filesystems, and I
haven't kept up with the latest code and decisions in that respect, but how
would you do background writeback if there is an assumption of running in
the context of the original submitter ?

Regards
Suparna

> In the end, you end up recreating most of struct task_struct...
>
> Cheers
> Trond
>

> --
> To unsubscribe, send a message with 'unsubscribe linux-aio' in
> the body to majo...@kvack.org. For more info on Linux AIO,
> see: http://www.kvack.org/aio/
> Don't email: <a href=mailto:"aa...@kvack.org">aa...@kvack.org</a>

--
Suparna Bhattacharya (sup...@in.ibm.com)
Linux Technology Center
IBM Software Lab, India

-

Andi Kleen

unread,

Feb 2, 2007, 2:46:03 AM2/2/07

to sup...@in.ibm.com

> Isn't that kind of information supposed to be captured in nfs_open_context ?
> Which is associated with the open file instance ...

Or a refcounted struct cred. Which would be needed for strict POSIX thread
semantics likely anyways. But there never was enough incentive to go down
that path and it would be likely somewhat slow.

>
> I know this has been a traditional issue with network filesystems, and I
> haven't kept up with the latest code and decisions in that respect, but how
> would you do background writeback if there is an assumption of running in
> the context of the original submitter ?

AFAIK (Trond will hopefully correct me if I'm wrong) in the special case of
NFS there isn't much problem because the server does the (passive) authentication
and there is no background writeback from server to client. The client just
does the usual checks at open time and then forgets about it. The server
threads don't have own credentials but just check those of others.

I can't think of any cases where you would need to do authentication
in the client for every read() or write()

Overall the arguments for reusing current don't seem to be strong to me.

-Andi

Davide Libenzi

unread,

Feb 4, 2007, 12:13:36 AM2/4/07

to Zach Brown

On Tue, 30 Jan 2007, Zach Brown wrote:

> +void asys_task_exiting(struct task_struct *tsk)
> +{
> + struct asys_result *res, *next;
> +
> + list_for_each_entry_safe(res, next, &tsk->asys_completed, item)
> + kfree(res);
> +
> + /*
> + * XXX this only works if tsk->fibril was allocated by
> + * sys_asys_submit(), not if its embedded in an asys_call. This
> + * implies that we must forbid sys_exit in asys_submit.
> + */
> + if (tsk->fibril) {
> + BUG_ON(!list_empty(&tsk->fibril->run_list));
> + kfree(tsk->fibril);
> + tsk->fibril = NULL;
> + }
> +}

What happens to lingering fibrils? Better keep track of both runnable and
sleepers, and do proper cleanup.

> +asmlinkage long sys_asys_submit(struct asys_input __user *user_inp,
> + unsigned long nr_inp)
> +{
> + struct asys_input inp;
> + struct asys_result *res;
> + struct asys_call *call;
> + struct thread_info *ti;
> + unsigned long i;
> + long err = 0;
> +
> + /* Allocate a fibril for the submitter's thread_info */
> + if (current->fibril == NULL) {
> + current->fibril = kzalloc(sizeof(struct fibril), GFP_KERNEL);
> + if (current->fibril == NULL)
> + return -ENOMEM;
> +
> + INIT_LIST_HEAD(&current->fibril->run_list);
> + current->fibril->state = TASK_RUNNING;
> + current->fibril->ti = current_thread_info();
> + }

Why do we need the "special" submission fibril?

> + for (i = 0; i < nr_inp; i++) {
> +
> + if (copy_from_user(&inp, &user_inp[i], sizeof(inp))) {
> + err = -EFAULT;
> + break;
> + }
> +
> + res = kmalloc(sizeof(struct asys_result), GFP_KERNEL);
> + if (res == NULL) {
> + err = -ENOMEM;
> + break;
> + }
> +
> + /* XXX kzalloc to init call.fibril.per_cpu, add helper */
> + call = kzalloc(sizeof(struct asys_call), GFP_KERNEL);
> + if (call == NULL) {
> + kfree(res);
> + err = -ENOMEM;
> + break;
> + }
> +
> + ti = alloc_thread_info(tsk);
> + if (ti == NULL) {
> + kfree(res);
> + kfree(call);
> + err = -ENOMEM;
> + break;
> + }
> +
> + err = asys_init_fibril(&call->fibril, ti, &inp);
> + if (err) {
> + kfree(res);
> + kfree(call);
> + free_thread_info(ti);
> + break;
> + }
> +
> + res->comp.cookie = inp.cookie;
> + call->result = res;
> + ti->task = current;
> +
> + sched_new_runnable_fibril(&call->fibril);
> + schedule();
> + }
> +
> + return i ? i : err;
> +}

Streamline error path (kfree(NULL) is OK):

err = -ENOMEM;
a = alloc();
b = alloc();
c = alloc();
if (!a || !b || !c)
goto error;
...
error:
kfree(c);
kfree(b);
kfree(a);
return err;

- Davide

Davide Libenzi

unread,

Feb 4, 2007, 12:13:42 AM2/4/07

to Zach Brown

On Tue, 30 Jan 2007, Zach Brown wrote:

> This very rough patch series introduces a different way to provide AIO support
> for system calls.

Zab, great stuff!
I've found a little time to take a look at the patches and throw some
comments at you.
Keep in mind though, that the last time I seriously looked at this stuff,
sched.c was like 2K lines of code, and now it's 7K. Enough said ;)

> Right now to provide AIO support for a system call you have to express your
> interface in the iocb argument struct for sys_io_submit(), teach fs/aio.c to
> translate this into some call path in the kernel that passes in an iocb, and
> then update your code path implement with either completion-based (EIOCBQUEUED)
> or retry-based (EIOCBRETRY) AIO with the iocb.
>
> This patch series changes this by moving the complexity into generic code such
> that a system call handler would provide AIO support in exactly the same way
> that it supports a synchronous call. It does this by letting a task have
> multiple stacks executing system calls in the kernel. Stacks are switched in
> schedule() as they block and are made runnable.

As I said in another email, I think this is a *great* idea. It allows for
generic async execution, while leaving the core kernel AIO unware. Of
course, ot be usable, a lot of details needs to be polished, and
performance evaluated.

> We start in sys_asys_submit(). It allocates a fibril for this executing
> submission syscall and hangs it off the task_struct. This lets this submission
> fibril be scheduled along with the later asys system calls themselves.

Why do you need this "special" fibril for the submission task?

> The specific switching mechanics of this implementation rely on the notion of
> tracking a stack as a full thread_info pointer. To make the switch we transfer
> the non-stack bits of the thread_info from the old fibril's ti to the new
> fibril's ti. We update the book keeping in the task_struct to
> consider the new thread_info as the current thread_info for the task. Like so:
>
> *next->ti = *ti;
> *thread_info_pt_regs(next->ti) = *thread_info_pt_regs(ti);
>
> current->thread_info = next->ti;
> current->thread.esp0 = (unsigned long)(thread_info_pt_regs(next->ti) + 1);
> current->fibril = next;
> current->state = next->state;
> current->per_call = next->per_call;
>
> Yeah, messy. I'm interested in aggressive feedback on how to do this sanely.
> Especially from the perspective of worrying about all the archs.

Maybe an arch-specific copy_thread_info()? Or, since there's a 1:1
relationship, just merging them.

> Did everyone catch that "per_call" thing there? That's to switch members of
> task_struct which are local to a specific call. link_count, journal_info, that
> sort of thing. More on that as we talk about the costs later.

Yes ;) Brutally copying the structure over does not look good IMO. Better
keep a pointer and swapping them. A clone_per_call() and free_per_call()
might be needed.

> Eventually the timer fires and the hrtimer code path wakes the fibril:
>
> - if (task)
> - wake_up_process(task);
> + if (wake_target)
> + wake_up_target(wake_target);
>
> We've doctored try_to_wake_up() to be able to tell if its argument is a
> task_struct or one of these fibril targets. In the fibril case it calls
> try_to_wake_up_fibril(). It notices that the target fibril does need to be
> woken and sets it TASK_RUNNING. It notices that it it's current in the task so
> it puts the fibril on the task's fibril run queue and wakes the task. There's
> grossness here. It needs the task to come through schedule() again so that it
> can find the new runnable fibril instead of continuing to execute its current
> fibril. To this end, wake-up marks the task's current ti TIF_NEED_RESCHED.

Fine IMO. Better keep scheduling code localized inside schedule().

> - With get AIO support for all syscalls. Every single one. (Well, please, no
> asys sys_exit() :)). Buffered IO, sendfile, recvmsg, poll, epoll, hardware
> crypto ioctls, open, mmap, getdents, the entire splice API, etc.

Eeek, ... poll, epoll :)
That might solve the async() <-> POSIX bridge in the other way around. The
collector will become the async() events fetcher, instead of the other way
around. Will work just fine ...

> - We wouldn't multiply testing and maintenance burden with separate AIO paths.
> No is_sync_kiocb() testing and divergence between returning or calling
> aio_complete(). No auditing to make sure that EIOCBRETRY only being returned
> after any significant references of current->. No worries about completion
> racing from the submission return path and some aio_complete() being called
> from another context. In this scheme if your sync syscall path isn't broken,
> your AIO path stands a great chance of working.

This is the *big win* of the whole thing IMO.

> - AIO syscalls which *don't* block see very little overhead. They'll allocate
> stacks and juggle the run queue locks a little, but they'll execute in turn on
> the submitting (cache-hot, presumably) processor. There's room to optimize
> this path, too, of course.

Stack allocation can be optimized/cached, as someone else already said.

> - The 800lb elephant in the room. It uses a full stack per blocked operation.
> I believe this is a reasonable price to pay for the flexibility of having *any*
> call pending. It rules out some loads which would want to keep *millions* of
> operations pending, but I humbly submit that a load rarely takes that number of
> concurrent ops to saturate a resource. (think of it this way: we've gotten
> this far by having to burn a full *task* to have *any* syscall pending.) While
> not optimal, it opens to door to a lot of functionality without having to
> rewrite the kernel as a giant non-blocking state machine.

This should not be a huge problem IMO. High latency operations like
network sockets can be handled with standard I/O events interfaces like
poll/epoll, so I do not expect to have a huge number of fibrils around.
The number of fibrils will be proportional to the number of active
connections, not to the total number of connections.

> It should be noted that my very first try was to copy the used part of stacks
> in to and out of one full allocated stack. This uses less memory per blocking
> operation at the cpu cost of copying the used regions. And it's a terrible
> idea, for at least two reasons. First, to actually get the memory overhead
> savings you allocate at stack switch time. If that allocation can't be
> satisfied you are in *trouble* because you might not be able to switch over to
> a fibril that is trying to free up memory. Deadlock city. Second, it means
> that *you can't reference on-stack data in the wake-up path*. This is a
> nightmare. Even our trivial sys_nanosleep() example would have had to take its
> hrtimer_sleeper off the stack and allocate it. Never mind, you know, basically
> every single user of <linux/wait.h>. My current thinking is that it's just
> not worth it.

Agreed. Most definitely not worth it, for the reasons above.

> - We would now have some measure of task_struct concurrency. Read that twice,

> it's scary. As two fibrils execute and block in turn they'll each be
> referencing current->. It means that we need to audit task_struct to make sure
> that paths can handle racing as its scheduled away. The current implementation
> *does not* let preemption trigger a fibril switch. So one only has to worry
> about racing with voluntary scheduling of the fibril paths. This can mean
> moving some task_struct members under an accessor that hides them in a struct
> in task_struct so they're switched along with the fibril. I think this is a
> manageable burden.

That seems the correct policy in any case.

> - Signals. I have no idea what behaviour we want. Help? My first guess is
> that we'll want signal state to be shared by fibrils by keeping it in the
> task_struct. If we want something like individual cancellation, we'll augment
> signal_pending() with some some per-fibril test which will cause it to return
> from TASK_INTERRUPTIBLE (the only reasonable way to implement generic
> cancellation, I'll argue) as it would have if a signal was pending.

Fibril should IMO use current thread signal policies. I think a signal
should hit (wake) any TASK_INTERRUPTIBLE fibril, if the current thread
policies mandate that. I'd keep a list_head of currently scheduled-out
TASK_INTERRUPTIBLE fibrils, and I'd make them runnable when a signal is
delivered to the thread (wake_target bit #1 set to mean wake-all-interruptable-fibrils?).
The other thing is signal_pending(). The sigpending flag test is not going
to work as is (cleared at the first do_signal). Setting a bit in each
fibril would mean walking the whole TASK_INTERRUPTIBLE fibril list. Maybe
a sequential signal counter in task_struct, matched by one in the fibril.
A signal would increment the task_struct counter, and a fibril
schedule-out would save the task_struct counter to the fibril. The
signal_pending() for a fibril is a compare of the two. Or something
similar.
In general, I think it'd make sense to have a fibril-based implemenation
and a kthread-based one, and compare the messyness :) of the two related
to cons/performnces.

Davide Libenzi

unread,

Feb 4, 2007, 3:00:48 PM2/4/07

to Zach Brown

On Sat, 3 Feb 2007, Davide Libenzi wrote:

> > - Signals. I have no idea what behaviour we want. Help? My first guess is
> > that we'll want signal state to be shared by fibrils by keeping it in the
> > task_struct. If we want something like individual cancellation, we'll augment
> > signal_pending() with some some per-fibril test which will cause it to return
> > from TASK_INTERRUPTIBLE (the only reasonable way to implement generic
> > cancellation, I'll argue) as it would have if a signal was pending.
>
> Fibril should IMO use current thread signal policies. I think a signal
> should hit (wake) any TASK_INTERRUPTIBLE fibril, if the current thread
> policies mandate that. I'd keep a list_head of currently scheduled-out
> TASK_INTERRUPTIBLE fibrils, and I'd make them runnable when a signal is
> delivered to the thread (wake_target bit #1 set to mean wake-all-interruptable-fibrils?).
> The other thing is signal_pending(). The sigpending flag test is not going
> to work as is (cleared at the first do_signal). Setting a bit in each
> fibril would mean walking the whole TASK_INTERRUPTIBLE fibril list. Maybe
> a sequential signal counter in task_struct, matched by one in the fibril.
> A signal would increment the task_struct counter, and a fibril
> schedule-out would save the task_struct counter to the fibril. The
> signal_pending() for a fibril is a compare of the two. Or something
> similar.

Another thing linked to signals that was not talked about, is cancellation
of an in-flight request. We want to give the ability to cancel an
in-flight request, with something like async_cancel(cookie). In my
userspace library I simply disable SA_RESTART of SIGUSR2, and I do a
pthread_kill() on the thread servicing the request. But this will IMO have
other implications (linked to signal delivery) in a kernel fibril-based
implementation, to think about it.

Linus Torvalds

unread,

Feb 9, 2007, 5:34:25 PM2/9/07

to Zach Brown

Ok, here's another entry in this discussion.

This is a *really* small patch. Yes, it adds 174 lines, and yes it's
actually x86 (32-bit) only, but about half of it is totally generic, and
*all* of it is almost ludicrously simple.

There's no new assembly language. The one-liner addition to
"syscall_table.S" is just adding the system call entry stub. It's all in
C, and none of it is even hard to understand.

It's architecture-specific, because different architectures do the whole
"fork()" entrypath differently, and this is basically a "delayed fork()",
not really an AIO thing at all.

So what this does, very simply is:

- on system call entry just save away the pt_regs pointer needed to do a
fork (on some architectures, this means that you need to use a longer
system call entry that saves off all registers - on x86 even that isn't
an issue)

- save that away as a magic cookie in the task structure

- do the system call

- IF the system call blocks, we call the architecture-specific
"schedule_async()" function before we even get any scheduler locks, and
it can just do a fork() at that time, and let the *child* return to the
original user space. The process that already started doing the system
call will just continue to do the system call.

- when the system call is done, we check to see if it was done totally
synchronously or not. If we ended up doing the clone(), we just exit
the new thread.

Now, I agree that this is a bit ugly in some of the details: in
particular, it means that if the system call blocks, we will literally
return as a *different* thread to user space. If you care, you shouldn't
use this interface, or come up with some way to make it work nicely (doing
it this way meant that I could just re-use all the clone/fork code as-is).

Also, it actually does take the hit of creating a full new thread. We
could optimize that a bit. But at least the cached case has basically
*zero* overhead: we literally end up doing just a few extra CPU
instructions to copy the arguments around etc, but no locked cycles, no
memory allocations, no *nothing*.

So I actually like this, because it means that while we slow down real IO,
we don't slow down the cached cases at all.

Final warning: I didn't do any cancel/wait crud. It doesn't even return
the thread ID as it is now. And I only hooked up "stat64()" as an exmple.
So this really is just a total toy. But it's kind of funny how simple it
was, once I started thinking about how I could do this in some clever way.

I even added comments, so a lot of the few new added lines aren't even
code!

Linus

---

diff --git a/arch/i386/kernel/process.c b/arch/i386/kernel/process.c
index c641056..0909724 100644
--- a/arch/i386/kernel/process.c
+++ b/arch/i386/kernel/process.c
@@ -698,6 +698,71 @@ struct task_struct fastcall * __switch_to(struct task_struct *prev_p, struct tas
return prev_p;
}

+/*
+ * This gets called when an async event causes a schedule.
+ * We should try to
+ *
+ * (a) create a new thread
+ * (b) within that new thread, return to the original
+ * user mode call-site.
+ * (c) clear the async event flag, since it is now no
+ * longer relevant.
+ *
+ * If anything fails (a resource issue etc), we just do
+ * the async system call as a normal synchronous event!
+ */
+#define CLONE_ALL (CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SIGHAND | CLONE_PARENT | CLONE_THREAD)
+#define FAILED_CLONE ((struct pt_regs *)1)
+void schedule_async(void)
+{
+ struct pt_regs *regs = current->async_cookie;
+ int retval;
+
+ if (regs == FAILED_CLONE)
+ return;
+
+ current->async_cookie = NULL;
+ /*
+ * This is magic. The child will return through "ret_from_fork()" to
+ * where the original thread started it all. It's not the same thread
+ * any more, and we don't much care. The "real thread" has now become
+ * the async worker thread, and will exit once the async work is done.
+ */
+ retval = do_fork(CLONE_ALL, regs->esp, regs, 0, NULL, NULL);
+
+ /*
+ * If it failed, we could just restore the async_cookie and try again
+ * on the next scheduling event.
+ *
+ * But it's just better to set it to some magic value to indicate
+ * "do not try this again". If it failed once, we shouldn't waste
+ * time trying it over and over again.
+ *
+ * Any non-NULL value will tell "do_async()" at the end that it was
+ * done "synchronously".
+ */
+ if (retval < 0)
+ current->async_cookie = FAILED_CLONE;
+}
+
+asmlinkage int sys_async(struct pt_regs regs)
+{
+ void *async_cookie;
+ unsigned long syscall, flags;
+ int __user *status;
+ unsigned long __user *user_args;
+
+ /* Pick out the do_async() arguments.. */
+ async_cookie = &regs;
+ syscall = regs.ebx;
+ flags = regs.ecx;
+ status = (int __user *) regs.edx;
+ user_args = (unsigned long __user *) regs.esi;
+
+ /* ..and call the generic helper routine */
+ return do_async(async_cookie, syscall, flags, status, user_args);
+}
+
asmlinkage int sys_fork(struct pt_regs regs)
{
return do_fork(SIGCHLD, regs.esp, &regs, 0, NULL, NULL);
diff --git a/arch/i386/kernel/syscall_table.S b/arch/i386/kernel/syscall_table.S
index 2697e92..647193c 100644
--- a/arch/i386/kernel/syscall_table.S
+++ b/arch/i386/kernel/syscall_table.S
@@ -319,3 +319,4 @@ ENTRY(sys_call_table)
.long sys_move_pages
.long sys_getcpu
.long sys_epoll_pwait
+ .long sys_async /* 320 */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4463735..e14b11b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -844,6 +844,13 @@ struct task_struct {

struct mm_struct *mm, *active_mm;

+ /*
+ * The scheduler uses this to determine if the current call is a
+ * standalone thread or just an async system call that hasn't
+ * had its real thread created yet.
+ */
+ void *async_cookie;
+
/* task state */
struct linux_binfmt *binfmt;
long exit_state;
@@ -1649,6 +1656,12 @@ extern int sched_create_sysfs_power_savings_entries(struct sysdev_class *cls);

extern void normalize_rt_tasks(void);

+/* Async system call support */
+extern long do_async(void *async_cookie, unsigned int syscall, unsigned long flags,
+ int __user *status, unsigned long __user *user_args);
+extern void schedule_async(void);
+
+
#endif /* __KERNEL__ */

#endif
diff --git a/kernel/Makefile b/kernel/Makefile
index 14f4d45..13bda9f 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -8,7 +8,7 @@ obj-y = sched.o fork.o exec_domain.o panic.o printk.o profile.o \
signal.o sys.o kmod.o workqueue.o pid.o \
rcupdate.o extable.o params.o posix-timers.o \
kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o mutex.o \
- hrtimer.o rwsem.o latency.o nsproxy.o srcu.o
+ hrtimer.o rwsem.o latency.o nsproxy.o srcu.o async.o

obj-$(CONFIG_STACKTRACE) += stacktrace.o
obj-y += time/
diff --git a/kernel/async.c b/kernel/async.c
new file mode 100644
index 0000000..29b14f3
--- /dev/null
+++ b/kernel/async.c
@@ -0,0 +1,71 @@
+/*
+ * kernel/async.c
+ *
+ * Create a light-weight kernel-level thread.
+ */
+#include <linux/sched.h>
+#include <linux/syscalls.h>
+
+#include <asm/uaccess.h>
+
+/* Fake "generic" system call pointer type */
+typedef asmlinkage long (*master_syscall_t)(unsigned long arg, ...);
+
+#define ASYNC_SYSCALL(syscall, param) \
+ { (master_syscall_t) (syscall), (param) }
+
+static struct async_call {
+ master_syscall_t fn;
+ int args;
+} call_descriptor[] = {
+ ASYNC_SYSCALL(sys_stat64, 2),
+};
+
+long do_async(
+ void *async_cookie,
+ unsigned int syscall,
+ unsigned long flags,
+ int __user *status,
+ unsigned long __user *user_args)
+{
+ int ret, size;
+ struct async_call *desc;
+ unsigned long args[6];
+
+ if (syscall >= ARRAY_SIZE(call_descriptor))
+ return -EINVAL;
+
+ desc = call_descriptor + syscall;
+ if (!desc->fn)
+ return -EINVAL;
+
+ if (desc->args > ARRAY_SIZE(args))
+ return -EINVAL;
+
+ size = sizeof(unsigned long)*desc->args;
+ if (copy_from_user(args, user_args, size))
+ return -EFAULT;
+
+ /* We don't nest async calls! */
+ if (current->async_cookie)
+ return -EINVAL;
+ current->async_cookie = async_cookie;
+
+ ret = desc->fn(args[0], args[1], args[2], args[3], args[4], args[5]);
+ put_user(ret, status);
+
+ /*
+ * Did we end up doing part of the work in a separate thread?
+ *
+ * If so, the async thread-creation already returned in the
+ * origial parent, and cleared out the async_cookie. We're
+ * now just in the worker thread, and should just exit. Our
+ * job here is done.
+ */
+ if (!current->async_cookie)
+ do_exit(0);
+
+ /* We did it synchronously - return 0 */
+ current->async_cookie = 0;
+ return 0;
+}
diff --git a/kernel/fork.c b/kernel/fork.c
index d57118d..6f38c46 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1413,6 +1413,18 @@ long do_fork(unsigned long clone_flags,
return nr;
}

+/*
+ * Architectures that don't have async support get this
+ * dummy async thread scheduler callback.
+ *
+ * They had better not set task->async_cookie in the
+ * first place, so this should never get called!
+ */
+void __attribute__ ((weak)) schedule_async(void)
+{
+ BUG();
+}
+
#ifndef ARCH_MIN_MMSTRUCT_ALIGN
#define ARCH_MIN_MMSTRUCT_ALIGN 0
#endif
diff --git a/kernel/sched.c b/kernel/sched.c
index cca93cc..cc73dee 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3436,6 +3436,17 @@ asmlinkage void __sched schedule(void)
}
profile_hit(SCHED_PROFILING, __builtin_return_address(0));

+ /* Are we running within an async system call? */
+ if (unlikely(current->async_cookie)) {
+ /*
+ * If so, we now try to start a new thread for it, but
+ * not for a preemption event or a scheduler timeout
+ * triggering!
+ */
+ if (!(preempt_count() & PREEMPT_ACTIVE) && current->state != TASK_RUNNING)
+ schedule_async();
+ }
+
need_resched:
preempt_disable();
prev = current;

Davide Libenzi

unread,

Feb 9, 2007, 6:12:32 PM2/9/07

to Linus Torvalds

On Fri, 9 Feb 2007, Linus Torvalds wrote:

>
> Ok, here's another entry in this discussion.

That's another way to do it. But you end up creating/destroying a new
thread for every request. May be performing just fine.
Another, even simpler way IMO, is to just have a plain per-task kthread
pool, and a queue. An async_submit() drops a request in the queue, and
wakes the requests queue-head where the kthreads are sleeping. One kthread
picks up the request, service it, drops a result in the result queue, and
wakes results queue-head (where async_fetch() are sleeping). Cancellation
is not problem here (by the mean of sending a signal to the service
kthread). Also, no problem with arch-dependent code. This is a 1:1
match of what my userspace implementation does.
Of course, no hot-path optimization are performed here, and you need a few
context switches more than necessary.
Let's have Zach (Ingo support to Zach would be great) play with the
optimized version, and then we can maybe bench the three to see if the
more complex code that the optimized version require, gets a pay-back from
the performance side.

/me thinks it likely will

- Davide

Linus Torvalds

unread,

Feb 9, 2007, 6:37:46 PM2/9/07

to Davide Libenzi

On Fri, 9 Feb 2007, Davide Libenzi wrote:
>
> That's another way to do it. But you end up creating/destroying a new
> thread for every request. May be performing just fine.

Well, I actually wanted to add a special CLONE_ASYNC flag, because I
think we could do it better if we know it's a particularly limited special
case. But that's really just a "small implementation detail", and I don't
know how big a deal it is. I didn't want to obscure the basic idea with
anything bigger.

I agree that the create/destroy is a big overhead, but at least it's now
only done when we actually end up doing some IO (and _after_ we've started
the IO, of course - that's when we block), so compared to doing it up
front, I'm hoping that it's not actually that horrid.

The "fork-like" approach also means that it's very flexible. It's not
really even limited to doing simple system calls any more: you *could*,
for example, decide that since you already have the thread, and now that
it's asynchronous, you'd actually return to user space (to let user space
"complete" whatever asynchronous action it wanted to complete).

> Another, even simpler way IMO, is to just have a plain per-task kthread
> pool, and a queue.

Yes, that is actually quite doable with basically the same interface. It's
literally a "small decision" inside of "schedule_async()" on how it
actually would want to handle the case of "hey, we now have concurrent
work to be done".

But I actually don't think a per-task kthread pool is necessarily a good
idea. If a thread pool works for this, then it should have worked for
regular thread create/destroy loads too - ie there really is little reason
to special-case the "async system call" case.

NOTE! I'm also not at all sure that we actually want to waste real threads
on this. My patch is in no way meant to be an "exclusive alternative" to
fibrils. Quite the reverse, actually: I _like_ those synchronous fibrils,
but I didn't like how Zach did the overhead of creating them up-front,
because I really would like the cached case to be totally *synchronous*.

So I wrote my patch with a "schedule_async()" implementation that just
creates a full-sized thread, but I actually wanted very much to try to
make it use fibrils that are allocated on-demand too. I was just too lazy.

So the patch is really meant as a "ok, this is how easy it is to make the
thread allocation be 'on-demand' instead of 'up-front'". The actual
_policy_ on how thread allocation is done isn't even interesting to me, to
some degree. I think Zack's fibrils would work fine, a thread pool would
work fine, and just the silly outright "new thread for everything" that
the example patch actually used may also possibly work well enough.

It's one reason I liked my patch. It was not only small and simple, it
really is very flexible, I think. It's also totally independent on how
you actually end up _executing_ the async requests.

(In fact, you could easily make it a config option whether you support any
asynchronous behaviour AT ALL. The "async()" system call might still be
there, but it would just return "0" all the time, and do the actual work
synchronously).

Linus

Eric Dumazet

unread,

Feb 9, 2007, 7:06:56 PM2/9/07

to Linus Torvalds

Linus Torvalds a écrit :

> Ok, here's another entry in this discussion.

>

> - IF the system call blocks, we call the architecture-specific
> "schedule_async()" function before we even get any scheduler locks, and
> it can just do a fork() at that time, and let the *child* return to the
> original user space. The process that already started doing the system
> call will just continue to do the system call.

Well, I guess if the original program was mono-threaded, and syscall used
fget_light(), we might have a problem here if the child try a close(). So you
may have to disable fget_light() magic if async call is the originator of the
syscall.

Eric

Linus Torvalds

unread,

Feb 9, 2007, 7:15:42 PM2/9/07

to Eric Dumazet

On Sat, 10 Feb 2007, Eric Dumazet wrote:
>
> Well, I guess if the original program was mono-threaded, and syscall used
> fget_light(), we might have a problem here if the child try a close(). So you
> may have to disable fget_light() magic if async call is the originator of the
> syscall.

Yes. All the issues that I already brought up with Zach's patches are
still there. This doesn't really change any of them. Any optimization that
checks for "am I single-threaded" will need to be aware of pending and
running async things.

With my patch, any _running_ async things will always be seen as normal
clones, but the pending ones won't. So you'd need to effectively change
anything that looks like

if (atomic_read(&current->mm->count) == 1)
.. do some simplified version ..

into

if (!current->async_cookie && atomic_read(..) == 1)
.. do the simplified thing ..

to make it safe.

I think we only do it for fget_light and some VM TLB simplification, so it
shouldn't be a big burden to check.

Side note: the real issues still remain. The interfaces, and the
performance testing.

Linus

bert hubert

unread,

Feb 10, 2007, 5:47:52 AM2/10/07

to Linus Torvalds

On Fri, Feb 09, 2007 at 02:33:01PM -0800, Linus Torvalds wrote:

> - IF the system call blocks, we call the architecture-specific
> "schedule_async()" function before we even get any scheduler locks, and
> it can just do a fork() at that time, and let the *child* return to the
> original user space. The process that already started doing the system
> call will just continue to do the system call.

Ah - cool. The average time we have to wait is probably far greater than the
fork overhead, microseconds versus milliseconds.

However, and there probably is a good reason for this, why isn't it possible
to do it the other way around, and have the *child* do the work and the
original return to userspace?

Would confuse me at lot less in any case.

Bert

--
http://www.PowerDNS.com Open source, database driven DNS Software
http://netherlabs.nl Open and Closed source services

Davide Libenzi

unread,

Feb 10, 2007, 1:19:44 PM2/10/07

to bert hubert

On Sat, 10 Feb 2007, bert hubert wrote:

> On Fri, Feb 09, 2007 at 02:33:01PM -0800, Linus Torvalds wrote:
>
> > - IF the system call blocks, we call the architecture-specific
> > "schedule_async()" function before we even get any scheduler locks, and
> > it can just do a fork() at that time, and let the *child* return to the
> > original user space. The process that already started doing the system
> > call will just continue to do the system call.
>
> Ah - cool. The average time we have to wait is probably far greater than the
> fork overhead, microseconds versus milliseconds.
>
> However, and there probably is a good reason for this, why isn't it possible
> to do it the other way around, and have the *child* do the work and the
> original return to userspace?

If the parent is going to schedule(), someone above has already dropped
the parent's task_struct inside a wait queue, so the *parent* will be the
wakeup target [1].
Linus take to the generic AIO is a neat one, but IMO continuos fork/exits
are going to be expensive. Even if the task is going to sleep, that does
not mean that the parent (well, in Linus case, the child actually) does
not have more stuff to feed to async(). IMO the frequency of AIO
submission and retrieval can get pretty high (hence the frequency of
fork/exit), and there might be a price to pay for it at the end.
IMO one solution, following the non-fibril way, may be:

- Keep a pool of per-process threads (a per-process pool already has stuff
like "files" already correctly setup, just for example - no need to
teach everywhere around the kernel of the "async" special case)

- When a schedule happen on the submission thread, we get a thread
(task_struct really) of the available pool

- We setup the submission (now going to sleep) thread return IP to an
async_complete (or whatever name) stub. This will drop a result in a
queue, and wake the async_wait (or whatever name) wait queue head

- We may want to swap at least the PID (signals, ...?) between the two, so
even if we're re-emrging with a new task_struct, the TID will be the same

- We make the "returning" thread to come back to userspace through some
special helper ala ret_from_fork (ret_from_async ?)

- We want also to keep a record (hash?) of userspace cookies and threads
currently servicing them, so that we can implement cancel (send signal)

Open issues:

- What if the pool becomes empty since all thread are stuck under schedule?
o Grow the pool (and delay-shrink at quiter times)?
o Make the caller really sleep?
o Fall back in queue-request mode?

- Look at the Devil hiding in the details and showing up many times during
the process

Yup, I can see Zach having a lot of fun with it ;)

[1] Well, you could add a list_head to the task_struct, and teach the
add-to-waitqueue to drop a reference to all the wait queue entries
hosting the task_struct. Then walk&fix (likely be only one entry) when
you swap the submission thread context (thread_info, per_call stuff, ...)
over a service thread task_struct. At that point you can re-emerge
with the same task_struct. Pretty nasty though.

- Davide

Davide Libenzi

unread,

Feb 10, 2007, 1:45:56 PM2/10/07

to Linus Torvalds

On Fri, 9 Feb 2007, Linus Torvalds wrote:

> > Another, even simpler way IMO, is to just have a plain per-task kthread
> > pool, and a queue.
>
> Yes, that is actually quite doable with basically the same interface. It's
> literally a "small decision" inside of "schedule_async()" on how it
> actually would want to handle the case of "hey, we now have concurrent
> work to be done".

For the queue approach, I meant the async_submit() to simply add the
request (cookie, syscall number and params) inside queue, and not trying
to execute the syscall. Once you're inside schedule, "stuff" has already
partially happened, and you cannot have the same request re-initiated by a
different thread.

> But I actually don't think a per-task kthread pool is necessarily a good
> idea. If a thread pool works for this, then it should have worked for
> regular thread create/destroy loads too - ie there really is little reason
> to special-case the "async system call" case.

A per-process thread pool already has things correctly inherited, so we
don't need to add special "adopt" routines for things like "files" and such.

> NOTE! I'm also not at all sure that we actually want to waste real threads
> on this. My patch is in no way meant to be an "exclusive alternative" to
> fibrils. Quite the reverse, actually: I _like_ those synchronous fibrils,
> but I didn't like how Zach did the overhead of creating them up-front,
> because I really would like the cached case to be totally *synchronous*.

I'm not advocating threads against fibrils. The use of threads may make
things easier under certain POVs (less ad-hoc changes into mainline). The
ideal would be to have a look at both and see Pros&Cons under different
POVs (performance, code impact, etc..).

- Davide

Linus Torvalds

unread,

Feb 10, 2007, 2:03:45 PM2/10/07

to Davide Libenzi

On Sat, 10 Feb 2007, Davide Libenzi wrote:
>
> For the queue approach, I meant the async_submit() to simply add the
> request (cookie, syscall number and params) inside queue, and not trying
> to execute the syscall. Once you're inside schedule, "stuff" has already
> partially happened, and you cannot have the same request re-initiated by a
> different thread.

But that makes it impossible to do things synchronously, which I think is
a *major* mistake.

The whole (and really _only_) point of my patch was really the whole
"synchronous call" part. I'm personally of the opinion that if you cannot
handle the cached case as fast as just doing the system call directly,
then the whole thing is almost pointless.

Things that take a long time we already have good methods for. "epoll" and
"kevent" are always going to be the best way to handle the "we have ten
thousand events outstanding". There simply isn't any question about it.
You can *never* handle ten thousand long-running events efficiently with
threads - even if you ignore all the CPU overhead, you're going to have a
much bigger memory (and thus *cache*) footprint.

So anybody who wants to use AIO to do those kinds of long-running async
things is out to lunch. It's not the point.

You use the AIO stuff for things that you *expect* to be almost
instantaneous. Even if you actually start ten thousand IO's in one go, and
they all do IO, you would hopefully expect that the first ones start
completingn before you've even submitted them all. If that's not true,
then you'd just be better off using epoll.

Also, if you can make the cached case as fast as just doing the direct
system call itself, that just changes the whole equation for using it. You
suddenly don't have any downsides. You can start using the async
interfaces in places you simply couldn't otherwise, or in places where
you'd have to do a lot of performance tuning ("it makes sense under this
particular load because I actually need to get 100 IO's going at the same
time to saturate the disk").

So the "do cached things synchronously" really is important. Just because
that makes a whole complicated optimization question go away: you
basically *always* win for normal stuff.

Linus

Linus Torvalds

unread,

Feb 10, 2007, 2:36:46 PM2/10/07

to Davide Libenzi

On Sat, 10 Feb 2007, Linus Torvalds wrote:
>
> But that makes it impossible to do things synchronously, which I think is
> a *major* mistake.
>
> The whole (and really _only_) point of my patch was really the whole
> "synchronous call" part. I'm personally of the opinion that if you cannot
> handle the cached case as fast as just doing the system call directly,
> then the whole thing is almost pointless.

Side note: one of the nice things with "do it synchronously if you can" is
that it also likely would allow us to do a reasonable job at "self-tuning"
things in the kernel. With my async approach, we get notified only when we
block, so it'seasy (for example) to have a simple counter that
automatically adapts to the number of outstanding IO's, in a way that it's
_not_ if we do things at submit time when we won't even know whether it
will block or not.

As a trivial example: we actually see what *kind* of blocking it is. Is it
blocking interruptibly ("long wait") or uninterruptibly ("disk wait")? So
by the time schedule_async() is called, we actually have some more
information about the situation, and we can even do different things
(possibly based on just hints that the user and/or system maintainer gives
us; ie you can tune the behaviour from _outside_ by setting different
limits, for example).

Davide Libenzi

unread,

Feb 10, 2007, 4:00:46 PM2/10/07

to Linus Torvalds

On Sat, 10 Feb 2007, Linus Torvalds wrote:

> On Sat, 10 Feb 2007, Davide Libenzi wrote:
> >
> > For the queue approach, I meant the async_submit() to simply add the
> > request (cookie, syscall number and params) inside queue, and not trying
> > to execute the syscall. Once you're inside schedule, "stuff" has already
> > partially happened, and you cannot have the same request re-initiated by a
> > different thread.
>
> But that makes it impossible to do things synchronously, which I think is
> a *major* mistake.

Yes! That's what I said when I described the method. No synco fast-paths.
At that point you could implement the full-queued method in userspace.

> The whole (and really _only_) point of my patch was really the whole
> "synchronous call" part. I'm personally of the opinion that if you cannot
> handle the cached case as fast as just doing the system call directly,
> then the whole thing is almost pointless.
>
> Things that take a long time we already have good methods for. "epoll" and
> "kevent" are always going to be the best way to handle the "we have ten
> thousand events outstanding". There simply isn't any question about it.
> You can *never* handle ten thousand long-running events efficiently with
> threads - even if you ignore all the CPU overhead, you're going to have a
> much bigger memory (and thus *cache*) footprint.

Think about the old-fashioned web server, using epoll to handle thousands
of connections. You'll be hosting an epoll_wait() over an async
thread/fibril. Now a burst of 500 connections becomes suddendly "hot", and
you start looping through those 500 hot connections trying to handle them.
You'll need to stat/open/read (let's assume a trivial, non-cached HTTP
server) from the file pointed by the URL's doc, and those better be
handled in async fashion otherwise you'll starve the others and pay huge
time in performance. You can multiplex using a state machine or coroutines
for example. Using coroutines your epoll dispatching loop end up doing
something like:

struct conn {
coroutine_t co;
int res;
int skfd;
...
};

void events_dispatch(struct epoll_event *events, int n) {

for (i = 0; i < n; i++) {
struct conn *c = (struct conn *) events[i].data;
co_call(c->co);
}
}

Note that co_call() will make the coroutine to re-emerge from the last
co_resume() they issued.
Your code doesn't not need to be coroutine/async aware, once you wrap the
possibly-blocking calls with like:

int my_stat(struct conn *c, const char *path, struct stat *buf) {
/* "c" is the cookie */
if ((c->res = async_submit(c, __NR_stat, path, buf)) == EASYNC)
/* co_resume() will bounce back to the scheduler loop */
co_resume();
return c->res;
}

Now, the *main* loop will be the async_wait() driven one:

struct async_result {
void *cookie;
long result;
};

n = async_wait(ares, nares);
for (i = 0; i < n; i++) {
if (ares[i].cookie == epoll_special_cookie)
events_dispatch(...);
else {
struct conn *c = (struct conn *) ares[i].cookie;
c->res = ares[i].result;
co_call(c->co);
}
}

Many of the async submission will complete in a synco way, but many of
them will require reschedule and service-thread attention. Of these 500
burst, you can expect 100 or more to require async service. Right away,
not sometime later. According to Oracle, 1000 or more requests, 90% of
which can be expected to block, can be fired at a time.
It is true that we need to have a fast synco path, but it is also true
that we must not suck in the non-synco path, because the submitting thread
has something else to do then simply issue one async request (it like
have *many* of them to push).
There we go, I broke my "No more than 20 lines" rule again :)

- Davide

David Miller

unread,

Feb 10, 2007, 7:56:33 PM2/10/07

to torv...@linux-foundation.org

From: Linus Torvalds <torv...@linux-foundation.org>
Date: Fri, 9 Feb 2007 14:33:01 -0800 (PST)

> So I actually like this, because it means that while we slow down
> real IO, we don't slow down the cached cases at all.

Even if you have everything, every page, every log file, in the page
cache, everything talking over the network wants to block.

Will you create a thread every time tcp_sendmsg() hits the send queue
limits?

Add some logging to tcp_sendmsg() on a busy web server if you do not
believe me :-)

The idea is probably excellent for operations on real files, but it's
going to stink badly for networking stuff.

Linus Torvalds

unread,

Feb 10, 2007, 9:51:58 PM2/10/07

to David Miller

On Sat, 10 Feb 2007, David Miller wrote:
>
> Even if you have everything, every page, every log file, in the page
> cache, everything talking over the network wants to block.
>
> Will you create a thread every time tcp_sendmsg() hits the send queue
> limits?

No. You use epoll() for those.

> The idea is probably excellent for operations on real files, but it's
> going to stink badly for networking stuff.

And I actually talked about that in one of the emails already. There is no
way you can beat an event-based thing for things that _are_ event-based.
That means mainly networking.

For things that aren't event-based, but based on real IO (ie filesystems
etc), event models *suck*. They suck because the code isn't amenable to it
in the first place (ie anybody who thinks that a filesystem is like a
network stack and can be done as a state machine with packets is just
crazy!).

So you would be crazy to makea web server that uses this to handle _all_
outstanding IO. Network connections are often slow, and you can have tens
of thousands outstanding (and some may be outstanding for hours until they
time out, if ever). But that's the whole point: you can easily mix the
two, as given in several examples already (ie you can easily make the main
loop itself basically do just

for (;;) {
async(epoll); /* wait for networking events */
async_wait(); /* wait for epoll _or_ any of the outstanding file IO events */
handle_completed_events();
}

and it's actually a lot better than an event model, exactly because now
you can handle events _and_ non-events well (a pure event model requires
that _everything_ be an event, which works fine for some things, but works
really badly for other things).

There's a reason why a lot of UNIX system calls are blocking: they just
don't make sense as event models, because there is no sensible half-way
point that you can keep track of (filename lookup is the most common
example).

Linus

Alan

unread,

Feb 11, 2007, 7:23:35 PM2/11/07

to Linus Torvalds

> I think we only do it for fget_light and some VM TLB simplification, so it
> shouldn't be a big burden to check.

And all the permission management stuff that relies on one thread not
being able to manipulate the uid/gid of another to race security checks.

Alan

James Antill

unread,

Feb 14, 2007, 12:10:47 PM2/14/07

to linux-...@vger.kernel.org

On Sat, 10 Feb 2007 18:49:56 -0800, Linus Torvalds wrote:

> And I actually talked about that in one of the emails already. There is no
> way you can beat an event-based thing for things that _are_ event-based.
> That means mainly networking.
>
> For things that aren't event-based, but based on real IO (ie filesystems
> etc), event models *suck*. They suck because the code isn't amenable to it
> in the first place (ie anybody who thinks that a filesystem is like a
> network stack and can be done as a state machine with packets is just
> crazy!).
>
> So you would be crazy to makea web server that uses this to handle _all_
> outstanding IO. Network connections are often slow, and you can have tens
> of thousands outstanding (and some may be outstanding for hours until they
> time out, if ever). But that's the whole point: you can easily mix the
> two, as given in several examples already (ie you can easily make the main
> loop itself basically do just

I don't see any replies to this, so here's my 2¢. The simple model of
what a webserver does when sending static data is:

1. local_disk_fd = open()
2. fstat(local_disk_fd)
3. TCP_CORK on
4. send_headers();
5. LOOP
5a. sendfile(network_con_fd, local_disk_fd)
5b. epoll(network_con_fd)
6. TCP_CORK off

..and here's my personal plan (again, somewhat simplified), which I
think will be "better":

7. helper_proc_pipe_fd = DO open() + fstat()
8. read_stat_event_data(helper_proc_pipe_fd)
9. TCP_CORK on network_con_fd
10. send_headers(network_con_fd);
11. LOOP
11a. splice(helper_proc_pipe_fd, network_con_fd)
11b. epoll(network_con_fd && helper_proc_pipe_fd)
12. TCP_CORK off network_con_fd

..where the "helper proc" is doing splice() from disk to the pipe, on the
other end. This, at least in theory, gives you an async webserver and zero
copy disk to network[1]. My assumption is that Evgeniy's aio_sendfile()
could fit into that model pretty easily, and would be faster.

However, from what you've said above you're only trying to help #1 and #2
(which are likely to be cached in the app. anyway) and apps.
that want to sendfile() to the network either do horrible hacks like
lighttpd's "AIO"[2], do a read+write copy loop with AIO or don't use AIO.

[1] And allows things like IO limiting, which aio_sendfile() won't.

[2] http://illiterat.livejournal.com/2989.html

--
James Antill -- ja...@and.org
http://www.and.org/and-httpd/ -- $2,000 security guarantee
http://www.and.org/vstr/