Account Options

  1. Sign in
The old Google Groups will be going away soon.
Switch to the new Google Groups.
Google Groups Home
« Groups Home
Generic AIO by scheduling stacks
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  Messages 1 - 25 of 59 - Collapse all  -  Translate all to Translated (View all originals)   Newer >
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Zach Brown  
View profile  
 More options Jan 30 2007, 4:41 pm
Newsgroups: fa.linux.kernel
From: Zach Brown <zach.br...@oracle.com>
Date: Tue, 30 Jan 2007 21:41:28 UTC
Local: Tues, Jan 30 2007 4:41 pm
Subject: [PATCH 0 of 4] Generic AIO by scheduling stacks

This very rough patch series introduces a different way to provide AIO support
for system calls.

Right now to provide AIO support for a system call you have to express your
interface in the iocb argument struct for sys_io_submit(), teach fs/aio.c to
translate this into some call path in the kernel that passes in an iocb, and
then update your code path implement with either completion-based (EIOCBQUEUED)
or retry-based (EIOCBRETRY) AIO with the iocb.

This patch series changes this by moving the complexity into generic code such
that a system call handler would provide AIO support in exactly the same way
that it supports a synchronous call.  It does this by letting a task have
multiple stacks executing system calls in the kernel.  Stacks are switched in
schedule() as they block and are made runnable.

First, let's introduce the term 'fibril'.  It means small fiber or thread.  It
represents the stack and the bits of data which manage its scheduling under the
task_struct.  There's a 1:1:1 relationship between a fibril, the stack it's
managing, and the thread_struct it's operating under.  They're static for the
fibril's lifetime.  I came to choosing a funny new term after a few iterations
of trying to use existing terms (stack, call, thread, task, path, fiber)
without going insane.  They were just too confusing to use with a clear
conscious.

So, let's illustrate the changes by walking through the execution of an asys
call.  Let's say sys_nanosleep().  Then we'll talk about the trade-offs.

Maybe it'd make sense to walk through the patches in another window while
reading the commentary.  I'm sorry if this seems tedious, but this is
non-trivial stuff.  I want to get the point across.

We start in sys_asys_submit().  It allocates a fibril for this executing
submission syscall and hangs it off the task_struct.  This lets this submission
fibril be scheduled along with the later asys system calls themselves.

sys_asys_submit() has arguments which specify system call numbers and
arguments.  For each of these calls the submission syscall allocates a fibril
and constructs an initial stack.  The fibril has eip pointing to the system
call handler and esp pointing to the new stack such that when the handler is
called its arguments will be on the stack.  The stack is also constructed so
that when the handler returns it jumps to a function which queues the return
code for collection in userspace.

sys_asys_submit() asks the scheduler to put these new fibrils on the simple run
queue in the task_struct.  It's just a list_head for now.  It then calls
schedule().

After we've gotten the run queue lock in the scheduler we notice that there are
fibrils on the task_struct's run queue.  Instead of continuing with schedule(),
then, we switch fibrils.  The old submission fibril will still be in schedule()
and we'll start executing sys_nanosleep() in the context of the submission
task_struct.

The specific switching mechanics of this implementation rely on the notion of
tracking a stack as a full thread_info pointer.  To make the switch we transfer
the non-stack bits of the thread_info from the old fibril's ti to the new
fibril's ti.  We update the book keeping in the task_struct to
consider the new thread_info as the current thread_info for the task.  Like so:

        *next->ti = *ti;
        *thread_info_pt_regs(next->ti) = *thread_info_pt_regs(ti);

        current->thread_info = next->ti;
        current->thread.esp0 = (unsigned long)(thread_info_pt_regs(next->ti) + 1);
        current->fibril = next;
        current->state = next->state;
        current->per_call = next->per_call;

Yeah, messy.  I'm interested in aggressive feedback on how to do this sanely.
Especially from the perspective of worrying about all the archs.

Did everyone catch that "per_call" thing there?  That's to switch members of
task_struct which are local to a specific call.  link_count, journal_info, that
sort of thing.  More on that as we talk about the costs later.

After the switch we're executing in sys_nanosleep().  Eventually it gets to the
point where it's building its timer which will wake it after the sleep
interval.  Currently it would store a bare task_struct reference for
wake_up_process().  Instead we introduce a helper which returns a cookie that
is given to a specific wake_up_*() variant.  Like so:

-       sl->task = task;
+       sl->wake_target = task_wake_target(task);

It then marks itself as TASK_INTERRUPTIBLE and calls schedule().  schedule()
notices that we have another fibril on the run queue.  It's the submission
fibril that we switched from earlier.  As we switched we saw that it was still
TASK_RUNNING so we put it on the run queue as we switched.  We now switch back to
the submission fibril, leaving the sys_nanosleep() fibril sleeping.  Let's say
the submission task returns to userspace which then immediately calls
sys_asys_await_completion().  It's an easy case :).  It goes to sleep, there
are no running fibrils and the schedule() path really puts the task to sleep.

Eventually the timer fires and the hrtimer code path wakes the fibril:

-       if (task)
-               wake_up_process(task);
+       if (wake_target)
+               wake_up_target(wake_target);

We've doctored try_to_wake_up() to be able to tell if its argument is a
task_struct or one of these fibril targets.  In the fibril case it calls
try_to_wake_up_fibril().  It notices that the target fibril does need to be
woken and sets it TASK_RUNNING.  It notices that it it's current in the task so
it puts the fibril on the task's fibril run queue and wakes the task.  There's
grossness here.  It needs the task to come through schedule() again so that it
can find the new runnable fibril instead of continuing to execute its current
fibril.  To this end, wake-up marks the task's current ti TIF_NEED_RESCHED.
This seems to work, but there are some pretty terrifying interactions between
schedule, wake-up, and the maintenance of fibril->state and task->state that
need to be sorted out.

Remember our task was sleeping in asys_await_completion()?  The task was woken
by the fibril wake-up path, but it's still executing the
asys_await_completion() fibril.  It comes out of schedule() and sees
TIF_NEED_RESCHED and comes back through the top of schedule().  This time it
finds the runnable sys_nanosleep() fibril and switches to it.  sys_nanosleep()
runs to completion and it returns which, because of the way we built its stack,
calls asys_teardown_stack().

asys_teardown_stack() takes the return code and puts it off in a list for
asys_await_completion().  It wakes a wait_queue to notify waiters of pending
completions.  In so doing it wakes up the asys_await_completion() fibril that
was sleeping in our task.

Then it has to tear down the fibril for the call that just completed.  In the
current implementation the fibril struct is actually embedded in an "asys_call"
struct.  asys_teardown_stack() frees the asys_call struct, and so the fibril,
after having cleared current->fibril.  It then calls schedule().  Our
asys_await_completion() fibril is on the run queue so we switch to it.
Switching notices the null current->fibril that we're switching from and takes
that as a queue to mark the previous thread_info for freeing *after* the
switch.

After the switch we're in asys_await_completion().  We find the waiting return
code completion struct in the list that was left by asys_teardown_stack().  We
return it to userspace.

Phew.  OK, so what are the trade-offs here?  I'll start with the benefits for
obvious reasons :).

- With get AIO support for all syscalls.  Every single one.  (Well, please, no
asys sys_exit() :)).  Buffered IO, sendfile, recvmsg, poll, epoll, hardware
crypto ioctls, open, mmap, getdents, the entire splice API, etc.

- The syscall API does not change just because calls are being issued AIO,
particularly things that reference task_struct.  AIO sys_getpid() does what
you'd expect, signal masking, etc.  You don't have to worry about your AIO call
being secretly handled by some worker threads that get very different results
from current-> references.

- We wouldn't multiply testing and maintenance burden with separate AIO paths.
No is_sync_kiocb() testing and divergence between returning or calling
aio_complete().  No auditing to make sure that EIOCBRETRY only being returned
after any significant references of current->.  No worries about completion
racing from the submission return path and some aio_complete() being called
from another context.  In this scheme if your sync syscall path isn't broken,
your AIO path stands a great chance of working.

- The submission syscall won't block while handling submitted calls.  Not for
metadata IO, not for memory allocation, not for mutex contention, nothing.  

- AIO syscalls which *don't* block see very little overhead.  They'll allocate
stacks and juggle the run queue locks a little, but they'll execute in turn on
the submitting (cache-hot, presumably) processor.  There's room to optimize
this path, too, of course.

- We don't need to duplicate interfaces for each AIO interface we want to
support.  No iocb unions mirroring the syscall API, no magical AIO sys_
variants.

And the costs?  It's not free.

- The 800lb elephant in the room.  It uses a full stack per blocked operation.
I believe this is a reasonable price to pay for the flexibility of having *any*
call pending.  It rules out some loads which would want to keep *millions* of
operations pending, but I humbly submit that a load rarely takes that number of
concurrent ops to saturate a resource.  (think of it this way: we've gotten
this far by having to burn a full *task* to have *any* syscall pending.)  While
not optimal, it opens to door to a lot of functionality without having to
rewrite the kernel as a giant non-blocking state machine.

It should be noted that my very first try was to copy the used part of stacks
in to and out of one full allocated stack.  This uses less
...

read more »


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Discussion subject changed to "Introduce aio system call submission and completion system calls" by Zach Brown
Zach Brown  
View profile  
 More options Jan 30 2007, 4:42 pm
Newsgroups: fa.linux.kernel
From: Zach Brown <zach.br...@oracle.com>
Date: Tue, 30 Jan 2007 21:42:27 UTC
Local: Tues, Jan 30 2007 4:42 pm
Subject: [PATCH 4 of 4] Introduce aio system call submission and completion system calls
This finally does something useful with the notion of being able to schedule
stacks as fibrils under a task_struct.  Again, i386-specific and in need of
proper layering with archs.

sys_asys_submit() is added to let userspace submit asynchronous system calls.
It specifies the system call number and arguments.  A fibril is constructed for
each call.  Each starts with a stack which executes the given system call
handler and then returns to a function which records the return code of the
system call handler.  sys_asys_await_completion() then lets userspace collect
these results.

sys_asys_submit() is careful to construct a fibril for the submission syscall
itself so that it can return to userspace if the calls it is dispatching block.
If none of them block, however, they will have all been run hot in this
submitting task on this processor.

It allocates and runs each system call in turn.  It could certainly work in
batches to decrease locking overhead at the cost of increased peak memory
overhead for calls which don't end up blocking.

The complexity of a fully-formed submission and completion interface hasn't
been addressed.  Details like targeting explicit completion contexts, batching,
timeouts, signal delivery, and syscall-free submission and completion (now with
more rings!) can all be hashed out in some giant thread, no doubt.  I didn't
want them to cloud the basic mechanics being presented here.

diff -r 4ea674e8825e -r 5bdda0f7bef2 arch/i386/kernel/syscall_table.S
--- a/arch/i386/kernel/syscall_table.S  Mon Jan 29 15:46:47 2007 -0800
+++ b/arch/i386/kernel/syscall_table.S  Mon Jan 29 15:50:10 2007 -0800
@@ -319,3 +319,5 @@ ENTRY(sys_call_table)
        .long sys_move_pages
        .long sys_getcpu
        .long sys_epoll_pwait
+       .long sys_asys_submit           /* 320 */
+       .long sys_asys_await_completion
diff -r 4ea674e8825e -r 5bdda0f7bef2 include/asm-i386/unistd.h
--- a/include/asm-i386/unistd.h Mon Jan 29 15:46:47 2007 -0800
+++ b/include/asm-i386/unistd.h Mon Jan 29 15:50:10 2007 -0800
@@ -325,6 +325,8 @@
 #define __NR_move_pages                317
 #define __NR_getcpu            318
 #define __NR_epoll_pwait       319
+#define __NR_asys_submit       320
+#define __NR_asys_await_completion     321

 #ifdef __KERNEL__

diff -r 4ea674e8825e -r 5bdda0f7bef2 include/linux/init_task.h
--- a/include/linux/init_task.h Mon Jan 29 15:46:47 2007 -0800
+++ b/include/linux/init_task.h Mon Jan 29 15:50:10 2007 -0800
@@ -148,6 +148,8 @@ extern struct group_info init_groups;
        .pi_lock        = SPIN_LOCK_UNLOCKED,                           \
        INIT_TRACE_IRQFLAGS                                             \
        INIT_LOCKDEP                                                    \
+       .asys_wait = __WAIT_QUEUE_HEAD_INITIALIZER(tsk.asys_wait),      \
+       .asys_completed = LIST_HEAD_INIT(tsk.asys_completed),           \
 }

diff -r 4ea674e8825e -r 5bdda0f7bef2 include/linux/sched.h
--- a/include/linux/sched.h     Mon Jan 29 15:46:47 2007 -0800
+++ b/include/linux/sched.h     Mon Jan 29 15:50:10 2007 -0800
@@ -1019,6 +1019,14 @@ struct task_struct {

        /* Protection of the PI data structures: */
        spinlock_t pi_lock;
+
+       /*
+        * XXX This is just a dummy that should be in a seperately managed
+        * context.  An explicit contexts lets asys calls be nested (!) and
+        * will let us provide the sys_io_*() API on top of asys.
+        */
+       struct list_head asys_completed;
+       wait_queue_head_t       asys_wait;

 #ifdef CONFIG_RT_MUTEXES
        /* PI waiters blocked on a rt_mutex held by this task */
diff -r 4ea674e8825e -r 5bdda0f7bef2 kernel/Makefile
--- a/kernel/Makefile   Mon Jan 29 15:46:47 2007 -0800
+++ b/kernel/Makefile   Mon Jan 29 15:50:10 2007 -0800
@@ -8,7 +8,7 @@ obj-y     = sched.o fork.o exec_domain.o
            signal.o sys.o kmod.o workqueue.o pid.o \
            rcupdate.o extable.o params.o posix-timers.o \
            kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o mutex.o \
-           hrtimer.o rwsem.o latency.o nsproxy.o srcu.o
+           hrtimer.o rwsem.o latency.o nsproxy.o srcu.o asys.o

 obj-$(CONFIG_STACKTRACE) += stacktrace.o
 obj-y += time/
diff -r 4ea674e8825e -r 5bdda0f7bef2 kernel/exit.c
--- a/kernel/exit.c     Mon Jan 29 15:46:47 2007 -0800
+++ b/kernel/exit.c     Mon Jan 29 15:50:10 2007 -0800
@@ -42,6 +42,7 @@
 #include <linux/audit.h> /* for audit_free() */
 #include <linux/resource.h>
 #include <linux/blkdev.h>
+#include <linux/asys.h>

 #include <asm/uaccess.h>
 #include <asm/unistd.h>
@@ -926,6 +927,8 @@ fastcall NORET_TYPE void do_exit(long co
        taskstats_exit(tsk, group_dead);

        exit_mm(tsk);
+
+       asys_task_exiting(tsk);

        if (group_dead)
                acct_process();
diff -r 4ea674e8825e -r 5bdda0f7bef2 kernel/fork.c
--- a/kernel/fork.c     Mon Jan 29 15:46:47 2007 -0800
+++ b/kernel/fork.c     Mon Jan 29 15:50:10 2007 -0800
@@ -49,6 +49,7 @@
 #include <linux/delayacct.h>
 #include <linux/taskstats_kern.h>
 #include <linux/random.h>
+#include <linux/asys.h>

 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -987,6 +988,8 @@ static struct task_struct *copy_process(
                goto fork_out;

        rt_mutex_init_task(p);
+
+       asys_init_task(p);

 #ifdef CONFIG_TRACE_IRQFLAGS
        DEBUG_LOCKS_WARN_ON(!p->hardirqs_enabled);
diff -r 4ea674e8825e -r 5bdda0f7bef2 include/linux/asys.h
--- /dev/null   Thu Jan 01 00:00:00 1970 +0000
+++ b/include/linux/asys.h      Mon Jan 29 15:50:10 2007 -0800
@@ -0,0 +1,7 @@
+#ifndef _LINUX_ASYS_H
+#define _LINUX_ASYS_H
+
+void asys_task_exiting(struct task_struct *tsk);
+void asys_init_task(struct task_struct *tsk);
+
+#endif
diff -r 4ea674e8825e -r 5bdda0f7bef2 kernel/asys.c
--- /dev/null   Thu Jan 01 00:00:00 1970 +0000
+++ b/kernel/asys.c     Mon Jan 29 15:50:10 2007 -0800
@@ -0,0 +1,252 @@
+#include <linux/sched.h>
+#include <linux/uaccess.h>
+#include <linux/err.h>
+#include <linux/asys.h>
+
+/* XXX */
+#include <asm/processor.h>
+
+/*
+ * system call and argument specification given to _submit from userspace
+ */
+struct asys_input {
+       int             syscall_nr;
+       unsigned long   cookie;
+       unsigned long   nr_args;
+       unsigned long   *args;
+};
+
+/*
+ * system call completion event given to userspace
+ * XXX: compat
+ */
+struct asys_completion {
+       long            return_code;
+       unsigned long   cookie;
+};
+
+/*
+ * This record of a completed async system call is kept around until it
+ * is collected by userspace.
+ */
+struct asys_result {
+       struct list_head        item;
+       struct asys_completion  comp;
+};
+
+/*
+ * This stack is built-up and handed to the scheduler to first process
+ * the system call.  It stores the progress of the call until the call returns
+ * and this structure is freed.
+ */
+struct asys_call {
+       struct asys_result      *result;
+       struct fibril           fibril;
+};
+
+void asys_init_task(struct task_struct *tsk)
+{
+       INIT_LIST_HEAD(&tsk->asys_completed);
+       init_waitqueue_head(&tsk->asys_wait);
+}
+
+void asys_task_exiting(struct task_struct *tsk)
+{
+       struct asys_result *res, *next;
+
+       list_for_each_entry_safe(res, next, &tsk->asys_completed, item)
+               kfree(res);
+
+       /*
+        * XXX this only works if tsk->fibril was allocated by
+        * sys_asys_submit(), not if its embedded in an asys_call.  This
+        * implies that we must forbid sys_exit in asys_submit.
+        */
+       if (tsk->fibril) {
+               BUG_ON(!list_empty(&tsk->fibril->run_list));
+               kfree(tsk->fibril);
+               tsk->fibril = NULL;
+       }
+}
+
+/*
+ * Initial asys call stacks are constructed such that this is called when
+ * the system call handler returns.  It records the return code from
+ * the handler in a completion event and frees data associated with the
+ * completed asys call.
+ *
+ * XXX we know that the x86 syscall handlers put their return code in eax and
+ * that regparm(3) here will take our rc argument from eax.
+ */
+static void fastcall NORET_TYPE asys_teardown_stack(long rc)
+{
+       struct asys_result *res;
+       struct asys_call *call;
+       struct fibril *fibril;
+
+       fibril = current->fibril;
+       call = container_of(fibril, struct asys_call, fibril);
+       res = call->result;
+       call->result = NULL;
+
+       res->comp.return_code = rc;
+       list_add_tail(&res->item, &current->asys_completed);
+       wake_up(&current->asys_wait);
+
+       /*
+        * We embedded the fibril in the call so that we could dereference it
+        * here without adding some tracking to the fibril.  We then free the
+        * call and fibril because we're done with them.
+        *
+        * The ti itself, though, is still in use.  It will only be freed once
+        * the scheduler switches away from it to another fibril.  It does
+        * that when it sees current->fibril assigned to NULL.
+        */
+       current->fibril = NULL;
+       BUG_ON(!list_empty(&fibril->run_list));
+       kfree(call);
+
+       /*
+        * XXX This is sloppy.  We "know" this is likely for now as the task
+        * with fibrils is only going to be in sys_asys_submit() or
+        * sys_asys_complete()
+        */
+       BUG_ON(list_empty(&current->runnable_fibrils));
+
+       schedule();
+       BUG();
+}
+
+asmlinkage long sys_asys_await_completion(struct asys_completion __user *comp)
+{
+       struct asys_result *res;
+       long ret;
+
+       ret = wait_event_interruptible(current->asys_wait,
+                                      !list_empty(&current->asys_completed));
+       if (ret)
+               goto out;
+
+       res = list_entry(current->asys_completed.next, struct asys_result,
+                         item);
+
+       /* XXX compat */
+       ret = copy_to_user(comp, &res->comp, sizeof(struct asys_completion));
+       if (ret) {
+               ret = -EFAULT;
+               goto out;
+       }
+
+       list_del(&res->item);
+       kfree(res);
+       ret = 1;
+
+out:
+       return ret;
+}
+
+/*
+ * This initializes a newly allocated fibril so that it can be handed to the
+ * scheduler.  The fibril is private to this code path at this point.
+ *
+ * XXX
+ *  - this is arch specific
+ *  - should maybe have a sched helper that uses INIT_PER_CALL_CHAIN
+ */
+extern unsigned long sys_call_table[]; /* XXX */
+static int asys_init_fibril(struct fibril *fibril, struct thread_info *ti,
+                           struct asys_input *inp)
+{
+       unsigned long *stack_bottom;
+
+       INIT_LIST_HEAD(&fibril->run_list);
+       fibril->ti = ti;
+
+       /* XXX sanity check syscall_nr */
+       fibril->eip = sys_call_table[inp->syscall_nr];
+       /* this mirrors copy_thread()'s use of task_pt_regs() */
+       fibril->esp = (unsigned long)thread_info_pt_regs(ti) -
+                       ((inp->nr_args + 1) * sizeof(long));
+
+       /*
+        * now setup the stack so that our syscall handler gets its arguments
+        *
...

read more »


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Discussion subject changed to "Generic AIO by scheduling stacks" by Linus Torvalds
Linus Torvalds  
View profile  
 More options Jan 30 2007, 4:58 pm
Newsgroups: fa.linux.kernel
From: Linus Torvalds <torva...@linux-foundation.org>
Date: Tue, 30 Jan 2007 21:58:56 UTC
Local: Tues, Jan 30 2007 4:58 pm
Subject: Re: [PATCH 0 of 4] Generic AIO by scheduling stacks

On Tue, 30 Jan 2007, Zach Brown wrote:

> This very rough patch series introduces a different way to provide AIO support
> for system calls.

Yee-haa!

I looked at this approach a long time ago, and basically gave up because
it looked like too much work.

I heartily approve, although I only gave the actual patches a very cursory
glance. I think the approach is the proper one, but the devil is in the
details. It might be that the stack allocation overhead or some other
subtle fundamental problem ends up making this impractical in the end, but
I would _really_ like for this to basically go in.

One of the biggest issues I see is signalling. Your mention it, and I
think that's going to be one of the big issues.

It won't matter at all for a certain class of calls (a lot of the people
who want to do AIO really end up doing non-interruptible things, and
signalling is a non-issue), but not only is it going to matter for some
others, we will almost certainly want to have a way to not just signal a
task, but a single "fibril" (and let me say that I'm not convinced about
your naming, but I don't hate it either ;)

But from a quick overview of the patches, I really don't see anything
fundamentally wrong. It needs some error checking and some limiting (I
_really_ don't think we want a regular user starting a thousand fibrils
concurrently), but it actually looks much less invasive than I thought it
would be.

So yay! Consider me at least preliminarily very happy.

                Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Linus Torvalds  
View profile  
 More options Jan 30 2007, 5:23 pm
Newsgroups: fa.linux.kernel
From: Linus Torvalds <torva...@linux-foundation.org>
Date: Tue, 30 Jan 2007 22:23:45 UTC
Local: Tues, Jan 30 2007 5:23 pm
Subject: Re: [PATCH 0 of 4] Generic AIO by scheduling stacks

On Tue, 30 Jan 2007, Linus Torvalds wrote:

> But from a quick overview of the patches, I really don't see anything
> fundamentally wrong. It needs some error checking and some limiting (I
> _really_ don't think we want a regular user starting a thousand fibrils
> concurrently), but it actually looks much less invasive than I thought it
> would be.

Side note (and maybe this was obvious to people already): I would suggest
that the "limiting" not be done the way fork() is limited ("return EAGAIN
if you go over a limit") but be done as a per-task counting semaphore
(down() on submit, up() on fibril exit).

So we should limit these to basically have some maximum concurrency
factor, but rather than consider it an error to go over it, we'd just cap
the concurrency by default, so that people can freely use asynchronous
interfaces without having to always worry about what happens if their
resources run out..

However, that also implies that we should probably add a "flags" parameter
to "async_submit()" and have a FIBRIL_IMMEDIATE flag (or a timeout) or
something to tell the kernel to rather return EAGAIN than wait. Sometimes
you don't want to block just because you already have too much work.

Hmm?

                Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Zach Brown  
View profile  
 More options Jan 30 2007, 5:47 pm
Newsgroups: fa.linux.kernel
From: Zach Brown <zach.br...@oracle.com>
Date: Tue, 30 Jan 2007 22:47:58 UTC
Local: Tues, Jan 30 2007 5:47 pm
Subject: Re: [PATCH 0 of 4] Generic AIO by scheduling stacks

> I looked at this approach a long time ago, and basically gave up  
> because
> it looked like too much work.

Indeed, your mention of it in that thread.. a year ago?.. is what got  
this notion sitting in the back of my head.  I didn't like it at  
first, but it grew on me.

> I heartily approve, although I only gave the actual patches a very  
> cursory
> glance. I think the approach is the proper one, but the devil is in  
> the
> details. It might be that the stack allocation overhead or some other
> subtle fundamental problem ends up making this impractical in the  
> end, but
> I would _really_ like for this to basically go in.

As for efficiency and overhead, I hope to get some time with the team  
that work on the Giant Database Software Whose Name We Shall Not  
Speak.  That'll give us some non-trival loads to profile.

> It won't matter at all for a certain class of calls (a lot of the  
> people
> who want to do AIO really end up doing non-interruptible things, and
> signalling is a non-issue), but not only is it going to matter for  
> some
> others, we will almost certainly want to have a way to not just  
> signal a
> task, but a single "fibril" (and let me say that I'm not convinced  
> about
> your naming, but I don't hate it either ;)

Yeah, no doubt.  I'm wildly open to discussion here.  (and yeah, me  
either, but I don't care much about the name.  I got tired of  
qualifying overloaded uses of 'stack' or 'thread', that's all :)).

> But from a quick overview of the patches, I really don't see anything
> fundamentally wrong. It needs some error checking and some limiting (I
> _really_ don't think we want a regular user starting a thousand  
> fibrils
> concurrently), but it actually looks much less invasive than I  
> thought it
> would be.

I think we'll also want to flesh out the submission and completion  
interface so that we don't find ourselves frustrated with it in  
another 5 years.  What's there now is just scaffolding to support the  
interesting kernel-internal part.  No doubt the kevent thread will  
come into play here.

- z
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Zach Brown  
View profile  
 More options Jan 30 2007, 5:54 pm
Newsgroups: fa.linux.kernel
From: Zach Brown <zach.br...@oracle.com>
Date: Tue, 30 Jan 2007 22:54:11 UTC
Local: Tues, Jan 30 2007 5:54 pm
Subject: Re: [PATCH 0 of 4] Generic AIO by scheduling stacks

> So we should limit these to basically have some maximum concurrency
> factor, but rather than consider it an error to go over it, we'd  
> just cap
> the concurrency by default, so that people can freely use asynchronous
> interfaces without having to always worry about what happens if their
> resources run out..

Yeah, call it the socket transmit queue model :).  Maybe tuned by a  
ulimit?

I don't have very strong opinions abou the specific mechanics of  
limiting concurrent submissions, as long as they're there.  Some  
folks in Oracle complain about having one more thing to have to tune,  
but the alternative seems worse.

> However, that also implies that we should probably add a "flags"  
> parameter
> to "async_submit()" and have a FIBRIL_IMMEDIATE flag (or a timeout) or
> something to tell the kernel to rather return EAGAIN than wait.  
> Sometimes
> you don't want to block just because you already have too much work.

EAGAIN or the initial number of submissions completed before the one  
that ran over the limit, perhaps.  Sure.  Nothing too controversial  
here :).   I have this kind of stuff queued up for worrying about  
once the internal mechanics are stronger.

- z

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Linus Torvalds  
View profile  
 More options Jan 30 2007, 5:55 pm
Newsgroups: fa.linux.kernel
From: Linus Torvalds <torva...@linux-foundation.org>
Date: Tue, 30 Jan 2007 22:55:08 UTC
Local: Tues, Jan 30 2007 5:55 pm
Subject: Re: [PATCH 0 of 4] Generic AIO by scheduling stacks

On Tue, 30 Jan 2007, Zach Brown wrote:

> I think we'll also want to flesh out the submission and completion interface
> so that we don't find ourselves frustrated with it in another 5 years.  What's
> there now is just scaffolding to support the interesting kernel-internal part.
> No doubt the kevent thread will come into play here.

Actually, the thing I like about kernel micro-threads (and that "fibril"
name is starting to grow on me) is that I'm hoping we might be able to use
them for that kevent thing too. And perhaps some other issues (ACPI has
some "events" that might work with synchronously scheduled threads).

IOW, synchronous threading does have its advantages..

Btw, I noticed that you didn't Cc Ingo. Definitely worth doing. Not just
because he's basically the normal scheduler maintainer, but also because
he's historically been involved in things like the async filename lookup
that the in-kernel web server thing used. EXACTLY the kinds of things
where fibrils actually give you *much* nicer interfaces.

Ingo - see linux-kernel for the announcement and WIP patches.

                        Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Zach Brown  
View profile  
 More options Jan 30 2007, 6:47 pm
Newsgroups: fa.linux.kernel
From: Zach Brown <zach.br...@oracle.com>
Date: Tue, 30 Jan 2007 23:47:11 UTC
Local: Tues, Jan 30 2007 6:47 pm
Subject: Re: [PATCH 0 of 4] Generic AIO by scheduling stacks

> Btw, I noticed that you didn't Cc Ingo. Definitely worth doing. Not  
> just
> because he's basically the normal scheduler maintainer, but also  
> because
> he's historically been involved in things like the async filename  
> lookup
> that the in-kernel web server thing used.

Yeah, that was dumb.  I had him in the cc: initially, then thought it  
was too large and lobbed a bunch off.  My mistake.

Ingo, I'm interested in your reaction to the i386-specific mechanics  
here (the thread_info copies terrify me) and the general notion of  
how to tie this cleanly into the scheduler.

- z
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Benjamin Herrenschmidt  
View profile  
 More options Jan 30 2007, 9:04 pm
Newsgroups: fa.linux.kernel
From: Benjamin Herrenschmidt <b...@kernel.crashing.org>
Date: Wed, 31 Jan 2007 02:04:55 UTC
Local: Tues, Jan 30 2007 9:04 pm
Subject: Re: [PATCH 0 of 4] Generic AIO by scheduling stacks

> - We would now have some measure of task_struct concurrency.  Read that twice,
> it's scary.  As two fibrils execute and block in turn they'll each be
> referencing current->.  It means that we need to audit task_struct to make sure
> that paths can handle racing as its scheduled away.  The current implementation
> *does not* let preemption trigger a fibril switch.  So one only has to worry
> about racing with voluntary scheduling of the fibril paths.  This can mean
> moving some task_struct members under an accessor that hides them in a struct
> in task_struct so they're switched along with the fibril.  I think this is a
> manageable burden.

That's the one scaring me in fact ... Maybe it will end up being an easy
one but I don't feel too comfortable... we didn't create fibril-like
things for threads, instead, we share PIDs between tasks. I wonder if
the sane approach would be to actually create task structs (or have a
pool of them pre-created sitting there for performances) and add a way
to share the necessary bits so that syscalls can be run on those
spin-offs.

Ben.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Benjamin Herrenschmidt  
View profile  
 More options Jan 30 2007, 9:08 pm
Newsgroups: fa.linux.kernel
From: Benjamin Herrenschmidt <b...@kernel.crashing.org>
Date: Wed, 31 Jan 2007 02:08:03 UTC
Local: Tues, Jan 30 2007 9:08 pm
Subject: Re: [PATCH 0 of 4] Generic AIO by scheduling stacks

On Tue, 2007-01-30 at 15:45 -0800, Zach Brown wrote:
> > Btw, I noticed that you didn't Cc Ingo. Definitely worth doing. Not  
> > just
> > because he's basically the normal scheduler maintainer, but also  
> > because
> > he's historically been involved in things like the async filename  
> > lookup
> > that the in-kernel web server thing used.

> Yeah, that was dumb.  I had him in the cc: initially, then thought it  
> was too large and lobbed a bunch off.  My mistake.

> Ingo, I'm interested in your reaction to the i386-specific mechanics  
> here (the thread_info copies terrify me) and the general notion of  
> how to tie this cleanly into the scheduler.

Thread info copies aren't such a big deal, we do that for irq stacks
already afaik

Ben.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Linus Torvalds  
View profile  
 More options Jan 30 2007, 9:47 pm
Newsgroups: fa.linux.kernel
From: Linus Torvalds <torva...@linux-foundation.org>
Date: Wed, 31 Jan 2007 02:47:04 UTC
Local: Tues, Jan 30 2007 9:47 pm
Subject: Re: [PATCH 0 of 4] Generic AIO by scheduling stacks

On Wed, 31 Jan 2007, Benjamin Herrenschmidt wrote:
> > - We would now have some measure of task_struct concurrency.  Read that twice,
> > it's scary.  As two fibrils execute and block in turn they'll each be
> > referencing current->.  It means that we need to audit task_struct to make sure
> > that paths can handle racing as its scheduled away.  The current implementation
> > *does not* let preemption trigger a fibril switch.  So one only has to worry
> > about racing with voluntary scheduling of the fibril paths.  This can mean
> > moving some task_struct members under an accessor that hides them in a struct
> > in task_struct so they're switched along with the fibril.  I think this is a
> > manageable burden.

> That's the one scaring me in fact ... Maybe it will end up being an easy
> one but I don't feel too comfortable...

We actually have almost zero "interesting" data in the task-struct.

All the real meat of a task has long since been split up into structures
that can be shared for threading anyway (ie signal/files/mm/etc).

Which is why I'm personally very comfy with just re-using task_struct
as-is.

NOTE! This is with the understanding that we *never* do any preemption.
The whole point of the microthreading as far as I'm concerned is exactly
that it is cooperative. It's not preemptive, and it's emphatically *not*
concurrent (ie you'd never have two fibrils running at the same time on
separate CPU's).

If you want preemptive of concurrent CPU usage, you use separate threads.
The point of AIO scheduling is very much inherent in its name: it's for
filling up CPU's when there's IO.

So the theory (and largely practice) is that you want to use real threads
to fill your CPU's, but then *within* those threads you use AIO to make
sure that each thread actually uses the CPU efficiently and doesn't just
block with nothing to do.

So with the understanding that this is neither CPU-concurrent nor
preemptive (*within* a fibril group - obviously the thread itself gets
both preempted and concurrently run with other threads), I don't worry at
all about sharing "struct task_struct".

Does that mean that we might not have some cases where we'd need to make
sure we do things differently? Of course not. Something migt show up. But
this actually makes it very clear what the difference between "struct
thread_struct" and "struct task_struct" are. One is shared between
fibrils, the other isn't.

                        Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Linus Torvalds  
View profile  
 More options Jan 30 2007, 10:02 pm
Newsgroups: fa.linux.kernel
From: Linus Torvalds <torva...@linux-foundation.org>
Date: Wed, 31 Jan 2007 03:02:54 UTC
Local: Tues, Jan 30 2007 10:02 pm
Subject: Re: [PATCH 0 of 4] Generic AIO by scheduling stacks

On Tue, 30 Jan 2007, Linus Torvalds wrote:

> Does that mean that we might not have some cases where we'd need to make
> sure we do things differently? Of course not. Something migt show up. But
> this actually makes it very clear what the difference between "struct
> thread_struct" and "struct task_struct" are. One is shared between
> fibrils, the other isn't.

Btw, this is also something where we should just disallow certain system
calls from being done through the asynchronous method.

Notably, clone/fork(), execve() and exit() are all things that we probably
simply shouldn't allow as "AIO" events.

The process handling ones are obvious: they are very much about the shared
"struct task_struct", so they rather clearly should only done "natively".

More interesting is the question about "close()", though. Currently we
have an optimization (fget/fput_light) that basically boils down to "we
know we are the only owners". That optimization becomes more "interesting"
with AIO - we need to disable it when fibrils are active (because other
fibrils or the main thread can do it), but we can still keep it for the
non-fibril case.

So we can certainly allow close() to happen in a fibril, but at the same
time, this is an area where just the *existence* of fibrils means that
certain other decisions that were thread-related may be modified to be
aware of the micro-threads too.

I suspect there are rather few of those, though. The only one I can think
of is literally the fget/fput_light() case, but there could be others.

                        Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Benjamin Herrenschmidt  
View profile  
 More options Jan 31 2007, 12:17 am
Newsgroups: fa.linux.kernel
From: Benjamin Herrenschmidt <b...@kernel.crashing.org>
Date: Wed, 31 Jan 2007 05:17:33 UTC
Local: Wed, Jan 31 2007 12:17 am
Subject: Re: [PATCH 0 of 4] Generic AIO by scheduling stacks

> NOTE! This is with the understanding that we *never* do any preemption.
> The whole point of the microthreading as far as I'm concerned is exactly
> that it is cooperative. It's not preemptive, and it's emphatically *not*
> concurrent (ie you'd never have two fibrils running at the same time on
> separate CPU's).

That makes it indeed much less worrisome...

> If you want preemptive of concurrent CPU usage, you use separate threads.
> The point of AIO scheduling is very much inherent in its name: it's for
> filling up CPU's when there's IO.

Ok, I see, that's in fact pretty similar to some task switching hack I
did about 10 years ago on MacOS to have "asynchronous" IO code be
implemented linearily :-)

Makes lots of sense imho.

Ben.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Nick Piggin  
View profile  
 More options Jan 31 2007, 12:38 am
Newsgroups: fa.linux.kernel
From: Nick Piggin <nickpig...@yahoo.com.au>
Date: Wed, 31 Jan 2007 05:38:11 UTC
Local: Wed, Jan 31 2007 12:38 am
Subject: Re: [PATCH 0 of 4] Generic AIO by scheduling stacks

So using stacks to hold state is (IMO) the logical choice to do async
syscalls, especially once you have a look at some of the other AIO
stuff going around.

I always thought that the AIO people didn't do this because they wanted
to avoid context switch overhead.

So now if we introduce the context switch overhead back, why do we need
just another scheduling primitive? What's so bad about using threads? The
upside is that almost everything is already there and working, and also
they don't have any of these preemption or concurrency restrictions.

The only thing I saw in Zach's post against the use of threads is that
some kernel API would change. But surely if this is the showstopper then
there must be some better argument than sys_getpid()?!

Aside from that, I'm glad that someone is looking at this way for AIO,
because I really don't like some aspects in the other approach.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Nick Piggin  
View profile  
 More options Jan 31 2007, 12:52 am
Newsgroups: fa.linux.kernel
From: Nick Piggin <nickpig...@yahoo.com.au>
Date: Wed, 31 Jan 2007 05:52:28 UTC
Local: Wed, Jan 31 2007 12:52 am
Subject: Re: [PATCH 0 of 4] Generic AIO by scheduling stacks

In other words, while I share the appreciation for this clever trick of
using cooperative switching between these little thriblets, I don't
actually feel it is very elegant to then have to change the kernel so
much in order to handle them.

I would be fascinated to see where such a big advantage comes from using
these rather than threads. Maybe we can then improve threads not to suck
so much and everybody wins.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Linus Torvalds  
View profile  
 More options Jan 31 2007, 1:07 am
Newsgroups: fa.linux.kernel
From: Linus Torvalds <torva...@linux-foundation.org>
Date: Wed, 31 Jan 2007 06:07:44 UTC
Local: Wed, Jan 31 2007 1:07 am
Subject: Re: [PATCH 0 of 4] Generic AIO by scheduling stacks

On Wed, 31 Jan 2007, Nick Piggin wrote:

> I always thought that the AIO people didn't do this because they wanted
> to avoid context switch overhead.

I don't think that scheduling overhead was ever a really the reason, at
least not the primary one, and at least not on Linux. Sure, we can
probably make cooperative thread switching a bit faster than even
VM-sharing thread switching (maybe), but it's not going to be *that* big
an issue.

Ifaik, the bigger issues were about setup costs (but also purely semantic
- it was hard to do AIO semantics with threads).

And memory costs. The "one stack page per outstanding AIO" may end up
still being too expensive, but threads were even more so.

[ Of course, that used to also be the claim by all the people who thought
  we couldn't do native kernel threads for "normal" threading either, and
  should go with the n*m setup. Shows how much they knew ;^]

But I've certainly _personally_ always wanted to do AIO with threads. I
wanted to do it with regular threads (ie the "clone()" kind). It didn't
fly. But I think we can possibly lower both the setup costs and the memory
costs with the cooperative approach, to the point where maybe this one is
more palatable and workable.

And maybe it also solves some of the scalability worries (threads have ID
space and scheduling setup things that essentially go away by just not
doing them - which is what the fibrils simply wouldn't have).

(Sadly, some of the people who really _use_ AIO are the database people,
and they really only care about a particularly stupid and trivial case:
pure reads and writes. A lot of other loads care about much more complex
things: filename lookups etc, that traditional AIO cannot do at all, and
that you really want something more thread-like for. But those other loads
get kind of swamped by the DB needs, which are might tighter and trivial
enough that you don't "need" a real thread for them).

                Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Discussion subject changed to "Introduce aio system call submission and completion system calls" by Andi Kleen
Andi Kleen  
View profile  
 More options Jan 31 2007, 2:58 am
Newsgroups: fa.linux.kernel
From: Andi Kleen <a...@suse.de>
Date: Wed, 31 Jan 2007 07:58:27 UTC
Local: Wed, Jan 31 2007 2:58 am
Subject: Re: [PATCH 4 of 4] Introduce aio system call submission and completion system calls

Zach Brown <zach.br...@oracle.com> writes:
> This finally does something useful with the notion of being able to schedule
> stacks as fibrils under a task_struct.  Again, i386-specific and in need of
> proper layering with archs.

> sys_asys_submit() is added to let userspace submit asynchronous system calls.
> It specifies the system call number and arguments.  A fibril is constructed for
> each call.  Each starts with a stack which executes the given system call
> handler and then returns to a function which records the return code of the
> system call handler.  sys_asys_await_completion() then lets userspace collect
> these results.

Do you have any numbers how this compares cycle wise to just doing
clone+syscall+exit in user space?

If the difference is not too big might it be easier to fix
clone+syscall to be more efficient than teach all the rest
of the kernel about fibrils?

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Discussion subject changed to "Generic AIO by scheduling stacks" by Ingo Molnar
Ingo Molnar  
View profile  
 More options Jan 31 2007, 3:45 am
Newsgroups: fa.linux.kernel
From: Ingo Molnar <mi...@elte.hu>
Date: Wed, 31 Jan 2007 08:45:59 UTC
Local: Wed, Jan 31 2007 3:45 am
Subject: Re: [PATCH 0 of 4] Generic AIO by scheduling stacks

* Linus Torvalds <torva...@linux-foundation.org> wrote:

> [ Of course, that used to also be the claim by all the people who
>   thought we couldn't do native kernel threads for "normal" threading
>   either, and should go with the n*m setup. Shows how much they knew
>   ;^]

> But I've certainly _personally_ always wanted to do AIO with threads.
> I wanted to do it with regular threads (ie the "clone()" kind). It
> didn't fly. But I think we can possibly lower both the setup costs and
> the memory costs with the cooperative approach, to the point where
> maybe this one is more palatable and workable.

as you know i've been involved in all the affected IO and scheduling
disciplines in question: i hacked on scheduling, 1:1 threading, on Tux,
i even optimized kaio under TPC workloads, and various other things. So
i believe to have seen most of these things first-hand and thus i
(pretend to) have no particular prejudice or other subjective
preference. In the following, rather long (and boring) analysis i touch
upon both 1:1 threading, light threads, state machines and AIO.

The first and most important thing is that i think there are only /two/
basic, fundamental IO approaches that matter to kernel design:

 1) the most easy to program one.

 2) the fastest one.

1), the most easily programmed model: this is tasks and synchronous IO.
Samba (and Squid, and even Apache most of the time ...) is still using
plain old /processes/ and is doing fine! Most of the old LWP arguments
are not really true anymore, TLB switches are very fast meanwhile (per
hardware improvements) and our scheduler is still pretty light.
Furthermore, for certain specialized environments that are able isolate
the programmer from the risks and complexities of thread programming,
threading can be somewhat faster (such as Java or C#). But for the
general C/C++ based server environment even threading is pure hell. (We
know it from the kernel that programming a threaded design is /hard/,
needs alot of discipline and takes alot more resources than any of the
saner models.)

2) the fastest model: is a pure, minimal state-machine tailored to the
task/workload we want to do. Tux does that (and i'm keeping it uptodate
for current 2.6 kernels too), and it still beats the pants off anything
user-space.

But reality is that most people care more about programmability: 300
MB/sec or 500 MB/sec web throughput limit (or a serving limit 30K
requests in parallel or 60K requests in parallel - whatever the
performance metric you care about is) isnt really making any difference
in all but the most specialized environments - /because/ the price to
pay for it is much worse programmability (and hence much worse
flexibility). [ And this is all about basic economy: if it's 10 times
harder to program something and what takes 1 month to program in one
model takes 10 months to program in the other model - but in 10 months
the hardware already got another 50% faster, likely offsetting the
advantage of that 'other' model to begin with ... ]

Note that often the speed difference between task based and state based
designs is alot smaller. But God is the programming complexity
difference huge. Not even huge but /HUGE/. All our programming languages
are procedural and stack based. (Not only is there a basic lack of
state-based programming languages, it's a product of our brain
structure: state machines are not really what the human mind is
structured for, but i digress.)

Now where do all these LWP, fibre, firbril, micro-thread or N:M concepts
fit? Most of the time they are just a /weakening/ of the #1 concept. And
that's why they will lose out, because #1 is all about programmability
and they dont offer anything new: because they cannot. Either you go for
programmability or you go for performance. There is /no/ middle ground
for us in the kernel! (User-space can do the middle ground thing, but
the kernel must only be structured around the two most fundamental
concepts that exist. [NOTE: more about KAIO later.])

Having a 1:1 relationship between user-space and kernel-space context is
a strong concept, and on modern CPUs we perform /process/ context
switches in 650 nsecs, we do user thread context switches in 500 nsecs,
and kernel<->kernel thread context switches in 350 nsecs. That's pretty
damn fast.

The big problem is, the M:N concepts (and i consider micro-threads or
'schedulable stacks' to be of that type) tend to operate the following
way: they build up hype by being "faster" - but the performance
advantage only comes from weakening #1! (for example by not making the
contexts generally schedulable, bindable, load-balancable, suspendable,
migratable, etc.) But then a few years later they grow (by virtue of
basic pressure from programmers, who /want/ a clean easily programmable
concept #1) the /very same/ features that they "optimized away" in the
beginning. With the difference that now all that infrastructure they
left out initially is replicated in user-space, in a poorer and
inconsistent way, blowing up cache-size, and slowing the /sane/ things
down in the end. (and then i havent even begun about the nightmare of
extending the debug infrastructure to light threads, ABI dependencies,
etc, etc...)

One often repeated (because pretty much only) performance advantage of
'light threads' is context-switch performance between user-space
threads. But reality is, nobody /cares/ about being able to
context-switch between "light user-space threads"! Why? Because there
are only two reasons why such a high-performance context-switch would
occur:

 1) there's contention between those two tasks. Wonderful: now two
    artificial threads are running on the /same/ CPU and they are even
    contending each other. Why not run a single context on a single CPU
    instead and only get contended if /another/ CPU runs a conflicting
    context?? While this makes for nice "pthread locking benchmarks",
    it is not really useful for anything real.

 2) there has been an IO event. The thing is, for IO events we enter the
    kernel no matter what - and we'll do so for the next 10 years at
    minimum. We want to abstract away the hardware, we want to do
    reliable resource accounting, we want to share hardware resources,
    we want to rate-limit, etc., etc. While in /theory/ you could handle
    IO purely from user-space, in practice you dont want to do that. And
    if we accept the premise that we'll enter the kernel anyway, there's
    zero performance difference between scheduling right there in the
    kernel, or returning back to user-space to schedule there. (in fact
    i submit that the former is faster). Or if we accept the theoretical
    possibility of 'perfect IO hardware' that implements /all/ the
    features that the kernel wants (in a secure and generic way, and
    mind you, such IO hardware does not exist yet), then /at most/ the
    performance advantage of user-space doing the scheduling is the
    overhead of a null syscall entry. Which is a whopping 100 nsecs on
    modern CPUs! That's roughly the latency of a /single/ DRAM access!

Furthermore, 'light thread' concepts can no way approach the performance
of #2 state-machines: if you /know/ what the structure of your context
is, and you can program it in a specialized state-machine way, there's
just so many shortcuts possible that it's not even funny.

> And maybe it also solves some of the scalability worries (threads have
> ID space and scheduling setup things that essentially go away by just
> not doing them - which is what the fibrils simply wouldn't have).

our PID management is very fast - i've never seen it show up in
profiles. (We could make it even more scalable if the need arises, for
example right now we still maintain the luxory of having a globally
increasing and tightly compressed PID/TID space.)

until a few days ago we even had a /global lock/ in the thread/task
creation and destruction hotpath since 2.5.43 (when oprofile support was
added, more than 4 years ago, the oprofile exit notifier lock) and
nobody really noticed or cared!

so i believe there are only two directions that make sense in the long
run:

1) improve our basic #1 design gradually. If something is a bottleneck,
   if the scheduler has grown too fat, cut some slack. If micro-threads
   or fibrils offer anything nice for our basic thread model: integrate
   it into the kernel. (Dont worry about having the enter the kernel for
   that, we /want/ that isolation! Resource control /needs/ a secure and
   trustable context. Good debuggability /needs/ a secure and trustable
   context. Hardware makers will make damn sure that such hardware-based
   isolation is fast as lightning. SYSENTER will continue to be very
   fast.) In any case, unless someone can answer the issues i raised
   above, please dont mess with the basic 1:1 task model we have. It's
   really what we want to have.

2) enable crazy people to do IO in the #2 way. For this i think the most
   programmable interface is KAIO - because the 'context' /only/
   involves the IO entity itself, which is simple enough for our brain
   to wrap itself around. (and the hardware itself has the concept of
   'in flight IO' anyway, so people are forced to learn about it and to
   deal with it anyway, even in the synchronous IO model. So there's
   quite some mindshare to build upon.) This also happens to realize
   /most/ of the performance advantages that a state-machine like Tux
   tends to have. (In that sense i dont see epoll to be conceptually
   different from KAIO, although internally within the kernel it's quite
   a bit different.)

now, KAIO 'behind the hood' (within the kernel) is like /very/ hard to
program in a fully state-driven manner - but we are crazy folks who
/might/ be able to pull it off. Initially we should just simulate the
really hard bits via a pool of in-kernel ...

read more »


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Xavier Bestel  
View profile  
 More options Jan 31 2007, 5:51 am
Newsgroups: fa.linux.kernel
From: Xavier Bestel <xavier.bes...@free.fr>
Date: Wed, 31 Jan 2007 10:51:13 UTC
Local: Wed, Jan 31 2007 5:51 am
Subject: Re: [PATCH 0 of 4] Generic AIO by scheduling stacks

On Tue, 2007-01-30 at 19:02 -0800, Linus Torvalds wrote:
> Btw, this is also something where we should just disallow certain system
> calls from being done through the asynchronous method.

Does that mean that doing an AIO-disabled syscall will wait for all in-
flight AIO syscalls to finish ?

        Xav

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Discussion subject changed to "Introduce aio system call submission and completion system calls" by Zach Brown
Zach Brown  
View profile  
 More options Jan 31 2007, 12:16 pm
Newsgroups: fa.linux.kernel
From: Zach Brown <zach.br...@oracle.com>
Date: Wed, 31 Jan 2007 17:16:14 UTC
Local: Wed, Jan 31 2007 12:16 pm
Subject: Re: [PATCH 4 of 4] Introduce aio system call submission and completion system calls

On Jan 31, 2007, at 12:58 AM, Andi Kleen wrote:

> Do you have any numbers how this compares cycle wise to just doing
> clone+syscall+exit in user space?

Not yet, no.  Release early, release often, and all that.  I'll throw  
something together.

- z
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Andi Kleen  
View profile  
 More options Jan 31 2007, 12:24 pm
Newsgroups: fa.linux.kernel
From: Andi Kleen <a...@suse.de>
Date: Wed, 31 Jan 2007 17:24:06 UTC
Local: Wed, Jan 31 2007 12:24 pm
Subject: Re: [PATCH 4 of 4] Introduce aio system call submission and completion system calls
On Wednesday 31 January 2007 18:15, Zach Brown wrote:

> On Jan 31, 2007, at 12:58 AM, Andi Kleen wrote:

> > Do you have any numbers how this compares cycle wise to just doing
> > clone+syscall+exit in user space?

> Not yet, no.  Release early, release often, and all that.  I'll throw  
> something together.

So what was the motivation for doing this then?  It's only point
is to have smaller startup costs for AIO than clone+fork without
fixing the VFS code to be a state machine, right?

I'm personally unclear if it's really less work to teach a lot of
code in the kernel about a new thread abstraction than changing VFS.

Your patches don't look that complicated yet but you openly
admitted you waved away many of the more tricky issues (like
signals etc.) and I bet there are yet-unknown side effects
of this too that will need more changes.

I would expect a VFS solution to be the fastest of any at least.

I'm not sure the fibrils thing will be that much faster than
a possibly somewhat fast pathed for this case clone+syscall+exit.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Discussion subject changed to "Generic AIO by scheduling stacks" by Zach Brown
Zach Brown  
View profile  
 More options Jan 31 2007, 12:39 pm
Newsgroups: fa.linux.kernel
From: Zach Brown <zach.br...@oracle.com>
Date: Wed, 31 Jan 2007 17:39:05 UTC
Local: Wed, Jan 31 2007 12:39 pm
Subject: Re: [PATCH 0 of 4] Generic AIO by scheduling stacks

>> - We would now have some measure of task_struct concurrency.  Read  
>> that twice,
>> it's scary.
> That's the one scaring me in fact ... Maybe it will end up being an  
> easy
> one but I don't feel too comfortable...

Indeed, that was my first reaction too.  I dismissed the idea for a  
good six months after initially realizing that it implied sharing  
journal_info, etc.

But when I finally sat down and started digging through the  
task_struct members and, after quickly dismissing involuntary  
preemption of the fibrils, it didn't seem so bad.  I haven't done an  
exhaustive audit yet (and I won't advocate merging until I have) but  
I haven't seen any train wrecks.

> we didn't create fibril-like
> things for threads, instead, we share PIDs between tasks. I wonder if
> the sane approach would be to actually create task structs (or have a
> pool of them pre-created sitting there for performances) and add a way
> to share the necessary bits so that syscalls can be run on those
> spin-offs.

Maybe, if it comes to that.  I have some hopes that sharing by  
default and explicitly marking the bits that we shouldn't share will  
be good enough.

- z
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Zach Brown  
View profile  
 More options Jan 31 2007, 12:49 pm
Newsgroups: fa.linux.kernel
From: Zach Brown <zach.br...@oracle.com>
Date: Wed, 31 Jan 2007 17:49:07 UTC
Local: Wed, Jan 31 2007 12:49 pm
Subject: Re: [PATCH 0 of 4] Generic AIO by scheduling stacks

> Does that mean that we might not have some cases where we'd need to  
> make
> sure we do things differently? Of course not. Something migt show up.

Might, and has.  For a good time, take journal_info out of  
per_call_chain() in the patch set and watch it helpfully and loudly  
wet itself.  There really really are bits of thread_struct which are  
strictly thread-local-storage, of a sort, for a kernel call path.    
Sharing them, even only through cooperate scheduling, is fatal.  
link_count is another obvious one.

They're also the only ones I've bothered to discover so far :).

> But
> this actually makes it very clear what the difference between "struct
> thread_struct" and "struct task_struct" are. One is shared between
> fibrils, the other isn't.

Indeed.

Right now the per-fibril uses of task_struct members are left inline  
in task_struct and are copied on fibril switches.

We *could* put them in thread_info, at the cost of stack pressure, or  
hang them off task_struct in their own struct to avoid the copies, at  
the cost of indirection.  I didn't like imposing a cost on paths that  
don't use fibrils, though, so I left them inline.

(I think you know all this.  I'm clarifying for the peanut gallery, I  
hope.)

- z
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Benjamin LaHaise  
View profile  
 More options Jan 31 2007, 12:52 pm
Newsgroups: fa.linux.kernel
From: Benjamin LaHaise <b...@kvack.org>
Date: Wed, 31 Jan 2007 17:52:00 UTC
Subject: Re: [PATCH 0 of 4] Generic AIO by scheduling stacks

On Wed, Jan 31, 2007 at 09:38:11AM -0800, Zach Brown wrote:
> Indeed, that was my first reaction too.  I dismissed the idea for a  
> good six months after initially realizing that it implied sharing  
> journal_info, etc.

> But when I finally sat down and started digging through the  
> task_struct members and, after quickly dismissing involuntary  
> preemption of the fibrils, it didn't seem so bad.  I haven't done an  
> exhaustive audit yet (and I won't advocate merging until I have) but  
> I haven't seen any train wrecks.

I'm still of the opinion that you cannot do this without creating actual
threads.  That said, there is room for setting up the task_struct beforehand
without linking it into the system lists.  The reason I don't think this
approach works (and I looked at it a few times) is that many things end
up requiring special handling: things like permissions, signals, FPU state,
segment registers....  The list ends up looking exactly the way task_struct
is, making the actual savings very small.

What the fibrils approach is useful for is the launching of the thread
initially, as you don't have to retain things like the current FPU state,
change segment registers, etc.  Changing the stack is cheap, the rest of
the work is not.

                -ben
--
"Time is of no importance, Mr. President, only life is important."
Don't Email: <d...@kvack.org>.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Zach Brown  
View profile  
 More options Jan 31 2007, 1:00 pm
Newsgroups: fa.linux.kernel
From: Zach Brown <zach.br...@oracle.com>
Date: Wed, 31 Jan 2007 18:00:45 UTC
Local: Wed, Jan 31 2007 1:00 pm
Subject: Re: [PATCH 0 of 4] Generic AIO by scheduling stacks

> Btw, this is also something where we should just disallow certain  
> system
> calls from being done through the asynchronous method.

Yeah.  Maybe just a bitmap built from __NR_ constants?  I don't know  
if we can do it in a way that doesn't require arch maintainer's  
attention.

It seems like it would be nice to avoid putting a test in the  
handlers themselves, and leave it up to the aio syscall submission  
processing.

> More interesting is the question about "close()", though. Currently we
> have an optimization (fget/fput_light) that basically boils down to  
> "we
> know we are the only owners". That optimization becomes more  
> "interesting"
> with AIO - we need to disable it when fibrils are active (because  
> other
> fibrils or the main thread can do it), but we can still keep it for  
> the
> non-fibril case.

I'll take a look, thanks for pointing it out.

- z
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Messages 1 - 25 of 59   Newer >
« Back to Discussions « Newer topic     Older topic »