Account Options

  1. Sign in
The old Google Groups will be going away soon.
Switch to the new Google Groups.
Google Groups Home
« Groups Home
Introduce per_call_chain()
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  Messages 1 - 25 of 94 - Collapse all  -  Translate all to Translated (View all originals)   Newer >
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Zach Brown  
View profile  
 More options Jan 30 2007, 4:40 pm
Newsgroups: fa.linux.kernel
From: Zach Brown <zach.br...@oracle.com>
Date: Tue, 30 Jan 2007 21:40:26 UTC
Local: Tues, Jan 30 2007 4:40 pm
Subject: [PATCH 1 of 4] Introduce per_call_chain()
There are members of task_struct which are only used by a given call chain to
pass arguments up and down the chain itself.  They are logically thread-local
storage.

The patches later in the series want to have multiple calls pending for a given
task, though only one will be executing at a given time.  By putting these
thread-local members of task_struct in a seperate storage structure we're able
to trivially swap them in and out as their calls are swapped in and out.

per_call_chain() doesn't have a terribly great name. It was chosen in the
spirit of per_cpu().

The storage was left inline in task_struct to avoid introducing indirection for
the vast majority of uses which will never have multiple calls executing in a
task.

I chose a few members of task_struct to migrate under per_call_chain() along
with the introduction as an example of what it looks like.  These would be
seperate patches in a patch series that was suitable for merging.

diff -r b1128b48dc99 -r 26e278468209 fs/jbd/journal.c
--- a/fs/jbd/journal.c  Fri Jan 12 20:00:03 2007 +0000
+++ b/fs/jbd/journal.c  Mon Jan 29 15:36:13 2007 -0800
@@ -471,7 +471,7 @@ int journal_force_commit_nested(journal_
        tid_t tid;

        spin_lock(&journal->j_state_lock);
-       if (journal->j_running_transaction && !current->journal_info) {
+       if (journal->j_running_transaction && !per_call_chain(journal_info)) {
                transaction = journal->j_running_transaction;
                __log_start_commit(journal, transaction->t_tid);
        } else if (journal->j_committing_transaction)
diff -r b1128b48dc99 -r 26e278468209 fs/jbd/transaction.c
--- a/fs/jbd/transaction.c      Fri Jan 12 20:00:03 2007 +0000
+++ b/fs/jbd/transaction.c      Mon Jan 29 15:36:13 2007 -0800
@@ -279,12 +279,12 @@ handle_t *journal_start(journal_t *journ
        if (!handle)
                return ERR_PTR(-ENOMEM);

-       current->journal_info = handle;
+       per_call_chain(journal_info) = handle;

        err = start_this_handle(journal, handle);
        if (err < 0) {
                jbd_free_handle(handle);
-               current->journal_info = NULL;
+               per_call_chain(journal_info) = NULL;
                handle = ERR_PTR(err);
        }
        return handle;
@@ -1368,7 +1368,7 @@ int journal_stop(handle_t *handle)
                } while (old_handle_count != transaction->t_handle_count);
        }

-       current->journal_info = NULL;
+       per_call_chain(journal_info) = NULL;
        spin_lock(&journal->j_state_lock);
        spin_lock(&transaction->t_handle_lock);
        transaction->t_outstanding_credits -= handle->h_buffer_credits;
diff -r b1128b48dc99 -r 26e278468209 fs/namei.c
--- a/fs/namei.c        Fri Jan 12 20:00:03 2007 +0000
+++ b/fs/namei.c        Mon Jan 29 15:36:13 2007 -0800
@@ -628,20 +628,20 @@ static inline int do_follow_link(struct
 static inline int do_follow_link(struct path *path, struct nameidata *nd)
 {
        int err = -ELOOP;
-       if (current->link_count >= MAX_NESTED_LINKS)
+       if (per_call_chain(link_count) >= MAX_NESTED_LINKS)
                goto loop;
-       if (current->total_link_count >= 40)
+       if (per_call_chain(total_link_count) >= 40)
                goto loop;
        BUG_ON(nd->depth >= MAX_NESTED_LINKS);
        cond_resched();
        err = security_inode_follow_link(path->dentry, nd);
        if (err)
                goto loop;
-       current->link_count++;
-       current->total_link_count++;
+       per_call_chain(link_count)++;
+       per_call_chain(total_link_count)++;
        nd->depth++;
        err = __do_follow_link(path, nd);
-       current->link_count--;
+       per_call_chain(link_count)--;
        nd->depth--;
        return err;
 loop:
@@ -1025,7 +1025,7 @@ int fastcall link_path_walk(const char *

 int fastcall path_walk(const char * name, struct nameidata *nd)
 {
-       current->total_link_count = 0;
+       per_call_chain(total_link_count) = 0;
        return link_path_walk(name, nd);
 }

@@ -1153,7 +1153,7 @@ static int fastcall do_path_lookup(int d

                fput_light(file, fput_needed);
        }
-       current->total_link_count = 0;
+       per_call_chain(total_link_count) = 0;
        retval = link_path_walk(name, nd);
 out:
        if (likely(retval == 0)) {
diff -r b1128b48dc99 -r 26e278468209 include/linux/init_task.h
--- a/include/linux/init_task.h Fri Jan 12 20:00:03 2007 +0000
+++ b/include/linux/init_task.h Mon Jan 29 15:36:13 2007 -0800
@@ -88,6 +88,11 @@ extern struct nsproxy init_nsproxy;

 extern struct group_info init_groups;

+#define INIT_PER_CALL_CHAIN(tsk)                                       \
+{                                                                      \
+       .journal_info   = NULL,                                         \
+}
+
 /*
  *  INIT_TASK is used to set up the first task table, touch at
  * your own risk!. Base=0, limit=0x1fffff (=2MB)
@@ -124,6 +129,7 @@ extern struct group_info init_groups;
        .keep_capabilities = 0,                                         \
        .user           = INIT_USER,                                    \
        .comm           = "swapper",                                  \
+       .per_call       = INIT_PER_CALL_CHAIN(tsk),                     \
        .thread         = INIT_THREAD,                                  \
        .fs             = &init_fs,                                 \
        .files          = &init_files,                                      \
@@ -135,7 +141,6 @@ extern struct group_info init_groups;
                .signal = {{0}}},                                       \
        .blocked        = {{0}},                                        \
        .alloc_lock     = __SPIN_LOCK_UNLOCKED(tsk.alloc_lock),         \
-       .journal_info   = NULL,                                         \
        .cpu_timers     = INIT_CPU_TIMERS(tsk.cpu_timers),              \
        .fs_excl        = ATOMIC_INIT(0),                               \
        .pi_lock        = SPIN_LOCK_UNLOCKED,                           \
diff -r b1128b48dc99 -r 26e278468209 include/linux/jbd.h
--- a/include/linux/jbd.h       Fri Jan 12 20:00:03 2007 +0000
+++ b/include/linux/jbd.h       Mon Jan 29 15:36:13 2007 -0800
@@ -883,7 +883,7 @@ extern void         __wait_on_journal (journal_

 static inline handle_t *journal_current_handle(void)
 {
-       return current->journal_info;
+       return per_call_chain(journal_info);
 }

 /* The journaling code user interface:
diff -r b1128b48dc99 -r 26e278468209 include/linux/sched.h
--- a/include/linux/sched.h     Fri Jan 12 20:00:03 2007 +0000
+++ b/include/linux/sched.h     Mon Jan 29 15:36:13 2007 -0800
@@ -784,6 +784,20 @@ static inline void prefetch_stack(struct
 static inline void prefetch_stack(struct task_struct *t) { }
 #endif

+/*
+ * Members of this structure are used to pass arguments down call chains
+ * without specific arguments.  Historically they lived on task_struct,
+ * putting them in one place gives us some flexibility.  They're accessed
+ * with per_call_chain(name).
+ */
+struct per_call_chain_storage {
+       int link_count;         /* number of links in one symlink */
+       int total_link_count;   /* total links followed in a lookup */
+       void *journal_info;     /* journalling filesystem info */
+};
+
+#define per_call_chain(foo) current->per_call.foo
+
 struct audit_context;          /* See audit.c */
 struct mempolicy;
 struct pipe_inode_info;
@@ -920,7 +934,7 @@ struct task_struct {
                                       it with task_lock())
                                     - initialized normally by flush_old_exec */
 /* file system info */
-       int link_count, total_link_count;
+       struct per_call_chain_storage per_call;
 #ifdef CONFIG_SYSVIPC
 /* ipc stuff */
        struct sysv_sem sysvsem;
@@ -993,9 +1007,6 @@ struct task_struct {
        struct held_lock held_locks[MAX_LOCK_DEPTH];
        unsigned int lockdep_recursion;
 #endif
-
-/* journalling filesystem info */
-       void *journal_info;

 /* VM state */
        struct reclaim_state *reclaim_state;
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Discussion subject changed to "Introduce i386 fibril scheduling" by Zach Brown
Zach Brown  
View profile  
 More options Jan 30 2007, 4:40 pm
Newsgroups: fa.linux.kernel
From: Zach Brown <zach.br...@oracle.com>
Date: Tue, 30 Jan 2007 21:40:31 UTC
Local: Tues, Jan 30 2007 4:40 pm
Subject: [PATCH 2 of 4] Introduce i386 fibril scheduling
This patch introduces the notion of a 'fibril'.  It's meant to be a lighter
kernel thread.  There can be multiple of them in the process of executing for a
given task_struct, but only one can every be actively running at a time.  Think
of it as a stack and some metadata for scheduling them inside the task_stuct.

This implementation is wildly architecture-specific but isn't put in the right
places.  Since these are not code paths that I have extensive experience with,
I focused more on on getting it going and representative of the concept than on
making it right on the first try.  I'm actively interested in feedback from
people who know more about the places this touches.

The fibril struct itself is left stand-alone for clarity.  There is a 1:1
relationship between fibrils and struct thread_info, though, so it might make
more sense to embed the two somehow.

The use of list_head for the run queue is simplistic.  As long as we're not
removing specific fibrils from the list, which seems unlikely, we be more
clever.  Maybe no more clever than a singly-linked list, though.

Fibril management is under the runqueue lock because that ends up working well
for the wake-up path as well.  In the current patch, though, it makes for some
pretty sloppy code for unlocking the runqueue lock (and re-enabling interrupts
and pre-emption) on the other side of the switch.

The actual mechanics of switching from one stack to another at the end of
schedule_fibril() makes me nervous.  I'm not convinced that blindly copying the
contents of thread_info from the previous to the next stack is safe, even if
done with interrupts disabled.  (NMIs?)  The juggling of current->thread_info
might be racy, etc.

diff -r 26e278468209 -r df7bc026d50e arch/i386/kernel/process.c
--- a/arch/i386/kernel/process.c        Mon Jan 29 15:36:13 2007 -0800
+++ b/arch/i386/kernel/process.c        Mon Jan 29 15:36:16 2007 -0800
@@ -698,6 +698,28 @@ struct task_struct fastcall * __switch_t
        return prev_p;
 }

+/*
+ * We've just switched the stack and instruction pointer to point to a new
+ * fibril.  We were called from schedule() -> schedule_fibril() with the
+ * runqueue lock held _irq and with preemption disabled.
+ *
+ * We let finish_fibril_switch() unwind the state that was built up by
+ * our callers.  We do that here so that we don't need to ask fibrils to
+ * first execute something analagous to schedule_tail(). Maybe that's
+ * wrong.
+ *
+ * We'd also have to reacquire the kernel lock here.  For now we know the
+ * BUG_ON(lock_depth) prevents us from having to worry about it.
+ */
+void fastcall __switch_to_fibril(struct thread_info *ti)
+{
+       finish_fibril_switch();
+
+       /* free the ti if schedule_fibril() told us that it's done */
+       if (ti->status & TS_FREE_AFTER_SWITCH)
+               free_thread_info(ti);
+}
+
 asmlinkage int sys_fork(struct pt_regs regs)
 {
        return do_fork(SIGCHLD, regs.esp, &regs, 0, NULL, NULL);
diff -r 26e278468209 -r df7bc026d50e include/asm-i386/system.h
--- a/include/asm-i386/system.h Mon Jan 29 15:36:13 2007 -0800
+++ b/include/asm-i386/system.h Mon Jan 29 15:36:16 2007 -0800
@@ -31,6 +31,31 @@ extern struct task_struct * FASTCALL(__s
                      "=a" (last),"=S" (esi),"=D" (edi)                   \
                     :"m" (next->thread.esp),"m" (next->thread.eip),  \
                      "2" (prev), "d" (next));                              \
+} while (0)
+
+struct thread_info;
+void fastcall __switch_to_fibril(struct thread_info *ti);
+
+/*
+ * This is called with the run queue lock held _irq and with preemption
+ * disabled.  __switch_to_fibril drops those.
+ */
+#define switch_to_fibril(prev, next, ti) do {                          \
+       unsigned long esi,edi;                                          \
+       asm volatile("pushfl\n\t"             /* Save flags */        \
+                    "pushl %%ebp\n\t"                                        \
+                    "movl %%esp,%0\n\t"      /* save ESP */          \
+                    "movl %4,%%esp\n\t"      /* restore ESP */       \
+                    "movl $1f,%1\n\t"                /* save EIP */          \
+                    "pushl %5\n\t"           /* restore EIP */       \
+                    "jmp __switch_to_fibril\n"                               \
+                    "1:\t"                                           \
+                    "popl %%ebp\n\t"                                 \
+                    "popfl"                                          \
+                    :"=m" (prev->esp),"=m" (prev->eip),              \
+                     "=S" (esi),"=D" (edi)                         \
+                    :"m" (next->esp),"m" (next->eip),                        \
+                     "d" (prev), "a" (ti));                                \
 } while (0)

 #define _set_base(addr,base) do { unsigned long __pr; \
diff -r 26e278468209 -r df7bc026d50e include/asm-i386/thread_info.h
--- a/include/asm-i386/thread_info.h    Mon Jan 29 15:36:13 2007 -0800
+++ b/include/asm-i386/thread_info.h    Mon Jan 29 15:36:16 2007 -0800
@@ -91,6 +91,12 @@ static inline struct thread_info *curren
 static inline struct thread_info *current_thread_info(void)
 {
        return (struct thread_info *)(current_stack_pointer & ~(THREAD_SIZE - 1));
+}
+
+/* XXX perhaps should be integrated with task_pt_regs(task) */
+static inline struct pt_regs *thread_info_pt_regs(struct thread_info *info)
+{
+       return (struct pt_regs *)(KSTK_TOP(info)-8) - 1;
 }

 /* thread information allocation */
@@ -169,6 +175,7 @@ static inline struct thread_info *curren
  */
 #define TS_USEDFPU             0x0001  /* FPU was used by this task this quantum (SMP) */
 #define TS_POLLING             0x0002  /* True if in idle loop and not sleeping */
+#define TS_FREE_AFTER_SWITCH   0x0004  /* free ti in __switch_to_fibril() */

 #define tsk_is_polling(t) ((t)->thread_info->status & TS_POLLING)

diff -r 26e278468209 -r df7bc026d50e include/linux/init_task.h
--- a/include/linux/init_task.h Mon Jan 29 15:36:13 2007 -0800
+++ b/include/linux/init_task.h Mon Jan 29 15:36:16 2007 -0800
@@ -111,6 +111,8 @@ extern struct group_info init_groups;
        .cpus_allowed   = CPU_MASK_ALL,                                 \
        .mm             = NULL,                                         \
        .active_mm      = &init_mm,                                 \
+       .fibril         = NULL,                                         \
+       .runnable_fibrils = LIST_HEAD_INIT(tsk.runnable_fibrils),       \
        .run_list       = LIST_HEAD_INIT(tsk.run_list),                 \
        .ioprio         = 0,                                            \
        .time_slice     = HZ,                                           \
diff -r 26e278468209 -r df7bc026d50e include/linux/sched.h
--- a/include/linux/sched.h     Mon Jan 29 15:36:13 2007 -0800
+++ b/include/linux/sched.h     Mon Jan 29 15:36:16 2007 -0800
@@ -812,6 +812,38 @@ enum sleep_type {

 struct prio_array;

+/*
+ * A 'fibril' is a very small fiber.  It's used here to mean a small thread.
+ *
+ * (Chosing a weird new name avoided yet more overloading of 'task', 'call',
+ * 'thread', 'stack', 'fib{er,re}', etc).
+ *
+ * This structure is used by the schduler to track multiple executing stacks
+ * inside a task_struct.
+ *
+ * Only one fibril executes for a given task_struct at a time.  When it
+ * blocks, however, another fibril has the chance to execute while it sleeps.
+ * This means that call chains executing in fibrils can see concurrent
+ * current-> accesses at blocking points.  "per_call_chain()" members are
+ * switched along with the fibril, so they remain local.  Preemption *will not*
+ * trigger a fibril switch.
+ *
+ * XXX
+ *  - arch specific
+ */
+struct fibril {
+       struct list_head                run_list;
+       /* -1 unrunnable, 0 runnable, >0 stopped */
+       long                            state;
+       unsigned long                   eip;
+       unsigned long                   esp;
+       struct thread_info              *ti;
+       struct per_call_chain_storage   per_call;
+};
+
+void sched_new_runnable_fibril(struct fibril *fibril);
+void finish_fibril_switch(void);
+
 struct task_struct {
        volatile long state;    /* -1 unrunnable, 0 runnable, >0 stopped */
        struct thread_info *thread_info;
@@ -857,6 +889,20 @@ struct task_struct {
        struct list_head ptrace_list;

        struct mm_struct *mm, *active_mm;
+
+       /*
+        * The scheduler uses this to determine if the current call is a
+        * stand-alone task or a fibril.  If it's a fibril then wake-ups
+        * will target the fibril and a schedule() might result in swapping
+        * in another runnable fibril.  So to start executing fibrils at all
+        * one allocates a fibril to represent the running task and then
+        * puts initialized runnable fibrils in the run list.
+        *
+        * The state members of the fibril and runnable_fibrils list are
+        * managed under the task's run queue lock.
+        */
+       struct fibril *fibril;
+       struct list_head runnable_fibrils;

 /* task state */
        struct linux_binfmt *binfmt;
diff -r 26e278468209 -r df7bc026d50e kernel/exit.c
--- a/kernel/exit.c     Mon Jan 29 15:36:13 2007 -0800
+++ b/kernel/exit.c     Mon Jan 29 15:36:16 2007 -0800
@@ -854,6 +854,13 @@ fastcall NORET_TYPE void do_exit(long co
 {
        struct task_struct *tsk = current;
        int group_dead;
+
+       /*
+        * XXX this is just a debug helper, this should be waiting for all
+        * fibrils to return.  Possibly after sending them lots of -KILL
+        * signals?
+        */
+       BUG_ON(!list_empty(&current->runnable_fibrils));

        profile_task_exit(tsk);

diff -r 26e278468209 -r df7bc026d50e kernel/fork.c
--- a/kernel/fork.c     Mon Jan 29 15:36:13 2007 -0800
+++ b/kernel/fork.c     Mon Jan 29 15:36:16 2007 -0800
@@ -1179,6 +1179,9 @@ static struct task_struct *copy_process(

        /* for sys_ioprio_set(IOPRIO_WHO_PGRP) */
        p->ioprio = current->ioprio;
+
+       p->fibril = NULL;
+       INIT_LIST_HEAD(&p->runnable_fibrils);

        /*
         * The task hasn't been attached yet, so its cpus_allowed mask will
diff -r 26e278468209 -r df7bc026d50e kernel/sched.c
--- a/kernel/sched.c    Mon Jan 29 15:36:13 2007 -0800
+++ b/kernel/sched.c    Mon Jan 29 15:36:16 2007 -0800
@@ -3407,6 +3407,111 @@ static inline int interactive_sleep(enum
 }

 /*
+ * This unwinds the state that was built up by schedule -> schedule_fibril().
+ * The arch-specific switch_to_fibril() path calls here once the new fibril
+ * is executing.
+ */
+void finish_fibril_switch(void)
+{
+       spin_unlock_irq(&this_rq()->lock);
+       preempt_enable_no_resched();
+}
+
+/*
+ * Add a new fibril to the runnable list.  It'll be switched to next time
+ * the caller comes through schedule().
+ */
+void sched_new_runnable_fibril(struct fibril *fibril)
+{
+       struct task_struct *tsk = current;
+       unsigned long flags;
+       struct rq *rq = task_rq_lock(tsk, &flags);
+
+       fibril->state = TASK_RUNNING;
+       BUG_ON(!list_empty(&fibril->run_list));
+       list_add_tail(&fibril->run_list, &tsk->runnable_fibrils);
+
+       task_rq_unlock(rq, &flags);
+}
+
+/*
+ * This is called from schedule() when we're not being preempted and there is a
+ * fibril ...

read more »


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Discussion subject changed to "Teach paths to wake a specific void * target instead of a whole task_struct" by Zach Brown
Zach Brown  
View profile  
 More options Jan 30 2007, 4:41 pm
Newsgroups: fa.linux.kernel
From: Zach Brown <zach.br...@oracle.com>
Date: Tue, 30 Jan 2007 21:41:23 UTC
Local: Tues, Jan 30 2007 4:41 pm
Subject: [PATCH 3 of 4] Teach paths to wake a specific void * target instead of a whole task_struct
The addition of multiple sleeping fibrils under a task_struct means that we
can't simply wake a task_struct to be able to wake a specific sleeping code
path.

This patch introduces task_wake_target() as a way to refer to a code path that
is about to sleep and will be woken in the future.  Sleepers that used to wake
a current task_struct reference with wake_up_process() now use this helper to
get a wake target cookie and wake it with wake_up_target().

Some paths know that waking a task will be sufficient.  Paths working with
kernel threads that never use fibrils fall into this category.  They're changed
to use wake_up_task() instead of wake_up_process().

This is not an exhaustive patch.  It isn't yet clear how signals are going to
interract with fibrils.  Once that is decided callers of wake_up_state() are
going to need to reflect the desired behaviour.  I add __deprecated to it to
highlight this detail.

The actual act of performing the wake-up is hidden under try_to_wake_up() and
is serialized with the scheduler under the runqueue lock.  This is very
fiddly stuff.  I'm sure I've missed some details.  I've tried to comment
the intent above try_to_wake_up_fibril().

diff -r df7bc026d50e -r 4ea674e8825e arch/i386/kernel/ptrace.c
--- a/arch/i386/kernel/ptrace.c Mon Jan 29 15:36:16 2007 -0800
+++ b/arch/i386/kernel/ptrace.c Mon Jan 29 15:46:47 2007 -0800
@@ -492,7 +492,7 @@ long arch_ptrace(struct task_struct *chi
                child->exit_code = data;
                /* make sure the single step bit is not set. */
                clear_singlestep(child);
-               wake_up_process(child);
+               wake_up_task(child);
                ret = 0;
                break;

@@ -508,7 +508,7 @@ long arch_ptrace(struct task_struct *chi
                child->exit_code = SIGKILL;
                /* make sure the single step bit is not set. */
                clear_singlestep(child);
-               wake_up_process(child);
+               wake_up_task(child);
                break;

        case PTRACE_SYSEMU_SINGLESTEP: /* Same as SYSEMU, but singlestep if not syscall */
@@ -526,7 +526,7 @@ long arch_ptrace(struct task_struct *chi
                set_singlestep(child);
                child->exit_code = data;
                /* give it a chance to run. */
-               wake_up_process(child);
+               wake_up_task(child);
                ret = 0;
                break;

diff -r df7bc026d50e -r 4ea674e8825e drivers/block/loop.c
--- a/drivers/block/loop.c      Mon Jan 29 15:36:16 2007 -0800
+++ b/drivers/block/loop.c      Mon Jan 29 15:46:47 2007 -0800
@@ -824,7 +824,7 @@ static int loop_set_fd(struct loop_devic
                goto out_clr;
        }
        lo->lo_state = Lo_bound;
-       wake_up_process(lo->lo_thread);
+       wake_up_task(lo->lo_thread);
        return 0;

 out_clr:
diff -r df7bc026d50e -r 4ea674e8825e drivers/md/dm-io.c
--- a/drivers/md/dm-io.c        Mon Jan 29 15:36:16 2007 -0800
+++ b/drivers/md/dm-io.c        Mon Jan 29 15:46:47 2007 -0800
@@ -18,7 +18,7 @@ struct io {
 struct io {
        unsigned long error;
        atomic_t count;
-       struct task_struct *sleeper;
+       void *wake_target;
        io_notify_fn callback;
        void *context;
 };
@@ -110,8 +110,8 @@ static void dec_count(struct io *io, uns
                set_bit(region, &io->error);

        if (atomic_dec_and_test(&io->count)) {
-               if (io->sleeper)
-                       wake_up_process(io->sleeper);
+               if (io->wake_target)
+                       wake_up_task(io->wake_target);

                else {
                        int r = io->error;
@@ -323,7 +323,7 @@ static int sync_io(unsigned int num_regi

        io.error = 0;
        atomic_set(&io.count, 1); /* see dispatch_io() */
-       io.sleeper = current;
+       io.wake_target = task_wake_target(current);

        dispatch_io(rw, num_regions, where, dp, &io, 1);

@@ -358,7 +358,7 @@ static int async_io(unsigned int num_reg
        io = mempool_alloc(_io_pool, GFP_NOIO);
        io->error = 0;
        atomic_set(&io->count, 1); /* see dispatch_io() */
-       io->sleeper = NULL;
+       io->wake_target = NULL;
        io->callback = fn;
        io->context = context;

diff -r df7bc026d50e -r 4ea674e8825e drivers/scsi/qla2xxx/qla_os.c
--- a/drivers/scsi/qla2xxx/qla_os.c     Mon Jan 29 15:36:16 2007 -0800
+++ b/drivers/scsi/qla2xxx/qla_os.c     Mon Jan 29 15:46:47 2007 -0800
@@ -2403,7 +2403,7 @@ qla2xxx_wake_dpc(scsi_qla_host_t *ha)
 qla2xxx_wake_dpc(scsi_qla_host_t *ha)
 {
        if (ha->dpc_thread)
-               wake_up_process(ha->dpc_thread);
+               wake_up_task(ha->dpc_thread);
 }

 /*
diff -r df7bc026d50e -r 4ea674e8825e drivers/scsi/scsi_error.c
--- a/drivers/scsi/scsi_error.c Mon Jan 29 15:36:16 2007 -0800
+++ b/drivers/scsi/scsi_error.c Mon Jan 29 15:46:47 2007 -0800
@@ -51,7 +51,7 @@ void scsi_eh_wakeup(struct Scsi_Host *sh
 void scsi_eh_wakeup(struct Scsi_Host *shost)
 {
        if (shost->host_busy == shost->host_failed) {
-               wake_up_process(shost->ehandler);
+               wake_up_task(shost->ehandler);
                SCSI_LOG_ERROR_RECOVERY(5,
                                printk("Waking error handler thread\n"));
        }
diff -r df7bc026d50e -r 4ea674e8825e fs/aio.c
--- a/fs/aio.c  Mon Jan 29 15:36:16 2007 -0800
+++ b/fs/aio.c  Mon Jan 29 15:46:47 2007 -0800
@@ -907,7 +907,7 @@ void fastcall kick_iocb(struct kiocb *io
         * single context. */
        if (is_sync_kiocb(iocb)) {
                kiocbSetKicked(iocb);
-               wake_up_process(iocb->ki_obj.tsk);
+               wake_up_target(iocb->ki_obj.wake_target);
                return;
        }

@@ -941,7 +941,7 @@ int fastcall aio_complete(struct kiocb *
                BUG_ON(iocb->ki_users != 1);
                iocb->ki_user_data = res;
                iocb->ki_users = 0;
-               wake_up_process(iocb->ki_obj.tsk);
+               wake_up_target(iocb->ki_obj.wake_target);
                return 1;
        }

@@ -1053,7 +1053,7 @@ struct aio_timeout {
 struct aio_timeout {
        struct timer_list       timer;
        int                     timed_out;
-       struct task_struct      *p;
+       void                    *wake_target;
 };

 static void timeout_func(unsigned long data)
@@ -1061,7 +1061,7 @@ static void timeout_func(unsigned long d
        struct aio_timeout *to = (struct aio_timeout *)data;

        to->timed_out = 1;
-       wake_up_process(to->p);
+       wake_up_target(to->wake_target);
 }

 static inline void init_timeout(struct aio_timeout *to)
@@ -1070,7 +1070,7 @@ static inline void init_timeout(struct a
        to->timer.data = (unsigned long)to;
        to->timer.function = timeout_func;
        to->timed_out = 0;
-       to->p = current;
+       to->wake_target = task_wake_target(current);
 }

 static inline void set_timeout(long start_jiffies, struct aio_timeout *to,
diff -r df7bc026d50e -r 4ea674e8825e fs/direct-io.c
--- a/fs/direct-io.c    Mon Jan 29 15:36:16 2007 -0800
+++ b/fs/direct-io.c    Mon Jan 29 15:46:47 2007 -0800
@@ -124,7 +124,7 @@ struct dio {
        spinlock_t bio_lock;            /* protects BIO fields below */
        unsigned long refcount;         /* direct_io_worker() and bios */
        struct bio *bio_list;           /* singly linked via bi_private */
-       struct task_struct *waiter;     /* waiting task (NULL if none) */
+       void *wake_target;              /* waiting initiator (NULL if none) */

        /* AIO related stuff */
        struct kiocb *iocb;             /* kiocb */
@@ -278,8 +278,8 @@ static int dio_bio_end_aio(struct bio *b

        spin_lock_irqsave(&dio->bio_lock, flags);
        remaining = --dio->refcount;
-       if (remaining == 1 && dio->waiter)
-               wake_up_process(dio->waiter);
+       if (remaining == 1 && dio->wake_target)
+               wake_up_target(dio->wake_target);
        spin_unlock_irqrestore(&dio->bio_lock, flags);

        if (remaining == 0) {
@@ -309,8 +309,8 @@ static int dio_bio_end_io(struct bio *bi
        spin_lock_irqsave(&dio->bio_lock, flags);
        bio->bi_private = dio->bio_list;
        dio->bio_list = bio;
-       if (--dio->refcount == 1 && dio->waiter)
-               wake_up_process(dio->waiter);
+       if (--dio->refcount == 1 && dio->wake_target)
+               wake_up_target(dio->wake_target);
        spin_unlock_irqrestore(&dio->bio_lock, flags);
        return 0;
 }
@@ -393,12 +393,12 @@ static struct bio *dio_await_one(struct
         */
        while (dio->refcount > 1 && dio->bio_list == NULL) {
                __set_current_state(TASK_UNINTERRUPTIBLE);
-               dio->waiter = current;
+               dio->wake_target = task_wake_target(current);
                spin_unlock_irqrestore(&dio->bio_lock, flags);
                io_schedule();
                /* wake up sets us TASK_RUNNING */
                spin_lock_irqsave(&dio->bio_lock, flags);
-               dio->waiter = NULL;
+               dio->wake_target = NULL;
        }
        if (dio->bio_list) {
                bio = dio->bio_list;
@@ -990,7 +990,7 @@ direct_io_worker(int rw, struct kiocb *i
        spin_lock_init(&dio->bio_lock);
        dio->refcount = 1;
        dio->bio_list = NULL;
-       dio->waiter = NULL;
+       dio->wake_target = NULL;

        /*
         * In case of non-aligned buffers, we may need 2 more
diff -r df7bc026d50e -r 4ea674e8825e fs/jbd/journal.c
--- a/fs/jbd/journal.c  Mon Jan 29 15:36:16 2007 -0800
+++ b/fs/jbd/journal.c  Mon Jan 29 15:46:47 2007 -0800
@@ -94,7 +94,7 @@ static void commit_timeout(unsigned long
 {
        struct task_struct * p = (struct task_struct *) __data;

-       wake_up_process(p);
+       wake_up_task(p);
 }

 /*
diff -r df7bc026d50e -r 4ea674e8825e include/linux/aio.h
--- a/include/linux/aio.h       Mon Jan 29 15:36:16 2007 -0800
+++ b/include/linux/aio.h       Mon Jan 29 15:46:47 2007 -0800
@@ -98,7 +98,7 @@ struct kiocb {

        union {
                void __user             *user;
-               struct task_struct      *tsk;
+               void                    *wake_target;
        } ki_obj;

        __u64                   ki_user_data;   /* user's data for completion */
@@ -124,7 +124,6 @@ struct kiocb {
 #define is_sync_kiocb(iocb)    ((iocb)->ki_key == KIOCB_SYNC_KEY)
 #define init_sync_kiocb(x, filp)                       \
        do {                                            \
-               struct task_struct *tsk = current;      \
                (x)->ki_flags = 0;                   \
                (x)->ki_users = 1;                   \
                (x)->ki_key = KIOCB_SYNC_KEY;                \
@@ -133,7 +132,7 @@ struct kiocb {
                (x)->ki_cancel = NULL;                       \
                (x)->ki_retry = NULL;                        \
                (x)->ki_dtor = NULL;                 \
-               (x)->ki_obj.tsk = tsk;                       \
+               (x)->ki_obj.wake_target = task_wake_target(current); \
                (x)->ki_user_data = 0;                  \
                init_wait((&(x)->ki_wait));             \
        } while (0)
diff -r df7bc026d50e -r 4ea674e8825e include/linux/freezer.h
--- a/include/linux/freezer.h   Mon Jan 29 15:36:16 2007 -0800
+++ b/include/linux/freezer.h   Mon Jan 29 15:46:47 2007 -0800
@@ -42,7 +42,7 @@ static inline int thaw_process(struct ta
 {
        if (frozen(p)) {
                p->flags &= ~PF_FROZEN;
-               wake_up_process(p);
+               wake_up_task(p);
                return 1;
        }
        return 0;
diff -r df7bc026d50e -r 4ea674e8825e include/linux/hrtimer.h
--- a/include/linux/hrtimer.h   Mon Jan 29 15:36:16 2007 -0800
+++ b/include/linux/hrtimer.h   Mon Jan 29 15:46:47 2007 -0800
@@ -65,7 +65,7 @@ struct hrtimer {
  */
 struct hrtimer_sleeper {
        struct hrtimer timer;
-       struct
...

read more »


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Discussion subject changed to "Introduce i386 fibril scheduling" by Ingo Molnar
Ingo Molnar  
View profile  
 More options Feb 1 2007, 3:38 am
Newsgroups: fa.linux.kernel
From: Ingo Molnar <mi...@elte.hu>
Date: Thu, 01 Feb 2007 08:38:49 UTC
Local: Thurs, Feb 1 2007 3:38 am
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

* Zach Brown <zach.br...@oracle.com> wrote:

> This patch introduces the notion of a 'fibril'.  It's meant to be a
> lighter kernel thread. [...]

as per my other email, i dont really like this concept. This is the
killer:

> [...]  There can be multiple of them in the process of executing for a
> given task_struct, but only one can every be actively running at a
> time. [...]

there's almost no scheduling cost from being able to arbitrarily
schedule a kernel thread - but there are /huge/ benefits in it.

would it be hard to redo your AIO patches based on a pool of plain
simple kernel threads?

We could even extend the scheduling properties of kernel threads so that
they could also be 'companion threads' of any given user-space task.
(i.e. they'd always schedule on the same CPu as that user-space task)

I bet most of the real benefit would come from co-scheduling them on the
same CPU. But this should be a performance property, not a basic design
property. (And i also think that having a limited per-CPU pool of AIO
threads works better than having a per-user-thread pool - but again this
is a detail that can be easily changed, not a fundamental design
property.)

        Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Ingo Molnar  
View profile  
 More options Feb 1 2007, 8:05 am
Newsgroups: fa.linux.kernel
From: Ingo Molnar <mi...@elte.hu>
Date: Thu, 01 Feb 2007 13:05:04 UTC
Local: Thurs, Feb 1 2007 8:05 am
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

* Ingo Molnar <mi...@elte.hu> wrote:

> * Zach Brown <zach.br...@oracle.com> wrote:

> > This patch introduces the notion of a 'fibril'.  It's meant to be a
> > lighter kernel thread. [...]

> as per my other email, i dont really like this concept. This is the
> killer:

let me clarify this: i very much like your AIO patchset in general, in
the sense that it 'completes' the AIO implementation: finally everything
can be done via it, greatly increasing its utility and hopefully its
penetration. This is the most important step, by far.

what i dont really like /the particular/ concept above - the
introduction of 'fibrils' as a hard distinction of kernel threads. They
are /almost/ kernel threads, but still by being different they create
alot of duplication and miss out on a good deal of features that kernel
threads have naturally.

It kind of hurts to say this because i'm usually quite concept-happy -
one can easily get addicted to the introduction of new core kernel
concepts :-) But i really, really think we dont want to do fibrils but
we want to do kernel threads, and i havent really seen a discussion
about why they shouldnt be done via kernel threads.

Nor have i seen a discussion that whatever threading concept we use for
AIO within the kernel, it is really a fallback thing, not the primary
goal of "native" KAIO design. The primary goal of KAIO design is to
arrive at a state machine - and for one of the most important IO
disciplines, networking, that is reality already. (For filesystem events
i doubt we will ever be able to build an IO state machine - but there
are lots of crazy folks out there so it's not fundamentally impossible,
just very, very hard.)

so my suggestions center around the notion of extending kernel threads
to support the features you find important in fibrils:

> would it be hard to redo your AIO patches based on a pool of plain
> simple kernel threads?

> We could even extend the scheduling properties of kernel threads so
> that they could also be 'companion threads' of any given user-space
> task. (i.e. they'd always schedule on the same CPu as that user-space
> task)

> I bet most of the real benefit would come from co-scheduling them on
> the same CPU. But this should be a performance property, not a basic
> design property. (And i also think that having a limited per-CPU pool
> of AIO threads works better than having a per-user-thread pool - but
> again this is a detail that can be easily changed, not a fundamental
> design property.)

but i'm willing to be convinced of the opposite as well, as always. (I'm
real good at quickly changing my mind, especially when i'm embarrasingly
wrong about something. So please fire away and dont hold back.)

        Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Christoph Hellwig  
View profile  
 More options Feb 1 2007, 8:19 am
Newsgroups: fa.linux.kernel
From: Christoph Hellwig <h...@infradead.org>
Date: Thu, 01 Feb 2007 13:19:43 UTC
Local: Thurs, Feb 1 2007 8:19 am
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

On Thu, Feb 01, 2007 at 02:02:34PM +0100, Ingo Molnar wrote:
> what i dont really like /the particular/ concept above - the
> introduction of 'fibrils' as a hard distinction of kernel threads. They
> are /almost/ kernel threads, but still by being different they create
> alot of duplication and miss out on a good deal of features that kernel
> threads have naturally.

> It kind of hurts to say this because i'm usually quite concept-happy -
> one can easily get addicted to the introduction of new core kernel
> concepts :-) But i really, really think we dont want to do fibrils but
> we want to do kernel threads, and i havent really seen a discussion
> about why they shouldnt be done via kernel threads.

I tend to agree.  Note that there is one thing we should be doing one
one day (not only if we want to use it for aio) is to make kernel threads
more lightweight.  Thereéis a lot of baggae we keep around in task_struct
and co that only makes sense for threads that have a user space part and
aren't or shouldn't be needed for a purely kernel-resistant thread.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Ingo Molnar  
View profile  
 More options Feb 1 2007, 8:55 am
Newsgroups: fa.linux.kernel
From: Ingo Molnar <mi...@elte.hu>
Date: Thu, 01 Feb 2007 13:55:32 UTC
Local: Thurs, Feb 1 2007 8:55 am
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

* Christoph Hellwig <h...@infradead.org> wrote:

> I tend to agree.  Note that there is one thing we should be doing one
> one day (not only if we want to use it for aio) is to make kernel
> threads more lightweight.  There a lot of baggae we keep around in
> task_struct and co that only makes sense for threads that have a user
> space part and aren't or shouldn't be needed for a purely
> kernel-resistant thread.

yeah. I'm totally open to such efforts. I'd also be most happy if this
was primarily driven via the KAIO effort: i.e. to implement it via
kernel threads and then to benchmark the hell out of it. I volunteer to
fix whatever fat kernel thread handling has left.

and if people agree with me that 'native' state-machine driven KAIO is
where we want to ultimately achieve (it is certainly the best performing
implementation) then i dont see the point in fibrils as an interim
mechanism anyway. Lets just hide AIO complexities from userspace via
kernel threads, and optimize this via two methods: by making kernel
threads faster, and by simultaneously and gradually converting as much
KAIO code to a native state machine - which would not need any kind of
kernel thread help anyway.

(plus as i mentioned previously, co-scheduling kernel threads with
related user space threads on the same CPU might be something useful too
- not just for KAIO, and we could add that too.)

also, we context-switch kernel threads in 350 nsecs on current hardware
and the -rt kernel is certainly happy with that and runs all hardirqs
and softirqs in separate kernel thread contexts. There's not /that/ much
fat left to cut off - and if there's something more to optimize there
then there are a good number of projects interested in that, not just
the KAIO effort :)

        Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Mark Lord  
View profile  
 More options Feb 1 2007, 12:14 pm
Newsgroups: fa.linux.kernel
From: Mark Lord <l...@rtr.ca>
Date: Thu, 01 Feb 2007 17:14:37 UTC
Local: Thurs, Feb 1 2007 12:14 pm
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

Ingo Molnar wrote:

> also, we context-switch kernel threads in 350 nsecs on current hardware
> and the -rt kernel is certainly happy with that and runs all hardirqs

Ingo, how relevant is that "350 nsecs on current hardware" claim?

I don't mean that in a bad way, but my own experience suggests that
most people doing real hard RT (or tight soft RT) are not doing it
on x86 architectures.  But rather on lowly 1GHz (or less) ARM based
processors and the like.

For RT issues, those are the platforms I care more about,
as those are the ones that get embedded into real-time devices.

??

Cheers
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Ingo Molnar  
View profile  
 More options Feb 1 2007, 1:05 pm
Newsgroups: fa.linux.kernel
From: Ingo Molnar <mi...@elte.hu>
Date: Thu, 01 Feb 2007 18:05:46 UTC
Local: Thurs, Feb 1 2007 1:05 pm
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

* Mark Lord <l...@rtr.ca> wrote:

> >also, we context-switch kernel threads in 350 nsecs on current
> >hardware and the -rt kernel is certainly happy with that and runs all
> >hardirqs

> Ingo, how relevant is that "350 nsecs on current hardware" claim?

> I don't mean that in a bad way, but my own experience suggests that
> most people doing real hard RT (or tight soft RT) are not doing it on
> x86 architectures.  But rather on lowly 1GHz (or less) ARM based
> processors and the like.

it's not relevant to those embedded boards, but it's relevant to the AIO
discussion, which centers around performance.

> For RT issues, those are the platforms I care more about, as those are
> the ones that get embedded into real-time devices.

yeah. Nevertheless if you want to use -rt on your desktop (under Fedora
4/5/6) you can track an rpmized+distroized full kernel package quite
easily, via 3 easy commands:

   cd /etc/yum.repos.d
   wget http://people.redhat.com/~mingo/realtime-preempt/rt.repo

   yum install kernel-rt.x86_64   # on x86_64
   yum install kernel-rt          # on i686

which is closely tracking latest upstream -git. (for example, the
current kernel-rt-2.6.20-rc7.1.rt3.0109.i686.rpm is based on
2.6.20-rc7-git1, so if you want to run a kernel rpm that has all of
Linus' latest commits from yesterday, this might be for you.)

it's rumored to be a quite smooth kernel ;-) So in this sense, because
this also runs on all my testboxes by default, it matters on modern
hardware too, at least to me. Today's commodity hardware is tomorrow's
embedded hardware. If a kernel is good on today's colorful desktop
hardware then it will be perfect for tomorrow's embedded hardware.

        Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Linus Torvalds  
View profile  
 More options Feb 1 2007, 3:08 pm
Newsgroups: fa.linux.kernel
From: Linus Torvalds <torva...@linux-foundation.org>
Date: Thu, 01 Feb 2007 20:08:46 UTC
Local: Thurs, Feb 1 2007 3:08 pm
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

On Thu, 1 Feb 2007, Ingo Molnar wrote:

> there's almost no scheduling cost from being able to arbitrarily
> schedule a kernel thread - but there are /huge/ benefits in it.

That's a singularly *stupid* argument.

Of course scheduling is fast. That's the whole *point* of fibrils. They
still schedule. Nobody claimed anything else.

Bringing up RT kernels and scheduling latency is idiotic. It's like saying
"we should do this because the sky is blue". Sure, that's true, but what
the *hell* does raleigh scattering have to do with anything?

The cost has _never_ been scheduling. That was never the point. Why do you
even bring it up? Only to make an argument that makes no sense?

The cost of AIO is

 - maintenance. It'sa separate code-path, and it's one that simply doesn't
   fit into anything else AT ALL. It works (mostly) for simple things, ie
   reads and writes, but even there, it's really adding a lot of crud that
   we could do without.

 - setup and teardown costs: both in CPU and in memory. These are the big
   costs. It's especially true since a lot of AIO actually ends up cached.
   The user program just wants the data - 99% of the time it's likely to
   be there, and the whole point of AIO is to get at it cheaply, but not
   block if it's not there.

So your scheduling arguments are inane. They totally miss the point. They
have nothing to do with *anything*.

Ingo: everybody *agrees* that scheduling is cheap. Scheduling isn't the
issue. Scheduling isn't even needed in the perfect path where the AIO
didn't need to do any real IO (and that _is_ the path we actually would
like to optimize most).

So instead of talking about totally irrelevant things, please keep your
eyes on the ball.

So I claim that the ball is here:

 - cached data (and that is *espectally* true of some of the more
   interesting things we can do with a more generic AIO thing: path
   lookup, inode filling (stat/fstat) etc usually has hit-rates in the 99%
   range, but missing even just 1% of the time can be deadly, if the miss
   costs you a hundred msec of not doing anythign else!

   Do the math. A "stat()" system call generally takes on the other of a
   couple of microseconds. But if it misses even just 1% of the time (and
   takes 100 msec when it does that, because there is other IO also
   competing for the disk arm), ON AVERAGE it takes 1ms.

   So what you should aim for is improving that number. The cached case
   should hopefully still be in the microseconds, and the uncached case
   should be nonblocking for the caller.

 - setup/teardown costs. Both memory and CPU. This is where the current
   threads simply don't work. The setup cost of doing a clone/exit is
   actually much higher than the cost of doing the whole operation, most
   of the time. Remember: caches still work.

 - maintenance. Clearly AIO will always have some special code, but if we
   can move the special code *away* from filesystems and networking and
   all the thousands of device drivers, and into core kernel code, we've
   done something good. And if we can extend it from just pure read/write
   into just about *anything*, then people will be happy.

So stop blathering about scheduling costs, RT kernels and interrupts.
Interrupts generally happen a few thousand times a second. This is
soemthing you want to do a *million* times a second, without any IO
happening at all except for when it has to.

                        Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Zach Brown  
View profile  
 More options Feb 1 2007, 4:53 pm
Newsgroups: fa.linux.kernel
From: Zach Brown <zach.br...@oracle.com>
Date: Thu, 01 Feb 2007 21:53:25 UTC
Local: Thurs, Feb 1 2007 4:53 pm
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

> let me clarify this: i very much like your AIO patchset in general, in
> the sense that it 'completes' the AIO implementation: finally  
> everything
> can be done via it, greatly increasing its utility and hopefully its
> penetration. This is the most important step, by far.

We violently agree on this :).

> what i dont really like /the particular/ concept above - the
> introduction of 'fibrils' as a hard distinction of kernel threads.  
> They
> are /almost/ kernel threads, but still by being different they create
> alot of duplication and miss out on a good deal of features that  
> kernel
> threads have naturally.

I might quibble with some of the details, but I understand your  
fundamental concern.  I do.  I don't get up each morning *thrilled*  
by the idea of having to update lockdep, sysrq-t, etc, to understand  
these fibril things :).  The current fibril switch isn't nearly as  
clever as the lock-free task scheduling switch.  It'd be nice if we  
didn't have to do that work to optimize the hell out of it, sure.

> It kind of hurts to say this because i'm usually quite concept-happy -
> one can easily get addicted to the introduction of new core kernel
> concepts :-)

:)

> so my suggestions center around the notion of extending kernel threads
> to support the features you find important in fibrils:

>> would it be hard to redo your AIO patches based on a pool of plain
>> simple kernel threads?

It'd certainly be doable to throw together a credible attempt to  
service "asys" system call submission with full-on kernel threads.  
That seems like reasonable due diligence to me.  If full-on threads  
are almost as cheap, great.  If fibrils are so much cheaper that they  
seem to warrant investing in, great.

I am concerned about the change in behaviour if we fall back to full  
kernel threads, though.  I really, really, want aio syscalls to  
behave just like sync ones.

Would your strategy be to update the syscall implementations to share  
data in task_struct so that there isn't as significant a change in  
behaviour?  (sharing current->ioprio, instead if just inheriting it,  
for example.).  We'd be betting that there would be few of these and  
that they'd be pretty reasonable to share?

- z
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Benjamin LaHaise  
View profile  
 More options Feb 1 2007, 5:24 pm
Newsgroups: fa.linux.kernel
From: Benjamin LaHaise <b...@kvack.org>
Date: Thu, 01 Feb 2007 22:24:38 UTC
Local: Thurs, Feb 1 2007 5:24 pm
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

On Thu, Feb 01, 2007 at 01:52:13PM -0800, Zach Brown wrote:
> >let me clarify this: i very much like your AIO patchset in general, in
> >the sense that it 'completes' the AIO implementation: finally  
> >everything
> >can be done via it, greatly increasing its utility and hopefully its
> >penetration. This is the most important step, by far.

> We violently agree on this :).

There is also the old kernel_thread based method that should probably be
compared, especially if pre-created threads are thrown into the mix.  Also,
since the old days, a lot of thread scaling issues have been fixed that
could even make userland threads more viable.

> Would your strategy be to update the syscall implementations to share  
> data in task_struct so that there isn't as significant a change in  
> behaviour?  (sharing current->ioprio, instead if just inheriting it,  
> for example.).  We'd be betting that there would be few of these and  
> that they'd be pretty reasonable to share?

Priorities cannot be shared, as they have to adapt to the per-request
priority when we get down to the nitty gitty of POSIX AIO, as otherwise
realtime issues like keepalive transmits will be handled incorrectly.

                -ben
--
"Time is of no importance, Mr. President, only life is important."
Don't Email: <d...@kvack.org>.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Zach Brown  
View profile  
 More options Feb 1 2007, 5:39 pm
Newsgroups: fa.linux.kernel
From: Zach Brown <zach.br...@oracle.com>
Date: Thu, 01 Feb 2007 22:39:08 UTC
Local: Thurs, Feb 1 2007 5:39 pm
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

> Priorities cannot be shared, as they have to adapt to the per-request
> priority when we get down to the nitty gitty of POSIX AIO, as  
> otherwise
> realtime issues like keepalive transmits will be handled incorrectly.

Well, maybe not *blind* sharing.  But something more than the  
disconnect threads currently have with current->ioprio.

Today an existing kernel thread would most certainly ignore a  
sys_ioprio_set() in the submitter and then handle an aio syscall with  
an old current->ioprio.

Something more smart than that is all I'm on about.

- z
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Ingo Molnar  
View profile  
 More options Feb 2 2007, 5:51 am
Newsgroups: fa.linux.kernel
From: Ingo Molnar <mi...@elte.hu>
Date: Fri, 02 Feb 2007 10:51:18 UTC
Local: Fri, Feb 2 2007 5:51 am
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

* Linus Torvalds <torva...@linux-foundation.org> wrote:

> So stop blathering about scheduling costs, RT kernels and interrupts.
> Interrupts generally happen a few thousand times a second. This is
> soemthing you want to do a *million* times a second, without any IO
> happening at all except for when it has to.

we might be talking past each other.

i never suggested every aio op should create/destroy a kernel thread!

My only suggestion was to have a couple of transparent kernel threads
(not fibrils) attached to a user context that does asynchronous
syscalls! Those kernel threads would be 'switched in' if the current
user-space thread blocks - so instead of having to 'create' any of them
- the fast path would be to /switch/ them to under the current
user-space, so that user-space processing can continue under that other
thread!

That means that in the 'switch kernel context' fastpath it simply needs
to copy the blocked threads' user-space ptregs (~64 bytes) to its own
kernel stack, and then it can do a return-from-syscall without
user-space noticing the switch! Never would we really see the cost of
kernel thread creation. We would never see that cost in the fully cached
case (no other thread is needed then), nor would we see it in the
blocking-IO case, due to pooling. (there are some other details related
to things like the FPU context, but you get the idea.)

Let me quote Zach's reply to my suggestions:

| It'd certainly be doable to throw together a credible attempt to
| service "asys" system call submission with full-on kernel threads.
| That seems like reasonable due diligence to me.  If full-on threads
| are almost as cheap, great. If fibrils are so much cheaper that they
| seem to warrant investing in, great.

that's all i wanted to see being considered!

Please ignore my points about scheduling costs - i only talked about
them at length because the only fundamental difference between kernel
threads and fibrils is their /scheduling/ properties. /Not/ the
setup/teardown costs - those are not relevant /precisely/ because they
can be pooled and because they happen relatively rarely, compared to the
cached case. The 'switch to the blocked thread's ptregs' operation also
involves a context-switch under this design. That's why i was talking
about scheduling so much: the /only/ true difference between fibrils and
kernel threads is their /scheduling/.

I believe this is the point where your argument fails:

> - setup/teardown costs. Both memory and CPU. This is where the current
>   threads simply don't work. The setup cost of doing a clone/exit is
>   actually much higher than the cost of doing the whole operation,
>   most of the time.

you are comparing apples to oranges - i never said we should
create/destroy a kernel thread for every async op. That would be insane!

what we need to support asynchronous system-calls is the ability to pick
up an /already created/ kernel thread from a pool of per-task kernel
threads and to switch it to under the current user-space and return to
the user-space stack with that new kernel thread running. (The other,
blocked kernel thread stays blocked and is returned into the pool of
'pending' AIO kernel threads.) And this only needs to happen in the
'cachemiss' case anyway. In the 'cached' case no other kernel thread
would be involved at all, the current one just falls straight through
the system-call.

my argument is that the whole notion of cutting this at the kernel stack
and thread info level and making fibrils in essence a separate
scheduling entitity is wrong, wrong, wrong. Why not use plain kernel
threads for this?

[ finally, i think you totally ignored my main argument, state machines.
  The networking stack is a full and very nice state machine. It's
  kicked from user-space, and zillions of small contexts (sockets) are
  living on without any of the originating tasks having to be involved.
  So i'm still holding to the fundamental notion that within the kernel
  this form of AIO is a nice but /secondary/ mechanism. If a subsystem
  is able to pull it off, it can implement asynchronity via a state
  machine - and it will outperform any thread based AIO. Or not. We'll
  see. For something like the VFS i doubt we'll see (and i doubt we
  /want/ to see) a 'native' state-machine implementation.

  this is btw. quite close to the Tux model of doing asynchronous block
  IO and asynchronous VFS events such as asynchronous open(). Tux uses a
  pool of kernel threads to pass blocking work to, while not holding up
  the 'main' thread. But the main Tux speedup comes from having a native
  state machine for all the networking IO. ]

        Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Andi Kleen  
View profile  
 More options Feb 2 2007, 7:22 am
Newsgroups: fa.linux.kernel
From: Andi Kleen <a...@suse.de>
Date: Fri, 02 Feb 2007 12:22:26 UTC
Local: Fri, Feb 2 2007 7:22 am
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

Ingo Molnar <mi...@elte.hu> writes:
> and for one of the most important IO
> disciplines, networking, that is reality already.

Not 100% -- a few things in TCP/IP at least are blocking still.
Mostly relatively obscure things though.

Also the sockets model is currently incompatible with direct zero-copy RX/TX,
which needs fixing.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Andi Kleen  
View profile  
 More options Feb 2 2007, 7:23 am
Newsgroups: fa.linux.kernel
From: Andi Kleen <a...@suse.de>
Date: Fri, 02 Feb 2007 12:23:59 UTC
Local: Fri, Feb 2 2007 7:23 am
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

Christoph Hellwig <h...@infradead.org> writes:

> I tend to agree.  Note that there is one thing we should be doing one
> one day (not only if we want to use it for aio) is to make kernel threads
> more lightweight.  Thereéis a lot of baggae we keep around in task_struct
> and co that only makes sense for threads that have a user space part and
> aren't or shouldn't be needed for a purely kernel-resistant thread.

I suspect you will get a lot of this for free from the current namespace
efforts.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Linus Torvalds  
View profile  
 More options Feb 2 2007, 10:58 am
Newsgroups: fa.linux.kernel
From: Linus Torvalds <torva...@linux-foundation.org>
Date: Fri, 02 Feb 2007 15:58:10 UTC
Local: Fri, Feb 2 2007 10:58 am
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

On Fri, 2 Feb 2007, Ingo Molnar wrote:

> My only suggestion was to have a couple of transparent kernel threads
> (not fibrils) attached to a user context that does asynchronous
> syscalls! Those kernel threads would be 'switched in' if the current
> user-space thread blocks - so instead of having to 'create' any of them
> - the fast path would be to /switch/ them to under the current
> user-space, so that user-space processing can continue under that other
> thread!

But in that case, you really do end up with "fibrils" anyway.

Because those fibrils are what would be the state for the blocked system
calls when they aren't scheduled.

We may have a few hundred thousand system calls a second (maybe that's not
actually reasonable, but it should be what we *aim* for), and 99% of them
will hopefully hit the cache and never need any separate IO, but even if
it's just 1%, we're talking about thousands of threads.

I do _not_ think that it's reasonable to have thousands of threads state
around just "in case". Especially if all those threadlets are then
involved in signals etc - something that they are totally uninterested in.

I think it's a lot more reasonable to have just the kernel stack page for
"this was where I was when I blocked". IOW, a fibril-like thing. You need
some data structure to set up the state *before* you start doing any
threads at all, because hopefully the operation will be totally
synchronous, and no secondary thread is ever really needed!

What I like about fibrils is that they should be able to handle the cached
case well: the case where no "real" scheduling (just the fibril stack
switches) takes place.

Now, most traditional DB loads would tend to use AIO only when they "know"
that real IO will take place (the AIO call itself will probably be
O_DIRECT most of the time). So I suspect that a lot of those users will
never really have the cached case, but one of my hopes is to be able to do
exactly the things that we have *not* done well: asynchronous file opens
and pathname lookups, which is very common in a file server.

If done *really* right, a perfectly normal app could do things like
asynchronous stat() calls to fill in the readdir results. In other words,
what *I* would like to see is the ability to have something *really*
simple like "ls" use this, without it actually being a performance hit
for the common case where everythign is cached.

Have you done "ls -l" on a big uncached directory where the inodes
are all over the disk lately? You can hear the disk whirr. THAT is the
kind of "normal user" thing I'd like to be able to fix, and the db case is
actually secondary. The DB case is much much more limited (ok, so somebody
pointed out that they want slightly more than just read/write, but
still.. We're talking "special code".)

> [ finally, i think you totally ignored my main argument, state machines.

I ignored your argument, because it's not really relevant. The fact that
networking (and TCP in particular) has state machines is because it is a
packetized environment. Nothing else is. Think pathname lookup etc. They
are all *fundamentally* environments with a call stack.

So the state machine argument is totally bogus - it results in a
programming model that simply doesn't match the *normal* setup. You want
the kernel programming model to appear "linear" even when it isn't,
because it's too damn hard to think nonlinearly.

Yes, we could do pathname lookup with that kind of insane setup too. But
it would be HORRID!

                        Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Alan  
View profile  
 More options Feb 2 2007, 2:48 pm
Newsgroups: fa.linux.kernel
From: Alan <a...@lxorguk.ukuu.org.uk>
Date: Fri, 02 Feb 2007 19:48:44 UTC
Local: Fri, Feb 2 2007 2:48 pm
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
This one got shelved while I sorted other things out as it warranted a
longer look. Some comments follow, but firstly can we please bury this
"fibril" name. The constructs Zach is using appear to be identical to
co-routines, and they've been called that in computer science literature
for fifty years. They are one of the great and somehow forgotten ideas.
(and I admit I've used them extensively in past things where its
wonderful for multi-player gaming so I'm a convert already).

The stuff however isn't as free as you make out. Current kernel logic
knows about various things being "safe" but with fibrils you have to
address additional questions such as "What happens if I issue an I/O and
change priority". You also have an 800lb gorilla hiding behind a tree
waiting for you in priviledge and permission checking.

Right now current->*u/gid is safe across a syscall start to end, with an
asynchronous setuid all hell breaks loose. I'm not saying we shouldn't do
this, in fact we'd be able to do some of the utterly moronic poxix thread
uid handling in kernel space if we did, just that it isn't free. We have
locking rules defined by the magic serializing construct called
"the syscall" and you break those.

I'd expect the odd other gorilla waiting to mug you as well and the ones
nobody has thought of will be the worst 8)

The number of co-routines and stacks can be dealt with two ways - you use
small stacks allocated when you create a fibril, or you grab a page, use
separate IRQ stacks and either fail creation with -ENOBUFS etc which
drops work on user space, or block (for which cases ??) which also means
an overhead on co-routine exits. That can be tunable, for embedded easily
tuned right down.

Traditional co-routines have clear notions of being able to create a
co-routine, stack them and fire up specific ones. In part this is done
because many things expressed in this way know what to fire up next. It's
also a very clean way to express driver problem with a lot of state

Essentially as a co-routine is simply making "%esp" roughly the same as
the C++ world's "self".

You get some other funny things from co-routines which are very powerful,
very dangerous, or plain insane depending upon your view of life. One big
one is the ability for real men (and women) to do stuff like this,
because you don't need to keep the context attached to the same task.

        send_reset_command(dev);
        wait_for_irq_event(dev->irq);
        /* co-routine continues in IRQ context here */
        clean_up_reset_command(dev);
        exit_irq_event();
        /* co-routine continues out of IRQ context here */
        send_identify_command(dev);

Notice we just dealt with all the IRQ stack problems the moment an IRQ is
a co-routine transfer 8)

Ditto with timers, although for the kernel that might not be smart as we
have a lot of timers.

Less insanely you can create a context, start doing stuff in it and then
pass it to someone else local variables, state and all. This one is
actually rather useful for avoiding a lot of the 'D' state crap in the
kernel.

For example we have driver code that sleeps uninterruptibly because its
too hard to undo the mess and get out of the current state if it is
interrupted. In the world of sending other people co-routines you just do
this

        coroutine_set(MUST_COMPLETE);

and in exit

        foreach(coroutine)
                if(coroutine->flags & MUST_COMPLETE)
                        inherit_coroutine(init, coroutine);

and obviously you don't pass any over that will then not do the right
thing before accessing user space (well unless implementing
'read_for_someone_else()' or other strange syscalls - like ptrace...)

Other questions really relate to the scheduling - Zach do you intend
schedule_fibrils() to be a call code would make or just from schedule() ?

Linus will now tell me I'm out of my tree...

Alan (who used to use Co-routines in real languages on 36bit
computers with 9bit bytes before learning C)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Linus Torvalds  
View profile  
 More options Feb 2 2007, 3:15 pm
Newsgroups: fa.linux.kernel
From: Linus Torvalds <torva...@linux-foundation.org>
Date: Fri, 02 Feb 2007 20:15:16 UTC
Local: Fri, Feb 2 2007 3:15 pm
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

On Fri, 2 Feb 2007, Alan wrote:

> This one got shelved while I sorted other things out as it warranted a
> longer look. Some comments follow, but firstly can we please bury this
> "fibril" name. The constructs Zach is using appear to be identical to
> co-routines, and they've been called that in computer science literature
> for fifty years. They are one of the great and somehow forgotten ideas.
> (and I admit I've used them extensively in past things where its
> wonderful for multi-player gaming so I'm a convert already).

Well, they are indeed coroutines, but they are coroutines in the same
sense any "CPU scheduler" ends up being a coroutine.

They are NOT the generic co-routine that some languages support natively.
So I think trying to call them coroutines would be even more misleading
than calling them fibrils.

In other workds the whole *point* of the fibril is that you can do
"coroutine-like stuff" while using a "normal functional linear programming
paradign".

Wouldn't you agree?

(I love the concept of coroutines, but I absolutely detest what the code
ends up looking like. There's a good reason why people program mostly in
linear flow: that's how people think consciously - even if it's obviously
not how the brain actually works).

And we *definitely* don't want to have a coroutine programming interface
in the kernel. Not in C.

> The stuff however isn't as free as you make out. Current kernel logic
> knows about various things being "safe" but with fibrils you have to
> address additional questions such as "What happens if I issue an I/O and
> change priority". You also have an 800lb gorilla hiding behind a tree
> waiting for you in priviledge and permission checking.

This is why I think it should be 100% clear that things happen in process
context. That just answers everything. If you want to synchronize with
async events and change IO priority, you should do exactly that:

        wait_for_async();
        ioprio(newprority);

and that "solves" that problem. Leave it to user space.

> Right now current->*u/gid is safe across a syscall start to end, with an
> asynchronous setuid all hell breaks loose. I'm not saying we shouldn't do
> this, in fact we'd be able to do some of the utterly moronic poxix thread
> uid handling in kernel space if we did, just that it isn't free. We have
> locking rules defined by the magic serializing construct called
> "the syscall" and you break those.

I agree. As mentioned, we probably will have fallout.

> The number of co-routines and stacks can be dealt with two ways - you use
> small stacks allocated when you create a fibril, or you grab a page, use
> separate IRQ stacks and either fail creation with -ENOBUFS etc which
> drops work on user space, or block (for which cases ??) which also means
> an overhead on co-routine exits. That can be tunable, for embedded easily
> tuned right down.

Right. It should be possible to just say "use a max parallelism factor of
5", and if somebody submits a hundred AIO calls and they all block, when
it hits #6, it will just do it synchronously.

Basically, what I'm hoping can come out of this (and this is a simplistic
example, but perhaps exactly *because* of that it hopefully also shows
that we canactually make *simple* interfaces for complex asynchronous
things):

        struct one_entry *prev = NULL;
        struct dirent *de;

        while ((de = readdir(dir)) != NULL) {
                struct one_entry *entry = malloc(..);

                /* Add it to the list, fill in the name */
                entry->next = prev;
                prev = entry;
                strcpy(entry->name, de->d_name);

                /* Do the stat lookup async */
                async_stat(de->d_name, &entry->stat_buf);
        }
        wait_for_async();
        .. Ta-daa! All done ..

and it *should* allow us to do all the stat lookup asynchronously.

Done right, this should basically be no slower than doing it with a real
stat() if everything was cached. That would kind of be the holy grail
here.

> You get some other funny things from co-routines which are very powerful,
> very dangerous, or plain insane

You forgot "very hard to think about".

We DO NOT want coroutines in general. It's clever, but it's
 (a) impossible to do without language support that C doesn't have, or
     some really really horrid macro constructs that really only work for
     very specific and simple cases.
 (b) very non-intuitive unless you've worked with coroutines a lot (and
     almost nobody has)

> Linus will now tell me I'm out of my tree...

I don't think you're wrong in theory, I just thnk that in practice,
withing the confines of (a) existing code, (b) existing languages, and (c)
existing developers, we really REALLY don't want to expose coroutines as
such.

But if you wanted to point out that what we want to do is get the
ADVANTAGES of coroutines, without actually have to program them as such,
then yes, I agree 100%. But we shouldn't call them coroutines, because the
whole point is that as far as the user interface is concerned, they don't
look like that. In the kernel, they just look like normal linear
programming.

                Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Davide Libenzi  
View profile  
 More options Feb 2 2007, 4:06 pm
Newsgroups: fa.linux.kernel
From: Davide Libenzi <davi...@xmailserver.org>
Date: Fri, 02 Feb 2007 21:06:53 UTC
Local: Fri, Feb 2 2007 4:06 pm
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

On Fri, 2 Feb 2007, Linus Torvalds wrote:
> > You get some other funny things from co-routines which are very powerful,
> > very dangerous, or plain insane

> You forgot "very hard to think about".

> We DO NOT want coroutines in general. It's clever, but it's
>  (a) impossible to do without language support that C doesn't have, or
>      some really really horrid macro constructs that really only work for
>      very specific and simple cases.
>  (b) very non-intuitive unless you've worked with coroutines a lot (and
>      almost nobody has)

Actually, coroutines are not too bad to program once you have a
total-coverage async scheduler to run them. The attached (very sketchy)
example uses libpcl ( http://www.xmailserver.org/libpcl.html ) and epoll
as scheduler (but here you can really use anything). You can implement
coroutines in many way, from C preprocessor macros up to anything, but in
the libpcl case they are simply switched stacks. Like fibrils are supposed
to be. The problem is that in order to make a real-life example of
coroutine-based application work, you need everything that can put you at
sleep (syscalls or any external library call you have no control on)
implemented in an async way. And what I ended up doing is exactly what Zab
did inside the kernel. In my case a dynamic pool of (userspace) threads
servicing any non-native potentially pre-emptive call, and signaling the
result to a pollable fd (pipe in my case) that is integrated in the epoll
(poll/select whatever) scheduler.
I personally find Zab idea a really good one, since it allows for generic
kernel async implementation, w/out the burden of dirtying kernel code
paths with AIO knowledge. Being it fibrils or real kthreads, it is IMO
definitely worth a very close look.

- Davide

  cotest.c
1K Download

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Linus Torvalds  
View profile  
 More options Feb 2 2007, 4:10 pm
Newsgroups: fa.linux.kernel
From: Linus Torvalds <torva...@linux-foundation.org>
Date: Fri, 02 Feb 2007 21:10:47 UTC
Local: Fri, Feb 2 2007 4:10 pm
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

On Fri, 2 Feb 2007, Davide Libenzi wrote:

> Actually, coroutines are not too bad to program once you have a
> total-coverage async scheduler to run them.

No, no, I don't disagree at all. In fact, I agree emphatically.

It's just that you need the scheduler to run them, in order to not "see"
them as coroutines. Then, you can program everything *as*if* it was just a
regular declarative linear language with multiple threads).

And that gets us the same programming interface as we always have, and
people can forget about the fact that in a very real sense, they are using
coroutines with the scheduler just keeping track of it all for them.

After all, that's what we do between processes *anyway*. You can
technically see the kernel as one big program that uses coroutines and the
scheduler just keeping track of every coroutine instance. It's just that I
doubt that any kernel programmer really thinks in those terms. You *think*
in terms of "threads".

                        Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Alan  
View profile  
 More options Feb 2 2007, 4:18 pm
Newsgroups: fa.linux.kernel
From: Alan <a...@lxorguk.ukuu.org.uk>
Date: Fri, 02 Feb 2007 21:18:46 UTC
Local: Fri, Feb 2 2007 4:18 pm
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

> They are NOT the generic co-routine that some languages support natively.
> So I think trying to call them coroutines would be even more misleading
> than calling them fibrils.

Its actually pretty damned close the Honeywell B co-routine package, with
a kernel twist to be honest.

> ends up looking like. There's a good reason why people program mostly in
> linear flow: that's how people think consciously - even if it's obviously
> not how the brain actually works).

The IRQ example below is an example of how it linearizes - so it cuts
both ways like most tools, admittedly one of the blades is at the handle
end in this case ...

The brown and sticky will hit the rotating air impeller pretty hard if you
are not very careful about how that ends up scheduled. Its one thing to
exploit the ability to pull all the easy lookups out in advance, and
another having created all the parallelism to turn into into sane disk
scheduling and wakeups without scaling hit. But you do at least have the
opportunity to exploit it I guess.

> > You get some other funny things from co-routines which are very powerful,
> > very dangerous, or plain insane

> You forgot "very hard to think about".

I'm not sure handing a fibril off to another task is that hard to think
about. It's not easy to turn it around as an async_exit() keeping the
other fibrils around because of the mass of rules and behaviours tied to
process exit but its perhaps not impossible.

Other minor evil. If we use fibrils we need to be careful we
know in advance how many fibrils an operation needs so we don't deadlock
on them in critical places like writeout paths when we either hit the per
task limit or we have no page for another stack.

Alan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Linus Torvalds  
View profile  
 More options Feb 2 2007, 4:30 pm
Newsgroups: fa.linux.kernel
From: Linus Torvalds <torva...@linux-foundation.org>
Date: Fri, 02 Feb 2007 21:30:55 UTC
Local: Fri, Feb 2 2007 4:30 pm
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

On Fri, 2 Feb 2007, Alan wrote:

> The brown and sticky will hit the rotating air impeller pretty hard if you
> are not very careful about how that ends up scheduled

Why do you think that?

With cooperative scheduling (like the example Zach posted), there is
absolutely no "brown and sticky" wrt any CPU usage. Which is why
cooperative scheduling is a *good* thing. If you want to blow up your
1024-node CPU cluster, you'd to it with "real threads".

Also, with sane default limits of fibrils per process (say, in the 5-10),
it also ends up beign good for IO. No "insane" IO bombs, but an easy way
for users to just just get a reasonable amount of IO parallelism without
having to use threading (which is hard).

So, best of both worlds.

Yes, *of*course* you want to have limits on outstanding work. And yes, a
database server would set those limits much higher ("Only a thousand
outstanding IO requests? Can we raise that to ten thousand, please?") than
a regular process ("default: 5, and the super-user can raise it for you if
you're good").

But there really shouldn't be any downsides.

(Of course, there will be downsides. I'm sure there will be. But I don't
see any really serious and obvious ones).

> Other minor evil. If we use fibrils we need to be careful we
> know in advance how many fibrils an operation needs so we don't deadlock
> on them in critical places like writeout paths when we either hit the per
> task limit or we have no page for another stack.

Since we'd only create fibrils on a system call entry level, and system
calls are independent, how would you do that anyway?

Once a fibril has been created, it will *never* depend on any other fibril
resources ever again. At least not in any way that any normal non-fibril
call wouldn't already do as far as I can see.

                Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Ingo Molnar  
View profile  
 More options Feb 2 2007, 5:28 pm
Newsgroups: fa.linux.kernel
From: Ingo Molnar <mi...@elte.hu>
Date: Fri, 02 Feb 2007 22:28:23 UTC
Local: Fri, Feb 2 2007 5:28 pm
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

* Linus Torvalds <torva...@linux-foundation.org> wrote:

ok, i think i noticed another misunderstanding. The kernel thread based
scheme i'm suggesting would /not/ 'switch' to another kernel thread in
the cached case, by default. It would just execute in the original
context (as if it were a synchronous syscall), and the switch to a
kernel thread from the pool would only occur /if/ the context is about
to block. (this 'switch' thing would be done by the scheduler)
User-space gets back an -EAIO error code immediately and transparently -
but already running under the new kernel thread.

i.e. in the fully cached case there would be no scheduling at all - in
fact no thread pool is needed at all.

regarding cost:

the biggest memory resource cost of a kernel thread (assuming it has no
real user-space context) /is/ its kernel stack page, which is 4K or 8K.
The task struct takes ~1.5K. Once we have a ready kernel thread around,
it's quite cheap to 'flip' it to under any arbitrary user-space context:
change its thread_info->task pointer to the user-space context's task
struct, copy the mm pointer, the fs pointer to the "worker thread",
switch the thread_info, update ptregs - done. Hm?

Note: such a 'flip' would only occur when the original context blocks,
/not/ on every async syscall.

regarding CPU resource costs, i dont think there should be significant
signal overhead, because the original task is still only one instance,
and the kernel thread that is now running with the blocked kernel stack
is not part of the signal set. (Although it might make sense to make
such async syscalls interruptible, just like any syscall.)

The 'pool' of kernel threads doesnt even have to be per-task, it can be
a natural per-CPU thing - and its size will grow/shrink [with a low
update frequency] depending on how much AIO parallelism there is in the
workload. (But it can also be strictly per-user-context - to make sure
that a proper ->mm ->fs, etc. is set up and that when the async system
calls execute they have all the right context info.)

and note the immediate scheduling benefits: if an app (say like
OpenOffice) is single-threaded but has certain common ops coded as async
syscalls, then if any of those syscalls blocks then it could utilize
/more than one/ CPU. I.e. we could 'spread' a single-threaded app's
processing to multiple cores/hardware-threads /without/ having to
multi-thread the app in an intrusive way. I.e. this would be a
finegrained threading of syscalls, executed as coroutines in essence.
With fibrils all sorts of scheduling limitations occur and no
parallelism is possible.

in fact an app could also /trigger/ the execution of a syscall in a
different context - to create parallelism artificially - without any
blocking event. So we could do:

  cookie1 = sys_async(sys_read, params);
  cookie2 = sys_async(sys_write, params);

  [ ... calculation loop ... ]

  wait_on_async_syscall(cookie1);
  wait_on_async_syscall(cookie2);

or something like that. Without user-space having to create threads
itself, etc. So basically, we'd make kernel threads more useful, and
we'd make threading safer - by only letting syscalls thread.

> What I like about fibrils is that they should be able to handle the
> cached case well: the case where no "real" scheduling (just the fibril
> stack switches) takes place.

the cached case (when a system call would not block at all) would not
necessiate any switch to another kernel thread at all - the task just
executes its system call as if it were synchronous!

that's the nice thing: we can do this switch-to-another-kernel-thread
magic thing right in the scheduler when we block - and the switched-to
thread will magically return to user-space (with a -EAIO return code) as
if nothing happened (while the original task blocks). I.e. under this
scheme i'm suggesting we have /zero/ setup cost in the cached case. The
optimistic case just falls through and switches to nothing else. Any
switching cost only occurs in the slowpath - and even that cost is very
low.

once a kernel thread that ran off with the original stack finishes the
async syscall and wants to return the return code, this can be gathered
via a special return-code ringbuffer that notifies finished syscalls. (A
magic cookie is associated to every async syscall.)

> So the state machine argument is totally bogus - it results in a
> programming model that simply doesn't match the *normal* setup. You
> want the kernel programming model to appear "linear" even when it
> isn't, because it's too damn hard to think nonlinearly.

> Yes, we could do pathname lookup with that kind of insane setup too.
> But it would be HORRID!

yeah, but i guess not nearly as horrid as writing a new OS from scratch
;-)

seriously, i very much think and agree that programming state machines
is hard and not desired in most of the kernel. But it can be done, and
sometimes (definitely not in the common case) it's /cleaner/ than
functional programming. I've programmed an HTTP and an FTP in-kernel
server via a state machine and it worked better than i initially
expected. It needs different thinking but there /are/ people around with
that kind of thinking, so we just cannot exclude the possibility. [ It's
just that such people usually dedicate their brain to mental
fantasies^H^H^Hexcercises called 'Higher Mathematics' :-) ]

> [...] The fact that networking (and TCP in particular) has state
> machines is because it is a packetized environment.

rough ballpark figures: for things like webserving or fileserving (or
mailserving), networking sockets are the reason for context-blocking
events in 90% of the cases (mostly due to networking latency). 9% of the
blocking happens due to plain block IO, and 1% happens due to VFS
metadata (inode, directory, etc.) blocking.

( in Tux i had to handle /all/ of these sources of blocking because even
  1% kills your performance if you do a hundred thousand requests per
  second - but in terms of design weight, networking is pretty damn
  important. )

and interestingly, modern IO frameworks tend to gravitate towards a
packetized environment as well. I.e. i dont think state machines are
/that/ unimportant.

        Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Alan  
View profile  
 More options Feb 2 2007, 5:37 pm
Newsgroups: fa.linux.kernel
From: Alan <a...@lxorguk.ukuu.org.uk>
Date: Fri, 02 Feb 2007 22:37:24 UTC
Local: Fri, Feb 2 2007 5:37 pm
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

> > The brown and sticky will hit the rotating air impeller pretty hard if you
> > are not very careful about how that ends up scheduled

> Why do you think that?

> With cooperative scheduling (like the example Zach posted), there is
> absolutely no "brown and sticky" wrt any CPU usage. Which is why
> cooperative scheduling is a *good* thing. If you want to blow up your
> 1024-node CPU cluster, you'd to it with "real threads".

You end up with a lot more things running asynchronously. In the current
world we see a series of requests for attributes and hopefully we do
readahead and all is neatly ordered. If fibrils are not ordered the same
way then we could make it worse as we might not pick the right readahead
for example.

> Since we'd only create fibrils on a system call entry level, and system
> calls are independent, how would you do that anyway?

If we stick to that limit it ought to be ok. We've been busy slapping
people who call sys_*, except for internal magic like kernel_thread
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Messages 1 - 25 of 94   Newer >
« Back to Discussions « Newer topic     Older topic »