There are members of task_struct which are only used by a given call chain to pass arguments up and down the chain itself. They are logically thread-local storage.
The patches later in the series want to have multiple calls pending for a given task, though only one will be executing at a given time. By putting these thread-local members of task_struct in a seperate storage structure we're able to trivially swap them in and out as their calls are swapped in and out.
per_call_chain() doesn't have a terribly great name. It was chosen in the spirit of per_cpu().
The storage was left inline in task_struct to avoid introducing indirection for the vast majority of uses which will never have multiple calls executing in a task.
I chose a few members of task_struct to migrate under per_call_chain() along with the introduction as an example of what it looks like. These would be seperate patches in a patch series that was suitable for merging.
diff -r b1128b48dc99 -r 26e278468209 fs/jbd/journal.c --- a/fs/jbd/journal.c Fri Jan 12 20:00:03 2007 +0000 +++ b/fs/jbd/journal.c Mon Jan 29 15:36:13 2007 -0800 @@ -471,7 +471,7 @@ int journal_force_commit_nested(journal_ tid_t tid;
spin_lock(&journal->j_state_lock); - if (journal->j_running_transaction && !current->journal_info) { + if (journal->j_running_transaction && !per_call_chain(journal_info)) { transaction = journal->j_running_transaction; __log_start_commit(journal, transaction->t_tid); } else if (journal->j_committing_transaction) diff -r b1128b48dc99 -r 26e278468209 fs/jbd/transaction.c --- a/fs/jbd/transaction.c Fri Jan 12 20:00:03 2007 +0000 +++ b/fs/jbd/transaction.c Mon Jan 29 15:36:13 2007 -0800 @@ -279,12 +279,12 @@ handle_t *journal_start(journal_t *journ if (!handle) return ERR_PTR(-ENOMEM);
+/* + * Members of this structure are used to pass arguments down call chains + * without specific arguments. Historically they lived on task_struct, + * putting them in one place gives us some flexibility. They're accessed + * with per_call_chain(name). + */ +struct per_call_chain_storage { + int link_count; /* number of links in one symlink */ + int total_link_count; /* total links followed in a lookup */ + void *journal_info; /* journalling filesystem info */ +}; + +#define per_call_chain(foo) current->per_call.foo + struct audit_context; /* See audit.c */ struct mempolicy; struct pipe_inode_info; @@ -920,7 +934,7 @@ struct task_struct { it with task_lock()) - initialized normally by flush_old_exec */ /* file system info */ - int link_count, total_link_count; + struct per_call_chain_storage per_call; #ifdef CONFIG_SYSVIPC /* ipc stuff */ struct sysv_sem sysvsem; @@ -993,9 +1007,6 @@ struct task_struct { struct held_lock held_locks[MAX_LOCK_DEPTH]; unsigned int lockdep_recursion; #endif - -/* journalling filesystem info */ - void *journal_info;
/* VM state */ struct reclaim_state *reclaim_state; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
This patch introduces the notion of a 'fibril'. It's meant to be a lighter kernel thread. There can be multiple of them in the process of executing for a given task_struct, but only one can every be actively running at a time. Think of it as a stack and some metadata for scheduling them inside the task_stuct.
This implementation is wildly architecture-specific but isn't put in the right places. Since these are not code paths that I have extensive experience with, I focused more on on getting it going and representative of the concept than on making it right on the first try. I'm actively interested in feedback from people who know more about the places this touches.
The fibril struct itself is left stand-alone for clarity. There is a 1:1 relationship between fibrils and struct thread_info, though, so it might make more sense to embed the two somehow.
The use of list_head for the run queue is simplistic. As long as we're not removing specific fibrils from the list, which seems unlikely, we be more clever. Maybe no more clever than a singly-linked list, though.
Fibril management is under the runqueue lock because that ends up working well for the wake-up path as well. In the current patch, though, it makes for some pretty sloppy code for unlocking the runqueue lock (and re-enabling interrupts and pre-emption) on the other side of the switch.
The actual mechanics of switching from one stack to another at the end of schedule_fibril() makes me nervous. I'm not convinced that blindly copying the contents of thread_info from the previous to the next stack is safe, even if done with interrupts disabled. (NMIs?) The juggling of current->thread_info might be racy, etc.
+/* + * We've just switched the stack and instruction pointer to point to a new + * fibril. We were called from schedule() -> schedule_fibril() with the + * runqueue lock held _irq and with preemption disabled. + * + * We let finish_fibril_switch() unwind the state that was built up by + * our callers. We do that here so that we don't need to ask fibrils to + * first execute something analagous to schedule_tail(). Maybe that's + * wrong. + * + * We'd also have to reacquire the kernel lock here. For now we know the + * BUG_ON(lock_depth) prevents us from having to worry about it. + */ +void fastcall __switch_to_fibril(struct thread_info *ti) +{ + finish_fibril_switch(); + + /* free the ti if schedule_fibril() told us that it's done */ + if (ti->status & TS_FREE_AFTER_SWITCH) + free_thread_info(ti); +} + asmlinkage int sys_fork(struct pt_regs regs) { return do_fork(SIGCHLD, regs.esp, ®s, 0, NULL, NULL); diff -r 26e278468209 -r df7bc026d50e include/asm-i386/system.h --- a/include/asm-i386/system.h Mon Jan 29 15:36:13 2007 -0800 +++ b/include/asm-i386/system.h Mon Jan 29 15:36:16 2007 -0800 @@ -31,6 +31,31 @@ extern struct task_struct * FASTCALL(__s "=a" (last),"=S" (esi),"=D" (edi) \ :"m" (next->thread.esp),"m" (next->thread.eip), \ "2" (prev), "d" (next)); \ +} while (0) + +struct thread_info; +void fastcall __switch_to_fibril(struct thread_info *ti); + +/* + * This is called with the run queue lock held _irq and with preemption + * disabled. __switch_to_fibril drops those. + */ +#define switch_to_fibril(prev, next, ti) do { \ + unsigned long esi,edi; \ + asm volatile("pushfl\n\t" /* Save flags */ \ + "pushl %%ebp\n\t" \ + "movl %%esp,%0\n\t" /* save ESP */ \ + "movl %4,%%esp\n\t" /* restore ESP */ \ + "movl $1f,%1\n\t" /* save EIP */ \ + "pushl %5\n\t" /* restore EIP */ \ + "jmp __switch_to_fibril\n" \ + "1:\t" \ + "popl %%ebp\n\t" \ + "popfl" \ + :"=m" (prev->esp),"=m" (prev->eip), \ + "=S" (esi),"=D" (edi) \ + :"m" (next->esp),"m" (next->eip), \ + "d" (prev), "a" (ti)); \ } while (0)
/* thread information allocation */ @@ -169,6 +175,7 @@ static inline struct thread_info *curren */ #define TS_USEDFPU 0x0001 /* FPU was used by this task this quantum (SMP) */ #define TS_POLLING 0x0002 /* True if in idle loop and not sleeping */ +#define TS_FREE_AFTER_SWITCH 0x0004 /* free ti in __switch_to_fibril() */
+/* + * A 'fibril' is a very small fiber. It's used here to mean a small thread. + * + * (Chosing a weird new name avoided yet more overloading of 'task', 'call', + * 'thread', 'stack', 'fib{er,re}', etc). + * + * This structure is used by the schduler to track multiple executing stacks + * inside a task_struct. + * + * Only one fibril executes for a given task_struct at a time. When it + * blocks, however, another fibril has the chance to execute while it sleeps. + * This means that call chains executing in fibrils can see concurrent + * current-> accesses at blocking points. "per_call_chain()" members are + * switched along with the fibril, so they remain local. Preemption *will not* + * trigger a fibril switch. + * + * XXX + * - arch specific + */ +struct fibril { + struct list_head run_list; + /* -1 unrunnable, 0 runnable, >0 stopped */ + long state; + unsigned long eip; + unsigned long esp; + struct thread_info *ti; + struct per_call_chain_storage per_call; +}; + +void sched_new_runnable_fibril(struct fibril *fibril); +void finish_fibril_switch(void); + struct task_struct { volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */ struct thread_info *thread_info; @@ -857,6 +889,20 @@ struct task_struct { struct list_head ptrace_list;
struct mm_struct *mm, *active_mm; + + /* + * The scheduler uses this to determine if the current call is a + * stand-alone task or a fibril. If it's a fibril then wake-ups + * will target the fibril and a schedule() might result in swapping + * in another runnable fibril. So to start executing fibrils at all + * one allocates a fibril to represent the running task and then + * puts initialized runnable fibrils in the run list. + * + * The state members of the fibril and runnable_fibrils list are + * managed under the task's run queue lock. + */ + struct fibril *fibril; + struct list_head runnable_fibrils;
/* task state */ struct linux_binfmt *binfmt; diff -r 26e278468209 -r df7bc026d50e kernel/exit.c --- a/kernel/exit.c Mon Jan 29 15:36:13 2007 -0800 +++ b/kernel/exit.c Mon Jan 29 15:36:16 2007 -0800 @@ -854,6 +854,13 @@ fastcall NORET_TYPE void do_exit(long co { struct task_struct *tsk = current; int group_dead; + + /* + * XXX this is just a debug helper, this should be waiting for all + * fibrils to return. Possibly after sending them lots of -KILL + * signals? + */ + BUG_ON(!list_empty(¤t->runnable_fibrils));
/* * The task hasn't been attached yet, so its cpus_allowed mask will diff -r 26e278468209 -r df7bc026d50e kernel/sched.c --- a/kernel/sched.c Mon Jan 29 15:36:13 2007 -0800 +++ b/kernel/sched.c Mon Jan 29 15:36:16 2007 -0800 @@ -3407,6 +3407,111 @@ static inline int interactive_sleep(enum }
/* + * This unwinds the state that was built up by schedule -> schedule_fibril(). + * The arch-specific switch_to_fibril() path calls here once the new fibril + * is executing. + */ +void finish_fibril_switch(void) +{ + spin_unlock_irq(&this_rq()->lock); + preempt_enable_no_resched(); +} + +/* + * Add a new fibril to the runnable list. It'll be switched to next time + * the caller comes through schedule(). + */ +void sched_new_runnable_fibril(struct fibril *fibril) +{ + struct task_struct *tsk = current; + unsigned long flags; + struct rq *rq = task_rq_lock(tsk, &flags); + + fibril->state = TASK_RUNNING; + BUG_ON(!list_empty(&fibril->run_list)); + list_add_tail(&fibril->run_list, &tsk->runnable_fibrils); + + task_rq_unlock(rq, &flags); +} + +/* + * This is called from schedule() when we're not being preempted and there is a + * fibril
...
The addition of multiple sleeping fibrils under a task_struct means that we can't simply wake a task_struct to be able to wake a specific sleeping code path.
This patch introduces task_wake_target() as a way to refer to a code path that is about to sleep and will be woken in the future. Sleepers that used to wake a current task_struct reference with wake_up_process() now use this helper to get a wake target cookie and wake it with wake_up_target().
Some paths know that waking a task will be sufficient. Paths working with kernel threads that never use fibrils fall into this category. They're changed to use wake_up_task() instead of wake_up_process().
This is not an exhaustive patch. It isn't yet clear how signals are going to interract with fibrils. Once that is decided callers of wake_up_state() are going to need to reflect the desired behaviour. I add __deprecated to it to highlight this detail.
The actual act of performing the wake-up is hidden under try_to_wake_up() and is serialized with the scheduler under the runqueue lock. This is very fiddly stuff. I'm sure I've missed some details. I've tried to comment the intent above try_to_wake_up_fibril().
diff -r df7bc026d50e -r 4ea674e8825e arch/i386/kernel/ptrace.c --- a/arch/i386/kernel/ptrace.c Mon Jan 29 15:36:16 2007 -0800 +++ b/arch/i386/kernel/ptrace.c Mon Jan 29 15:46:47 2007 -0800 @@ -492,7 +492,7 @@ long arch_ptrace(struct task_struct *chi child->exit_code = data; /* make sure the single step bit is not set. */ clear_singlestep(child); - wake_up_process(child); + wake_up_task(child); ret = 0; break;
@@ -508,7 +508,7 @@ long arch_ptrace(struct task_struct *chi child->exit_code = SIGKILL; /* make sure the single step bit is not set. */ clear_singlestep(child); - wake_up_process(child); + wake_up_task(child); break;
case PTRACE_SYSEMU_SINGLESTEP: /* Same as SYSEMU, but singlestep if not syscall */ @@ -526,7 +526,7 @@ long arch_ptrace(struct task_struct *chi set_singlestep(child); child->exit_code = data; /* give it a chance to run. */ - wake_up_process(child); + wake_up_task(child); ret = 0; break;
/* * In case of non-aligned buffers, we may need 2 more diff -r df7bc026d50e -r 4ea674e8825e fs/jbd/journal.c --- a/fs/jbd/journal.c Mon Jan 29 15:36:16 2007 -0800 +++ b/fs/jbd/journal.c Mon Jan 29 15:46:47 2007 -0800 @@ -94,7 +94,7 @@ static void commit_timeout(unsigned long { struct task_struct * p = (struct task_struct *) __data;
> This patch introduces the notion of a 'fibril'. It's meant to be a > lighter kernel thread. [...]
as per my other email, i dont really like this concept. This is the killer:
> [...] There can be multiple of them in the process of executing for a > given task_struct, but only one can every be actively running at a > time. [...]
there's almost no scheduling cost from being able to arbitrarily schedule a kernel thread - but there are /huge/ benefits in it.
would it be hard to redo your AIO patches based on a pool of plain simple kernel threads?
We could even extend the scheduling properties of kernel threads so that they could also be 'companion threads' of any given user-space task. (i.e. they'd always schedule on the same CPu as that user-space task)
I bet most of the real benefit would come from co-scheduling them on the same CPU. But this should be a performance property, not a basic design property. (And i also think that having a limited per-CPU pool of AIO threads works better than having a per-user-thread pool - but again this is a detail that can be easily changed, not a fundamental design property.)
> > This patch introduces the notion of a 'fibril'. It's meant to be a > > lighter kernel thread. [...]
> as per my other email, i dont really like this concept. This is the > killer:
let me clarify this: i very much like your AIO patchset in general, in the sense that it 'completes' the AIO implementation: finally everything can be done via it, greatly increasing its utility and hopefully its penetration. This is the most important step, by far.
what i dont really like /the particular/ concept above - the introduction of 'fibrils' as a hard distinction of kernel threads. They are /almost/ kernel threads, but still by being different they create alot of duplication and miss out on a good deal of features that kernel threads have naturally.
It kind of hurts to say this because i'm usually quite concept-happy - one can easily get addicted to the introduction of new core kernel concepts :-) But i really, really think we dont want to do fibrils but we want to do kernel threads, and i havent really seen a discussion about why they shouldnt be done via kernel threads.
Nor have i seen a discussion that whatever threading concept we use for AIO within the kernel, it is really a fallback thing, not the primary goal of "native" KAIO design. The primary goal of KAIO design is to arrive at a state machine - and for one of the most important IO disciplines, networking, that is reality already. (For filesystem events i doubt we will ever be able to build an IO state machine - but there are lots of crazy folks out there so it's not fundamentally impossible, just very, very hard.)
so my suggestions center around the notion of extending kernel threads to support the features you find important in fibrils:
> would it be hard to redo your AIO patches based on a pool of plain > simple kernel threads?
> We could even extend the scheduling properties of kernel threads so > that they could also be 'companion threads' of any given user-space > task. (i.e. they'd always schedule on the same CPu as that user-space > task)
> I bet most of the real benefit would come from co-scheduling them on > the same CPU. But this should be a performance property, not a basic > design property. (And i also think that having a limited per-CPU pool > of AIO threads works better than having a per-user-thread pool - but > again this is a detail that can be easily changed, not a fundamental > design property.)
but i'm willing to be convinced of the opposite as well, as always. (I'm real good at quickly changing my mind, especially when i'm embarrasingly wrong about something. So please fire away and dont hold back.)
On Thu, Feb 01, 2007 at 02:02:34PM +0100, Ingo Molnar wrote: > what i dont really like /the particular/ concept above - the > introduction of 'fibrils' as a hard distinction of kernel threads. They > are /almost/ kernel threads, but still by being different they create > alot of duplication and miss out on a good deal of features that kernel > threads have naturally.
> It kind of hurts to say this because i'm usually quite concept-happy - > one can easily get addicted to the introduction of new core kernel > concepts :-) But i really, really think we dont want to do fibrils but > we want to do kernel threads, and i havent really seen a discussion > about why they shouldnt be done via kernel threads.
I tend to agree. Note that there is one thing we should be doing one one day (not only if we want to use it for aio) is to make kernel threads more lightweight. Thereéis a lot of baggae we keep around in task_struct and co that only makes sense for threads that have a user space part and aren't or shouldn't be needed for a purely kernel-resistant thread. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
> I tend to agree. Note that there is one thing we should be doing one > one day (not only if we want to use it for aio) is to make kernel > threads more lightweight. There a lot of baggae we keep around in > task_struct and co that only makes sense for threads that have a user > space part and aren't or shouldn't be needed for a purely > kernel-resistant thread.
yeah. I'm totally open to such efforts. I'd also be most happy if this was primarily driven via the KAIO effort: i.e. to implement it via kernel threads and then to benchmark the hell out of it. I volunteer to fix whatever fat kernel thread handling has left.
and if people agree with me that 'native' state-machine driven KAIO is where we want to ultimately achieve (it is certainly the best performing implementation) then i dont see the point in fibrils as an interim mechanism anyway. Lets just hide AIO complexities from userspace via kernel threads, and optimize this via two methods: by making kernel threads faster, and by simultaneously and gradually converting as much KAIO code to a native state machine - which would not need any kind of kernel thread help anyway.
(plus as i mentioned previously, co-scheduling kernel threads with related user space threads on the same CPU might be something useful too - not just for KAIO, and we could add that too.)
also, we context-switch kernel threads in 350 nsecs on current hardware and the -rt kernel is certainly happy with that and runs all hardirqs and softirqs in separate kernel thread contexts. There's not /that/ much fat left to cut off - and if there's something more to optimize there then there are a good number of projects interested in that, not just the KAIO effort :)
> also, we context-switch kernel threads in 350 nsecs on current hardware > and the -rt kernel is certainly happy with that and runs all hardirqs
Ingo, how relevant is that "350 nsecs on current hardware" claim?
I don't mean that in a bad way, but my own experience suggests that most people doing real hard RT (or tight soft RT) are not doing it on x86 architectures. But rather on lowly 1GHz (or less) ARM based processors and the like.
For RT issues, those are the platforms I care more about, as those are the ones that get embedded into real-time devices.
> >also, we context-switch kernel threads in 350 nsecs on current > >hardware and the -rt kernel is certainly happy with that and runs all > >hardirqs
> Ingo, how relevant is that "350 nsecs on current hardware" claim?
> I don't mean that in a bad way, but my own experience suggests that > most people doing real hard RT (or tight soft RT) are not doing it on > x86 architectures. But rather on lowly 1GHz (or less) ARM based > processors and the like.
it's not relevant to those embedded boards, but it's relevant to the AIO discussion, which centers around performance.
> For RT issues, those are the platforms I care more about, as those are > the ones that get embedded into real-time devices.
yeah. Nevertheless if you want to use -rt on your desktop (under Fedora 4/5/6) you can track an rpmized+distroized full kernel package quite easily, via 3 easy commands:
yum install kernel-rt.x86_64 # on x86_64 yum install kernel-rt # on i686
which is closely tracking latest upstream -git. (for example, the current kernel-rt-2.6.20-rc7.1.rt3.0109.i686.rpm is based on 2.6.20-rc7-git1, so if you want to run a kernel rpm that has all of Linus' latest commits from yesterday, this might be for you.)
it's rumored to be a quite smooth kernel ;-) So in this sense, because this also runs on all my testboxes by default, it matters on modern hardware too, at least to me. Today's commodity hardware is tomorrow's embedded hardware. If a kernel is good on today's colorful desktop hardware then it will be perfect for tomorrow's embedded hardware.
> there's almost no scheduling cost from being able to arbitrarily > schedule a kernel thread - but there are /huge/ benefits in it.
That's a singularly *stupid* argument.
Of course scheduling is fast. That's the whole *point* of fibrils. They still schedule. Nobody claimed anything else.
Bringing up RT kernels and scheduling latency is idiotic. It's like saying "we should do this because the sky is blue". Sure, that's true, but what the *hell* does raleigh scattering have to do with anything?
The cost has _never_ been scheduling. That was never the point. Why do you even bring it up? Only to make an argument that makes no sense?
The cost of AIO is
- maintenance. It'sa separate code-path, and it's one that simply doesn't fit into anything else AT ALL. It works (mostly) for simple things, ie reads and writes, but even there, it's really adding a lot of crud that we could do without.
- setup and teardown costs: both in CPU and in memory. These are the big costs. It's especially true since a lot of AIO actually ends up cached. The user program just wants the data - 99% of the time it's likely to be there, and the whole point of AIO is to get at it cheaply, but not block if it's not there.
So your scheduling arguments are inane. They totally miss the point. They have nothing to do with *anything*.
Ingo: everybody *agrees* that scheduling is cheap. Scheduling isn't the issue. Scheduling isn't even needed in the perfect path where the AIO didn't need to do any real IO (and that _is_ the path we actually would like to optimize most).
So instead of talking about totally irrelevant things, please keep your eyes on the ball.
So I claim that the ball is here:
- cached data (and that is *espectally* true of some of the more interesting things we can do with a more generic AIO thing: path lookup, inode filling (stat/fstat) etc usually has hit-rates in the 99% range, but missing even just 1% of the time can be deadly, if the miss costs you a hundred msec of not doing anythign else!
Do the math. A "stat()" system call generally takes on the other of a couple of microseconds. But if it misses even just 1% of the time (and takes 100 msec when it does that, because there is other IO also competing for the disk arm), ON AVERAGE it takes 1ms.
So what you should aim for is improving that number. The cached case should hopefully still be in the microseconds, and the uncached case should be nonblocking for the caller.
- setup/teardown costs. Both memory and CPU. This is where the current threads simply don't work. The setup cost of doing a clone/exit is actually much higher than the cost of doing the whole operation, most of the time. Remember: caches still work.
- maintenance. Clearly AIO will always have some special code, but if we can move the special code *away* from filesystems and networking and all the thousands of device drivers, and into core kernel code, we've done something good. And if we can extend it from just pure read/write into just about *anything*, then people will be happy.
So stop blathering about scheduling costs, RT kernels and interrupts. Interrupts generally happen a few thousand times a second. This is soemthing you want to do a *million* times a second, without any IO happening at all except for when it has to.
> let me clarify this: i very much like your AIO patchset in general, in > the sense that it 'completes' the AIO implementation: finally > everything > can be done via it, greatly increasing its utility and hopefully its > penetration. This is the most important step, by far.
We violently agree on this :).
> what i dont really like /the particular/ concept above - the > introduction of 'fibrils' as a hard distinction of kernel threads. > They > are /almost/ kernel threads, but still by being different they create > alot of duplication and miss out on a good deal of features that > kernel > threads have naturally.
I might quibble with some of the details, but I understand your fundamental concern. I do. I don't get up each morning *thrilled* by the idea of having to update lockdep, sysrq-t, etc, to understand these fibril things :). The current fibril switch isn't nearly as clever as the lock-free task scheduling switch. It'd be nice if we didn't have to do that work to optimize the hell out of it, sure.
> It kind of hurts to say this because i'm usually quite concept-happy - > one can easily get addicted to the introduction of new core kernel > concepts :-)
:)
> so my suggestions center around the notion of extending kernel threads > to support the features you find important in fibrils:
>> would it be hard to redo your AIO patches based on a pool of plain >> simple kernel threads?
It'd certainly be doable to throw together a credible attempt to service "asys" system call submission with full-on kernel threads. That seems like reasonable due diligence to me. If full-on threads are almost as cheap, great. If fibrils are so much cheaper that they seem to warrant investing in, great.
I am concerned about the change in behaviour if we fall back to full kernel threads, though. I really, really, want aio syscalls to behave just like sync ones.
Would your strategy be to update the syscall implementations to share data in task_struct so that there isn't as significant a change in behaviour? (sharing current->ioprio, instead if just inheriting it, for example.). We'd be betting that there would be few of these and that they'd be pretty reasonable to share?
On Thu, Feb 01, 2007 at 01:52:13PM -0800, Zach Brown wrote: > >let me clarify this: i very much like your AIO patchset in general, in > >the sense that it 'completes' the AIO implementation: finally > >everything > >can be done via it, greatly increasing its utility and hopefully its > >penetration. This is the most important step, by far.
> We violently agree on this :).
There is also the old kernel_thread based method that should probably be compared, especially if pre-created threads are thrown into the mix. Also, since the old days, a lot of thread scaling issues have been fixed that could even make userland threads more viable.
> Would your strategy be to update the syscall implementations to share > data in task_struct so that there isn't as significant a change in > behaviour? (sharing current->ioprio, instead if just inheriting it, > for example.). We'd be betting that there would be few of these and > that they'd be pretty reasonable to share?
Priorities cannot be shared, as they have to adapt to the per-request priority when we get down to the nitty gitty of POSIX AIO, as otherwise realtime issues like keepalive transmits will be handled incorrectly.
-ben -- "Time is of no importance, Mr. President, only life is important." Don't Email: <d...@kvack.org>. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
> Priorities cannot be shared, as they have to adapt to the per-request > priority when we get down to the nitty gitty of POSIX AIO, as > otherwise > realtime issues like keepalive transmits will be handled incorrectly.
Well, maybe not *blind* sharing. But something more than the disconnect threads currently have with current->ioprio.
Today an existing kernel thread would most certainly ignore a sys_ioprio_set() in the submitter and then handle an aio syscall with an old current->ioprio.
Something more smart than that is all I'm on about.
> So stop blathering about scheduling costs, RT kernels and interrupts. > Interrupts generally happen a few thousand times a second. This is > soemthing you want to do a *million* times a second, without any IO > happening at all except for when it has to.
we might be talking past each other.
i never suggested every aio op should create/destroy a kernel thread!
My only suggestion was to have a couple of transparent kernel threads (not fibrils) attached to a user context that does asynchronous syscalls! Those kernel threads would be 'switched in' if the current user-space thread blocks - so instead of having to 'create' any of them - the fast path would be to /switch/ them to under the current user-space, so that user-space processing can continue under that other thread!
That means that in the 'switch kernel context' fastpath it simply needs to copy the blocked threads' user-space ptregs (~64 bytes) to its own kernel stack, and then it can do a return-from-syscall without user-space noticing the switch! Never would we really see the cost of kernel thread creation. We would never see that cost in the fully cached case (no other thread is needed then), nor would we see it in the blocking-IO case, due to pooling. (there are some other details related to things like the FPU context, but you get the idea.)
Let me quote Zach's reply to my suggestions:
| It'd certainly be doable to throw together a credible attempt to | service "asys" system call submission with full-on kernel threads. | That seems like reasonable due diligence to me. If full-on threads | are almost as cheap, great. If fibrils are so much cheaper that they | seem to warrant investing in, great.
that's all i wanted to see being considered!
Please ignore my points about scheduling costs - i only talked about them at length because the only fundamental difference between kernel threads and fibrils is their /scheduling/ properties. /Not/ the setup/teardown costs - those are not relevant /precisely/ because they can be pooled and because they happen relatively rarely, compared to the cached case. The 'switch to the blocked thread's ptregs' operation also involves a context-switch under this design. That's why i was talking about scheduling so much: the /only/ true difference between fibrils and kernel threads is their /scheduling/.
I believe this is the point where your argument fails:
> - setup/teardown costs. Both memory and CPU. This is where the current > threads simply don't work. The setup cost of doing a clone/exit is > actually much higher than the cost of doing the whole operation, > most of the time.
you are comparing apples to oranges - i never said we should create/destroy a kernel thread for every async op. That would be insane!
what we need to support asynchronous system-calls is the ability to pick up an /already created/ kernel thread from a pool of per-task kernel threads and to switch it to under the current user-space and return to the user-space stack with that new kernel thread running. (The other, blocked kernel thread stays blocked and is returned into the pool of 'pending' AIO kernel threads.) And this only needs to happen in the 'cachemiss' case anyway. In the 'cached' case no other kernel thread would be involved at all, the current one just falls straight through the system-call.
my argument is that the whole notion of cutting this at the kernel stack and thread info level and making fibrils in essence a separate scheduling entitity is wrong, wrong, wrong. Why not use plain kernel threads for this?
[ finally, i think you totally ignored my main argument, state machines. The networking stack is a full and very nice state machine. It's kicked from user-space, and zillions of small contexts (sockets) are living on without any of the originating tasks having to be involved. So i'm still holding to the fundamental notion that within the kernel this form of AIO is a nice but /secondary/ mechanism. If a subsystem is able to pull it off, it can implement asynchronity via a state machine - and it will outperform any thread based AIO. Or not. We'll see. For something like the VFS i doubt we'll see (and i doubt we /want/ to see) a 'native' state-machine implementation.
this is btw. quite close to the Tux model of doing asynchronous block IO and asynchronous VFS events such as asynchronous open(). Tux uses a pool of kernel threads to pass blocking work to, while not holding up the 'main' thread. But the main Tux speedup comes from having a native state machine for all the networking IO. ]
> I tend to agree. Note that there is one thing we should be doing one > one day (not only if we want to use it for aio) is to make kernel threads > more lightweight. Thereéis a lot of baggae we keep around in task_struct > and co that only makes sense for threads that have a user space part and > aren't or shouldn't be needed for a purely kernel-resistant thread.
I suspect you will get a lot of this for free from the current namespace efforts.
> My only suggestion was to have a couple of transparent kernel threads > (not fibrils) attached to a user context that does asynchronous > syscalls! Those kernel threads would be 'switched in' if the current > user-space thread blocks - so instead of having to 'create' any of them > - the fast path would be to /switch/ them to under the current > user-space, so that user-space processing can continue under that other > thread!
But in that case, you really do end up with "fibrils" anyway.
Because those fibrils are what would be the state for the blocked system calls when they aren't scheduled.
We may have a few hundred thousand system calls a second (maybe that's not actually reasonable, but it should be what we *aim* for), and 99% of them will hopefully hit the cache and never need any separate IO, but even if it's just 1%, we're talking about thousands of threads.
I do _not_ think that it's reasonable to have thousands of threads state around just "in case". Especially if all those threadlets are then involved in signals etc - something that they are totally uninterested in.
I think it's a lot more reasonable to have just the kernel stack page for "this was where I was when I blocked". IOW, a fibril-like thing. You need some data structure to set up the state *before* you start doing any threads at all, because hopefully the operation will be totally synchronous, and no secondary thread is ever really needed!
What I like about fibrils is that they should be able to handle the cached case well: the case where no "real" scheduling (just the fibril stack switches) takes place.
Now, most traditional DB loads would tend to use AIO only when they "know" that real IO will take place (the AIO call itself will probably be O_DIRECT most of the time). So I suspect that a lot of those users will never really have the cached case, but one of my hopes is to be able to do exactly the things that we have *not* done well: asynchronous file opens and pathname lookups, which is very common in a file server.
If done *really* right, a perfectly normal app could do things like asynchronous stat() calls to fill in the readdir results. In other words, what *I* would like to see is the ability to have something *really* simple like "ls" use this, without it actually being a performance hit for the common case where everythign is cached.
Have you done "ls -l" on a big uncached directory where the inodes are all over the disk lately? You can hear the disk whirr. THAT is the kind of "normal user" thing I'd like to be able to fix, and the db case is actually secondary. The DB case is much much more limited (ok, so somebody pointed out that they want slightly more than just read/write, but still.. We're talking "special code".)
> [ finally, i think you totally ignored my main argument, state machines.
I ignored your argument, because it's not really relevant. The fact that networking (and TCP in particular) has state machines is because it is a packetized environment. Nothing else is. Think pathname lookup etc. They are all *fundamentally* environments with a call stack.
So the state machine argument is totally bogus - it results in a programming model that simply doesn't match the *normal* setup. You want the kernel programming model to appear "linear" even when it isn't, because it's too damn hard to think nonlinearly.
Yes, we could do pathname lookup with that kind of insane setup too. But it would be HORRID!
This one got shelved while I sorted other things out as it warranted a longer look. Some comments follow, but firstly can we please bury this "fibril" name. The constructs Zach is using appear to be identical to co-routines, and they've been called that in computer science literature for fifty years. They are one of the great and somehow forgotten ideas. (and I admit I've used them extensively in past things where its wonderful for multi-player gaming so I'm a convert already).
The stuff however isn't as free as you make out. Current kernel logic knows about various things being "safe" but with fibrils you have to address additional questions such as "What happens if I issue an I/O and change priority". You also have an 800lb gorilla hiding behind a tree waiting for you in priviledge and permission checking.
Right now current->*u/gid is safe across a syscall start to end, with an asynchronous setuid all hell breaks loose. I'm not saying we shouldn't do this, in fact we'd be able to do some of the utterly moronic poxix thread uid handling in kernel space if we did, just that it isn't free. We have locking rules defined by the magic serializing construct called "the syscall" and you break those.
I'd expect the odd other gorilla waiting to mug you as well and the ones nobody has thought of will be the worst 8)
The number of co-routines and stacks can be dealt with two ways - you use small stacks allocated when you create a fibril, or you grab a page, use separate IRQ stacks and either fail creation with -ENOBUFS etc which drops work on user space, or block (for which cases ??) which also means an overhead on co-routine exits. That can be tunable, for embedded easily tuned right down.
Traditional co-routines have clear notions of being able to create a co-routine, stack them and fire up specific ones. In part this is done because many things expressed in this way know what to fire up next. It's also a very clean way to express driver problem with a lot of state
Essentially as a co-routine is simply making "%esp" roughly the same as the C++ world's "self".
You get some other funny things from co-routines which are very powerful, very dangerous, or plain insane depending upon your view of life. One big one is the ability for real men (and women) to do stuff like this, because you don't need to keep the context attached to the same task.
send_reset_command(dev); wait_for_irq_event(dev->irq); /* co-routine continues in IRQ context here */ clean_up_reset_command(dev); exit_irq_event(); /* co-routine continues out of IRQ context here */ send_identify_command(dev);
Notice we just dealt with all the IRQ stack problems the moment an IRQ is a co-routine transfer 8)
Ditto with timers, although for the kernel that might not be smart as we have a lot of timers.
Less insanely you can create a context, start doing stuff in it and then pass it to someone else local variables, state and all. This one is actually rather useful for avoiding a lot of the 'D' state crap in the kernel.
For example we have driver code that sleeps uninterruptibly because its too hard to undo the mess and get out of the current state if it is interrupted. In the world of sending other people co-routines you just do this
and obviously you don't pass any over that will then not do the right thing before accessing user space (well unless implementing 'read_for_someone_else()' or other strange syscalls - like ptrace...)
Other questions really relate to the scheduling - Zach do you intend schedule_fibrils() to be a call code would make or just from schedule() ?
Linus will now tell me I'm out of my tree...
Alan (who used to use Co-routines in real languages on 36bit computers with 9bit bytes before learning C) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
> This one got shelved while I sorted other things out as it warranted a > longer look. Some comments follow, but firstly can we please bury this > "fibril" name. The constructs Zach is using appear to be identical to > co-routines, and they've been called that in computer science literature > for fifty years. They are one of the great and somehow forgotten ideas. > (and I admit I've used them extensively in past things where its > wonderful for multi-player gaming so I'm a convert already).
Well, they are indeed coroutines, but they are coroutines in the same sense any "CPU scheduler" ends up being a coroutine.
They are NOT the generic co-routine that some languages support natively. So I think trying to call them coroutines would be even more misleading than calling them fibrils.
In other workds the whole *point* of the fibril is that you can do "coroutine-like stuff" while using a "normal functional linear programming paradign".
Wouldn't you agree?
(I love the concept of coroutines, but I absolutely detest what the code ends up looking like. There's a good reason why people program mostly in linear flow: that's how people think consciously - even if it's obviously not how the brain actually works).
And we *definitely* don't want to have a coroutine programming interface in the kernel. Not in C.
> The stuff however isn't as free as you make out. Current kernel logic > knows about various things being "safe" but with fibrils you have to > address additional questions such as "What happens if I issue an I/O and > change priority". You also have an 800lb gorilla hiding behind a tree > waiting for you in priviledge and permission checking.
This is why I think it should be 100% clear that things happen in process context. That just answers everything. If you want to synchronize with async events and change IO priority, you should do exactly that:
wait_for_async(); ioprio(newprority);
and that "solves" that problem. Leave it to user space.
> Right now current->*u/gid is safe across a syscall start to end, with an > asynchronous setuid all hell breaks loose. I'm not saying we shouldn't do > this, in fact we'd be able to do some of the utterly moronic poxix thread > uid handling in kernel space if we did, just that it isn't free. We have > locking rules defined by the magic serializing construct called > "the syscall" and you break those.
I agree. As mentioned, we probably will have fallout.
> The number of co-routines and stacks can be dealt with two ways - you use > small stacks allocated when you create a fibril, or you grab a page, use > separate IRQ stacks and either fail creation with -ENOBUFS etc which > drops work on user space, or block (for which cases ??) which also means > an overhead on co-routine exits. That can be tunable, for embedded easily > tuned right down.
Right. It should be possible to just say "use a max parallelism factor of 5", and if somebody submits a hundred AIO calls and they all block, when it hits #6, it will just do it synchronously.
Basically, what I'm hoping can come out of this (and this is a simplistic example, but perhaps exactly *because* of that it hopefully also shows that we canactually make *simple* interfaces for complex asynchronous things):
/* Add it to the list, fill in the name */ entry->next = prev; prev = entry; strcpy(entry->name, de->d_name);
/* Do the stat lookup async */ async_stat(de->d_name, &entry->stat_buf); } wait_for_async(); .. Ta-daa! All done ..
and it *should* allow us to do all the stat lookup asynchronously.
Done right, this should basically be no slower than doing it with a real stat() if everything was cached. That would kind of be the holy grail here.
> You get some other funny things from co-routines which are very powerful, > very dangerous, or plain insane
You forgot "very hard to think about".
We DO NOT want coroutines in general. It's clever, but it's (a) impossible to do without language support that C doesn't have, or some really really horrid macro constructs that really only work for very specific and simple cases. (b) very non-intuitive unless you've worked with coroutines a lot (and almost nobody has)
> Linus will now tell me I'm out of my tree...
I don't think you're wrong in theory, I just thnk that in practice, withing the confines of (a) existing code, (b) existing languages, and (c) existing developers, we really REALLY don't want to expose coroutines as such.
But if you wanted to point out that what we want to do is get the ADVANTAGES of coroutines, without actually have to program them as such, then yes, I agree 100%. But we shouldn't call them coroutines, because the whole point is that as far as the user interface is concerned, they don't look like that. In the kernel, they just look like normal linear programming.
On Fri, 2 Feb 2007, Linus Torvalds wrote: > > You get some other funny things from co-routines which are very powerful, > > very dangerous, or plain insane
> You forgot "very hard to think about".
> We DO NOT want coroutines in general. It's clever, but it's > (a) impossible to do without language support that C doesn't have, or > some really really horrid macro constructs that really only work for > very specific and simple cases. > (b) very non-intuitive unless you've worked with coroutines a lot (and > almost nobody has)
Actually, coroutines are not too bad to program once you have a total-coverage async scheduler to run them. The attached (very sketchy) example uses libpcl ( http://www.xmailserver.org/libpcl.html ) and epoll as scheduler (but here you can really use anything). You can implement coroutines in many way, from C preprocessor macros up to anything, but in the libpcl case they are simply switched stacks. Like fibrils are supposed to be. The problem is that in order to make a real-life example of coroutine-based application work, you need everything that can put you at sleep (syscalls or any external library call you have no control on) implemented in an async way. And what I ended up doing is exactly what Zab did inside the kernel. In my case a dynamic pool of (userspace) threads servicing any non-native potentially pre-emptive call, and signaling the result to a pollable fd (pipe in my case) that is integrated in the epoll (poll/select whatever) scheduler. I personally find Zab idea a really good one, since it allows for generic kernel async implementation, w/out the burden of dirtying kernel code paths with AIO knowledge. Being it fibrils or real kthreads, it is IMO definitely worth a very close look.
> Actually, coroutines are not too bad to program once you have a > total-coverage async scheduler to run them.
No, no, I don't disagree at all. In fact, I agree emphatically.
It's just that you need the scheduler to run them, in order to not "see" them as coroutines. Then, you can program everything *as*if* it was just a regular declarative linear language with multiple threads).
And that gets us the same programming interface as we always have, and people can forget about the fact that in a very real sense, they are using coroutines with the scheduler just keeping track of it all for them.
After all, that's what we do between processes *anyway*. You can technically see the kernel as one big program that uses coroutines and the scheduler just keeping track of every coroutine instance. It's just that I doubt that any kernel programmer really thinks in those terms. You *think* in terms of "threads".
> They are NOT the generic co-routine that some languages support natively. > So I think trying to call them coroutines would be even more misleading > than calling them fibrils.
Its actually pretty damned close the Honeywell B co-routine package, with a kernel twist to be honest.
> ends up looking like. There's a good reason why people program mostly in > linear flow: that's how people think consciously - even if it's obviously > not how the brain actually works).
The IRQ example below is an example of how it linearizes - so it cuts both ways like most tools, admittedly one of the blades is at the handle end in this case ...
> Basically, what I'm hoping can come out of this (and this is a simplistic > example, but perhaps exactly *because* of that it hopefully also shows > that we canactually make *simple* interfaces for complex asynchronous > things):
> /* Add it to the list, fill in the name */ > entry->next = prev; > prev = entry; > strcpy(entry->name, de->d_name);
> /* Do the stat lookup async */ > async_stat(de->d_name, &entry->stat_buf); > } > wait_for_async();
The brown and sticky will hit the rotating air impeller pretty hard if you are not very careful about how that ends up scheduled. Its one thing to exploit the ability to pull all the easy lookups out in advance, and another having created all the parallelism to turn into into sane disk scheduling and wakeups without scaling hit. But you do at least have the opportunity to exploit it I guess.
> > You get some other funny things from co-routines which are very powerful, > > very dangerous, or plain insane
> You forgot "very hard to think about".
I'm not sure handing a fibril off to another task is that hard to think about. It's not easy to turn it around as an async_exit() keeping the other fibrils around because of the mass of rules and behaviours tied to process exit but its perhaps not impossible.
Other minor evil. If we use fibrils we need to be careful we know in advance how many fibrils an operation needs so we don't deadlock on them in critical places like writeout paths when we either hit the per task limit or we have no page for another stack.
> The brown and sticky will hit the rotating air impeller pretty hard if you > are not very careful about how that ends up scheduled
Why do you think that?
With cooperative scheduling (like the example Zach posted), there is absolutely no "brown and sticky" wrt any CPU usage. Which is why cooperative scheduling is a *good* thing. If you want to blow up your 1024-node CPU cluster, you'd to it with "real threads".
Also, with sane default limits of fibrils per process (say, in the 5-10), it also ends up beign good for IO. No "insane" IO bombs, but an easy way for users to just just get a reasonable amount of IO parallelism without having to use threading (which is hard).
So, best of both worlds.
Yes, *of*course* you want to have limits on outstanding work. And yes, a database server would set those limits much higher ("Only a thousand outstanding IO requests? Can we raise that to ten thousand, please?") than a regular process ("default: 5, and the super-user can raise it for you if you're good").
But there really shouldn't be any downsides.
(Of course, there will be downsides. I'm sure there will be. But I don't see any really serious and obvious ones).
> Other minor evil. If we use fibrils we need to be careful we > know in advance how many fibrils an operation needs so we don't deadlock > on them in critical places like writeout paths when we either hit the per > task limit or we have no page for another stack.
Since we'd only create fibrils on a system call entry level, and system calls are independent, how would you do that anyway?
Once a fibril has been created, it will *never* depend on any other fibril resources ever again. At least not in any way that any normal non-fibril call wouldn't already do as far as I can see.
> > My only suggestion was to have a couple of transparent kernel threads > > (not fibrils) attached to a user context that does asynchronous > > syscalls! Those kernel threads would be 'switched in' if the current > > user-space thread blocks - so instead of having to 'create' any of them > > - the fast path would be to /switch/ them to under the current > > user-space, so that user-space processing can continue under that other > > thread!
> But in that case, you really do end up with "fibrils" anyway.
> Because those fibrils are what would be the state for the blocked > system calls when they aren't scheduled.
> We may have a few hundred thousand system calls a second (maybe that's > not actually reasonable, but it should be what we *aim* for), and 99% > of them will hopefully hit the cache and never need any separate IO, > but even if it's just 1%, we're talking about thousands of threads.
> I do _not_ think that it's reasonable to have thousands of threads > state around just "in case". Especially if all those threadlets are > then involved in signals etc - something that they are totally > uninterested in.
> I think it's a lot more reasonable to have just the kernel stack page > for "this was where I was when I blocked". IOW, a fibril-like thing.
ok, i think i noticed another misunderstanding. The kernel thread based scheme i'm suggesting would /not/ 'switch' to another kernel thread in the cached case, by default. It would just execute in the original context (as if it were a synchronous syscall), and the switch to a kernel thread from the pool would only occur /if/ the context is about to block. (this 'switch' thing would be done by the scheduler) User-space gets back an -EAIO error code immediately and transparently - but already running under the new kernel thread.
i.e. in the fully cached case there would be no scheduling at all - in fact no thread pool is needed at all.
regarding cost:
the biggest memory resource cost of a kernel thread (assuming it has no real user-space context) /is/ its kernel stack page, which is 4K or 8K. The task struct takes ~1.5K. Once we have a ready kernel thread around, it's quite cheap to 'flip' it to under any arbitrary user-space context: change its thread_info->task pointer to the user-space context's task struct, copy the mm pointer, the fs pointer to the "worker thread", switch the thread_info, update ptregs - done. Hm?
Note: such a 'flip' would only occur when the original context blocks, /not/ on every async syscall.
regarding CPU resource costs, i dont think there should be significant signal overhead, because the original task is still only one instance, and the kernel thread that is now running with the blocked kernel stack is not part of the signal set. (Although it might make sense to make such async syscalls interruptible, just like any syscall.)
The 'pool' of kernel threads doesnt even have to be per-task, it can be a natural per-CPU thing - and its size will grow/shrink [with a low update frequency] depending on how much AIO parallelism there is in the workload. (But it can also be strictly per-user-context - to make sure that a proper ->mm ->fs, etc. is set up and that when the async system calls execute they have all the right context info.)
and note the immediate scheduling benefits: if an app (say like OpenOffice) is single-threaded but has certain common ops coded as async syscalls, then if any of those syscalls blocks then it could utilize /more than one/ CPU. I.e. we could 'spread' a single-threaded app's processing to multiple cores/hardware-threads /without/ having to multi-thread the app in an intrusive way. I.e. this would be a finegrained threading of syscalls, executed as coroutines in essence. With fibrils all sorts of scheduling limitations occur and no parallelism is possible.
in fact an app could also /trigger/ the execution of a syscall in a different context - to create parallelism artificially - without any blocking event. So we could do:
or something like that. Without user-space having to create threads itself, etc. So basically, we'd make kernel threads more useful, and we'd make threading safer - by only letting syscalls thread.
> What I like about fibrils is that they should be able to handle the > cached case well: the case where no "real" scheduling (just the fibril > stack switches) takes place.
the cached case (when a system call would not block at all) would not necessiate any switch to another kernel thread at all - the task just executes its system call as if it were synchronous!
that's the nice thing: we can do this switch-to-another-kernel-thread magic thing right in the scheduler when we block - and the switched-to thread will magically return to user-space (with a -EAIO return code) as if nothing happened (while the original task blocks). I.e. under this scheme i'm suggesting we have /zero/ setup cost in the cached case. The optimistic case just falls through and switches to nothing else. Any switching cost only occurs in the slowpath - and even that cost is very low.
once a kernel thread that ran off with the original stack finishes the async syscall and wants to return the return code, this can be gathered via a special return-code ringbuffer that notifies finished syscalls. (A magic cookie is associated to every async syscall.)
> So the state machine argument is totally bogus - it results in a > programming model that simply doesn't match the *normal* setup. You > want the kernel programming model to appear "linear" even when it > isn't, because it's too damn hard to think nonlinearly.
> Yes, we could do pathname lookup with that kind of insane setup too. > But it would be HORRID!
yeah, but i guess not nearly as horrid as writing a new OS from scratch ;-)
seriously, i very much think and agree that programming state machines is hard and not desired in most of the kernel. But it can be done, and sometimes (definitely not in the common case) it's /cleaner/ than functional programming. I've programmed an HTTP and an FTP in-kernel server via a state machine and it worked better than i initially expected. It needs different thinking but there /are/ people around with that kind of thinking, so we just cannot exclude the possibility. [ It's just that such people usually dedicate their brain to mental fantasies^H^H^Hexcercises called 'Higher Mathematics' :-) ]
> [...] The fact that networking (and TCP in particular) has state > machines is because it is a packetized environment.
rough ballpark figures: for things like webserving or fileserving (or mailserving), networking sockets are the reason for context-blocking events in 90% of the cases (mostly due to networking latency). 9% of the blocking happens due to plain block IO, and 1% happens due to VFS metadata (inode, directory, etc.) blocking.
( in Tux i had to handle /all/ of these sources of blocking because even 1% kills your performance if you do a hundred thousand requests per second - but in terms of design weight, networking is pretty damn important. )
and interestingly, modern IO frameworks tend to gravitate towards a packetized environment as well. I.e. i dont think state machines are /that/ unimportant.
> > The brown and sticky will hit the rotating air impeller pretty hard if you > > are not very careful about how that ends up scheduled
> Why do you think that?
> With cooperative scheduling (like the example Zach posted), there is > absolutely no "brown and sticky" wrt any CPU usage. Which is why > cooperative scheduling is a *good* thing. If you want to blow up your > 1024-node CPU cluster, you'd to it with "real threads".
You end up with a lot more things running asynchronously. In the current world we see a series of requests for attributes and hopefully we do readahead and all is neatly ordered. If fibrils are not ordered the same way then we could make it worse as we might not pick the right readahead for example.
> Since we'd only create fibrils on a system call entry level, and system > calls are independent, how would you do that anyway?
If we stick to that limit it ought to be ok. We've been busy slapping people who call sys_*, except for internal magic like kernel_thread - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/