This very rough patch series introduces a different way to provide AIO support for system calls.
Right now to provide AIO support for a system call you have to express your interface in the iocb argument struct for sys_io_submit(), teach fs/aio.c to translate this into some call path in the kernel that passes in an iocb, and then update your code path implement with either completion-based (EIOCBQUEUED) or retry-based (EIOCBRETRY) AIO with the iocb.
This patch series changes this by moving the complexity into generic code such that a system call handler would provide AIO support in exactly the same way that it supports a synchronous call. It does this by letting a task have multiple stacks executing system calls in the kernel. Stacks are switched in schedule() as they block and are made runnable.
First, let's introduce the term 'fibril'. It means small fiber or thread. It represents the stack and the bits of data which manage its scheduling under the task_struct. There's a 1:1:1 relationship between a fibril, the stack it's managing, and the thread_struct it's operating under. They're static for the fibril's lifetime. I came to choosing a funny new term after a few iterations of trying to use existing terms (stack, call, thread, task, path, fiber) without going insane. They were just too confusing to use with a clear conscious.
So, let's illustrate the changes by walking through the execution of an asys call. Let's say sys_nanosleep(). Then we'll talk about the trade-offs.
Maybe it'd make sense to walk through the patches in another window while reading the commentary. I'm sorry if this seems tedious, but this is non-trivial stuff. I want to get the point across.
We start in sys_asys_submit(). It allocates a fibril for this executing submission syscall and hangs it off the task_struct. This lets this submission fibril be scheduled along with the later asys system calls themselves.
sys_asys_submit() has arguments which specify system call numbers and arguments. For each of these calls the submission syscall allocates a fibril and constructs an initial stack. The fibril has eip pointing to the system call handler and esp pointing to the new stack such that when the handler is called its arguments will be on the stack. The stack is also constructed so that when the handler returns it jumps to a function which queues the return code for collection in userspace.
sys_asys_submit() asks the scheduler to put these new fibrils on the simple run queue in the task_struct. It's just a list_head for now. It then calls schedule().
After we've gotten the run queue lock in the scheduler we notice that there are fibrils on the task_struct's run queue. Instead of continuing with schedule(), then, we switch fibrils. The old submission fibril will still be in schedule() and we'll start executing sys_nanosleep() in the context of the submission task_struct.
The specific switching mechanics of this implementation rely on the notion of tracking a stack as a full thread_info pointer. To make the switch we transfer the non-stack bits of the thread_info from the old fibril's ti to the new fibril's ti. We update the book keeping in the task_struct to consider the new thread_info as the current thread_info for the task. Like so:
Yeah, messy. I'm interested in aggressive feedback on how to do this sanely. Especially from the perspective of worrying about all the archs.
Did everyone catch that "per_call" thing there? That's to switch members of task_struct which are local to a specific call. link_count, journal_info, that sort of thing. More on that as we talk about the costs later.
After the switch we're executing in sys_nanosleep(). Eventually it gets to the point where it's building its timer which will wake it after the sleep interval. Currently it would store a bare task_struct reference for wake_up_process(). Instead we introduce a helper which returns a cookie that is given to a specific wake_up_*() variant. Like so:
It then marks itself as TASK_INTERRUPTIBLE and calls schedule(). schedule() notices that we have another fibril on the run queue. It's the submission fibril that we switched from earlier. As we switched we saw that it was still TASK_RUNNING so we put it on the run queue as we switched. We now switch back to the submission fibril, leaving the sys_nanosleep() fibril sleeping. Let's say the submission task returns to userspace which then immediately calls sys_asys_await_completion(). It's an easy case :). It goes to sleep, there are no running fibrils and the schedule() path really puts the task to sleep.
Eventually the timer fires and the hrtimer code path wakes the fibril:
- if (task) - wake_up_process(task); + if (wake_target) + wake_up_target(wake_target);
We've doctored try_to_wake_up() to be able to tell if its argument is a task_struct or one of these fibril targets. In the fibril case it calls try_to_wake_up_fibril(). It notices that the target fibril does need to be woken and sets it TASK_RUNNING. It notices that it it's current in the task so it puts the fibril on the task's fibril run queue and wakes the task. There's grossness here. It needs the task to come through schedule() again so that it can find the new runnable fibril instead of continuing to execute its current fibril. To this end, wake-up marks the task's current ti TIF_NEED_RESCHED. This seems to work, but there are some pretty terrifying interactions between schedule, wake-up, and the maintenance of fibril->state and task->state that need to be sorted out.
Remember our task was sleeping in asys_await_completion()? The task was woken by the fibril wake-up path, but it's still executing the asys_await_completion() fibril. It comes out of schedule() and sees TIF_NEED_RESCHED and comes back through the top of schedule(). This time it finds the runnable sys_nanosleep() fibril and switches to it. sys_nanosleep() runs to completion and it returns which, because of the way we built its stack, calls asys_teardown_stack().
asys_teardown_stack() takes the return code and puts it off in a list for asys_await_completion(). It wakes a wait_queue to notify waiters of pending completions. In so doing it wakes up the asys_await_completion() fibril that was sleeping in our task.
Then it has to tear down the fibril for the call that just completed. In the current implementation the fibril struct is actually embedded in an "asys_call" struct. asys_teardown_stack() frees the asys_call struct, and so the fibril, after having cleared current->fibril. It then calls schedule(). Our asys_await_completion() fibril is on the run queue so we switch to it. Switching notices the null current->fibril that we're switching from and takes that as a queue to mark the previous thread_info for freeing *after* the switch.
After the switch we're in asys_await_completion(). We find the waiting return code completion struct in the list that was left by asys_teardown_stack(). We return it to userspace.
Phew. OK, so what are the trade-offs here? I'll start with the benefits for obvious reasons :).
- With get AIO support for all syscalls. Every single one. (Well, please, no asys sys_exit() :)). Buffered IO, sendfile, recvmsg, poll, epoll, hardware crypto ioctls, open, mmap, getdents, the entire splice API, etc.
- The syscall API does not change just because calls are being issued AIO, particularly things that reference task_struct. AIO sys_getpid() does what you'd expect, signal masking, etc. You don't have to worry about your AIO call being secretly handled by some worker threads that get very different results from current-> references.
- We wouldn't multiply testing and maintenance burden with separate AIO paths. No is_sync_kiocb() testing and divergence between returning or calling aio_complete(). No auditing to make sure that EIOCBRETRY only being returned after any significant references of current->. No worries about completion racing from the submission return path and some aio_complete() being called from another context. In this scheme if your sync syscall path isn't broken, your AIO path stands a great chance of working.
- The submission syscall won't block while handling submitted calls. Not for metadata IO, not for memory allocation, not for mutex contention, nothing.
- AIO syscalls which *don't* block see very little overhead. They'll allocate stacks and juggle the run queue locks a little, but they'll execute in turn on the submitting (cache-hot, presumably) processor. There's room to optimize this path, too, of course.
- We don't need to duplicate interfaces for each AIO interface we want to support. No iocb unions mirroring the syscall API, no magical AIO sys_ variants.
And the costs? It's not free.
- The 800lb elephant in the room. It uses a full stack per blocked operation. I believe this is a reasonable price to pay for the flexibility of having *any* call pending. It rules out some loads which would want to keep *millions* of operations pending, but I humbly submit that a load rarely takes that number of concurrent ops to saturate a resource. (think of it this way: we've gotten this far by having to burn a full *task* to have *any* syscall pending.) While not optimal, it opens to door to a lot of functionality without having to rewrite the kernel as a giant non-blocking state machine.
It should be noted that my very first try was to copy the used part of stacks in to and out of one full allocated stack. This uses less
...
This finally does something useful with the notion of being able to schedule stacks as fibrils under a task_struct. Again, i386-specific and in need of proper layering with archs.
sys_asys_submit() is added to let userspace submit asynchronous system calls. It specifies the system call number and arguments. A fibril is constructed for each call. Each starts with a stack which executes the given system call handler and then returns to a function which records the return code of the system call handler. sys_asys_await_completion() then lets userspace collect these results.
sys_asys_submit() is careful to construct a fibril for the submission syscall itself so that it can return to userspace if the calls it is dispatching block. If none of them block, however, they will have all been run hot in this submitting task on this processor.
It allocates and runs each system call in turn. It could certainly work in batches to decrease locking overhead at the cost of increased peak memory overhead for calls which don't end up blocking.
The complexity of a fully-formed submission and completion interface hasn't been addressed. Details like targeting explicit completion contexts, batching, timeouts, signal delivery, and syscall-free submission and completion (now with more rings!) can all be hashed out in some giant thread, no doubt. I didn't want them to cloud the basic mechanics being presented here.
/* Protection of the PI data structures: */ spinlock_t pi_lock; + + /* + * XXX This is just a dummy that should be in a seperately managed + * context. An explicit contexts lets asys calls be nested (!) and + * will let us provide the sys_io_*() API on top of asys. + */ + struct list_head asys_completed; + wait_queue_head_t asys_wait;
#ifdef CONFIG_RT_MUTEXES /* PI waiters blocked on a rt_mutex held by this task */ diff -r 4ea674e8825e -r 5bdda0f7bef2 kernel/Makefile --- a/kernel/Makefile Mon Jan 29 15:46:47 2007 -0800 +++ b/kernel/Makefile Mon Jan 29 15:50:10 2007 -0800 @@ -8,7 +8,7 @@ obj-y = sched.o fork.o exec_domain.o signal.o sys.o kmod.o workqueue.o pid.o \ rcupdate.o extable.o params.o posix-timers.o \ kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o mutex.o \ - hrtimer.o rwsem.o latency.o nsproxy.o srcu.o + hrtimer.o rwsem.o latency.o nsproxy.o srcu.o asys.o
#ifdef CONFIG_TRACE_IRQFLAGS DEBUG_LOCKS_WARN_ON(!p->hardirqs_enabled); diff -r 4ea674e8825e -r 5bdda0f7bef2 include/linux/asys.h --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/include/linux/asys.h Mon Jan 29 15:50:10 2007 -0800 @@ -0,0 +1,7 @@ +#ifndef _LINUX_ASYS_H +#define _LINUX_ASYS_H + +void asys_task_exiting(struct task_struct *tsk); +void asys_init_task(struct task_struct *tsk); + +#endif diff -r 4ea674e8825e -r 5bdda0f7bef2 kernel/asys.c --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/kernel/asys.c Mon Jan 29 15:50:10 2007 -0800 @@ -0,0 +1,252 @@ +#include <linux/sched.h> +#include <linux/uaccess.h> +#include <linux/err.h> +#include <linux/asys.h> + +/* XXX */ +#include <asm/processor.h> + +/* + * system call and argument specification given to _submit from userspace + */ +struct asys_input { + int syscall_nr; + unsigned long cookie; + unsigned long nr_args; + unsigned long *args; +}; + +/* + * system call completion event given to userspace + * XXX: compat + */ +struct asys_completion { + long return_code; + unsigned long cookie; +}; + +/* + * This record of a completed async system call is kept around until it + * is collected by userspace. + */ +struct asys_result { + struct list_head item; + struct asys_completion comp; +}; + +/* + * This stack is built-up and handed to the scheduler to first process + * the system call. It stores the progress of the call until the call returns + * and this structure is freed. + */ +struct asys_call { + struct asys_result *result; + struct fibril fibril; +}; + +void asys_init_task(struct task_struct *tsk) +{ + INIT_LIST_HEAD(&tsk->asys_completed); + init_waitqueue_head(&tsk->asys_wait); +} + +void asys_task_exiting(struct task_struct *tsk) +{ + struct asys_result *res, *next; + + list_for_each_entry_safe(res, next, &tsk->asys_completed, item) + kfree(res); + + /* + * XXX this only works if tsk->fibril was allocated by + * sys_asys_submit(), not if its embedded in an asys_call. This + * implies that we must forbid sys_exit in asys_submit. + */ + if (tsk->fibril) { + BUG_ON(!list_empty(&tsk->fibril->run_list)); + kfree(tsk->fibril); + tsk->fibril = NULL; + } +} + +/* + * Initial asys call stacks are constructed such that this is called when + * the system call handler returns. It records the return code from + * the handler in a completion event and frees data associated with the + * completed asys call. + * + * XXX we know that the x86 syscall handlers put their return code in eax and + * that regparm(3) here will take our rc argument from eax. + */ +static void fastcall NORET_TYPE asys_teardown_stack(long rc) +{ + struct asys_result *res; + struct asys_call *call; + struct fibril *fibril; + + fibril = current->fibril; + call = container_of(fibril, struct asys_call, fibril); + res = call->result; + call->result = NULL; + + res->comp.return_code = rc; + list_add_tail(&res->item, ¤t->asys_completed); + wake_up(¤t->asys_wait); + + /* + * We embedded the fibril in the call so that we could dereference it + * here without adding some tracking to the fibril. We then free the + * call and fibril because we're done with them. + * + * The ti itself, though, is still in use. It will only be freed once + * the scheduler switches away from it to another fibril. It does + * that when it sees current->fibril assigned to NULL. + */ + current->fibril = NULL; + BUG_ON(!list_empty(&fibril->run_list)); + kfree(call); + + /* + * XXX This is sloppy. We "know" this is likely for now as the task + * with fibrils is only going to be in sys_asys_submit() or + * sys_asys_complete() + */ + BUG_ON(list_empty(¤t->runnable_fibrils)); + + schedule(); + BUG(); +} + +asmlinkage long sys_asys_await_completion(struct asys_completion __user *comp) +{ + struct asys_result *res; + long ret; + + ret = wait_event_interruptible(current->asys_wait, + !list_empty(¤t->asys_completed)); + if (ret) + goto out; + + res = list_entry(current->asys_completed.next, struct asys_result, + item); + + /* XXX compat */ + ret = copy_to_user(comp, &res->comp, sizeof(struct asys_completion)); + if (ret) { + ret = -EFAULT; + goto out; + } + + list_del(&res->item); + kfree(res); + ret = 1; + +out: + return ret; +} + +/* + * This initializes a newly allocated fibril so that it can be handed to the + * scheduler. The fibril is private to this code path at this point. + * + * XXX + * - this is arch specific + * - should maybe have a sched helper that uses INIT_PER_CALL_CHAIN + */ +extern unsigned long sys_call_table[]; /* XXX */ +static int asys_init_fibril(struct fibril *fibril, struct thread_info *ti, + struct asys_input *inp) +{ + unsigned long *stack_bottom; + + INIT_LIST_HEAD(&fibril->run_list); + fibril->ti = ti; + + /* XXX sanity check syscall_nr */ + fibril->eip = sys_call_table[inp->syscall_nr]; + /* this mirrors copy_thread()'s use of task_pt_regs() */ + fibril->esp = (unsigned long)thread_info_pt_regs(ti) - + ((inp->nr_args + 1) * sizeof(long)); + + /* + * now setup the stack so that our syscall handler gets its arguments + *
...
> This very rough patch series introduces a different way to provide AIO support > for system calls.
Yee-haa!
I looked at this approach a long time ago, and basically gave up because it looked like too much work.
I heartily approve, although I only gave the actual patches a very cursory glance. I think the approach is the proper one, but the devil is in the details. It might be that the stack allocation overhead or some other subtle fundamental problem ends up making this impractical in the end, but I would _really_ like for this to basically go in.
One of the biggest issues I see is signalling. Your mention it, and I think that's going to be one of the big issues.
It won't matter at all for a certain class of calls (a lot of the people who want to do AIO really end up doing non-interruptible things, and signalling is a non-issue), but not only is it going to matter for some others, we will almost certainly want to have a way to not just signal a task, but a single "fibril" (and let me say that I'm not convinced about your naming, but I don't hate it either ;)
But from a quick overview of the patches, I really don't see anything fundamentally wrong. It needs some error checking and some limiting (I _really_ don't think we want a regular user starting a thousand fibrils concurrently), but it actually looks much less invasive than I thought it would be.
So yay! Consider me at least preliminarily very happy.
> But from a quick overview of the patches, I really don't see anything > fundamentally wrong. It needs some error checking and some limiting (I > _really_ don't think we want a regular user starting a thousand fibrils > concurrently), but it actually looks much less invasive than I thought it > would be.
Side note (and maybe this was obvious to people already): I would suggest that the "limiting" not be done the way fork() is limited ("return EAGAIN if you go over a limit") but be done as a per-task counting semaphore (down() on submit, up() on fibril exit).
So we should limit these to basically have some maximum concurrency factor, but rather than consider it an error to go over it, we'd just cap the concurrency by default, so that people can freely use asynchronous interfaces without having to always worry about what happens if their resources run out..
However, that also implies that we should probably add a "flags" parameter to "async_submit()" and have a FIBRIL_IMMEDIATE flag (or a timeout) or something to tell the kernel to rather return EAGAIN than wait. Sometimes you don't want to block just because you already have too much work.
> I looked at this approach a long time ago, and basically gave up > because > it looked like too much work.
Indeed, your mention of it in that thread.. a year ago?.. is what got this notion sitting in the back of my head. I didn't like it at first, but it grew on me.
> I heartily approve, although I only gave the actual patches a very > cursory > glance. I think the approach is the proper one, but the devil is in > the > details. It might be that the stack allocation overhead or some other > subtle fundamental problem ends up making this impractical in the > end, but > I would _really_ like for this to basically go in.
As for efficiency and overhead, I hope to get some time with the team that work on the Giant Database Software Whose Name We Shall Not Speak. That'll give us some non-trival loads to profile.
> It won't matter at all for a certain class of calls (a lot of the > people > who want to do AIO really end up doing non-interruptible things, and > signalling is a non-issue), but not only is it going to matter for > some > others, we will almost certainly want to have a way to not just > signal a > task, but a single "fibril" (and let me say that I'm not convinced > about > your naming, but I don't hate it either ;)
Yeah, no doubt. I'm wildly open to discussion here. (and yeah, me either, but I don't care much about the name. I got tired of qualifying overloaded uses of 'stack' or 'thread', that's all :)).
> But from a quick overview of the patches, I really don't see anything > fundamentally wrong. It needs some error checking and some limiting (I > _really_ don't think we want a regular user starting a thousand > fibrils > concurrently), but it actually looks much less invasive than I > thought it > would be.
I think we'll also want to flesh out the submission and completion interface so that we don't find ourselves frustrated with it in another 5 years. What's there now is just scaffolding to support the interesting kernel-internal part. No doubt the kevent thread will come into play here.
> So we should limit these to basically have some maximum concurrency > factor, but rather than consider it an error to go over it, we'd > just cap > the concurrency by default, so that people can freely use asynchronous > interfaces without having to always worry about what happens if their > resources run out..
Yeah, call it the socket transmit queue model :). Maybe tuned by a ulimit?
I don't have very strong opinions abou the specific mechanics of limiting concurrent submissions, as long as they're there. Some folks in Oracle complain about having one more thing to have to tune, but the alternative seems worse.
> However, that also implies that we should probably add a "flags" > parameter > to "async_submit()" and have a FIBRIL_IMMEDIATE flag (or a timeout) or > something to tell the kernel to rather return EAGAIN than wait. > Sometimes > you don't want to block just because you already have too much work.
EAGAIN or the initial number of submissions completed before the one that ran over the limit, perhaps. Sure. Nothing too controversial here :). I have this kind of stuff queued up for worrying about once the internal mechanics are stronger.
> I think we'll also want to flesh out the submission and completion interface > so that we don't find ourselves frustrated with it in another 5 years. What's > there now is just scaffolding to support the interesting kernel-internal part. > No doubt the kevent thread will come into play here.
Actually, the thing I like about kernel micro-threads (and that "fibril" name is starting to grow on me) is that I'm hoping we might be able to use them for that kevent thing too. And perhaps some other issues (ACPI has some "events" that might work with synchronously scheduled threads).
IOW, synchronous threading does have its advantages..
Btw, I noticed that you didn't Cc Ingo. Definitely worth doing. Not just because he's basically the normal scheduler maintainer, but also because he's historically been involved in things like the async filename lookup that the in-kernel web server thing used. EXACTLY the kinds of things where fibrils actually give you *much* nicer interfaces.
Ingo - see linux-kernel for the announcement and WIP patches.
> Btw, I noticed that you didn't Cc Ingo. Definitely worth doing. Not > just > because he's basically the normal scheduler maintainer, but also > because > he's historically been involved in things like the async filename > lookup > that the in-kernel web server thing used.
Yeah, that was dumb. I had him in the cc: initially, then thought it was too large and lobbed a bunch off. My mistake.
Ingo, I'm interested in your reaction to the i386-specific mechanics here (the thread_info copies terrify me) and the general notion of how to tie this cleanly into the scheduler.
> - We would now have some measure of task_struct concurrency. Read that twice, > it's scary. As two fibrils execute and block in turn they'll each be > referencing current->. It means that we need to audit task_struct to make sure > that paths can handle racing as its scheduled away. The current implementation > *does not* let preemption trigger a fibril switch. So one only has to worry > about racing with voluntary scheduling of the fibril paths. This can mean > moving some task_struct members under an accessor that hides them in a struct > in task_struct so they're switched along with the fibril. I think this is a > manageable burden.
That's the one scaring me in fact ... Maybe it will end up being an easy one but I don't feel too comfortable... we didn't create fibril-like things for threads, instead, we share PIDs between tasks. I wonder if the sane approach would be to actually create task structs (or have a pool of them pre-created sitting there for performances) and add a way to share the necessary bits so that syscalls can be run on those spin-offs.
On Tue, 2007-01-30 at 15:45 -0800, Zach Brown wrote: > > Btw, I noticed that you didn't Cc Ingo. Definitely worth doing. Not > > just > > because he's basically the normal scheduler maintainer, but also > > because > > he's historically been involved in things like the async filename > > lookup > > that the in-kernel web server thing used.
> Yeah, that was dumb. I had him in the cc: initially, then thought it > was too large and lobbed a bunch off. My mistake.
> Ingo, I'm interested in your reaction to the i386-specific mechanics > here (the thread_info copies terrify me) and the general notion of > how to tie this cleanly into the scheduler.
Thread info copies aren't such a big deal, we do that for irq stacks already afaik
On Wed, 31 Jan 2007, Benjamin Herrenschmidt wrote: > > - We would now have some measure of task_struct concurrency. Read that twice, > > it's scary. As two fibrils execute and block in turn they'll each be > > referencing current->. It means that we need to audit task_struct to make sure > > that paths can handle racing as its scheduled away. The current implementation > > *does not* let preemption trigger a fibril switch. So one only has to worry > > about racing with voluntary scheduling of the fibril paths. This can mean > > moving some task_struct members under an accessor that hides them in a struct > > in task_struct so they're switched along with the fibril. I think this is a > > manageable burden.
> That's the one scaring me in fact ... Maybe it will end up being an easy > one but I don't feel too comfortable...
We actually have almost zero "interesting" data in the task-struct.
All the real meat of a task has long since been split up into structures that can be shared for threading anyway (ie signal/files/mm/etc).
Which is why I'm personally very comfy with just re-using task_struct as-is.
NOTE! This is with the understanding that we *never* do any preemption. The whole point of the microthreading as far as I'm concerned is exactly that it is cooperative. It's not preemptive, and it's emphatically *not* concurrent (ie you'd never have two fibrils running at the same time on separate CPU's).
If you want preemptive of concurrent CPU usage, you use separate threads. The point of AIO scheduling is very much inherent in its name: it's for filling up CPU's when there's IO.
So the theory (and largely practice) is that you want to use real threads to fill your CPU's, but then *within* those threads you use AIO to make sure that each thread actually uses the CPU efficiently and doesn't just block with nothing to do.
So with the understanding that this is neither CPU-concurrent nor preemptive (*within* a fibril group - obviously the thread itself gets both preempted and concurrently run with other threads), I don't worry at all about sharing "struct task_struct".
Does that mean that we might not have some cases where we'd need to make sure we do things differently? Of course not. Something migt show up. But this actually makes it very clear what the difference between "struct thread_struct" and "struct task_struct" are. One is shared between fibrils, the other isn't.
> Does that mean that we might not have some cases where we'd need to make > sure we do things differently? Of course not. Something migt show up. But > this actually makes it very clear what the difference between "struct > thread_struct" and "struct task_struct" are. One is shared between > fibrils, the other isn't.
Btw, this is also something where we should just disallow certain system calls from being done through the asynchronous method.
Notably, clone/fork(), execve() and exit() are all things that we probably simply shouldn't allow as "AIO" events.
The process handling ones are obvious: they are very much about the shared "struct task_struct", so they rather clearly should only done "natively".
More interesting is the question about "close()", though. Currently we have an optimization (fget/fput_light) that basically boils down to "we know we are the only owners". That optimization becomes more "interesting" with AIO - we need to disable it when fibrils are active (because other fibrils or the main thread can do it), but we can still keep it for the non-fibril case.
So we can certainly allow close() to happen in a fibril, but at the same time, this is an area where just the *existence* of fibrils means that certain other decisions that were thread-related may be modified to be aware of the micro-threads too.
I suspect there are rather few of those, though. The only one I can think of is literally the fget/fput_light() case, but there could be others.
> NOTE! This is with the understanding that we *never* do any preemption. > The whole point of the microthreading as far as I'm concerned is exactly > that it is cooperative. It's not preemptive, and it's emphatically *not* > concurrent (ie you'd never have two fibrils running at the same time on > separate CPU's).
That makes it indeed much less worrisome...
> If you want preemptive of concurrent CPU usage, you use separate threads. > The point of AIO scheduling is very much inherent in its name: it's for > filling up CPU's when there's IO.
Ok, I see, that's in fact pretty similar to some task switching hack I did about 10 years ago on MacOS to have "asynchronous" IO code be implemented linearily :-)
> On Wed, 31 Jan 2007, Benjamin Herrenschmidt wrote:
>>>- We would now have some measure of task_struct concurrency. Read that twice, >>>it's scary. As two fibrils execute and block in turn they'll each be >>>referencing current->. It means that we need to audit task_struct to make sure >>>that paths can handle racing as its scheduled away. The current implementation >>>*does not* let preemption trigger a fibril switch. So one only has to worry >>>about racing with voluntary scheduling of the fibril paths. This can mean >>>moving some task_struct members under an accessor that hides them in a struct >>>in task_struct so they're switched along with the fibril. I think this is a >>>manageable burden.
>>That's the one scaring me in fact ... Maybe it will end up being an easy >>one but I don't feel too comfortable...
> We actually have almost zero "interesting" data in the task-struct.
> All the real meat of a task has long since been split up into structures > that can be shared for threading anyway (ie signal/files/mm/etc).
> Which is why I'm personally very comfy with just re-using task_struct > as-is.
> NOTE! This is with the understanding that we *never* do any preemption. > The whole point of the microthreading as far as I'm concerned is exactly > that it is cooperative. It's not preemptive, and it's emphatically *not* > concurrent (ie you'd never have two fibrils running at the same time on > separate CPU's).
So using stacks to hold state is (IMO) the logical choice to do async syscalls, especially once you have a look at some of the other AIO stuff going around.
I always thought that the AIO people didn't do this because they wanted to avoid context switch overhead.
So now if we introduce the context switch overhead back, why do we need just another scheduling primitive? What's so bad about using threads? The upside is that almost everything is already there and working, and also they don't have any of these preemption or concurrency restrictions.
The only thing I saw in Zach's post against the use of threads is that some kernel API would change. But surely if this is the showstopper then there must be some better argument than sys_getpid()?!
Aside from that, I'm glad that someone is looking at this way for AIO, because I really don't like some aspects in the other approach.
>> On Wed, 31 Jan 2007, Benjamin Herrenschmidt wrote:
>>>> - We would now have some measure of task_struct concurrency. Read >>>> that twice, >>>> it's scary. As two fibrils execute and block in turn they'll each be >>>> referencing current->. It means that we need to audit task_struct >>>> to make sure >>>> that paths can handle racing as its scheduled away. The current >>>> implementation >>>> *does not* let preemption trigger a fibril switch. So one only has >>>> to worry >>>> about racing with voluntary scheduling of the fibril paths. This >>>> can mean >>>> moving some task_struct members under an accessor that hides them in >>>> a struct >>>> in task_struct so they're switched along with the fibril. I think >>>> this is a >>>> manageable burden.
>>> That's the one scaring me in fact ... Maybe it will end up being an easy >>> one but I don't feel too comfortable...
>> We actually have almost zero "interesting" data in the task-struct.
>> All the real meat of a task has long since been split up into >> structures that can be shared for threading anyway (ie >> signal/files/mm/etc).
>> Which is why I'm personally very comfy with just re-using task_struct >> as-is.
>> NOTE! This is with the understanding that we *never* do any >> preemption. The whole point of the microthreading as far as I'm >> concerned is exactly that it is cooperative. It's not preemptive, and >> it's emphatically *not* concurrent (ie you'd never have two fibrils >> running at the same time on separate CPU's).
> So using stacks to hold state is (IMO) the logical choice to do async > syscalls, especially once you have a look at some of the other AIO > stuff going around.
> I always thought that the AIO people didn't do this because they wanted > to avoid context switch overhead.
> So now if we introduce the context switch overhead back, why do we need > just another scheduling primitive? What's so bad about using threads? The > upside is that almost everything is already there and working, and also > they don't have any of these preemption or concurrency restrictions.
In other words, while I share the appreciation for this clever trick of using cooperative switching between these little thriblets, I don't actually feel it is very elegant to then have to change the kernel so much in order to handle them.
I would be fascinated to see where such a big advantage comes from using these rather than threads. Maybe we can then improve threads not to suck so much and everybody wins.
> I always thought that the AIO people didn't do this because they wanted > to avoid context switch overhead.
I don't think that scheduling overhead was ever a really the reason, at least not the primary one, and at least not on Linux. Sure, we can probably make cooperative thread switching a bit faster than even VM-sharing thread switching (maybe), but it's not going to be *that* big an issue.
Ifaik, the bigger issues were about setup costs (but also purely semantic - it was hard to do AIO semantics with threads).
And memory costs. The "one stack page per outstanding AIO" may end up still being too expensive, but threads were even more so.
[ Of course, that used to also be the claim by all the people who thought we couldn't do native kernel threads for "normal" threading either, and should go with the n*m setup. Shows how much they knew ;^]
But I've certainly _personally_ always wanted to do AIO with threads. I wanted to do it with regular threads (ie the "clone()" kind). It didn't fly. But I think we can possibly lower both the setup costs and the memory costs with the cooperative approach, to the point where maybe this one is more palatable and workable.
And maybe it also solves some of the scalability worries (threads have ID space and scheduling setup things that essentially go away by just not doing them - which is what the fibrils simply wouldn't have).
(Sadly, some of the people who really _use_ AIO are the database people, and they really only care about a particularly stupid and trivial case: pure reads and writes. A lot of other loads care about much more complex things: filename lookups etc, that traditional AIO cannot do at all, and that you really want something more thread-like for. But those other loads get kind of swamped by the DB needs, which are might tighter and trivial enough that you don't "need" a real thread for them).
Zach Brown <zach.br...@oracle.com> writes: > This finally does something useful with the notion of being able to schedule > stacks as fibrils under a task_struct. Again, i386-specific and in need of > proper layering with archs.
> sys_asys_submit() is added to let userspace submit asynchronous system calls. > It specifies the system call number and arguments. A fibril is constructed for > each call. Each starts with a stack which executes the given system call > handler and then returns to a function which records the return code of the > system call handler. sys_asys_await_completion() then lets userspace collect > these results.
Do you have any numbers how this compares cycle wise to just doing clone+syscall+exit in user space?
If the difference is not too big might it be easier to fix clone+syscall to be more efficient than teach all the rest of the kernel about fibrils?
> [ Of course, that used to also be the claim by all the people who > thought we couldn't do native kernel threads for "normal" threading > either, and should go with the n*m setup. Shows how much they knew > ;^]
> But I've certainly _personally_ always wanted to do AIO with threads. > I wanted to do it with regular threads (ie the "clone()" kind). It > didn't fly. But I think we can possibly lower both the setup costs and > the memory costs with the cooperative approach, to the point where > maybe this one is more palatable and workable.
as you know i've been involved in all the affected IO and scheduling disciplines in question: i hacked on scheduling, 1:1 threading, on Tux, i even optimized kaio under TPC workloads, and various other things. So i believe to have seen most of these things first-hand and thus i (pretend to) have no particular prejudice or other subjective preference. In the following, rather long (and boring) analysis i touch upon both 1:1 threading, light threads, state machines and AIO.
The first and most important thing is that i think there are only /two/ basic, fundamental IO approaches that matter to kernel design:
1) the most easy to program one.
2) the fastest one.
1), the most easily programmed model: this is tasks and synchronous IO. Samba (and Squid, and even Apache most of the time ...) is still using plain old /processes/ and is doing fine! Most of the old LWP arguments are not really true anymore, TLB switches are very fast meanwhile (per hardware improvements) and our scheduler is still pretty light. Furthermore, for certain specialized environments that are able isolate the programmer from the risks and complexities of thread programming, threading can be somewhat faster (such as Java or C#). But for the general C/C++ based server environment even threading is pure hell. (We know it from the kernel that programming a threaded design is /hard/, needs alot of discipline and takes alot more resources than any of the saner models.)
2) the fastest model: is a pure, minimal state-machine tailored to the task/workload we want to do. Tux does that (and i'm keeping it uptodate for current 2.6 kernels too), and it still beats the pants off anything user-space.
But reality is that most people care more about programmability: 300 MB/sec or 500 MB/sec web throughput limit (or a serving limit 30K requests in parallel or 60K requests in parallel - whatever the performance metric you care about is) isnt really making any difference in all but the most specialized environments - /because/ the price to pay for it is much worse programmability (and hence much worse flexibility). [ And this is all about basic economy: if it's 10 times harder to program something and what takes 1 month to program in one model takes 10 months to program in the other model - but in 10 months the hardware already got another 50% faster, likely offsetting the advantage of that 'other' model to begin with ... ]
Note that often the speed difference between task based and state based designs is alot smaller. But God is the programming complexity difference huge. Not even huge but /HUGE/. All our programming languages are procedural and stack based. (Not only is there a basic lack of state-based programming languages, it's a product of our brain structure: state machines are not really what the human mind is structured for, but i digress.)
Now where do all these LWP, fibre, firbril, micro-thread or N:M concepts fit? Most of the time they are just a /weakening/ of the #1 concept. And that's why they will lose out, because #1 is all about programmability and they dont offer anything new: because they cannot. Either you go for programmability or you go for performance. There is /no/ middle ground for us in the kernel! (User-space can do the middle ground thing, but the kernel must only be structured around the two most fundamental concepts that exist. [NOTE: more about KAIO later.])
Having a 1:1 relationship between user-space and kernel-space context is a strong concept, and on modern CPUs we perform /process/ context switches in 650 nsecs, we do user thread context switches in 500 nsecs, and kernel<->kernel thread context switches in 350 nsecs. That's pretty damn fast.
The big problem is, the M:N concepts (and i consider micro-threads or 'schedulable stacks' to be of that type) tend to operate the following way: they build up hype by being "faster" - but the performance advantage only comes from weakening #1! (for example by not making the contexts generally schedulable, bindable, load-balancable, suspendable, migratable, etc.) But then a few years later they grow (by virtue of basic pressure from programmers, who /want/ a clean easily programmable concept #1) the /very same/ features that they "optimized away" in the beginning. With the difference that now all that infrastructure they left out initially is replicated in user-space, in a poorer and inconsistent way, blowing up cache-size, and slowing the /sane/ things down in the end. (and then i havent even begun about the nightmare of extending the debug infrastructure to light threads, ABI dependencies, etc, etc...)
One often repeated (because pretty much only) performance advantage of 'light threads' is context-switch performance between user-space threads. But reality is, nobody /cares/ about being able to context-switch between "light user-space threads"! Why? Because there are only two reasons why such a high-performance context-switch would occur:
1) there's contention between those two tasks. Wonderful: now two artificial threads are running on the /same/ CPU and they are even contending each other. Why not run a single context on a single CPU instead and only get contended if /another/ CPU runs a conflicting context?? While this makes for nice "pthread locking benchmarks", it is not really useful for anything real.
2) there has been an IO event. The thing is, for IO events we enter the kernel no matter what - and we'll do so for the next 10 years at minimum. We want to abstract away the hardware, we want to do reliable resource accounting, we want to share hardware resources, we want to rate-limit, etc., etc. While in /theory/ you could handle IO purely from user-space, in practice you dont want to do that. And if we accept the premise that we'll enter the kernel anyway, there's zero performance difference between scheduling right there in the kernel, or returning back to user-space to schedule there. (in fact i submit that the former is faster). Or if we accept the theoretical possibility of 'perfect IO hardware' that implements /all/ the features that the kernel wants (in a secure and generic way, and mind you, such IO hardware does not exist yet), then /at most/ the performance advantage of user-space doing the scheduling is the overhead of a null syscall entry. Which is a whopping 100 nsecs on modern CPUs! That's roughly the latency of a /single/ DRAM access!
Furthermore, 'light thread' concepts can no way approach the performance of #2 state-machines: if you /know/ what the structure of your context is, and you can program it in a specialized state-machine way, there's just so many shortcuts possible that it's not even funny.
> And maybe it also solves some of the scalability worries (threads have > ID space and scheduling setup things that essentially go away by just > not doing them - which is what the fibrils simply wouldn't have).
our PID management is very fast - i've never seen it show up in profiles. (We could make it even more scalable if the need arises, for example right now we still maintain the luxory of having a globally increasing and tightly compressed PID/TID space.)
until a few days ago we even had a /global lock/ in the thread/task creation and destruction hotpath since 2.5.43 (when oprofile support was added, more than 4 years ago, the oprofile exit notifier lock) and nobody really noticed or cared!
so i believe there are only two directions that make sense in the long run:
1) improve our basic #1 design gradually. If something is a bottleneck, if the scheduler has grown too fat, cut some slack. If micro-threads or fibrils offer anything nice for our basic thread model: integrate it into the kernel. (Dont worry about having the enter the kernel for that, we /want/ that isolation! Resource control /needs/ a secure and trustable context. Good debuggability /needs/ a secure and trustable context. Hardware makers will make damn sure that such hardware-based isolation is fast as lightning. SYSENTER will continue to be very fast.) In any case, unless someone can answer the issues i raised above, please dont mess with the basic 1:1 task model we have. It's really what we want to have.
2) enable crazy people to do IO in the #2 way. For this i think the most programmable interface is KAIO - because the 'context' /only/ involves the IO entity itself, which is simple enough for our brain to wrap itself around. (and the hardware itself has the concept of 'in flight IO' anyway, so people are forced to learn about it and to deal with it anyway, even in the synchronous IO model. So there's quite some mindshare to build upon.) This also happens to realize /most/ of the performance advantages that a state-machine like Tux tends to have. (In that sense i dont see epoll to be conceptually different from KAIO, although internally within the kernel it's quite a bit different.)
now, KAIO 'behind the hood' (within the kernel) is like /very/ hard to program in a fully state-driven manner - but we are crazy folks who /might/ be able to pull it off. Initially we should just simulate the really hard bits via a pool of in-kernel
...
On Tue, 2007-01-30 at 19:02 -0800, Linus Torvalds wrote: > Btw, this is also something where we should just disallow certain system > calls from being done through the asynchronous method.
Does that mean that doing an AIO-disabled syscall will wait for all in- flight AIO syscalls to finish ?
On Wednesday 31 January 2007 18:15, Zach Brown wrote:
> On Jan 31, 2007, at 12:58 AM, Andi Kleen wrote:
> > Do you have any numbers how this compares cycle wise to just doing > > clone+syscall+exit in user space?
> Not yet, no. Release early, release often, and all that. I'll throw > something together.
So what was the motivation for doing this then? It's only point is to have smaller startup costs for AIO than clone+fork without fixing the VFS code to be a state machine, right?
I'm personally unclear if it's really less work to teach a lot of code in the kernel about a new thread abstraction than changing VFS.
Your patches don't look that complicated yet but you openly admitted you waved away many of the more tricky issues (like signals etc.) and I bet there are yet-unknown side effects of this too that will need more changes.
I would expect a VFS solution to be the fastest of any at least.
I'm not sure the fibrils thing will be that much faster than a possibly somewhat fast pathed for this case clone+syscall+exit.
>> - We would now have some measure of task_struct concurrency. Read >> that twice, >> it's scary. > That's the one scaring me in fact ... Maybe it will end up being an > easy > one but I don't feel too comfortable...
Indeed, that was my first reaction too. I dismissed the idea for a good six months after initially realizing that it implied sharing journal_info, etc.
But when I finally sat down and started digging through the task_struct members and, after quickly dismissing involuntary preemption of the fibrils, it didn't seem so bad. I haven't done an exhaustive audit yet (and I won't advocate merging until I have) but I haven't seen any train wrecks.
> we didn't create fibril-like > things for threads, instead, we share PIDs between tasks. I wonder if > the sane approach would be to actually create task structs (or have a > pool of them pre-created sitting there for performances) and add a way > to share the necessary bits so that syscalls can be run on those > spin-offs.
Maybe, if it comes to that. I have some hopes that sharing by default and explicitly marking the bits that we shouldn't share will be good enough.
> Does that mean that we might not have some cases where we'd need to > make > sure we do things differently? Of course not. Something migt show up.
Might, and has. For a good time, take journal_info out of per_call_chain() in the patch set and watch it helpfully and loudly wet itself. There really really are bits of thread_struct which are strictly thread-local-storage, of a sort, for a kernel call path. Sharing them, even only through cooperate scheduling, is fatal. link_count is another obvious one.
They're also the only ones I've bothered to discover so far :).
> But > this actually makes it very clear what the difference between "struct > thread_struct" and "struct task_struct" are. One is shared between > fibrils, the other isn't.
Indeed.
Right now the per-fibril uses of task_struct members are left inline in task_struct and are copied on fibril switches.
We *could* put them in thread_info, at the cost of stack pressure, or hang them off task_struct in their own struct to avoid the copies, at the cost of indirection. I didn't like imposing a cost on paths that don't use fibrils, though, so I left them inline.
(I think you know all this. I'm clarifying for the peanut gallery, I hope.)
On Wed, Jan 31, 2007 at 09:38:11AM -0800, Zach Brown wrote: > Indeed, that was my first reaction too. I dismissed the idea for a > good six months after initially realizing that it implied sharing > journal_info, etc.
> But when I finally sat down and started digging through the > task_struct members and, after quickly dismissing involuntary > preemption of the fibrils, it didn't seem so bad. I haven't done an > exhaustive audit yet (and I won't advocate merging until I have) but > I haven't seen any train wrecks.
I'm still of the opinion that you cannot do this without creating actual threads. That said, there is room for setting up the task_struct beforehand without linking it into the system lists. The reason I don't think this approach works (and I looked at it a few times) is that many things end up requiring special handling: things like permissions, signals, FPU state, segment registers.... The list ends up looking exactly the way task_struct is, making the actual savings very small.
What the fibrils approach is useful for is the launching of the thread initially, as you don't have to retain things like the current FPU state, change segment registers, etc. Changing the stack is cheap, the rest of the work is not.
-ben -- "Time is of no importance, Mr. President, only life is important." Don't Email: <d...@kvack.org>. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
> Btw, this is also something where we should just disallow certain > system > calls from being done through the asynchronous method.
Yeah. Maybe just a bitmap built from __NR_ constants? I don't know if we can do it in a way that doesn't require arch maintainer's attention.
It seems like it would be nice to avoid putting a test in the handlers themselves, and leave it up to the aio syscall submission processing.
> More interesting is the question about "close()", though. Currently we > have an optimization (fget/fput_light) that basically boils down to > "we > know we are the only owners". That optimization becomes more > "interesting" > with AIO - we need to disable it when fibrils are active (because > other > fibrils or the main thread can do it), but we can still keep it for > the > non-fibril case.