Pending AIO work/patches

Suparna Bhattacharya

unread,

Jun 20, 2005, 8:00:13 AM6/20/05

to

Since AIO development is gaining momentum once again, ocfs2 and
samba both appear to be using AIO, NFS needs async semaphores etc,
there appears to be an increase in interest in straightening out some
of the pending work in this area. So this seems like a good
time to re-post some of those patches for discussion and decision.

Just to help sync up, here is an initial list based on the pieces
that have been in progress with patches in existence (please feel free
to add/update ones I missed or reflected inaccurately here):

(1) Updating AIO to use wait-bit based filtered wakeups (me/wli)
Status: Updated to 2.6.12-rc6, needs review
(2) Buffered filesystem AIO read/write (me/Ben)
Status: aio write: Updated to 2.6.12-rc6, needs review
Status: aio read : Needs rework against readahead changes in mainline
(3) POSIX AIO support (Bull: Laurent/Sebastian or Oracle: Joel)
Status: Needs review and discussion ?
(4) AIO for pipes (Chris Mason)
Status: Needs update to latest kernels
(5) Asynchronous semaphore implementation (Ben/Trond?)
Status: Posted - under development & discussion
(6) epoll - AIO integration (Zach Brown/Feng Zhou/wli)
Status: Needs resurrection ?
(7) Vector AIO (aio readv/writev) (Yasushi Saito)
Status: Needs resurrection ?

On my part, I'll start by re-posting (1) for discussion, and then
move to (2).

Regards
Suparna

--
Suparna Bhattacharya (sup...@in.ibm.com)
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Trond Myklebust

unread,

Jun 20, 2005, 9:20:08 AM6/20/05

to

må den 20.06.2005 Klokka 17:31 (+0530) skreiv Suparna Bhattacharya:

> (5) Asynchronous semaphore implementation (Ben/Trond?)
> Status: Posted - under development & discussion

I'm working on something that attempts to define per-arch low-level
primitives (essentially wrappers to the current arch specific code). I
expect to post something in the next 2 weeks (or at least before the
kernel summit)...

Cheers,
Trond

Sébastien Dugué

unread,

Jun 20, 2005, 10:40:11 AM6/20/05

to

On Mon, 2005-06-20 at 17:31 +0530, Suparna Bhattacharya wrote:
> (3) POSIX AIO support (Bull: Laurent/Sebastian or Oracle: Joel)
> Status: Needs review and discussion ?

I'm currently running some benchmarks (sysbench + MySQL) using AIO,
results will be available soon.

I will also release libposix-aio V0.5 at the same time.

(2) will sure help cleanup our code.

Sébastien.

--
------------------------------------------------------

Sébastien Dugué BULL/FREC:B1-247
phone: (+33) 476 29 77 70 Bullcom: 229-7770

mailto:sebasti...@bull.net

Linux POSIX AIO: http://www.bullopensource.org/posix

------------------------------------------------------

Suparna Bhattacharya

unread,

Jun 20, 2005, 12:00:16 PM6/20/05

to

On Mon, Jun 20, 2005 at 05:31:54PM +0530, Suparna Bhattacharya wrote:
> Since AIO development is gaining momentum once again, ocfs2 and
> samba both appear to be using AIO, NFS needs async semaphores etc,
> there appears to be an increase in interest in straightening out some
> of the pending work in this area. So this seems like a good
> time to re-post some of those patches for discussion and decision.
>
> Just to help sync up, here is an initial list based on the pieces
> that have been in progress with patches in existence (please feel free
> to add/update ones I missed or reflected inaccurately here):
>
> (1) Updating AIO to use wait-bit based filtered wakeups (me/wli)
> Status: Updated to 2.6.12-rc6, needs review

Here is a little bit of background on the motivation behind this set of
patches to update AIO for filtered wakeups:

(a) Since the introduction of filtered wakeups support and
the wait_bit_queue infrastructure in mainline, it is no longer
sufficient to just embed a wait queue entry in the kiocb
for AIO operations involving filtered wakeups.
(b) Given that filesystem reads/writes use filtered wakeups underlying
wait_on_page_bit, fixing this becomes a pre-req for buffered
filesystem AIO.
(c) The wait_bit_queue infrastructure actually enables a cleaner
implementation of filesystem AIO because it already provides
for an action routine intended to allow both blocking and
non-blocking or asynchronous behaviour.

As I was rewriting the patches to address this, there is one other
change I made to resolve one remaining ugliness in my earlier
patchsets - special casing of the form
if (wait == NULL) wait = &local_wait
to switch to a stack based wait queue entry if not passed a wait
queue entry associated with an iocb.

To avoid this, I have tried biting the bullet by including a default
wait bit queue entry in the task structure, to be used instead of
on-demand allocation of a wait bit queue entry on stack.

All in all, these changes have (hopefully) simplified the code,
as well as made it more up-to-date. Comments (including
better names etc as requested by Zach) are welcome !

Suparna Bhattacharya

unread,

Jun 20, 2005, 12:20:15 PM6/20/05

to

On Mon, Jun 20, 2005 at 09:31:26PM +0530, Suparna Bhattacharya wrote:

> Here is a little bit of background on the motivation behind this set of
> patches to update AIO for filtered wakeups:
>
> (a) Since the introduction of filtered wakeups support and
> the wait_bit_queue infrastructure in mainline, it is no longer
> sufficient to just embed a wait queue entry in the kiocb
> for AIO operations involving filtered wakeups.
> (b) Given that filesystem reads/writes use filtered wakeups underlying
> wait_on_page_bit, fixing this becomes a pre-req for buffered
> filesystem AIO.
> (c) The wait_bit_queue infrastructure actually enables a cleaner
> implementation of filesystem AIO because it already provides
> for an action routine intended to allow both blocking and
> non-blocking or asynchronous behaviour.
>

>> Updated to apply to 2.6.12-rc6.

Add a wait queue parameter to the action routine called by
__wait_on_bit to allow it to determine whether to block or
not.

Signed-off-by: Suparna Bhattacharya <sup...@in.ibm.com>

fs/buffer.c | 2 +-
kernel/wait.c | 14 ++++++++------
mm/filemap.c | 2 +-
3 files changed, 10 insertions(+), 8 deletions(-)

diff -puN kernel/wait.c~modify-wait-bit-action-args kernel/wait.c
--- linux-2.6.9-rc1-mm4/kernel/wait.c~modify-wait-bit-action-args 2004-09-10 16:52:04.000000000 +0530
+++ linux-2.6.9-rc1-mm4-suparna/kernel/wait.c 2004-09-10 16:52:04.000000000 +0530
@@ -152,14 +152,14 @@ EXPORT_SYMBOL(wake_bit_function);
*/
int __sched fastcall
__wait_on_bit(wait_queue_head_t *wq, struct wait_bit_queue *q,
- int (*action)(void *), unsigned mode)
+ int (*action)(void *, wait_queue_t *), unsigned mode)
{
int ret = 0;

do {
prepare_to_wait(wq, &q->wait, mode);
if (test_bit(q->key.bit_nr, q->key.flags))
- ret = (*action)(q->key.flags);
+ ret = (*action)(q->key.flags, &q->wait);
} while (test_bit(q->key.bit_nr, q->key.flags) && !ret);
finish_wait(wq, &q->wait);
return ret;
@@ -167,7 +167,8 @@ EXPORT_SYMBOL(__wait_on_bit);
EXPORT_SYMBOL(__wait_on_bit);

int __sched fastcall out_of_line_wait_on_bit(void *word, int bit,
- int (*action)(void *), unsigned mode)
+ int (*action)(void *, wait_queue_t *),
+ unsigned mode)
{
wait_queue_head_t *wq = bit_waitqueue(word, bit);
DEFINE_WAIT_BIT(wait, word, bit);
@@ -178,14 +179,14 @@ EXPORT_SYMBOL(out_of_line_wait_on_bit);

int __sched fastcall
__wait_on_bit_lock(wait_queue_head_t *wq, struct wait_bit_queue *q,
- int (*action)(void *), unsigned mode)
+ int (*action)(void *, wait_queue_t *), unsigned mode)
{
int ret = 0;

do {
prepare_to_wait_exclusive(wq, &q->wait, mode);
if (test_bit(q->key.bit_nr, q->key.flags)) {
- if ((ret = (*action)(q->key.flags)))
+ if ((ret = (*action)(q->key.flags, &q->wait)))
break;
}
} while (test_and_set_bit(q->key.bit_nr, q->key.flags));
@@ -195,7 +196,8 @@ __wait_on_bit_lock(wait_queue_head_t *wq
EXPORT_SYMBOL(__wait_on_bit_lock);

int __sched fastcall out_of_line_wait_on_bit_lock(void *word, int bit,
- int (*action)(void *), unsigned mode)
+ int (*action)(void *, wait_queue_t *wait),
+ unsigned mode)
{
wait_queue_head_t *wq = bit_waitqueue(word, bit);
DEFINE_WAIT_BIT(wait, word, bit);
diff -puN mm/filemap.c~modify-wait-bit-action-args mm/filemap.c
--- linux-2.6.9-rc1-mm4/mm/filemap.c~modify-wait-bit-action-args 2004-09-10 16:52:04.000000000 +0530
+++ linux-2.6.9-rc1-mm4-suparna/mm/filemap.c 2004-09-10 16:52:04.000000000 +0530
@@ -133,7 +133,7 @@ void remove_from_page_cache(struct page
}
EXPORT_SYMBOL(remove_from_page_cache);

-static int sync_page(void *word)
+static int sync_page(void *word, wait_queue_t *wait)
{
struct address_space *mapping;
struct page *page;
diff -puN fs/buffer.c~modify-wait-bit-action-args fs/buffer.c
--- linux-2.6.9-rc1-mm4/fs/buffer.c~modify-wait-bit-action-args 2004-09-10 16:52:04.000000000 +0530
+++ linux-2.6.9-rc1-mm4-suparna/fs/buffer.c 2004-09-10 16:52:04.000000000 +0530
@@ -52,7 +52,7 @@ void wake_up_buffer(struct buffer_head *
bh->b_private = private;
}

-static int sync_buffer(void *word)
+static int sync_buffer(void *word, wait_queue_t *wait)
{
struct block_device *bd;
struct buffer_head *bh

_
diff -urp linux-2.6.9-rc3/include/linux/wait.h linux-2.6.9-rc3-mm2/include/linux/wait.h
--- linux-2.6.9-rc3/include/linux/wait.h 2004-10-07 13:19:09.000000000 +0530
+++ linux-2.6.9-rc3-mm2/include/linux/wait.h 2004-10-07 12:04:17.000000000 +0530
@@ -140,11 +151,15 @@ void FASTCALL(__wake_up(wait_queue_head_
extern void FASTCALL(__wake_up_locked(wait_queue_head_t *q, unsigned int mode));
extern void FASTCALL(__wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr));
void FASTCALL(__wake_up_bit(wait_queue_head_t *, void *, int));
-int FASTCALL(__wait_on_bit(wait_queue_head_t *, struct wait_bit_queue *, int (*)(void *), unsigned));
-int FASTCALL(__wait_on_bit_lock(wait_queue_head_t *, struct wait_bit_queue *, int (*)(void *), unsigned));
+int FASTCALL(__wait_on_bit(wait_queue_head_t *, struct wait_bit_queue *,
+ int (*)(void *, wait_queue_t *), unsigned));
+int FASTCALL(__wait_on_bit_lock(wait_queue_head_t *, struct wait_bit_queue *,
+ int (*)(void *, wait_queue_t *), unsigned));
void FASTCALL(wake_up_bit(void *, int));
-int FASTCALL(out_of_line_wait_on_bit(void *, int, int (*)(void *), unsigned));
-int FASTCALL(out_of_line_wait_on_bit_lock(void *, int, int (*)(void *), unsigned));
+int FASTCALL(out_of_line_wait_on_bit(void *, int, int (*)(void *,
+ wait_queue_t *), unsigned));
+int FASTCALL(out_of_line_wait_on_bit_lock(void *, int, int (*)(void *,
+ wait_queue_t *), unsigned));
wait_queue_head_t *FASTCALL(bit_waitqueue(void *, int));

#define wake_up(x) __wake_up(x, TASK_UNINTERRUPTIBLE | TASK_INTERRUPTIBLE, 1, NULL)
@@ -364,7 +392,8 @@ int wake_bit_function(wait_queue_t *wait
* but has no intention of setting it.
*/
static inline int wait_on_bit(void *word, int bit,
- int (*action)(void *), unsigned mode)
+ int (*action)(void *, wait_queue_t *),
+ unsigned mode)
{
if (!test_bit(bit, word))
return 0;
@@ -388,7 +417,8 @@ static inline int wait_on_bit(void *word
* clear with the intention of setting it, and when done, clearing it.
*/
static inline int wait_on_bit_lock(void *word, int bit,
- int (*action)(void *), unsigned mode)
+ int (*action)(void *, wait_queue_t *),
+ unsigned mode)
{
if (!test_and_set_bit(bit, word))
return 0;
diff -urp linux-2.6.9-rc3/include/linux/writeback.h linux-2.6.9-rc3-mm2/include/linux/writeback.h
--- linux-2.6.9-rc3/include/linux/writeback.h 2004-10-07 13:19:09.000000000 +0530
+++ linux-2.6.9-rc3-mm2/include/linux/writeback.h 2004-10-07 12:05:30.000000000 +0530
@@ -68,7 +68,7 @@ struct writeback_control {
*/
void writeback_inodes(struct writeback_control *wbc);
void wake_up_inode(struct inode *inode);
-int inode_wait(void *);
+int inode_wait(void *, wait_queue_t *);
void sync_inodes_sb(struct super_block *, int wait);
void sync_inodes(int wait);

diff -urp linux-2.6.9-rc3/fs/inode.c linux-2.6.9-rc3-mm2/fs/inode.c
--- linux-2.6.9-rc3/fs/inode.c 2004-10-07 13:19:03.000000000 +0530
+++ linux-2.6.9-rc3-mm2/fs/inode.c 2004-10-07 12:06:07.000000000 +0530
@@ -1264,7 +1264,7 @@ void remove_dquot_ref(struct super_block

#endif

-int inode_wait(void *word)
+int inode_wait(void *word, wait_queue_t *wait)
{
schedule();
return 0;

Suparna Bhattacharya

unread,

Jun 20, 2005, 12:20:09 PM6/20/05

to

In order to allow for interruptible and asynchronous versions of
lock_page in conjunction with the wait_on_bit changes, we need to
define low-level lock page routines which take an additional
argument, i.e a wait queue entry and may return non-zero status,
e.g -EINTR, -EIOCBRETRY, -EWOULDBLOCK etc. This patch renames
__lock_page to lock_page_slow, so that __lock_page and
__lock_page_slow can denote the versions which take a wait queue
parameter.

Signed-off-by: Suparna Bhattacharya <sup...@in.ibm.com>

include/linux/pagemap.h | 4 ++--
mm/filemap.c | 4 ++--
2 files changed, 4 insertions(+), 4 deletions(-)

diff -puN include/linux/pagemap.h~lock_page_slow include/linux/pagemap.h
--- linux-2.6.9-rc1-mm4/include/linux/pagemap.h~lock_page_slow 2004-09-13 11:46:23.000000000 +0530
+++ linux-2.6.9-rc1-mm4-suparna/include/linux/pagemap.h 2004-09-13 12:01:03.000000000 +0530
@@ -151,14 +151,14 @@ static inline pgoff_t linear_page_index(
return pgoff >> (PAGE_CACHE_SHIFT - PAGE_SHIFT);
}

-extern void FASTCALL(__lock_page(struct page *page));
+extern void FASTCALL(lock_page_slow(struct page *page));
extern void FASTCALL(unlock_page(struct page *page));

static inline void lock_page(struct page *page)
{
might_sleep();
if (TestSetPageLocked(page))
- __lock_page(page);
+ lock_page_slow(page);
}

/*
diff -puN mm/filemap.c~lock_page_slow mm/filemap.c
--- linux-2.6.9-rc1-mm4/mm/filemap.c~lock_page_slow 2004-09-13 11:46:23.000000000 +0530
+++ linux-2.6.9-rc1-mm4-suparna/mm/filemap.c 2004-09-13 12:07:53.000000000 +0530
@@ -438,14 +438,14 @@ EXPORT_SYMBOL(end_page_writeback);
* chances are that on the second loop, the block layer's plug list is empty,
* so sync_page() will then return in state TASK_UNINTERRUPTIBLE.
*/
-void fastcall __lock_page(struct page *page)
+void fastcall lock_page_slow(struct page *page)
{
DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);

__wait_on_bit_lock(page_waitqueue(page), &wait, sync_page,
TASK_UNINTERRUPTIBLE);
}
-EXPORT_SYMBOL(__lock_page);
+EXPORT_SYMBOL(lock_page_slow);

/*
* Note completion of filesystem specific page synchronisation

_

Suparna Bhattacharya

unread,

Jun 20, 2005, 12:30:37 PM6/20/05

to

On Mon, Jun 20, 2005 at 09:31:26PM +0530, Suparna Bhattacharya wrote:

> patches to update AIO for filtered wakeups:
>

>> Updated to apply to 2.6.12-rc6

init_wait_bit_key() initializes the key field in an already
allocated wait bit structure, useful for async wait bit support.
Also separate out the wait bit test to a common routine which
can be used by different kinds of wakeup callbacks.

Signed-off-by: Suparna Bhattacharya <sup...@in.ibm.com>

linux-2.6.9-rc1-mm4-suparna/include/linux/wait.h | 17 +++++++++++++++++
linux-2.6.9-rc1-mm4-suparna/kernel/wait.c | 11 ++---------
2 files changed, 19 insertions(+), 9 deletions(-)

diff -puN include/linux/wait.h~init-wait-bit-key include/linux/wait.h
--- linux-2.6.9-rc1-mm4/include/linux/wait.h~init-wait-bit-key 2004-09-14 16:00:46.000000000 +0530
+++ linux-2.6.9-rc1-mm4-suparna/include/linux/wait.h 2004-09-15 15:33:51.000000000 +0530
@@ -103,6 +103,17 @@ static inline int waitqueue_active(wait_
return !list_empty(&q->task_list);
}

+static inline int test_wait_bit_key(wait_queue_t *wait,
+ struct wait_bit_key *key)
+{
+ struct wait_bit_queue *wait_bit
+ = container_of(wait, struct wait_bit_queue, wait);
+
+ return (wait_bit->key.flags == key->flags &&
+ wait_bit->key.bit_nr == key->bit_nr &&
+ !test_bit(key->bit_nr, key->flags));
+}
+
/*
* Used to distinguish between sync and async io wait context:
* sync i/o typically specifies a NULL wait queue entry or a wait
@@ -344,6 +359,19 @@ int wake_bit_function(wait_queue_t *wait
(wait)->task = current; \
(wait)->func = autoremove_wake_function; \
INIT_LIST_HEAD(&(wait)->task_list); \
+ } while (0)
+
+#define init_wait_bit_key(waitbit, word, bit) \
+ do { \
+ (waitbit)->key.flags = word; \
+ (waitbit)->key.bit_nr = bit; \
+ } while (0)
+
+#define init_wait_bit_task(waitbit, tsk) \
+ do { \
+ (waitbit)->wait.task = tsk; \
+ (waitbit)->wait.func = wake_bit_function; \
+ INIT_LIST_HEAD(&(waitbit)->wait.task_list); \
} while (0)

/**
diff -puN kernel/wait.c~init-wait-bit-key kernel/wait.c
--- linux-2.6.9-rc1-mm4/kernel/wait.c~init-wait-bit-key 2004-09-15 12:14:03.000000000 +0530
+++ linux-2.6.9-rc1-mm4-suparna/kernel/wait.c 2004-09-15 15:33:05.000000000 +0530
@@ -132,16 +132,9 @@ EXPORT_SYMBOL(autoremove_wake_function);

int wake_bit_function(wait_queue_t *wait, unsigned mode, int sync, void *arg)
{
- struct wait_bit_key *key = arg;
- struct wait_bit_queue *wait_bit
- = container_of(wait, struct wait_bit_queue, wait);
-
- if (wait_bit->key.flags != key->flags ||
- wait_bit->key.bit_nr != key->bit_nr ||
- test_bit(key->bit_nr, key->flags))
+ if (!test_wait_bit_key(wait, arg))
return 0;
- else
- return autoremove_wake_function(wait, mode, sync, key);
+ return autoremove_wake_function(wait, mode, sync, arg);
}
EXPORT_SYMBOL(wake_bit_function);

_

Suparna Bhattacharya

unread,

Jun 20, 2005, 12:40:16 PM6/20/05

to

On Mon, Jun 20, 2005 at 09:31:26PM +0530, Suparna Bhattacharya wrote:

> > (1) Updating AIO to use wait-bit based filtered wakeups (me/wli)
> > Status: Updated to 2.6.12-rc6, needs review
>

Enable wait bit based filtered wakeups to work for AIO.
Replaces the wait queue entry in the kiocb with a wait bit
structure, to allow enough space for the wait bit key.
This adds an extra level of indirection in references to the
wait queue entry in the iocb. Also, adds an extra check
in aio_wake_function to allow for other kinds of waiters
which do not require wait bit, based on the assumption that
the key passed in would be NULL in such cases.

Signed-off-by: Suparna Bhattacharya <sup...@in.ibm.com>

fs/aio.c | 20 +++++++++++++-------
include/linux/aio.h | 4 ++--
include/linux/wait.h | 18 ++++++++++++++++++
kernel/wait.c | 17 ++++++++++++++---
4 files changed, 47 insertions(+), 12 deletions(-)

diff -puN fs/aio.c~aio-wait-bit fs/aio.c
--- linux-2.6.9-rc1-mm4/fs/aio.c~aio-wait-bit 2004-09-15 16:02:13.000000000 +0530
+++ linux-2.6.9-rc1-mm4-suparna/fs/aio.c 2004-09-15 16:41:22.000000000 +0530
@@ -725,13 +725,13 @@ static ssize_t aio_run_iocb(struct kiocb
* the aio_wake_function callback).
*/
- BUG_ON(current->io_wait != NULL);
- current->io_wait = &iocb->ki_wait;
+ BUG_ON(!is_sync_wait(current->io_wait));
+ current->io_wait = &iocb->ki_wait.wait;
ret = retry(iocb);
current->io_wait = NULL;

if (-EIOCBRETRY != ret) {
if (-EIOCBQUEUED != ret) {
- BUG_ON(!list_empty(&iocb->ki_wait.task_list));
+ BUG_ON(!list_empty(&iocb->ki_wait.wait.task_list));
aio_complete(iocb, ret, 0);
/* must not access the iocb after this */
}
@@ -740,7 +740,7 @@ static ssize_t aio_run_iocb(struct kiocb
* Issue an additional retry to avoid waiting forever if
* no waits were queued (e.g. in case of a short read).
*/
- if (list_empty(&iocb->ki_wait.task_list))
+ if (list_empty(&iocb->ki_wait.wait.task_list))
kiocbSetKicked(iocb);
}
out:
@@ -886,7 +886,7 @@ void queue_kicked_iocb(struct kiocb *ioc
unsigned long flags;
int run = 0;

- WARN_ON((!list_empty(&iocb->ki_wait.task_list)));
+ WARN_ON((!list_empty(&iocb->ki_wait.wait.task_list)));

spin_lock_irqsave(&ctx->ctx_lock, flags);
run = __queue_kicked_iocb(iocb);
@@ -1472,7 +1472,13 @@ ssize_t aio_setup_iocb(struct kiocb *kio
*/
int aio_wake_function(wait_queue_t *wait, unsigned mode, int sync, void *key)
{
- struct kiocb *iocb = container_of(wait, struct kiocb, ki_wait);

+ struct wait_bit_queue *wait_bit
+ = container_of(wait, struct wait_bit_queue, wait);

+ struct kiocb *iocb = container_of(wait_bit, struct kiocb, ki_wait);
+
+ /* Assumes that a non-NULL key implies wait bit filtering */
+ if (key && !test_wait_bit_key(wait, key))
+ return 0;

list_del_init(&wait->task_list);
kick_iocb(iocb);
@@ -1528,8 +1534,9 @@ int fastcall io_submit_one(struct kioctx
req->ki_buf = (char __user *)(unsigned long)iocb->aio_buf;
req->ki_left = req->ki_nbytes = iocb->aio_nbytes;
req->ki_opcode = iocb->aio_lio_opcode;
- init_waitqueue_func_entry(&req->ki_wait, aio_wake_function);
- INIT_LIST_HEAD(&req->ki_wait.task_list);
+ init_waitqueue_func_entry(&req->ki_wait.wait, aio_wake_function);
+ INIT_LIST_HEAD(&req->ki_wait.wait.task_list);
+ req->ki_run_list.next = req->ki_run_list.prev = NULL;
req->ki_retried = 0;

ret = aio_setup_iocb(req);
diff -puN kernel/wait.c~aio-wait-bit kernel/wait.c
--- linux-2.6.9-rc1-mm4/kernel/wait.c~aio-wait-bit 2004-09-15 16:02:13.000000000 +0530
+++ linux-2.6.9-rc1-mm4-suparna/kernel/wait.c 2004-09-20 14:59:24.000000000 +0530
@@ -132,7 +132,8 @@ EXPORT_SYMBOL(autoremove_wake_function);

int wake_bit_function(wait_queue_t *wait, unsigned mode, int sync, void *arg)
{

- if (!test_wait_bit_key(wait, arg))
+ /* Assumes that a non-NULL key implies wait bit filtering */
+ if (arg && !test_wait_bit_key(wait, arg))
return 0;

return autoremove_wake_function(wait, mode, sync, key);
}

@@ -152,7 +153,12 @@ __wait_on_bit(wait_queue_head_t *wq, str

if (test_bit(q->key.bit_nr, q->key.flags))

ret = (*action)(q->key.flags, &q->wait);
} while (test_bit(q->key.bit_nr, q->key.flags) && !ret);

- finish_wait(wq, &q->wait);
+ /*
+ * AIO retries require the wait queue entry to remain queued
+ * for async notification
+ */
+ if (ret != -EIOCBRETRY)
+ finish_wait(wq, &q->wait);
return ret;
}
EXPORT_SYMBOL(__wait_on_bit);
@@ -181,7 +187,12 @@ __wait_on_bit_lock(wait_queue_head_t *wq

break;
}
} while (test_and_set_bit(q->key.bit_nr, q->key.flags));

- finish_wait(wq, &q->wait);
+ /*
+ * AIO retries require the wait queue entry to remain queued
+ * for async notification
+ */
+ if (ret != -EIOCBRETRY)
+ finish_wait(wq, &q->wait);
return ret;
}
EXPORT_SYMBOL(__wait_on_bit_lock);
diff -puN include/linux/aio.h~aio-wait-bit include/linux/aio.h
--- linux-2.6.9-rc1-mm4/include/linux/aio.h~aio-wait-bit 2004-09-15 16:05:40.000000000 +0530
+++ linux-2.6.9-rc1-mm4-suparna/include/linux/aio.h 2004-09-20 15:20:58.000000000 +0530
@@ -69,7 +69,7 @@ struct kiocb {
size_t ki_nbytes; /* copy of iocb->aio_nbytes */
char __user *ki_buf; /* remaining iocb->aio_buf */
size_t ki_left; /* remaining bytes */
- wait_queue_t ki_wait;
+ struct wait_bit_queue ki_wait;
long ki_retried; /* just for testing */
long ki_kicked; /* just for testing */
long ki_queued; /* just for testing */
@@ -90,7 +90,7 @@ struct kiocb {
(x)->ki_dtor = NULL; \
(x)->ki_obj.tsk = tsk; \
(x)->ki_user_data = 0; \
- init_wait((&(x)->ki_wait)); \
+ init_wait_bit_task((&(x)->ki_wait), current);\
} while (0)

#define AIO_RING_MAGIC 0xa10a10a1

Suparna Bhattacharya

unread,

Jun 20, 2005, 12:40:07 PM6/20/05

to

On Mon, Jun 20, 2005 at 09:31:26PM +0530, Suparna Bhattacharya wrote:

> > (1) Updating AIO to use wait-bit based filtered wakeups (me/wli)
> > Status: Updated to 2.6.12-rc6, needs review
>

Define low-level page wait and lock page routines which take a
wait queue entry pointer as an additional parameter and
return status (which may be non-zero when the wait queue
parameter signifies an asynchronous wait, typically during
AIO).

Synchronous IO waits become a special case where the wait
queue parameter is the running task's default io wait context.
Asynchronous IO waits happen when the wait queue parameter
is the io wait context of a kiocb. Code paths which choose to
execute synchronous or asynchronous behaviour depending on the
called context specify the current io wait context (which points
to sync or async context as the case may be) as the wait
parameter.

Signed-off-by: Suparna Bhattacharya <sup...@in.ibm.com>

include/linux/pagemap.h | 38 ++++++++++++++------
mm/filemap.c | 27 ++++++++------
2 files changed, 44 insertions(+), 21 deletions(-)

diff -urp linux-2.6.9-rc3/kernel/sched.c linux-2.6.9-rc3-mm2/kernel/sched.c
--- linux-2.6.9-rc3/kernel/sched.c 2004-10-07 13:19:10.000000000 +0530
+++ linux-2.6.9-rc3-mm2/kernel/sched.c 2004-10-08 11:53:18.000000000 +0530
@@ -4428,6 +4428,20 @@ long __sched io_schedule_timeout(long ti
return ret;
}

+/*
+ * Sleep only if the wait context passed is not async,
+ * otherwise return so that a retry can be issued later.
+ */
+int __sched io_wait_schedule(wait_queue_t *wait)
+{
+ if (!is_sync_wait(wait))
+ return -EIOCBRETRY;
+ io_schedule();
+ return 0;
+}
+
+EXPORT_SYMBOL(io_wait_schedule);
+
/**
* sys_sched_get_priority_max - return maximum RT priority.
* @policy: scheduling class.
diff -puN mm/filemap.c~aio-wait-page mm/filemap.c
--- linux-2.6.9-rc1-mm4/mm/filemap.c~aio-wait-page 2004-09-17 09:25:48.000000000 +0530
+++ linux-2.6.9-rc1-mm4-suparna/mm/filemap.c 2004-09-20 22:57:37.000000000 +0530
@@ -146,8 +146,7 @@ static int sync_page(void *word, wait_qu
mapping = page_mapping(page);
if (mapping && mapping->a_ops && mapping->a_ops->sync_page)
mapping->a_ops->sync_page(page);
- io_schedule();
- return 0;
+ return io_wait_schedule(wait);
}

/**
@@ -378,13 +377,17 @@ static inline void wake_up_page(struct p
__wake_up_bit(page_waitqueue(page), &page->flags, bit);
}

-void fastcall wait_on_page_bit(struct page *page, int bit_nr)
+int fastcall wait_on_page_bit(struct page *page, int bit_nr,
+ wait_queue_t *wait)
{
- DEFINE_WAIT_BIT(wait, &page->flags, bit_nr);
-
- if (test_bit(bit_nr, &page->flags))
- __wait_on_bit(page_waitqueue(page), &wait, sync_page,
+ if (test_bit(bit_nr, &page->flags)) {

+ struct wait_bit_queue *wait_bit
+ = container_of(wait, struct wait_bit_queue, wait);

+ init_wait_bit_key(wait_bit, &page->flags, bit_nr);
+ return __wait_on_bit(page_waitqueue(page), wait_bit, sync_page,
TASK_UNINTERRUPTIBLE);
+ }
+ return 0;
}
EXPORT_SYMBOL(wait_on_page_bit);

@@ -431,18 +434,20 @@ void end_page_writeback(struct page *pag
EXPORT_SYMBOL(end_page_writeback);

/*
- * Get a lock on the page, assuming we need to sleep to get it.
+ * Get a lock on the page, assuming we need to wait to get it.
*
* Ugly: running sync_page() in state TASK_UNINTERRUPTIBLE is scary. If some
* random driver's requestfn sets TASK_RUNNING, we could busywait. However

* chances are that on the second loop, the block layer's plug list is empty,
* so sync_page() will then return in state TASK_UNINTERRUPTIBLE.
*/

-void fastcall lock_page_slow(struct page *page)
+int fastcall lock_page_slow(struct page *page, wait_queue_t *wait)
{
- DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);

+ struct wait_bit_queue *wait_bit
+ = container_of(wait, struct wait_bit_queue, wait);

- __wait_on_bit_lock(page_waitqueue(page), &wait, sync_page,
+ init_wait_bit_key(wait_bit, &page->flags, PG_locked);
+ return __wait_on_bit_lock(page_waitqueue(page), wait_bit, sync_page,
TASK_UNINTERRUPTIBLE);
}
EXPORT_SYMBOL(lock_page_slow);
diff -puN include/linux/pagemap.h~aio-wait-page include/linux/pagemap.h
--- linux-2.6.9-rc1-mm4/include/linux/pagemap.h~aio-wait-page 2004-09-17 09:25:48.000000000 +0530
+++ linux-2.6.9-rc1-mm4-suparna/include/linux/pagemap.h 2004-09-20 22:56:21.000000000 +0530
@@ -151,21 +151,25 @@ static inline pgoff_t linear_page_index(

return pgoff >> (PAGE_CACHE_SHIFT - PAGE_SHIFT);
}

-extern void FASTCALL(lock_page_slow(struct page *page));
+extern int FASTCALL(lock_page_slow(struct page *page, wait_queue_t *wait));

extern void FASTCALL(unlock_page(struct page *page));

-static inline void lock_page(struct page *page)
+static inline int __lock_page(struct page *page, wait_queue_t *wait)
{
might_sleep();
if (TestSetPageLocked(page))
- lock_page_slow(page);
+ return lock_page_slow(page, wait);
+ return 0;
}
+
+#define lock_page(page) __lock_page(page, &current->__wait.wait)

/*
* This is exported only for wait_on_page_locked/wait_on_page_writeback.
* Never use this directly!
*/
-extern void FASTCALL(wait_on_page_bit(struct page *page, int bit_nr));
+extern int FASTCALL(wait_on_page_bit(struct page *page, int bit_nr,
+ wait_queue_t *wait));

/*
* Wait for a page to be unlocked.
@@ -174,20 +178,29 @@ extern void FASTCALL(wait_on_page_bit(st
* ie with increased "page->count" so that the page won't
* go away during the wait..
*/
-static inline void wait_on_page_locked(struct page *page)
+static inline int __wait_on_page_locked(struct page *page, wait_queue_t *wait)
{
if (PageLocked(page))
- wait_on_page_bit(page, PG_locked);
+ return wait_on_page_bit(page, PG_locked, wait);
+ return 0;
}

+#define wait_on_page_locked(page) \
+ __wait_on_page_locked(page, &current->__wait.wait)
+
/*
* Wait for a page to complete writeback
*/
-static inline void wait_on_page_writeback(struct page *page)
+static inline int __wait_on_page_writeback(struct page *page,
+ wait_queue_t *wait)
{
if (PageWriteback(page))
- wait_on_page_bit(page, PG_writeback);
+ return wait_on_page_bit(page, PG_writeback, wait);
+ return 0;
}

+#define wait_on_page_writeback(page) \
+ __wait_on_page_writeback(page, &current->__wait.wait)
+
extern void end_page_writeback(struct page *page);

/*
* Fault a userspace page into pagetables. Return non-zero on a fault.
--- linux-2.6.10-rc1/include/linux/sched.h 2004-11-03 12:04:20.000000000 +0530
+++ linux-2.6.10-rc1-aio/include/linux/sched.h 2004-11-10 13:06:03.376403392 +0530
@@ -165,6 +165,7 @@ extern void show_stack(struct task_struc

void io_schedule(void);
long io_schedule_timeout(long timeout);
+int io_wait_schedule(wait_queue_t *wait);

extern void cpu_init (void);
extern void trap_init(void);

Suparna Bhattacharya

unread,

Jun 20, 2005, 12:40:15 PM6/20/05

to

On Mon, Jun 20, 2005 at 09:31:26PM +0530, Suparna Bhattacharya wrote:

> > (1) Updating AIO to use wait-bit based filtered wakeups (me/wli)
> > Status: Updated to 2.6.12-rc6, needs review
>

Allocates space for the default io wait queue entry (actually a wait
bit entry) in the task struct. Doing so simplifies the patches
for AIO wait page allowing for cleaner and more efficient
implementation, at the cost of 28 additional bytes in task struct
vs allocation on demand on-stack.

Signed-off-by: Suparna Bhattacharya <sup...@in.ibm.com>

include/linux/sched.h | 11 +++++++----
kernel/fork.c | 3 ++-
2 files changed, 9 insertions(+), 5 deletions(-)

diff -puN include/linux/sched.h~tsk-default-io-wait include/linux/sched.h
--- linux-2.6.9-rc1-mm4/include/linux/sched.h~tsk-default-io-wait 2004-09-20 14:05:09.000000000 +0530
+++ linux-2.6.9-rc1-mm4-suparna/include/linux/sched.h 2004-09-20 15:14:25.000000000 +0530
@@ -584,11 +584,14 @@ struct task_struct {

unsigned long ptrace_message;
siginfo_t *last_siginfo; /* For ptrace use. */
+
+/* Space for default IO wait bit entry used for synchronous IO waits */
+ struct wait_bit_queue __wait;
/*
- * current io wait handle: wait queue entry to use for io waits
- * If this thread is processing aio, this points at the waitqueue
- * inside the currently handled kiocb. It may be NULL (i.e. default
- * to a stack based synchronous wait) if its doing sync IO.
+ * Current IO wait handle: wait queue entry to use for IO waits
+ * If this thread is processing AIO, this points at the waitqueue
+ * inside the currently handled kiocb. Otherwise, points to the
+ * default IO wait field (i.e &__wait.wait above).
*/
wait_queue_t *io_wait;
#ifdef CONFIG_NUMA
diff -puN kernel/fork.c~tsk-default-io-wait kernel/fork.c
--- linux-2.6.9-rc1-mm4/kernel/fork.c~tsk-default-io-wait 2004-09-20 14:05:09.000000000 +0530
+++ linux-2.6.9-rc1-mm4-suparna/kernel/fork.c 2004-09-20 15:15:22.000000000 +0530
@@ -859,7 +859,8 @@ static task_t *copy_process(unsigned lon
do_posix_clock_monotonic_gettime(&p->start_time);
p->security = NULL;
p->io_context = NULL;
- p->io_wait = NULL;
+ init_wait_bit_task(&p->__wait, p);
+ p->io_wait = &p->__wait.wait;
p->audit_context = NULL;
#ifdef CONFIG_NUMA
p->mempolicy = mpol_copy(p->mempolicy);

_

Benjamin LaHaise

unread,

Jun 20, 2005, 2:20:07 PM6/20/05

to

On Mon, Jun 20, 2005 at 05:31:54PM +0530, Suparna Bhattacharya wrote:

> (1) Updating AIO to use wait-bit based filtered wakeups (me/wli)
> Status: Updated to 2.6.12-rc6, needs review
> (2) Buffered filesystem AIO read/write (me/Ben)
> Status: aio write: Updated to 2.6.12-rc6, needs review
> Status: aio read : Needs rework against readahead changes in mainline

I've looked over the patches from today and they seem quite sane. Comments
pending...

> (3) POSIX AIO support (Bull: Laurent/Sebastian or Oracle: Joel)
> Status: Needs review and discussion ?

The latest version incorporates changes from the last round of feedback
(great work Sebastien!) and updates the library license, so people should
definately take a closer look. This includes the necessary changes for
in-kernel signal support, as well as minimal conversion done on iocbs (the
layout matches the in-kernel iocb).

A quick reading shows that most of it looks quite good. I have to stare
at the cancellation code a bit more closely, though.

> (4) AIO for pipes (Chris Mason)
> Status: Needs update to latest kernels

This likely needs a rewrite with whatever the final semaphore operations
turn out to look like, as it gets very easy to convert down() into
aio_down() and that minimises the changes to the code.

> (5) Asynchronous semaphore implementation (Ben/Trond?)
> Status: Posted - under development & discussion

I got the aio_down() variant working with cancellation now, and should be
able to post an updated series of patches against 2.6.12 shortly.

> (6) epoll - AIO integration (Zach Brown/Feng Zhou/wli)
> Status: Needs resurrection ?

What are folks thoughts in this area? Zach's patches took the approach of
making multishot iocbs possible, which helped avoid the overhead of plain
aio_poll's command setup, which was quite visible in pipetest. Zach -- did
you do any benchmarking on your aio-epoll patches?

> (7) Vector AIO (aio readv/writev) (Yasushi Saito)
> Status: Needs resurrection ?

Zach also made some noises about this recently...

-ben
--
"Time is what keeps everything from happening all at once." -- John Wheeler

Zach Brown

unread,

Jun 20, 2005, 3:00:12 PM6/20/05

to

>>(6) epoll - AIO integration (Zach Brown/Feng Zhou/wli)
>> Status: Needs resurrection ?
>
>
> What are folks thoughts in this area? Zach's patches took the approach of
> making multishot iocbs possible, which helped avoid the overhead of plain
> aio_poll's command setup, which was quite visible in pipetest. Zach -- did
> you do any benchmarking on your aio-epoll patches?

No, what little work I did on this was just pushing for stable
functionality. I had a little test app that was still missing event
delivery occasionally. I'm sure it'd be easy enough to track down. It
still seems like a pretty reasonable translation of epoll event delivery
through the aio completion queue. I'm not thrilled with the epoll edge
delivery semantics, though, it would be nice to make duplicate event
generation contigent on servicing an initial event. EPOLLIN being
throttled until read activity is seen on the fd, that kind of thing.
Nontrivial work, of course.

>>(7) Vector AIO (aio readv/writev) (Yasushi Saito)
>> Status: Needs resurrection ?
>
> Zach also made some noises about this recently...

Yeah, I've got a patch working that adds CMD_AIO_P{WRITE,READ}V for ext3
via some aio->aio_p{read,werive}v ops. It's currently against some
distro 2.6, but I'll port it up to current and post the patch. It seems
pretty noncontroversial -- one obviously wants to scatter/gather
file-contiguous IO with tiny iovec elements, which bubble down well to
the generic fs/block helpers, rather than trying to get the various
layers to merge many large iocb submissions that can be found to be
file-contiguous.

- z

Sébastien Dugué

unread,

Jun 21, 2005, 4:50:07 AM6/21/05

to

On Mon, 2005-06-20 at 14:10 -0400, Benjamin LaHaise wrote:

> > (3) POSIX AIO support (Bull: Laurent/Sebastian or Oracle: Joel)
> > Status: Needs review and discussion ?
>
> The latest version incorporates changes from the last round of feedback
> (great work Sebastien!) and updates the library license, so people should
> definately take a closer look. This includes the necessary changes for
> in-kernel signal support, as well as minimal conversion done on iocbs (the
> layout matches the in-kernel iocb).
>
> A quick reading shows that most of it looks quite good. I have to stare
> at the cancellation code a bit more closely, though.
>

As I understand it, cancellation needs support from the
underlying filesystem in order to dequeue requests no yet
commited to disk.

So far there is no support from the kernel aside from the USB gadget
driver which registers its own 'iocb->ki_cancel' method for dequeuing
requests.

Sébastien.

--
------------------------------------------------------

Sébastien Dugué BULL/FREC:B1-247
phone: (+33) 476 29 77 70 Bullcom: 229-7770

mailto:sebasti...@bull.net

Linux POSIX AIO: http://www.bullopensource.org/posix

------------------------------------------------------

-

Andi Kleen

unread,

Jun 21, 2005, 2:50:13 PM6/21/05

to

Suparna Bhattacharya <sup...@in.ibm.com> writes:
>
> All in all, these changes have (hopefully) simplified the code,
> as well as made it more up-to-date. Comments (including
> better names etc as requested by Zach) are welcome !

I looked over the patches and they look all fine to me.

-Andi

William Lee Irwin III

unread,

Jun 21, 2005, 3:10:08 PM6/21/05

to

On Mon, Jun 20, 2005 at 05:31:54PM +0530, Suparna Bhattacharya wrote:

> Since AIO development is gaining momentum once again, ocfs2 and
> samba both appear to be using AIO, NFS needs async semaphores etc,
> there appears to be an increase in interest in straightening out some
> of the pending work in this area. So this seems like a good
> time to re-post some of those patches for discussion and decision.
> Just to help sync up, here is an initial list based on the pieces
> that have been in progress with patches in existence (please feel free
> to add/update ones I missed or reflected inaccurately here):

I'm going to keep going over these until I get tired of it for general
bulletproofing purposes, but all indications thus far are good.

-- wli

Suparna Bhattacharya

unread,

Jun 24, 2005, 6:50:05 AM6/24/05

to

On Mon, Jun 20, 2005 at 05:31:54PM +0530, Suparna Bhattacharya wrote:

> Since AIO development is gaining momentum once again, ocfs2 and
> samba both appear to be using AIO, NFS needs async semaphores etc,
> there appears to be an increase in interest in straightening out some
> of the pending work in this area. So this seems like a good
> time to re-post some of those patches for discussion and decision.
>
> Just to help sync up, here is an initial list based on the pieces
> that have been in progress with patches in existence (please feel free
> to add/update ones I missed or reflected inaccurately here):
>
> (1) Updating AIO to use wait-bit based filtered wakeups (me/wli)
> Status: Updated to 2.6.12-rc6, needs review
> (2) Buffered filesystem AIO read/write (me/Ben)
> Status: aio write: Updated to 2.6.12-rc6, needs review
> Status: aio read : Needs rework against readahead changes in mainline

> ...
> ...

> On my part, I'll start by re-posting (1) for discussion, and then
> move to (2).
>

Feedback on (1) seems positive so far, so now moving to (2), here are
the patches that implement the changes to make filesystem AIO read
and write truly asynchronous even without O_DIRECT. With these patches
in place it will no longer be necessary for the POSIX AIO library
(from Sébastien et al) to force O_DIRECT and memcpy for alignment.
(Samba should find this useful)

There are 2 patches: one for buffered filesystem AIO read and the other
for buffered filesystem AIO O_SYNC write. These build on the AIO wait bit
integration patches posted earlier.

Comments would be appreciated.

The full patchset including (1) and (2) above is available at
www.kernel.org/pub/linux/kernel/people/suparna/aio/2612

Suparna Bhattacharya

unread,

Jun 24, 2005, 7:30:18 AM6/24/05

to

On Fri, Jun 24, 2005 at 04:19:28PM +0530, Suparna Bhattacharya wrote:
> > (2) Buffered filesystem AIO read/write (me/Ben)

Filesystem aio read:

Converts the wait for page to become uptodate (lock page)
after readahead/readpage (in do_generic_mapping_read) to a retry
exit, to make buffered filesystem AIO reads actually synchronous.

The patch avoids exclusive wakeups with AIO, a problem originally
spotted by Chris Mason, though the reasoning for why it is an
issue is now much clearer (see explanation in the comment below
in aio.c), and the solution is perhaps slightly simpler.

fs/aio.c | 11 ++++++++++-
mm/filemap.c | 20 +++++++++++++++++---
2 files changed, 27 insertions(+), 4 deletions(-)

diff -u orig/mm/filemap.c linux-2.6.12-rc6/mm/filemap.c
--- orig/mm/filemap.c 2005-06-23 21:50:28.000000000 +0530
+++ linux-2.6.12-rc6/mm/filemap.c 2005-06-21 12:13:21.000000000 +0530
@@ -770,6 +770,11 @@
if (!isize)
goto out;

+ if (in_aio()) {
+ /* Avoid repeat readahead */
+ if (is_retried_kiocb(io_wait_to_kiocb(current->io_wait)))
+ next_index = last_index;
+ }
end_index = (isize - 1) >> PAGE_CACHE_SHIFT;
for (;;) {
struct page *page;
@@ -771,7 +771,11 @@ page_ok:

page_not_up_to_date:
/* Get exclusive access to the page ... */
- lock_page(page);
+
+ if ((error = __lock_page(page, current->io_wait))) {
+ pr_debug("queued lock page \n");
+ goto readpage_error;
+ }

/* Did it get unhashed before we got the lock? */
if (!page->mapping) {
@@ -793,7 +798,8 @@ readpage:
goto readpage_error;

if (!PageUptodate(page)) {
- lock_page(page);
+ if ((error = __lock_page(page, current->io_wait)))
+ goto readpage_error;
if (!PageUptodate(page)) {
if (page->mapping == NULL) {
/*
@@ -880,7 +880,11 @@ readpage:
goto page_ok;

readpage_error:
- /* UHHUH! A synchronous read error occurred. Report it */
+ /* We don't have uptodate data in the page yet */
+ /* Could be due to an error or because we need to
+ * retry when we get an async i/o notification.
+ * Report the reason.
+ */
desc->error = error;
page_cache_release(page);
goto out;
diff -u orig/fs/aio.c linux-2.6.12-rc6/fs/aio.c
--- orig/fs/aio.c 2005-06-23 21:51:14.000000000 +0530
+++ linux-2.6.12-rc6/fs/aio.c 2005-06-23 18:47:18.000000000 +0530
@@ -1473,7 +1473,16 @@

list_del_init(&wait->task_list);
kick_iocb(iocb);
- return 1;
+ /*
+ * Avoid exclusive wakeups with retries since an exclusive wakeup
+ * may involve implicit expectations of waking up the next waiter
+ * and there is no guarantee that the retry will take a path that
+ * would do so. For example if a page has become up-to-date, then
+ * a retried read may end up straightaway performing a copyout
+ * and not go through a lock_page - unlock_page that would have
+ * passed the baton to the next waiter.
+ */
+ return 0;
}

int fastcall io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,

Suparna Bhattacharya

unread,

Jun 24, 2005, 7:40:16 AM6/24/05

to

On Fri, Jun 24, 2005 at 04:19:28PM +0530, Suparna Bhattacharya wrote:

> On Mon, Jun 20, 2005 at 05:31:54PM +0530, Suparna Bhattacharya wrote:
> > (2) Buffered filesystem AIO read/write (me/Ben)

Filesystem AIO write

AIO support for O_SYNC buffered writes, built over O_SYNC-speedup.
It uses the tagged radix tree lookups to writeout just the pages
pertaining to this request, and retries instead of blocking
for writeback to complete on the same range. All the writeout is
issued at the time of io submission, and there is a check to make
sure that retries skip over straight to the wait_on_page_writeback_range.

Signed-off-by: Suparna Bhattacharya <sup...@in.ibm.com>

include/linux/aio.h | 8 ++++-
mm/filemap.c | 82 +++++++++++++++++++++++++++++++++++++++-------------
2 files changed, 69 insertions(+), 21 deletions(-)

diff -urp -X dontdiff2 linux-2.6.10-rc1/include/linux/aio.h linux-2.6.10-rc1-new/include/linux/aio.h
--- linux-2.6.10-rc1/include/linux/aio.h 2004-11-26 14:30:55.000000000 +0530
+++ linux-2.6.10-rc1-new/include/linux/aio.h 2004-11-26 14:07:29.000000000 +0530
@@ -27,21 +27,26 @@ struct kioctx;
#define KIF_LOCKED 0
#define KIF_KICKED 1
#define KIF_CANCELLED 2
+#define KIF_SYNCED 3

#define kiocbTryLock(iocb) test_and_set_bit(KIF_LOCKED, &(iocb)->ki_flags)
#define kiocbTryKick(iocb) test_and_set_bit(KIF_KICKED, &(iocb)->ki_flags)
+#define kiocbTrySync(iocb) test_and_set_bit(KIF_SYNCED, &(iocb)->ki_flags)

#define kiocbSetLocked(iocb) set_bit(KIF_LOCKED, &(iocb)->ki_flags)
#define kiocbSetKicked(iocb) set_bit(KIF_KICKED, &(iocb)->ki_flags)
#define kiocbSetCancelled(iocb) set_bit(KIF_CANCELLED, &(iocb)->ki_flags)
+#define kiocbSetSynced(iocb) set_bit(KIF_SYNCED, &(iocb)->ki_flags)

#define kiocbClearLocked(iocb) clear_bit(KIF_LOCKED, &(iocb)->ki_flags)
#define kiocbClearKicked(iocb) clear_bit(KIF_KICKED, &(iocb)->ki_flags)
#define kiocbClearCancelled(iocb) clear_bit(KIF_CANCELLED, &(iocb)->ki_flags)
+#define kiocbClearSynced(iocb) clear_bit(KIF_SYNCED, &(iocb)->ki_flags)

#define kiocbIsLocked(iocb) test_bit(KIF_LOCKED, &(iocb)->ki_flags)
#define kiocbIsKicked(iocb) test_bit(KIF_KICKED, &(iocb)->ki_flags)
#define kiocbIsCancelled(iocb) test_bit(KIF_CANCELLED, &(iocb)->ki_flags)
+#define kiocbIsSynced(iocb) test_bit(KIF_SYNCED, &(iocb)->ki_flags)

struct kiocb {
struct list_head ki_run_list;
@@ -184,7 +189,8 @@ do { \
} \
} while (0)

-#define io_wait_to_kiocb(wait) container_of(wait, struct kiocb, ki_wait)
+#define io_wait_to_kiocb(io_wait) container_of(container_of(io_wait, \
+ struct wait_bit_queue, wait), struct kiocb, ki_wait)
#define is_retried_kiocb(iocb) ((iocb)->ki_retried > 1)

#include <linux/aio_abi.h>
diff -urp -X dontdiff2 linux-2.6.10-rc1/mm/filemap.c linux-2.6.10-rc1-new/mm/filemap.c
--- linux-2.6.10-rc1/mm/filemap.c 2004-11-26 13:33:13.000000000 +0530
+++ linux-2.6.10-rc1-new/mm/filemap.c 2004-11-26 16:04:24.000000000 +0530
@@ -209,10 +209,11 @@ EXPORT_SYMBOL(filemap_flush);

/*
* Wait for writeback to complete against pages indexed by start->end
- * inclusive
+ * inclusive. In AIO context, this may queue an async notification
+ * and retry callback and return, instead of blocking the caller.
*/
-static int wait_on_page_writeback_range(struct address_space *mapping,
- pgoff_t start, pgoff_t end)
+static int __wait_on_page_writeback_range(struct address_space *mapping,
+ pgoff_t start, pgoff_t end, wait_queue_t *wait)
{
struct pagevec pvec;
int nr_pages;
@@ -224,20 +225,20 @@ static int wait_on_page_writeback_range(

pagevec_init(&pvec, 0);
index = start;
- while ((index <= end) &&
+ while (!ret && (index <= end) &&
(nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
PAGECACHE_TAG_WRITEBACK,
min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1)) != 0) {
unsigned i;

- for (i = 0; i < nr_pages; i++) {
+ for (i = 0; !ret && (i < nr_pages); i++) {
struct page *page = pvec.pages[i];

/* until radix tree lookup accepts end_index */
if (page->index > end)
continue;

- wait_on_page_writeback(page);
+ ret = __wait_on_page_writeback(page, wait);
if (PageError(page))
ret = -EIO;
}
@@ -254,6 +255,14 @@ static int wait_on_page_writeback_range(
return ret;
}

+static inline int wait_on_page_writeback_range(struct address_space *mapping,
+ pgoff_t start, pgoff_t end)
+{
+ return __wait_on_page_writeback_range(mapping, start, end,
+ &current->__wait.wait);
+}
+
+
/*
* Write and wait upon all the pages in the passed range. This is a "data
* integrity" operation. It waits upon in-flight writeout before starting and
@@ -267,18 +276,27 @@ int sync_page_range(struct inode *inode,
{
pgoff_t start = pos >> PAGE_CACHE_SHIFT;
pgoff_t end = (pos + count - 1) >> PAGE_CACHE_SHIFT;
- int ret;
+ int ret = 0;

if (!mapping_cap_writeback_dirty(mapping) || !count)
return 0;
+ if (in_aio()) {
+ /* Already issued writeouts for this iocb ? */
+ if (kiocbTrySync(io_wait_to_kiocb(current->io_wait)))
+ goto do_wait; /* just need to check if done */
+ }
ret = filemap_fdatawrite_range(mapping, pos, pos + count - 1);
- if (ret == 0) {
+
+ if (ret >= 0) {
down(&inode->i_sem);
ret = generic_osync_inode(inode, mapping, OSYNC_METADATA);
up(&inode->i_sem);
}
- if (ret == 0)
- ret = wait_on_page_writeback_range(mapping, start, end);
+do_wait:
+ if (ret >= 0) {
+ ret = __wait_on_page_writeback_range(mapping, start, end,
+ current->io_wait);
+ }
return ret;
}
EXPORT_SYMBOL(sync_page_range);
@@ -293,15 +311,23 @@ int sync_page_range_nolock(struct inode
{
pgoff_t start = pos >> PAGE_CACHE_SHIFT;
pgoff_t end = (pos + count - 1) >> PAGE_CACHE_SHIFT;
- int ret;
+ int ret = 0;

if (!mapping_cap_writeback_dirty(mapping) || !count)
return 0;
+ if (in_aio()) {
+ /* Already issued writeouts for this iocb ? */
+ if (kiocbTrySync(io_wait_to_kiocb(current->io_wait)))
+ goto do_wait; /* just need to check if done */
+ }
ret = filemap_fdatawrite_range(mapping, pos, pos + count - 1);
- if (ret == 0)
+ if (ret >= 0)
ret = generic_osync_inode(inode, mapping, OSYNC_METADATA);
- if (ret == 0)
- ret = wait_on_page_writeback_range(mapping, start, end);
+do_wait:
+ if (ret >= 0) {
+ ret = __wait_on_page_writeback_range(mapping, start, end,
+ current->io_wait);
+ }
return ret;
}
EXPORT_SYMBOL(sync_page_range_nolock);
@@ -2001,7 +2028,7 @@ generic_file_buffered_write(struct kiocb
*/
if (likely(status >= 0)) {
if (unlikely((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
- if (!a_ops->writepage || !is_sync_kiocb(iocb))
+ if (!a_ops->writepage)
status = generic_osync_inode(inode, mapping,
OSYNC_METADATA|OSYNC_DATA);
}
@@ -2108,14 +2135,23 @@ generic_file_aio_write_nolock(struct kio
ssize_t ret;
loff_t pos = *ppos;

+ if (!is_sync_kiocb(iocb) && kiocbIsSynced(iocb)) {
+ /* nothing to transfer, may just need to sync data */
+ ret = iov->iov_len; /* vector AIO not supported yet */
+ goto osync;
+ }
+
ret = __generic_file_aio_write_nolock(iocb, iov, nr_segs, ppos);

+osync:
if (ret > 0 && ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
int err;

err = sync_page_range_nolock(inode, mapping, pos, ret);
- if (err < 0)
- ret = err;
+ if (err < 0) {
+ ret = err;
+ *ppos = pos;
+ }
}
return ret;
}
@@ -2161,19 +2197,26 @@ ssize_t generic_file_aio_write(struct ki
struct iovec local_iov = { .iov_base = (void __user *)buf,
.iov_len = count };

- BUG_ON(iocb->ki_pos != pos);
+ if (!is_sync_kiocb(iocb) && kiocbIsSynced(iocb)) {
+ /* nothing to transfer, may just need to sync data */
+ ret = count;
+ goto osync;
+ }

down(&inode->i_sem);
ret = __generic_file_aio_write_nolock(iocb, &local_iov, 1,
&iocb->ki_pos);
up(&inode->i_sem);

+osync:
if (ret > 0 && ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
ssize_t err;

err = sync_page_range(inode, mapping, pos, ret);
- if (err < 0)
+ if (err < 0) {
ret = err;
+ iocb->ki_pos = pos;
+ }
}
return ret;

Jeremy Allison

unread,

Jun 24, 2005, 12:20:11 PM6/24/05

to

On Fri, Jun 24, 2005 at 04:19:28PM +0530, Suparna Bhattacharya wrote:
>

> Feedback on (1) seems positive so far, so now moving to (2), here are
> the patches that implement the changes to make filesystem AIO read
> and write truly asynchronous even without O_DIRECT. With these patches
> in place it will no longer be necessary for the POSIX AIO library
> (from Sébastien et al) to force O_DIRECT and memcpy for alignment.
> (Samba should find this useful)

Wonderful ! That's exactly what we need - thanks. I could have fixed this
in userspace but it would be rather ugly.

Jeremy.

Zach Brown

unread,

Jun 28, 2005, 1:00:17 PM6/28/05

to

I have to whine at least once about obscure names :)

> -void fastcall __lock_page(struct page *page)
> +void fastcall lock_page_slow(struct page *page)
> {
> DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
>
> __wait_on_bit_lock(page_waitqueue(page), &wait, sync_page,
> TASK_UNINTERRUPTIBLE);
> }
> -EXPORT_SYMBOL(__lock_page);
> +EXPORT_SYMBOL(lock_page_slow);

Can we chose a name that describes what it does? Something like
lock_page_wait_on_locked()?

- z

Suparna Bhattacharya

unread,

Jun 29, 2005, 5:50:23 AM6/29/05

to

On Tue, Jun 28, 2005 at 09:52:37AM -0700, Zach Brown wrote:
>
> I have to whine at least once about obscure names :)
>
> > -void fastcall __lock_page(struct page *page)
> > +void fastcall lock_page_slow(struct page *page)
> > {
> > DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
> >
> > __wait_on_bit_lock(page_waitqueue(page), &wait, sync_page,
> > TASK_UNINTERRUPTIBLE);
> > }
> > -EXPORT_SYMBOL(__lock_page);
> > +EXPORT_SYMBOL(lock_page_slow);
>
> Can we chose a name that describes what it does? Something like
> lock_page_wait_on_locked()?

The usage of the _slow suffix was inspired from fs/buffer.c,
__bread_slow, __find_get_block_slow, __getblk_slow etc.

I actually have an earlier version with the name changes you had
suggested earlier ... but then wasn't sure if these are really
much better, wanted to hear what others had to say. I do not have
any particular affinities, will go with whatever is the general
consensus.

Regards
Suparna

>
> - z

--
Suparna Bhattacharya (sup...@in.ibm.com)
Linux Technology Center
IBM Software Lab, India

-

Sébastien Dugué

unread,

Jun 30, 2005, 12:00:12 PM6/30/05

to

Just found a bug in aio_run_iocb: after having called the retry
method for the iocb, current->io_wait is RESET to NULL. While this
does not affect applications doing only AIO, applications
mixing sync and async IO (MySQL for example) end up crashing
later on in the sync path when calling lock_page_slow as the io_wait
queue is NULL.

Therefore after the retry method has been called the task io_wait
queue should be set to the default queue.

This patch applies over Suparna's wait-bit patchset and maybe should
be folded into aio-wait-bit.

aio-retry-iowait-fix

Suparna Bhattacharya

unread,

Jul 1, 2005, 3:40:12 AM7/1/05

to

Yes this is a problem. I had spotted it too but the implications hadn't
registered well enough for prompt fix - thanks for the patch.

Regards
Suparna

>
> Therefore after the retry method has been called the task io_wait
> queue should be set to the default queue.
>
> This patch applies over Suparna's wait-bit patchset and maybe should
> be folded into aio-wait-bit.
>
> Sébastien.
>
> --
> ------------------------------------------------------
>
> Sébastien Dugué BULL/FREC:B1-247
> phone: (+33) 476 29 77 70 Bullcom: 229-7770
>
> mailto:sebasti...@bull.net
>
> Linux POSIX AIO: http://www.bullopensource.org/posix
>
> ------------------------------------------------------

--

Christoph Hellwig

unread,

Jul 24, 2005, 6:20:07 PM7/24/05

to

On Mon, Jun 20, 2005 at 09:54:04PM +0530, Suparna Bhattacharya wrote:
> In order to allow for interruptible and asynchronous versions of
> lock_page in conjunction with the wait_on_bit changes, we need to
> define low-level lock page routines which take an additional
> argument, i.e a wait queue entry and may return non-zero status,
> e.g -EINTR, -EIOCBRETRY, -EWOULDBLOCK etc. This patch renames
> __lock_page to lock_page_slow, so that __lock_page and
> __lock_page_slow can denote the versions which take a wait queue
> parameter.

How many users that don't use a waitqueue parameter will be left
once all AIO patches go in?

Christoph Hellwig

unread,

Jul 24, 2005, 6:40:05 PM7/24/05

to

On Sun, Jul 24, 2005 at 11:17:02PM +0100, Christoph Hellwig wrote:
> On Mon, Jun 20, 2005 at 09:54:04PM +0530, Suparna Bhattacharya wrote:
> > In order to allow for interruptible and asynchronous versions of
> > lock_page in conjunction with the wait_on_bit changes, we need to
> > define low-level lock page routines which take an additional
> > argument, i.e a wait queue entry and may return non-zero status,
> > e.g -EINTR, -EIOCBRETRY, -EWOULDBLOCK etc. This patch renames
> > __lock_page to lock_page_slow, so that __lock_page and
> > __lock_page_slow can denote the versions which take a wait queue
> > parameter.
>
> How many users that don't use a waitqueue parameter will be left
> once all AIO patches go in?

Actually looking at the later patches we always seem to pass
current->io_wait anyway, so is there a real point for having the
argument?

Suparna Bhattacharya

unread,

Jul 25, 2005, 9:10:09 PM7/25/05

to

On Sun, Jul 24, 2005 at 11:36:34PM +0100, Christoph Hellwig wrote:
> On Sun, Jul 24, 2005 at 11:17:02PM +0100, Christoph Hellwig wrote:
> > On Mon, Jun 20, 2005 at 09:54:04PM +0530, Suparna Bhattacharya wrote:
> > > In order to allow for interruptible and asynchronous versions of
> > > lock_page in conjunction with the wait_on_bit changes, we need to
> > > define low-level lock page routines which take an additional
> > > argument, i.e a wait queue entry and may return non-zero status,
> > > e.g -EINTR, -EIOCBRETRY, -EWOULDBLOCK etc. This patch renames
> > > __lock_page to lock_page_slow, so that __lock_page and
> > > __lock_page_slow can denote the versions which take a wait queue
> > > parameter.
> >
> > How many users that don't use a waitqueue parameter will be left
> > once all AIO patches go in?

Since these patches are intended only for aio reads and (O_SYNC) writes,
that still leaves most other users of regular lock_page() as they are.

>
> Actually looking at the later patches we always seem to pass
> current->io_wait anyway, so is there a real point for having the
> argument?
>

Having the parameter enables issual of synchronous lock_page() within
an AIO path, overriding current->io_wait, (for example, consider the case
of a page fault during AIO !), and thus avoids the need to convert and audit
more paths to handle -EIOCBRETRY propagation (though it is possible to
do so as and when the need arises). This is why I decided to keep this
parameter explicit at the low level, and let the caller decide how to
invoke it and handle the return value propagation.

Does that make sense ?

BTW, there is also a potential secondary benefit of a low level
primitive for asynchronous page IO operations etc directly usable
by kernel threads, without having to use the whole gamut of AIO
interfaces.

Thanks for reviewing the code !

Regards
Suparna

--
Suparna Bhattacharya (sup...@in.ibm.com)
Linux Technology Center
IBM Software Lab, India

-