[PATCH] /fs/partition/check.c: fix return value warning

Abdel Benamrouche

unread,

May 10, 2008, 8:20:08 AM5/10/08

to

fs/partitions/check.c:381: warning: ignoring return value of ‘device_add’,
declared with attribute warn_unused_result

Signed-off-by: Abdel Benamrouche <drac...@gmail.com>
---
:100644 100644 6149e4b... 7a87fad... M fs/partitions/check.c
fs/partitions/check.c | 8 +++++++-
1 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/fs/partitions/check.c b/fs/partitions/check.c
index 6149e4b..7a87fad 100644
--- a/fs/partitions/check.c
+++ b/fs/partitions/check.c
@@ -378,7 +378,13 @@ void add_partition(struct gendisk *disk, int part, sector_t start, sector_t len,

/* delay uevent until 'holders' subdir is created */
p->dev.uevent_suppress = 1;
- device_add(&p->dev);
+ if (device_add(&p->dev)) {
+ put_device(&p->dev);
+ free_part_stats(p);
+ kfree(p);
+ return;
+ }
+
partition_sysfs_add_subdir(p);
p->dev.uevent_suppress = 0;
if (flags & ADDPART_FLAG_WHOLEDISK)
--
1.5.4.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Andrew Morton

unread,

May 12, 2008, 6:30:14 PM5/12/08

to

On Sat, 10 May 2008 13:40:53 +0200
Abdel Benamrouche <drac...@gmail.com> wrote:

> fs/partitions/check.c:381: warning: ignoring return value of ___device_add___,

> declared with attribute warn_unused_result
>
> Signed-off-by: Abdel Benamrouche <drac...@gmail.com>
> ---
> :100644 100644 6149e4b... 7a87fad... M fs/partitions/check.c
> fs/partitions/check.c | 8 +++++++-
> 1 files changed, 7 insertions(+), 1 deletions(-)
>
> diff --git a/fs/partitions/check.c b/fs/partitions/check.c
> index 6149e4b..7a87fad 100644
> --- a/fs/partitions/check.c
> +++ b/fs/partitions/check.c
> @@ -378,7 +378,13 @@ void add_partition(struct gendisk *disk, int part, sector_t start, sector_t len,
>
> /* delay uevent until 'holders' subdir is created */
> p->dev.uevent_suppress = 1;
> - device_add(&p->dev);
> + if (device_add(&p->dev)) {
> + put_device(&p->dev);
> + free_part_stats(p);
> + kfree(p);
> + return;
> + }
> +
> partition_sysfs_add_subdir(p);
> p->dev.uevent_suppress = 0;
> if (flags & ADDPART_FLAG_WHOLEDISK)

We should go further than this. add_partition() just drops the error
on the floor. It should be propagated back to callers, and callers
should be modified to handle it appropriately.

Presumably we should also handle a device_create_file() failure as well
- that is presently being silently ignored.

Abdel Benamrouche

unread,

May 13, 2008, 4:10:11 PM5/13/08

to

check value returned by add_partion()

Signed-off-by: Abdel Benamrouche <drac...@gmail.com>
---

block/ioctl.c | 5 +++--
fs/partitions/check.c | 8 +++++++-
2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/block/ioctl.c b/block/ioctl.c
index 52d6385..77185e5 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -17,6 +17,7 @@ static int blkpg_ioctl(struct block_device *bdev, struct blkpg_ioctl_arg __user
long long start, length;
int part;
int i;
+ int err;

if (!capable(CAP_SYS_ADMIN))
return -EACCES;
@@ -61,9 +62,9 @@ static int blkpg_ioctl(struct block_device *bdev, struct blkpg_ioctl_arg __user
}
}
/* all seems OK */
- add_partition(disk, part, start, length, ADDPART_FLAG_NONE);
+ err = add_partition(disk, part, start, length, ADDPART_FLAG_NONE);
mutex_unlock(&bdev->bd_mutex);
- return 0;
+ return err;
case BLKPG_DEL_PARTITION:
if (!disk->part[part-1])
return -ENXIO;
diff --git a/fs/partitions/check.c b/fs/partitions/check.c
index a1396a9..94f4e3c 100644
--- a/fs/partitions/check.c
+++ b/fs/partitions/check.c
@@ -500,8 +500,14 @@ int rescan_partitions(struct gendisk *disk, struct block_device *bdev)
if (from + size > get_capacity(disk)) {
printk(" %s: p%d exceeds device capacity\n",
disk->disk_name, p);
+ continue;
+ }
+ res = add_partition(disk, p, from, size, state->parts[p].flags);
+ if (res) {
+ printk(" %s: p%d could not be added. got error %d\n",
+ disk->disk_name, p, -res);
+ continue;
}
- add_partition(disk, p, from, size, state->parts[p].flags);
#ifdef CONFIG_BLK_DEV_MD
if (state->parts[p].flags & ADDPART_FLAG_RAID)
md_autodetect_dev(bdev->bd_dev+p);
--
1.5.4.3

Abdel Benamrouche

unread,

May 13, 2008, 4:10:12 PM5/13/08

to

fs/partitions/check.c:381: warning: ignoring return value of ___device_add___,
declared with attribute warn_unused_result

Signed-off-by: Abdel Benamrouche <drac...@gmail.com>
---
fs/partitions/check.c | 25 ++++++++++++++++++++-----
include/linux/genhd.h | 2 +-
2 files changed, 21 insertions(+), 6 deletions(-)

diff --git a/fs/partitions/check.c b/fs/partitions/check.c
index 6149e4b..a1396a9 100644
--- a/fs/partitions/check.c
+++ b/fs/partitions/check.c
@@ -344,18 +344,18 @@ static ssize_t whole_disk_show(struct device *dev,
static DEVICE_ATTR(whole_disk, S_IRUSR | S_IRGRP | S_IROTH,
whole_disk_show, NULL);

-void add_partition(struct gendisk *disk, int part, sector_t start, sector_t len, int flags)
+int add_partition(struct gendisk *disk, int part, sector_t start, sector_t len, int flags)
{
struct hd_struct *p;
int err;

p = kzalloc(sizeof(*p), GFP_KERNEL);
if (!p)
- return;
+ return -ENOMEM;

if (!init_part_stats(p)) {
kfree(p);
- return;
+ return -ENOMEM;
}
p->start_sect = start;
p->nr_sects = len;
@@ -378,15 +378,30 @@ void add_partition(struct gendisk *disk, int part, sector_t start, sector_t len,

/* delay uevent until 'holders' subdir is created */
p->dev.uevent_suppress = 1;
- device_add(&p->dev);

+ err = device_add(&p->dev);
+ if (err)
+ goto out1;

partition_sysfs_add_subdir(p);
p->dev.uevent_suppress = 0;

- if (flags & ADDPART_FLAG_WHOLEDISK)
+ if (flags & ADDPART_FLAG_WHOLEDISK) {
err = device_create_file(&p->dev, &dev_attr_whole_disk);
+ if (err)
+ goto out2;
+ }

/* suppress uevent if the disk supresses it */
if (!disk->dev.uevent_suppress)
kobject_uevent(&p->dev.kobj, KOBJ_ADD);
+
+ return 0;
+
+out2:
+ device_del(&p->dev);
+out1:

+ put_device(&p->dev);
+ free_part_stats(p);
+ kfree(p);

+ return err;
}

/* Not exported, helper to add_disk(). */
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index e9874e7..dd9a37a 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -529,7 +529,7 @@ extern dev_t blk_lookup_devt(const char *name);
extern char *disk_name (struct gendisk *hd, int part, char *buf);

extern int rescan_partitions(struct gendisk *disk, struct block_device *bdev);
-extern void add_partition(struct gendisk *, int, sector_t, sector_t, int);
+extern int __must_check add_partition(struct gendisk *, int, sector_t, sector_t, int);
extern void delete_partition(struct gendisk *, int);
extern void printk_all_partitions(void);

Abdel Benamrouche

unread,

May 13, 2008, 4:10:17 PM5/13/08

to

On Tue, May 13, 2008 at 12:27 AM, Andrew Morton <ak...@linux-foundation.org> wrote:
> On Sat, 10 May 2008 13:40:53 +0200
> Abdel Benamrouche <drac...@gmail.com> wrote:
>

> > fs/partitions/check.c:381: warning: ignoring return value of ___device_add___,
> > declared with attribute warn_unused_result
> >
> > Signed-off-by: Abdel Benamrouche <drac...@gmail.com>
> > ---

> > :100644 100644 6149e4b... 7a87fad... M fs/partitions/check.c
> > fs/partitions/check.c | 8 +++++++-

> > 1 files changed, 7 insertions(+), 1 deletions(-)
> >
> > diff --git a/fs/partitions/check.c b/fs/partitions/check.c
> > index 6149e4b..7a87fad 100644
> > --- a/fs/partitions/check.c
> > +++ b/fs/partitions/check.c
> > @@ -378,7 +378,13 @@ void add_partition(struct gendisk *disk, int part, sector_t start, sector_t len,

> >
> > /* delay uevent until 'holders' subdir is created */
> > p->dev.uevent_suppress = 1;
> > - device_add(&p->dev);

> > + if (device_add(&p->dev)) {

> > + put_device(&p->dev);
> > + free_part_stats(p);
> > + kfree(p);

> > + return;
> > + }
> > +

> > partition_sysfs_add_subdir(p);
> > p->dev.uevent_suppress = 0;

> > if (flags & ADDPART_FLAG_WHOLEDISK)
>
> We should go further than this. add_partition() just drops the error
> on the floor. It should be propagated back to callers, and callers
> should be modified to handle it appropriately.
>
> Presumably we should also handle a device_create_file() failure as well
> - that is presently being silently ignored.
>
>

done.
I make 2 patch so that it will be easier to read.

Tom Spink

unread,

May 19, 2008, 7:30:14 AM5/19/08

to

Hi,

This email contains an RFC patch that introduces init and exit routines to
the file_system_type structure. These routines were mentioned in
an email I saw about XFS starting threads that aren't needed when no
XFS filesystems are mounted.

So I decided to try and implement the infrastructure to do this.

Please let me know what you think, I'm pretty sure I'll be missing
something I won't know about (like a lock, or a refcount), but feedback
would be appreciated.

--

This patch adds tracking to filesystem types, whereby the number of mounts
of a particular filesystem type can be determined. This has the added
benefit of introducing init and exit routines for filesystem types, which
are called on the first mount and last unmount of the filesystem type,
respectively.

This is useful for filesystems which share global resources between all
mounts, but only need these resources when at least one filesystem is
mounted. For example, XFS creates a number of kernel threads which aren't
required when there are no XFS filesystems mounted. This patch will allow
XFS to start those threads just before the first filesystem is mounted, and
to shut them down when the last filesystem has been unmounted.

Signed-off-by: Tom Spink <tsp...@gmail.com>
---
fs/namespace.c | 9 +++++++++
fs/super.c | 25 +++++++++++++++++++++++++
include/linux/fs.h | 3 +++
3 files changed, 37 insertions(+), 0 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 4fc302c..bfa2f39 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1025,6 +1025,7 @@ static void shrink_submounts(struct vfsmount *mnt, struct list_head *umounts);
static int do_umount(struct vfsmount *mnt, int flags)
{
struct super_block *sb = mnt->mnt_sb;
+ struct file_system_type *type = sb->s_type;
int retval;
LIST_HEAD(umount_list);

@@ -1108,6 +1109,14 @@ static int do_umount(struct vfsmount *mnt, int flags)
security_sb_umount_busy(mnt);
up_write(&namespace_sem);
release_mounts(&umount_list);
+
+ /* Check to see if the unmount is successful, and we're unmounting the
+ * last filesystem of this type. If we are, run the exit routine of
+ * the filesystem type.
+ */
+ if (retval == 0 && ((--type->nr_mounts == 0) && type->exit))
+ type->exit();
+
return retval;
}

diff --git a/fs/super.c b/fs/super.c
index 453877c..e1dba4b 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -961,14 +961,39 @@ static struct vfsmount *fs_set_subtype(struct vfsmount *mnt, const char *fstype)
struct vfsmount *
do_kern_mount(const char *fstype, int flags, const char *name, void *data)
{
+ int rc;
struct file_system_type *type = get_fs_type(fstype);
struct vfsmount *mnt;
if (!type)
return ERR_PTR(-ENODEV);
+
+ /* If this is the first mount, then initialise the filesystem type. */
+ if (type->nr_mounts == 0 && type->init) {
+ rc = type->init();
+
+ /* If initialisation failed, pass the error back down the chain. */
+ if (rc) {
+ put_filesystem(type);
+ return ERR_PTR(rc);
+ }
+ }
+
mnt = vfs_kern_mount(type, flags, name, data);
if (!IS_ERR(mnt) && (type->fs_flags & FS_HAS_SUBTYPE) &&
!mnt->mnt_sb->s_subtype)
mnt = fs_set_subtype(mnt, fstype);
+
+ /* Check to see if the mount was successful, and if so, increment
+ * the mount counter. Otherwise, if we initialised the filesystem
+ * type already (and the mount just failed), we need to shut it
+ * back down.
+ */
+ if (!IS_ERR(mnt)) {
+ type->nr_mounts++;
+ } else if (type->nr_mounts == 0 && type->exit) {
+ type->exit();
+ }
+
put_filesystem(type);
return mnt;
}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index f413085..ba92056 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1474,9 +1474,12 @@ int sync_inode(struct inode *inode, struct writeback_control *wbc);
struct file_system_type {
const char *name;
int fs_flags;
+ int nr_mounts;
int (*get_sb) (struct file_system_type *, int,
const char *, void *, struct vfsmount *);
void (*kill_sb) (struct super_block *);
+ int (*init) (void);
+ void (*exit) (void);
struct module *owner;
struct file_system_type * next;
struct list_head fs_supers;
--
1.5.4.3

Jesper Juhl

unread,

May 19, 2008, 7:00:15 PM5/19/08

to

Hi Abdel,

2008/5/13 Andrew Morton <ak...@linux-foundation.org>:

Given these comments from Andrew, I'm not adding this patch to the Trivial tree.
Please address Andrews comments and re-submit - at which point I doubt
it'll be suitable for trivial, so please try to get it merged via
other maintainers.

--
Jesper Juhl <jespe...@gmail.com>
Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please http://www.expita.com/nomime.html

Tom Spink

unread,

May 20, 2008, 9:10:10 AM5/20/08

to

2008/5/19 Tom Spink <tsp...@gmail.com>:

Hi,

I'm just adding people to CC here, but also I had a couple of thoughts
after reviewing my own code.

I see that do_kern_mount is encapsulated with the BKL, but would it be
wise to introduce a lock (e.g. a mutex) now for reading and updating
nr_mounts (and hence calling ->init), rather than wait for the BKL
removal to come round here?

Also, have I got all the cases where a filesystem is unmounted,
because I now see umount_tree, and am wondering if decrementing the
nr_mounts field should be done in here, in the loop of vfsmounts... or
is it sufficient to leave it at the end of do_umount?

--
Regards,
Tom Spink

Al Viro

unread,

May 20, 2008, 9:50:08 AM5/20/08

to

On Tue, May 20, 2008 at 02:06:42PM +0100, Tom Spink wrote:
[snip]

> I'm just adding people to CC here, but also I had a couple of thoughts
> after reviewing my own code.
>
> I see that do_kern_mount is encapsulated with the BKL, but would it be
> wise to introduce a lock (e.g. a mutex) now for reading and updating
> nr_mounts (and hence calling ->init), rather than wait for the BKL
> removal to come round here?
>
> Also, have I got all the cases where a filesystem is unmounted,
> because I now see umount_tree, and am wondering if decrementing the
> nr_mounts field should be done in here, in the loop of vfsmounts... or
> is it sufficient to leave it at the end of do_umount?

No, you have not and no, doing that anywhere near that layer is hopeless.

a) Instances of filesystem can easily outlive all vfsmounts,
let alone their attachment to namespaces.
b) What should happen if init is done in the middle of exit?
c) Why do we need to bother, anyway?

Christoph Hellwig

unread,

May 20, 2008, 10:00:21 AM5/20/08

to

On Tue, May 20, 2008 at 02:43:06PM +0100, Al Viro wrote:
> No, you have not and no, doing that anywhere near that layer is hopeless.
>
> a) Instances of filesystem can easily outlive all vfsmounts,
> let alone their attachment to namespaces.
> b) What should happen if init is done in the middle of exit?
> c) Why do we need to bother, anyway?

We had a discussion about filesystems starting threads without an
active instance. I suggested tracking instances and add ->init / ->exit
methods to struct file_system_type for these kinds of instances.

But we should track superblock instances, not vfsmount instances of
course. Tom, you probably don't even need a counter, emptyness
of file_system_type.fs_supers should be indication enough. And yes
we'd need locking to prevent init racing with exit.

Tom Spink

unread,

May 20, 2008, 10:00:23 AM5/20/08

to

2008/5/20 Al Viro <vi...@zeniv.linux.org.uk>:

> On Tue, May 20, 2008 at 02:06:42PM +0100, Tom Spink wrote:
> [snip]
>
>> I'm just adding people to CC here, but also I had a couple of thoughts
>> after reviewing my own code.
>>
>> I see that do_kern_mount is encapsulated with the BKL, but would it be
>> wise to introduce a lock (e.g. a mutex) now for reading and updating
>> nr_mounts (and hence calling ->init), rather than wait for the BKL
>> removal to come round here?
>>
>> Also, have I got all the cases where a filesystem is unmounted,
>> because I now see umount_tree, and am wondering if decrementing the
>> nr_mounts field should be done in here, in the loop of vfsmounts... or
>> is it sufficient to leave it at the end of do_umount?

Hi Al,

> No, you have not and no, doing that anywhere near that layer is hopeless.
>
> a) Instances of filesystem can easily outlive all vfsmounts,
> let alone their attachment to namespaces.

I see! Whoops...

> b) What should happen if init is done in the middle of exit?

Okay, I guess *some* sort of locking is in order. :)

> c) Why do we need to bother, anyway?

Well, just for the reason I mentioned, I saw the posting about XFS
starting threads (when compiled into the kernel), but there's no use
of an XFS filesystem at all - there was a suggestion that something
like this be tried, so I thought I'd give it a go.

Thanks for replying!

--
Regards,
Tom Spink

Tom Spink

unread,

May 20, 2008, 11:20:12 AM5/20/08

to

2008/5/20 Christoph Hellwig <h...@infradead.org>:

> On Tue, May 20, 2008 at 02:43:06PM +0100, Al Viro wrote:
>> No, you have not and no, doing that anywhere near that layer is hopeless.
>>
>> a) Instances of filesystem can easily outlive all vfsmounts,
>> let alone their attachment to namespaces.
>> b) What should happen if init is done in the middle of exit?
>> c) Why do we need to bother, anyway?
>
> We had a discussion about filesystems starting threads without an
> active instance. I suggested tracking instances and add ->init / ->exit
> methods to struct file_system_type for these kinds of instances.
>
> But we should track superblock instances, not vfsmount instances of
> course. Tom, you probably don't even need a counter, emptyness
> of file_system_type.fs_supers should be indication enough. And yes
> we'd need locking to prevent init racing with exit.
>
>

Hi Guys,

Thanks for looking! So I've had another go - this time taking the
superblock approach, and hopefully I've got the locking right too.
Let me know what you think (or if I'm still barking up the wrong
tree)!

---
From: Tom Spink <tsp...@gmail.com>
Date: Tue, 20 May 2008 16:04:51 +0100
Subject: [PATCH] Introduce on-demand filesystem initialisation

This patch adds on-demand filesystem initialisation capabilities to the VFS,
whereby an init routine will be executed on first use of a particular
filesystem type. Also, an exit routine will be executed when the last
superblock of a filesystem type is deactivated.

This is useful for filesystems that share global resources between all
instances of the filesystem, but only need those resources when there are
any users of the filesystem. This lets the filesystem initialise those
resources (kernel threads or caches, say) when the first superblock is
created. It also lets the filesystem clean up those resources when the
last superblock is deactivated.

Signed-off-by: Tom Spink <tsp...@gmail.com>
---

fs/filesystems.c | 2 ++
fs/super.c | 22 +++++++++++++++++++++-
include/linux/fs.h | 3 +++
3 files changed, 26 insertions(+), 1 deletions(-)

diff --git a/fs/filesystems.c b/fs/filesystems.c
index f37f872..59b2eaa 100644
--- a/fs/filesystems.c
+++ b/fs/filesystems.c
@@ -79,6 +79,7 @@ int register_filesystem(struct file_system_type * fs)
res = -EBUSY;
else
*p = fs;
+ mutex_init(&fs->fs_supers_lock);
write_unlock(&file_systems_lock);
return res;
}
@@ -105,6 +106,7 @@ int unregister_filesystem(struct file_system_type * fs)
tmp = &file_systems;
while (*tmp) {
if (fs == *tmp) {
+ mutex_destroy(&fs->fs_supers_lock);
*tmp = fs->next;
fs->next = NULL;
write_unlock(&file_systems_lock);
diff --git a/fs/super.c b/fs/super.c
index 453877c..e3a3186 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -184,6 +184,11 @@ void deactivate_super(struct super_block *s)
fs->kill_sb(s);
put_filesystem(fs);
put_super(s);
+
+ mutex_lock(&fs->fs_supers_lock);
+ if (list_empty(&fs->fs_supers) && fs->exit)
+ fs->exit();
+ mutex_unlock(&fs->fs_supers_lock);
}
}

@@ -365,10 +370,25 @@ retry:
destroy_super(s);
return ERR_PTR(err);
}
+
+ mutex_lock(&type->fs_supers_lock);
+ if (list_empty(&type->fs_supers) && type->init) {
+ err = type->init();
+ if (err) {
+ mutex_unlock(&type->fs_supers_lock);
+ spin_unlock(&sb_lock);
+ destroy_super(s);
+ return ERR_PTR(err);
+ }
+ }
+
+ list_add(&s->s_instances, &type->fs_supers);
+ mutex_unlock(&type->fs_supers_lock);
+
s->s_type = type;
strlcpy(s->s_id, type->name, sizeof(s->s_id));
list_add_tail(&s->s_list, &super_blocks);
- list_add(&s->s_instances, &type->fs_supers);
+
spin_unlock(&sb_lock);
get_filesystem(type);
return s;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index f413085..92d446f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1477,8 +1477,11 @@ struct file_system_type {

int (*get_sb) (struct file_system_type *, int,
const char *, void *, struct vfsmount *);
void (*kill_sb) (struct super_block *);
+ int (*init) (void);
+ void (*exit) (void);
struct module *owner;
struct file_system_type * next;

+ struct mutex fs_supers_lock;
struct list_head fs_supers;

struct lock_class_key s_lock_key;
--
1.5.4.3

Matthew Wilcox

unread,

May 20, 2008, 11:40:17 AM5/20/08

to

On Tue, May 20, 2008 at 04:18:14PM +0100, Tom Spink wrote:
> +
> + mutex_lock(&type->fs_supers_lock);
> + if (list_empty(&type->fs_supers) && type->init) {
> + err = type->init();
> + if (err) {
> + mutex_unlock(&type->fs_supers_lock);
> + spin_unlock(&sb_lock);
> + destroy_super(s);
> + return ERR_PTR(err);
> + }
> + }
> +
> + list_add(&s->s_instances, &type->fs_supers);
> + mutex_unlock(&type->fs_supers_lock);
> +
> s->s_type = type;
> strlcpy(s->s_id, type->name, sizeof(s->s_id));
> list_add_tail(&s->s_list, &super_blocks);
> - list_add(&s->s_instances, &type->fs_supers);
> +
> spin_unlock(&sb_lock);

You can't take a mutex while holding a spinlock -- what if you had to
sleep to acquire the mutex?

I imagine you also don't want to hold a spinlock while calling the
->init or ->exit -- what if the fs wants to sleep in there (eg allocate
memory with GFP_KERNEL).

--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

Tom Spink

unread,

May 20, 2008, 11:50:17 AM5/20/08

to

2008/5/20 Matthew Wilcox <mat...@wil.cx>:

Oh no! This is bad. I really need to devise some script to stress
test my code - and also make myself pay attention to what I'm doing.
Sorry for the noise, guys.

--
Tom Spink

Tom Spink

unread,

May 20, 2008, 5:10:18 PM5/20/08

to

2008/5/20 Tom Spink <tsp...@gmail.com>:

Hi Guys,

I've taken some more time to go over the locking semantics. I wrote a
quick toy filesystem to simulate delays, blocking, memory allocation,
etc in the init and exit routines - and with an appropriately large
amount of printk's everywhere, I saw a quite a few interleavings.

I *think* I may have got it right, but please, let me know what you
think! The only thing that I think may be wrong with this patch is
the
spin_lock/unlock at the end of sget, where the superblock is
list_add_tailed into the super_blocks list. I believe this opens the
possibility for the same superblock being list_add_tailed twice... can
anyone else see this code-path, and is it a problem?

---

From: Tom Spink <tsp...@gmail.com>
Date: Tue, 20 May 2008 16:04:51 +0100
Subject: [PATCH] Introduce on-demand filesystem initialisation

This patch adds on-demand filesystem initialisation capabilities to the VFS,
whereby an init routine will be executed on first use of a particular
filesystem type. Also, an exit routine will be executed when the last
superblock of a filesystem type is deactivated.

This is useful for filesystems that share global resources between all
instances of the filesystem, but only need those resources when there are
any users of the filesystem. This lets the filesystem initialise those
resources (kernel threads or caches, say) when the first superblock is
created. It also lets the filesystem clean up those resources when the
last superblock is deactivated.

Signed-off-by: Tom Spink <tsp...@gmail.com>
---
fs/filesystems.c | 2 ++

fs/super.c | 31 +++++++++++++++++++++++++++++--
include/linux/fs.h | 3 +++
3 files changed, 34 insertions(+), 2 deletions(-)

diff --git a/fs/filesystems.c b/fs/filesystems.c
index f37f872..59b2eaa 100644
--- a/fs/filesystems.c
+++ b/fs/filesystems.c
@@ -79,6 +79,7 @@ int register_filesystem(struct file_system_type * fs)
res = -EBUSY;
else
*p = fs;
+ mutex_init(&fs->fs_supers_lock);
write_unlock(&file_systems_lock);
return res;
}
@@ -105,6 +106,7 @@ int unregister_filesystem(struct file_system_type * fs)
tmp = &file_systems;
while (*tmp) {
if (fs == *tmp) {
+ mutex_destroy(&fs->fs_supers_lock);
*tmp = fs->next;
fs->next = NULL;
write_unlock(&file_systems_lock);
diff --git a/fs/super.c b/fs/super.c

index 453877c..7625a90 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -287,6 +287,7 @@ int fsync_super(struct super_block *sb)
void generic_shutdown_super(struct super_block *sb)
{
const struct super_operations *sop = sb->s_op;

+ struct file_system_type *type = sb->s_type;

if (sb->s_root) {
shrink_dcache_for_umount(sb);
@@ -315,8 +316,14 @@ void generic_shutdown_super(struct super_block *sb)
spin_lock(&sb_lock);
/* should be initialized for __put_super_and_need_restart() */
list_del_init(&sb->s_list);
- list_del(&sb->s_instances);
spin_unlock(&sb_lock);
+
+ mutex_lock(&type->fs_supers_lock);
+ list_del(&sb->s_instances);
+ if (list_empty(&type->fs_supers) && type->exit)
+ type->exit();
+ mutex_unlock(&type->fs_supers_lock);
+
up_write(&sb->s_umount);
}

@@ -365,11 +372,31 @@ retry:
destroy_super(s);
return ERR_PTR(err);
}

+
s->s_type = type;
strlcpy(s->s_id, type->name, sizeof(s->s_id));

- list_add_tail(&s->s_list, &super_blocks);
+
+ spin_unlock(&sb_lock);

+
+ mutex_lock(&type->fs_supers_lock);
+ if (list_empty(&type->fs_supers) && type->init) {
+ err = type->init();
+ if (err) {
+ mutex_unlock(&type->fs_supers_lock);

+ destroy_super(s);
+
+ if (err < 0)

+ return ERR_PTR(err);
+ }
+ }
+

list_add(&s->s_instances, &type->fs_supers);
+ mutex_unlock(&type->fs_supers_lock);
+

+ spin_lock(&sb_lock);
+ list_add_tail(&s->s_list, &super_blocks);
spin_unlock(&sb_lock);
+

get_filesystem(type);
return s;
}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index f413085..92d446f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1477,8 +1477,11 @@ struct file_system_type {
int (*get_sb) (struct file_system_type *, int,
const char *, void *, struct vfsmount *);
void (*kill_sb) (struct super_block *);
+ int (*init) (void);
+ void (*exit) (void);
struct module *owner;
struct file_system_type * next;
+ struct mutex fs_supers_lock;
struct list_head fs_supers;

struct lock_class_key s_lock_key;
--
1.5.4.3

Matthew Wilcox

unread,

May 20, 2008, 6:10:15 PM5/20/08

to

On Tue, May 20, 2008 at 10:08:04PM +0100, Tom Spink wrote:
> I've taken some more time to go over the locking semantics. I wrote a
> quick toy filesystem to simulate delays, blocking, memory allocation,
> etc in the init and exit routines - and with an appropriately large
> amount of printk's everywhere, I saw a quite a few interleavings.
>
> I *think* I may have got it right, but please, let me know what you
> think! The only thing that I think may be wrong with this patch is
> the
> spin_lock/unlock at the end of sget, where the superblock is
> list_add_tailed into the super_blocks list. I believe this opens the
> possibility for the same superblock being list_add_tailed twice... can
> anyone else see this code-path, and is it a problem?

Hi Tom,

I spotted one definite bug; on failure, you leave the superblock on
the super_blocks list.

Your locking may well be correct, but it has the hallmarks of being "a bit
tricky" and a bit tricky means potentially buggy. How about doing the
nesting the other way round, ie take the mutex first, then the spinlock?

The code needs a bit of tweaking because you don't want to put the
superblock on any list where it can be found until it's fully
initialised. This may not be quite right:

> + mutex_lock(&type->fs_supers_lock);

> spin_lock(&sb_lock);
> /* should be initialized for __put_super_and_need_restart() */
> list_del_init(&sb->s_list);

> list_del(&sb->s_instances);
> spin_unlock(&sb_lock);
> +

> + if (list_empty(&type->fs_supers) && type->exit)
> + type->exit();
> + mutex_unlock(&type->fs_supers_lock);
> +
> up_write(&sb->s_umount);
> }
>

sget is a little more complex ... the fs_supers_lock would need to be
dropped in a lot more places than I've shown here:

@@ -365,11 +372,31 @@ retry:

retry:
+ mutex_lock(&type->fs_supers_lock);
spin_lock(&sb_lock);

destroy_super(s);
return ERR_PTR(err);

}
s->s_type = type;
strlcpy(s->s_id, type->name, sizeof(s->s_id));

+ if (list_empty(&type->fs_supers) && type->init) {

+ spin_unlock(&sb_lock);

+ err = type->init();
+ if (err) {
+ mutex_unlock(&type->fs_supers_lock);
+ destroy_super(s);

+ return ERR_PTR(err);

+ }
+ spin_lock(&sb_lock);
+ }
list_add_tail(&s->s_list, &super_blocks);

list_add(&s->s_instances, &type->fs_supers);
spin_unlock(&sb_lock);
+ mutex_unlock(&type->fs_supers_lock);
get_filesystem(type);
return s;
}

--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

Tom Spink

unread,

May 20, 2008, 6:30:20 PM5/20/08

to

2008/5/20 Matthew Wilcox <mat...@wil.cx>:

> On Tue, May 20, 2008 at 10:08:04PM +0100, Tom Spink wrote:
>> I've taken some more time to go over the locking semantics. I wrote a
>> quick toy filesystem to simulate delays, blocking, memory allocation,
>> etc in the init and exit routines - and with an appropriately large
>> amount of printk's everywhere, I saw a quite a few interleavings.
>>
>> I *think* I may have got it right, but please, let me know what you
>> think! The only thing that I think may be wrong with this patch is
>> the
>> spin_lock/unlock at the end of sget, where the superblock is
>> list_add_tailed into the super_blocks list. I believe this opens the
>> possibility for the same superblock being list_add_tailed twice... can
>> anyone else see this code-path, and is it a problem?
>
> Hi Tom,

Hi Matthew,

> I spotted one definite bug; on failure, you leave the superblock on
> the super_blocks list.

I spotted this while I was coding, and I was careful not to let it get
added to the list... If the ->init routine fails, the superblock
hasn't even been added to the list yet. The patch moves this line:

list_add_tail(&s->s_list, &super_blocks);

Down to after the ->init call.

> Your locking may well be correct, but it has the hallmarks of being "a bit
> tricky" and a bit tricky means potentially buggy. How about doing the
> nesting the other way round, ie take the mutex first, then the spinlock?

Thanks for the suggestion!

> The code needs a bit of tweaking because you don't want to put the
> superblock on any list where it can be found until it's fully
> initialised. This may not be quite right:
>
>> + mutex_lock(&type->fs_supers_lock);
>> spin_lock(&sb_lock);
>> /* should be initialized for __put_super_and_need_restart() */
>> list_del_init(&sb->s_list);
>> list_del(&sb->s_instances);
>> spin_unlock(&sb_lock);
>> +
>> + if (list_empty(&type->fs_supers) && type->exit)
>> + type->exit();
>> + mutex_unlock(&type->fs_supers_lock);
>> +
>> up_write(&sb->s_umount);
>> }
>>

I'll definitely give it a go.

I had something similar earlier, but I thought it started to look
slightly messy when I discovered that dropping the spinlock would lead
to a racey ->init... but I hadn't thought of putting the mutex outside
the spinlock; the mutex protecting ->init and ->exit (I was getting
caught up in trying not to go to sleep inside a spinlock)

Thanks!
--
Tom Spink

Jan Engelhardt

unread,

May 21, 2008, 5:50:08 AM5/21/08

to

On Tuesday 2008-05-20 23:08, Tom Spink wrote:
>
>I *think* I may have got it right, but please, let me know what you
>think! The only thing that I think may be wrong with this patch is
>the
>spin_lock/unlock at the end of sget, where the superblock is
>list_add_tailed into the super_blocks list. I believe this opens the
>possibility for the same superblock being list_add_tailed twice... can
>anyone else see this code-path, and is it a problem?
>

>+ mutex_lock(&type->fs_supers_lock);
>+ if (list_empty(&type->fs_supers) && type->init) {
>+ err = type->init();
>+ if (err) {

The filesystem may want to have the superblock passed.
Well, will see once a filesystem has the need for it.

Tom Spink

unread,

May 21, 2008, 11:00:26 AM5/21/08

to

2008/5/20 Tom Spink <tsp...@gmail.com>:

Ready for another? <g>

Here's another try, with Matthews suggestion of moving the mutex
outside the spinlock. Again, I've used a wee stress test that tries
to mount a toy filesystem many times, with random pauses in the init
routines. It seems to pass this (and again I've seen quite a few
interleavings of the calls), and a mental scan of the code paths leads
me to believe the locking is correct.

Thanks for putting up with me, guys!

-- Tom

--

From: Tom Spink <tsp...@gmail.com>
Date: Wed, 21 May 2008 13:29:07 +0100
Subject: [PATCH] Introduce on-demand filesystem initialisation

This patch adds on-demand filesystem initialisation capabilities to the VFS,
whereby an init routine will be executed on first use of a particular
filesystem type. Also, an exit routine will be executed when the last
superblock of a filesystem type is deactivated.

This is useful for filesystems that share global resources between all
instances of the filesystem, but only need those resources when there are
any users of the filesystem. This lets the filesystem initialise those
resources (kernel threads or caches, say) when the first superblock is
created. It also lets the filesystem clean up those resources when the
last superblock is deactivated.

Signed-off-by: Tom Spink <tsp...@gmail.com>
---
fs/filesystems.c | 2 ++

fs/super.c | 29 ++++++++++++++++++++++++++++-
include/linux/fs.h | 3 +++
3 files changed, 33 insertions(+), 1 deletions(-)

diff --git a/fs/filesystems.c b/fs/filesystems.c
index f37f872..59b2eaa 100644
--- a/fs/filesystems.c
+++ b/fs/filesystems.c
@@ -79,6 +79,7 @@ int register_filesystem(struct file_system_type * fs)
res = -EBUSY;
else
*p = fs;
+ mutex_init(&fs->fs_supers_lock);
write_unlock(&file_systems_lock);
return res;
}
@@ -105,6 +106,7 @@ int unregister_filesystem(struct file_system_type * fs)
tmp = &file_systems;
while (*tmp) {
if (fs == *tmp) {
+ mutex_destroy(&fs->fs_supers_lock);
*tmp = fs->next;
fs->next = NULL;
write_unlock(&file_systems_lock);
diff --git a/fs/super.c b/fs/super.c

index 453877c..65252c2 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -287,7 +287,9 @@ int fsync_super(struct super_block *sb)

void generic_shutdown_super(struct super_block *sb)
{
const struct super_operations *sop = sb->s_op;
+ struct file_system_type *type = sb->s_type;

+ mutex_lock(&type->fs_supers_lock);
if (sb->s_root) {
shrink_dcache_for_umount(sb);
fsync_super(sb);
@@ -317,7 +319,12 @@ void generic_shutdown_super(struct super_block *sb)

list_del_init(&sb->s_list);
list_del(&sb->s_instances);
spin_unlock(&sb_lock);
+
+ if (list_empty(&type->fs_supers) && type->exit)
+ type->exit();
+

up_write(&sb->s_umount);
+ mutex_unlock(&type->fs_supers_lock);
}

EXPORT_SYMBOL(generic_shutdown_super);
@@ -338,6 +345,7 @@ struct super_block *sget(struct file_system_type *type,
struct super_block *old;
int err;

+ mutex_lock(&type->fs_supers_lock);
retry:
spin_lock(&sb_lock);
if (test) {
@@ -348,14 +356,17 @@ retry:
goto retry;
if (s)
destroy_super(s);
+ mutex_unlock(&type->fs_supers_lock);
return old;
}
}
if (!s) {
spin_unlock(&sb_lock);
s = alloc_super(type);
- if (!s)
+ if (!s) {
+ mutex_unlock(&type->fs_supers_lock);
return ERR_PTR(-ENOMEM);
+ }
goto retry;
}

@@ -363,14 +374,30 @@ retry:
if (err) {
spin_unlock(&sb_lock);
destroy_super(s);
+ mutex_unlock(&type->fs_supers_lock);
return ERR_PTR(err);
}
+

+ if (list_empty(&type->fs_supers) && type->init) {
+ spin_unlock(&sb_lock);
+ err = type->init();

+ if (err < 0) {
+ destroy_super(s);
+ mutex_unlock(&type->fs_supers_lock);

+ return ERR_PTR(err);
+ }
+ spin_lock(&sb_lock);
+ }

+

s->s_type = type;
strlcpy(s->s_id, type->name, sizeof(s->s_id));
+

list_add_tail(&s->s_list, &super_blocks);
list_add(&s->s_instances, &type->fs_supers);

+
spin_unlock(&sb_lock);
get_filesystem(type);
+ mutex_unlock(&type->fs_supers_lock);
return s;
}

diff --git a/include/linux/fs.h b/include/linux/fs.h
index f413085..92d446f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1477,8 +1477,11 @@ struct file_system_type {
int (*get_sb) (struct file_system_type *, int,
const char *, void *, struct vfsmount *);
void (*kill_sb) (struct super_block *);
+ int (*init) (void);
+ void (*exit) (void);
struct module *owner;
struct file_system_type * next;
+ struct mutex fs_supers_lock;
struct list_head fs_supers;

struct lock_class_key s_lock_key;
--
1.5.4.3

Andrea Righi

unread,

May 24, 2008, 1:20:14 PM5/24/08

to

This is the core io-throttle kernel infrastructure. It creates the basic
interfaces to cgroups and implements the I/O measurement and throttling
functions.

Signed-off-by: Andrea Righi <righi....@gmail.com>
---
block/Makefile | 2 +
block/blk-io-throttle.c | 219 +++++++++++++++++++++++++++++++++++++++
include/linux/blk-io-throttle.h | 10 ++
include/linux/cgroup_subsys.h | 6 +
init/Kconfig | 10 ++
5 files changed, 247 insertions(+), 0 deletions(-)

diff --git a/block/Makefile b/block/Makefile
index 5a43c7d..8dec69b 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -14,3 +14,5 @@ obj-$(CONFIG_IOSCHED_CFQ) += cfq-iosched.o

obj-$(CONFIG_BLK_DEV_IO_TRACE) += blktrace.o
obj-$(CONFIG_BLOCK_COMPAT) += compat_ioctl.o
+
+obj-$(CONFIG_CGROUP_IO_THROTTLE) += blk-io-throttle.o
diff --git a/block/blk-io-throttle.c b/block/blk-io-throttle.c
new file mode 100644
index 0000000..cc2d10f
--- /dev/null
+++ b/block/blk-io-throttle.c
@@ -0,0 +1,219 @@
+/*
+ * blk-io-throttle.c
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ *
+ * Copyright (C) 2008 Andrea Righi <righi....@gmail.com>
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/cgroup.h>
+#include <linux/slab.h>
+#include <linux/gfp.h>
+#include <linux/err.h>
+#include <linux/sched.h>
+#include <linux/fs.h>
+#include <linux/jiffies.h>
+#include <linux/hardirq.h>
+#include <linux/spinlock.h>
+#include <linux/blk-io-throttle.h>
+
+struct iothrottle {
+ struct cgroup_subsys_state css;
+ spinlock_t lock; /* protects the accounting of the cgroup i/o stats */
+ unsigned long iorate;
+ unsigned long req;
+ unsigned long last_request;
+};
+
+static inline struct iothrottle *cgroup_to_iothrottle(struct cgroup *cont)
+{
+ return container_of(cgroup_subsys_state(cont, iothrottle_subsys_id),
+ struct iothrottle, css);
+}
+
+static inline struct iothrottle *task_to_iothrottle(struct task_struct *task)
+{
+ return container_of(task_subsys_state(task, iothrottle_subsys_id),
+ struct iothrottle, css);
+}
+
+/*
+ * Note: called from kernel/cgroup.c with cgroup_lock() held.
+ */
+static struct cgroup_subsys_state *iothrottle_create(
+ struct cgroup_subsys *ss, struct cgroup *cont)
+{
+ struct iothrottle *iot = kzalloc(sizeof(*iot), GFP_KERNEL);
+
+ if (unlikely(!iot))
+ return ERR_PTR(-ENOMEM);
+
+ spin_lock_init(&iot->lock);
+ iot->last_request = jiffies;
+
+ return &iot->css;
+}
+
+/*
+ * Note: called from kernel/cgroup.c with cgroup_lock() held.
+ */
+static void iothrottle_destroy(struct cgroup_subsys *ss, struct cgroup *cont)
+{
+ kfree(cgroup_to_iothrottle(cont));
+}
+
+static ssize_t iothrottle_read(struct cgroup *cont, struct cftype *cft,
+ struct file *file, char __user *buf,
+ size_t nbytes, loff_t *ppos)
+{
+ ssize_t count, ret;
+ unsigned long delta, iorate, req, last_request;
+ struct iothrottle *iot;
+ char *page;
+
+ page = (char *)__get_free_page(GFP_TEMPORARY);
+ if (!page)
+ return -ENOMEM;
+
+ cgroup_lock();
+ if (cgroup_is_removed(cont)) {
+ cgroup_unlock();
+ ret = -ENODEV;
+ goto out;
+ }
+
+ iot = cgroup_to_iothrottle(cont);
+ spin_lock_irq(&iot->lock);
+
+ delta = (long)jiffies - (long)iot->last_request;
+ iorate = iot->iorate;
+ req = iot->req;
+ last_request = iot->last_request;
+
+ spin_unlock_irq(&iot->lock);
+ cgroup_unlock();
+
+ /* print also additional debugging stuff */
+ count = sprintf(page, "bandwidth-max: %lu KiB/sec\n"
+ " requested: %lu bytes\n"
+ " last request: %lu jiffies\n"
+ " delta: %lu jiffies\n",
+ iorate, req, last_request, delta);
+
+ ret = simple_read_from_buffer(buf, nbytes, ppos, page, count);
+out:
+ free_page((unsigned long)page);
+ return ret;
+}
+
+static int iothrottle_write_u64(struct cgroup *cont, struct cftype *cft,
+ u64 val)
+{
+ struct iothrottle *iot;
+ int ret = 0;
+
+ cgroup_lock();
+ if (cgroup_is_removed(cont)) {
+ ret = -ENODEV;
+ goto out;
+ }
+
+ iot = cgroup_to_iothrottle(cont);
+
+ spin_lock_irq(&iot->lock);
+ iot->iorate = (unsigned long)val;
+ iot->req = 0;
+ iot->last_request = jiffies;
+ spin_unlock_irq(&iot->lock);
+out:
+ cgroup_unlock();
+ return ret;
+}
+
+static struct cftype files[] = {
+ {
+ .name = "bandwidth",
+ .read = iothrottle_read,
+ .write_u64 = iothrottle_write_u64,
+ },
+};
+
+static int iothrottle_populate(struct cgroup_subsys *ss, struct cgroup *cont)
+{
+ return cgroup_add_files(cont, ss, files, ARRAY_SIZE(files));
+}
+
+struct cgroup_subsys iothrottle_subsys = {
+ .name = "blockio",
+ .create = iothrottle_create,
+ .destroy = iothrottle_destroy,
+ .populate = iothrottle_populate,
+ .subsys_id = iothrottle_subsys_id,
+};
+
+static inline int __cant_sleep(void)
+{
+ return in_atomic() || in_interrupt() || irqs_disabled();
+}
+
+void cgroup_io_account(size_t bytes)
+{
+ struct iothrottle *iot;
+ unsigned long delta, t;
+ long sleep;
+ int cant_sleep = __cant_sleep();
+
+ iot = task_to_iothrottle(current);
+ if (!iot)
+ return;
+
+ spin_lock_irq(&iot->lock);
+ if (!iot->iorate)
+ goto out;
+
+ /* Account the I/O activity */
+ iot->req += bytes;
+
+ /* Evaluate if we need to throttle the current process */
+ if (cant_sleep)
+ goto out;
+
+ delta = (long)jiffies - (long)iot->last_request;
+ if (!delta)
+ goto out;
+
+ t = msecs_to_jiffies(iot->req / iot->iorate);
+ if (!t)
+ goto out;
+
+ sleep = t - delta;
+ if (sleep > 0) {
+ spin_unlock_irq(&iot->lock);
+ pr_debug("io-throttle: task %p (%s) must sleep %lu jiffies\n",
+ current, current->comm, sleep);
+ schedule_timeout_killable(sleep);
+ return;
+ }
+
+ /* Reset I/O accounting */
+ iot->req = 0;
+ iot->last_request = jiffies;
+out:
+ spin_unlock_irq(&iot->lock);
+}
+EXPORT_SYMBOL(cgroup_io_account);
diff --git a/include/linux/blk-io-throttle.h b/include/linux/blk-io-throttle.h
new file mode 100644
index 0000000..950eb03
--- /dev/null
+++ b/include/linux/blk-io-throttle.h
@@ -0,0 +1,10 @@
+#ifndef BLK_IO_THROTTLE_H
+#define BLK_IO_THROTTLE_H
+
+#ifdef CONFIG_CGROUP_IO_THROTTLE
+extern void cgroup_io_account(size_t bytes);
+#else
+static inline void cgroup_io_account(size_t bytes) { }
+#endif /* CONFIG_CGROUP_IO_THROTTLE */
+
+#endif /* BLK_IO_THROTTLE_H */
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index e287745..0caf3c2 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -48,3 +48,9 @@ SUBSYS(devices)
#endif

/* */
+
+#ifdef CONFIG_CGROUP_IO_THROTTLE
+SUBSYS(iothrottle)
+#endif
+
+/* */
diff --git a/init/Kconfig b/init/Kconfig
index 6135d07..6840f64 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -305,6 +305,16 @@ config CGROUP_DEVICE
Provides a cgroup implementing whitelists for devices which
a process in the cgroup can mknod or open.

+config CGROUP_IO_THROTTLE
+ bool "Enable cgroup I/O throttling (EXPERIMENTAL)"
+ depends on CGROUPS && EXPERIMENTAL
+ help
+ This allows to limit the maximum I/O bandwidth for specific
+ cgroup(s).
+ See Documentation/controllers/io-throttle.txt for more information.
+
+ If unsure, say N.
+
config CPUSETS
bool "Cpuset support"
depends on SMP && CGROUPS

Andrea Righi

unread,

May 24, 2008, 1:20:18 PM5/24/08

to

Documentation of the block device I/O bandwidth controller: description, usage,
advantages and design.

Signed-off-by: Andrea Righi <righi....@gmail.com>
---

Documentation/controllers/io-throttle.txt | 81 +++++++++++++++++++++++++++++
1 files changed, 81 insertions(+), 0 deletions(-)

diff --git a/Documentation/controllers/io-throttle.txt b/Documentation/controllers/io-throttle.txt
new file mode 100644
index 0000000..e7ab050
--- /dev/null
+++ b/Documentation/controllers/io-throttle.txt
@@ -0,0 +1,81 @@
+
+ Block device I/O bandwidth controller
+
+1. Description
+
+This controller allows to limit the block I/O bandwidth for specific process
+containers (cgroups) imposing additional delays on I/O requests for those
+processes that exceed the limits defined in the control group filesystem.
+
+Bandwidth limiting rules offers a better control over QoS respect to priority
+or weighted-based solutions, that only give information about applications'
+relative performance requirements.
+
+The goal of the I/O bandwidth controller is to improve performance
+predictability and QoS of the different control groups sharing the same block
+devices.
+
+NOTE: if you're looking for a way to improve the overall throughput of the
+system probably you should use a different solution.
+
+2. User Interface
+
+A new I/O bandwidth limitation rule is described using the file
+blockio.bandwidth.
+
+Example:
+
+* mount the cgroup filesystem (blockio subsystem):
+ # mkdir /mnt/cgroup
+ # mount -t cgroup -oblockio blockio /mnt/cgroup
+
+* Instantiate the new cgroup "foo":
+ # mkdir /mnt/cgroup/foo
+ --> the cgroup foo has been created
+
+* add the current shell process to the "foo" cgroup:
+ # /bin/echo $$ > /mnt/cgroup/foo/tasks
+ --> the current shell has been added to the cgroup "foo"
+
+* give maximum 1MiB/s of I/O bandwidth for the cgroup "foo":
+ # /bin/echo 1024 > /mnt/cgroup/foo/blockio.bandwidth
+ # sh
+ --> the subshell 'sh' is running in cgroup "foo" and it can use a maximum I/O
+ bandwidth of 1MiB/s (blockio.bandwidth is expressed in KiB/s).
+
+3. Advantages of providing this feature
+
+* Allow QoS for block device I/O among different cgroups
+* Improve I/O performance predictability on block devices shared between
+ different cgroups
+* It is independent on the particular I/O scheduler (anticipatory, deadline,
+ CFQ, noop) and/or the underlying block devices
+* The bandwidth limitations are guaranteed both for synchronous and
+ asynchronous operations, even the I/O passing through the page cache or
+ buffers and not only direct I/O (see below for details)
+
+4. Design
+
+The I/O throttling is performed imposing an explicit timeout, via
+schedule_timeout_killable() on the processes that exceed the I/O bandwidth
+dedicated to the cgroup they belong.
+
+It just works as expected for read operations: the real I/O activity is reduced
+synchronously according to the defined limitations.
+
+Write operations, instead, are modeled depending on the dirty pages ratio
+(write throttling in memory), since the writes to the real block device are
+processed asynchronously by different kernel threads (pdflush). However, the
+dirty pages ratio is directly proportional to the actual I/O that will be
+performed on the real block device. So, due to the asynchronous transfers
+through the page cache, the I/O throttling in memory can be considered a form
+of anticipatory throttling to the underlying block devices.
+
+Multiple re-writes in already dirtied page cache areas are not considered for
+accounting the I/O activity. This is valid for multiple re-reads of pages
+already present in the page cache as well.
+
+This means that a process that re-writes and/or re-reads multiple times the
+same blocks in a file (without re-creating it by truncate(), ftrunctate(),
+creat(), etc.) is affected by the I/O limitations only for the actual I/O
+performed to (or from) the underlying block devices.

Andrea Righi

unread,

May 24, 2008, 1:30:18 PM5/24/08

to

Apply the io-throttle controller to the opportune kernel functions. Both
accounting and throttling functionalities are performed by cgroup_io_account().

Signed-off-by: Andrea Righi <righi....@gmail.com>
---

diff --git a/block/blk-core.c b/block/blk-core.c
index 6a9cc0d..d6b3353 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -26,6 +26,7 @@
#include <linux/swap.h>
#include <linux/writeback.h>
#include <linux/task_io_accounting_ops.h>
+#include <linux/blk-io-throttle.h>
#include <linux/interrupt.h>
#include <linux/cpu.h>
#include <linux/blktrace_api.h>
@@ -1485,6 +1486,7 @@ void submit_bio(int rw, struct bio *bio)
count_vm_events(PGPGOUT, count);
} else {
task_io_account_read(bio->bi_size);
+ cgroup_io_account(bio->bi_size);
count_vm_events(PGPGIN, count);
}

diff --git a/fs/buffer.c b/fs/buffer.c
index a073f3f..eb4893c 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -35,6 +35,7 @@
#include <linux/suspend.h>
#include <linux/buffer_head.h>
#include <linux/task_io_accounting_ops.h>
+#include <linux/blk-io-throttle.h>
#include <linux/bio.h>
#include <linux/notifier.h>
#include <linux/cpu.h>
@@ -700,6 +701,8 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
static int __set_page_dirty(struct page *page,
struct address_space *mapping, int warn)
{
+ size_t cgroup_io_acct = 0;
+
if (unlikely(!mapping))
return !TestSetPageDirty(page);

@@ -715,12 +718,15 @@ static int __set_page_dirty(struct page *page,
__inc_bdi_stat(mapping->backing_dev_info,
BDI_RECLAIMABLE);
task_io_account_write(PAGE_CACHE_SIZE);
+ cgroup_io_acct = PAGE_CACHE_SIZE;
}
radix_tree_tag_set(&mapping->page_tree,
page_index(page), PAGECACHE_TAG_DIRTY);
}
write_unlock_irq(&mapping->tree_lock);
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
+ if (cgroup_io_acct)
+ cgroup_io_account(cgroup_io_acct);

return 1;
}
diff --git a/fs/direct-io.c b/fs/direct-io.c
index 9e81add..1b82224 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -35,6 +35,7 @@
#include <linux/buffer_head.h>
#include <linux/rwsem.h>
#include <linux/uio.h>
+#include <linux/blk-io-throttle.h>
#include <asm/atomic.h>

/*
@@ -667,6 +668,7 @@ submit_page_section(struct dio *dio, struct page *page,
* Read accounting is performed in submit_bio()
*/
task_io_account_write(len);
+ cgroup_io_account(len);
}

/*
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 789b6ad..97a0c74 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -23,6 +23,7 @@
#include <linux/init.h>
#include <linux/backing-dev.h>
#include <linux/task_io_accounting_ops.h>
+#include <linux/blk-io-throttle.h>
#include <linux/blkdev.h>
#include <linux/mpage.h>
#include <linux/rmap.h>
@@ -1077,6 +1078,7 @@ int __set_page_dirty_nobuffers(struct page *page)
if (!TestSetPageDirty(page)) {
struct address_space *mapping = page_mapping(page);
struct address_space *mapping2;
+ size_t cgroup_io_acct = 0;

if (!mapping)
return 1;
@@ -1091,6 +1093,7 @@ int __set_page_dirty_nobuffers(struct page *page)
__inc_bdi_stat(mapping->backing_dev_info,
BDI_RECLAIMABLE);
task_io_account_write(PAGE_CACHE_SIZE);
+ cgroup_io_acct = PAGE_CACHE_SIZE;
}
radix_tree_tag_set(&mapping->page_tree,
page_index(page), PAGECACHE_TAG_DIRTY);
@@ -1100,6 +1103,8 @@ int __set_page_dirty_nobuffers(struct page *page)
/* !PageAnon && !swapper_space */
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
}
+ if (cgroup_io_acct)
+ cgroup_io_account(cgroup_io_acct);
return 1;
}
return 0;
diff --git a/mm/readahead.c b/mm/readahead.c
index d8723a5..fe2f865 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -14,6 +14,7 @@
#include <linux/blkdev.h>
#include <linux/backing-dev.h>
#include <linux/task_io_accounting_ops.h>
+#include <linux/blk-io-throttle.h>
#include <linux/pagevec.h>
#include <linux/pagemap.h>

@@ -76,6 +77,7 @@ int read_cache_pages(struct address_space *mapping, struct list_head *pages,
break;
}
task_io_account_read(PAGE_CACHE_SIZE);
+ cgroup_io_account(PAGE_CACHE_SIZE);
}
return ret;

Andrea Righi

unread,

May 24, 2008, 1:30:20 PM5/24/08

to

I'm resending an updated version of the cgroup I/O bandwidth controller
that I wrote some months ago (http://lwn.net/Articles/265944), since
someone asked me recently.

The goal of this patchset is to implement a block device I/O bandwidth
controller using cgroups.

Detailed informations about design, its goal and usage are described in
the documentation.

Review, comments and feedbacks are welcome.

-Andrea

Balbir Singh

unread,

May 24, 2008, 2:30:10 PM5/24/08

to

Andrea Righi wrote:
> I'm resending an updated version of the cgroup I/O bandwidth controller
> that I wrote some months ago (http://lwn.net/Articles/265944), since
> someone asked me recently.
>
> The goal of this patchset is to implement a block device I/O bandwidth
> controller using cgroups.
>
> Detailed informations about design, its goal and usage are described in
> the documentation.
>
> Review, comments and feedbacks are welcome.
>

Hi, Andrea,

There are several parallel efforts for the IO controller? Could you
describe/compare them with yours? Is any consensus building taking place?

CC'ing containers mailing list.

--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL

Andrea Righi

unread,

May 24, 2008, 6:30:20 PM5/24/08

to

Balbir Singh wrote:
> Andrea Righi wrote:
>> I'm resending an updated version of the cgroup I/O bandwidth controller
>> that I wrote some months ago (http://lwn.net/Articles/265944), since
>> someone asked me recently.
>>
>> The goal of this patchset is to implement a block device I/O bandwidth
>> controller using cgroups.
>>
>> Detailed informations about design, its goal and usage are described in
>> the documentation.
>>
>> Review, comments and feedbacks are welcome.
>>
>
> Hi, Andrea,
>
> There are several parallel efforts for the IO controller? Could you
> describe/compare them with yours? Is any consensus building taking place?
>
> CC'ing containers mailing list.
>

Hi Balbir,

I've seen the Ryo Tsuruta's dm-band (http://lwn.net/Articles/266257),
the CFQ cgroup solution proposed by Vasily Tarasov
(http://lwn.net/Articles/274652) and a similar approach by Satoshi
Uchida (http://lwn.net/Articles/275944).

First one is implemented at the device mapper layer and AFAIK at the
moment it allows only to define per-task, per-user and per-group rules
(cgroup support is in the TODO list anyway).

Second and third solutions are implemented at the i/o scheduler
layer, CFQ in particular.

They work as expected with direct i/o operations (or synchronous reads).
The problem is: how to limit the i/o activity of an applicaton that
already wrote all the data in memory? The i/o scheduler is not the right
place to throttle application writes, because it's "too late". At the
very least we can map back the i/o path to find which process originally
dirtied memory and depending on this information delay the dispatching
of the requests. However, this needs bigger buffers, more page cache
usage, that can lead to potential OOM conditions in massively shared
environments.

So, my approach has the disadvantage or the advantage (depending on the
context and the requirements) to explicitly choke applications'
requests. Other solutions that operates in the subsystem used to
dispatch i/o requests are probably better to maximize overall
performance, but do not offer the same control over a real QoS as
request limiting can do.

Another difference is that all of them are priority/weighted based. The
io-throttle controller, instead, allows to define direct bandwidth
limiting rules. The difference here is that priority based approach
propagates bursts and does no smoothing. Bandwidth limiting approach
controls bursts by smoothing the i/o rate. This means better performance
predictability at the cost of poor throughput optimization.

I'll run some benchmarks and post the results ASAP. It would be also
interesting to run the same benchmarks using the other i/o controllers
and compare the results in terms of fairness, performance
predictability, responsiveness, throughput, etc. I'll see what I can do.

-Andrea

Tom Spink

unread,

May 25, 2008, 2:10:22 PM5/25/08

to

This (short) patch series is another RFC for the patch that introduces on-demand
filesystem initialisation. In addition to the original infrastructure
implementation (with clean-ups), it changes XFS to use this new infrastructure.

I wrote a toy filesystem (testfs) to simulate scheduling/allocation delays and
to torture the mount/unmount cycles. I didn't manage to deadlock the system
in my tests. XFS also works as expected aswell, in that the global threads
are not created until an XFS filesystem is mounted for the first time. When the
last XFS filesystem is unmounted, the threads go away.

Please let me know what you think!

-- Tom

fs/filesystems.c | 2 +
fs/super.c | 47 +++++++++++++++++++++++++++++++++++-
fs/xfs/linux-2.6/xfs_super.c | 55 +++++++++++++++++++++++-------------------
include/linux/fs.h | 3 ++
4 files changed, 81 insertions(+), 26 deletions(-)

Tom Spink

unread,

May 25, 2008, 2:10:19 PM5/25/08

to

This patch makes XFS use the file system type specific init and exit
callbacks, so that XFS only initialises when it's used for the first
time.

This is useful for when XFS is compiled into the kernel, but never
actually used as it stops XFS from creating global threads, until
they are needed.

Signed-off-by: Tom Spink <tsp...@gmail.com>
---

fs/xfs/linux-2.6/xfs_super.c | 55 +++++++++++++++++++++++-------------------
1 files changed, 30 insertions(+), 25 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_super.c b/fs/xfs/linux-2.6/xfs_super.c
index 742b2c7..3e7340a 100644
--- a/fs/xfs/linux-2.6/xfs_super.c
+++ b/fs/xfs/linux-2.6/xfs_super.c
@@ -1422,23 +1422,9 @@ static struct quotactl_ops xfs_quotactl_operations = {
.set_xquota = xfs_fs_setxquota,
};

-static struct file_system_type xfs_fs_type = {
- .owner = THIS_MODULE,
- .name = "xfs",
- .get_sb = xfs_fs_get_sb,
- .kill_sb = kill_block_super,
- .fs_flags = FS_REQUIRES_DEV,
-};
-
-
-STATIC int __init
-init_xfs_fs( void )
+static int xfs_fs_init(void)
{
- int error;
- static char message[] __initdata = KERN_INFO \
- XFS_VERSION_STRING " with " XFS_BUILD_OPTIONS " enabled\n";
-
- printk(message);
+ int error;

ktrace_init(64);

@@ -1455,14 +1441,8 @@ init_xfs_fs( void )
uuid_init();
vfs_initquota();

- error = register_filesystem(&xfs_fs_type);
- if (error)
- goto undo_register;
return 0;

-undo_register:
- xfs_buf_terminate();
-
undo_buffers:
xfs_destroy_zones();

@@ -1470,17 +1450,42 @@ undo_zones:
return error;
}

-STATIC void __exit
-exit_xfs_fs( void )
+static void xfs_fs_exit(void)
{
vfs_exitquota();
- unregister_filesystem(&xfs_fs_type);
xfs_cleanup();
xfs_buf_terminate();
xfs_destroy_zones();
ktrace_uninit();
}

+static struct file_system_type xfs_fs_type = {
+ .owner = THIS_MODULE,
+ .name = "xfs",
+ .get_sb = xfs_fs_get_sb,
+ .kill_sb = kill_block_super,
+ .fs_flags = FS_REQUIRES_DEV,
+ .init = xfs_fs_init,
+ .exit = xfs_fs_exit,
+};
+
+
+STATIC int __init
+init_xfs_fs( void )
+{
+ static char message[] __initdata = KERN_INFO \
+ XFS_VERSION_STRING " with " XFS_BUILD_OPTIONS " enabled\n";
+
+ printk(message);
+ return register_filesystem(&xfs_fs_type);
+}
+
+STATIC void __exit
+exit_xfs_fs( void )
+{
+ unregister_filesystem(&xfs_fs_type);
+}
+
module_init(init_xfs_fs);
module_exit(exit_xfs_fs);

--
1.5.4.3

Tom Spink

unread,

May 25, 2008, 2:10:19 PM5/25/08

to

This patch adds on-demand filesystem initialisation capabilities to the VFS,
whereby an init routine will be executed on first use of a particular
filesystem type. Also, an exit routine will be executed when the last
superblock of a filesystem type is deactivated.

This is useful for filesystems that share global resources between all
instances of the filesystem, but only need those resources when there are
any users of the filesystem. This lets the filesystem initialise those
resources (kernel threads or caches, say) when the first superblock is
created. It also lets the filesystem clean up those resources when the
last superblock is deactivated.

Signed-off-by: Tom Spink <tsp...@gmail.com>
---
fs/filesystems.c | 2 ++
fs/super.c | 47 ++++++++++++++++++++++++++++++++++++++++++++++-
include/linux/fs.h | 3 +++
3 files changed, 51 insertions(+), 1 deletions(-)

diff --git a/fs/filesystems.c b/fs/filesystems.c
index f37f872..59b2eaa 100644
--- a/fs/filesystems.c
+++ b/fs/filesystems.c
@@ -79,6 +79,7 @@ int register_filesystem(struct file_system_type * fs)
res = -EBUSY;
else
*p = fs;
+ mutex_init(&fs->fs_supers_lock);
write_unlock(&file_systems_lock);
return res;
}
@@ -105,6 +106,7 @@ int unregister_filesystem(struct file_system_type * fs)
tmp = &file_systems;
while (*tmp) {
if (fs == *tmp) {
+ mutex_destroy(&fs->fs_supers_lock);
*tmp = fs->next;
fs->next = NULL;
write_unlock(&file_systems_lock);
diff --git a/fs/super.c b/fs/super.c

index 453877c..af20175 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -181,7 +181,21 @@ void deactivate_super(struct super_block *s)
spin_unlock(&sb_lock);
DQUOT_OFF(s, 0);
down_write(&s->s_umount);
+
+ /* Take the mutex before calling kill_sb, because it may
+ * modify the fs_supers list.
+ */
+ mutex_lock(&fs->fs_supers_lock);
fs->kill_sb(s);
+
+ /* Check to see if this is the last superblock of the
+ * filesystem going down, and if it is, then run the exit
+ * routine.
+ */

+ if (list_empty(&fs->fs_supers) && fs->exit)
+ fs->exit();
+ mutex_unlock(&fs->fs_supers_lock);

+
put_filesystem(fs);
put_super(s);
}
@@ -338,6 +352,7 @@ struct super_block *sget(struct file_system_type *type,

struct super_block *old;
int err;

+ mutex_lock(&type->fs_supers_lock);
retry:
spin_lock(&sb_lock);
if (test) {

@@ -348,14 +363,17 @@ retry:

goto retry;
if (s)
destroy_super(s);
+ mutex_unlock(&type->fs_supers_lock);
return old;
}
}
if (!s) {
spin_unlock(&sb_lock);
s = alloc_super(type);
- if (!s)
+ if (!s) {
+ mutex_unlock(&type->fs_supers_lock);
return ERR_PTR(-ENOMEM);
+ }
goto retry;
}

@@ -363,14 +381,41 @@ retry:

if (err) {
spin_unlock(&sb_lock);
destroy_super(s);
+ mutex_unlock(&type->fs_supers_lock);
return ERR_PTR(err);
}
+

+ /* If this is the first superblock of this particular filesystem,
+ * run the init routine, if any. If we're going to do this, then we
+ * also need to drop the sb_lock spinlock.
+ */

+ if (list_empty(&type->fs_supers) && type->init) {
+ spin_unlock(&sb_lock);
+

+ /* We can do this, because we're holding the fs_supers_lock
+ * mutex.
+ */

+ err = type->init();
+ if (err < 0) {

+ /* If the filesystem failed to initialise, then back
+ * out, destroy the superblock and return the error.
+ */

+ destroy_super(s);
+ mutex_unlock(&type->fs_supers_lock);
+ return ERR_PTR(err);
+ }
+
+ spin_lock(&sb_lock);
+ }
+
s->s_type = type;
strlcpy(s->s_id, type->name, sizeof(s->s_id));

list_add_tail(&s->s_list, &super_blocks);
list_add(&s->s_instances, &type->fs_supers);
spin_unlock(&sb_lock);

get_filesystem(type);
+

+ mutex_unlock(&type->fs_supers_lock);
return s;
}

diff --git a/include/linux/fs.h b/include/linux/fs.h
index f413085..92d446f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1477,8 +1477,11 @@ struct file_system_type {
int (*get_sb) (struct file_system_type *, int,
const char *, void *, struct vfsmount *);
void (*kill_sb) (struct super_block *);
+ int (*init) (void);
+ void (*exit) (void);
struct module *owner;
struct file_system_type * next;
+ struct mutex fs_supers_lock;
struct list_head fs_supers;

struct lock_class_key s_lock_key;
--
1.5.4.3

--

Andrea Righi

unread,

May 27, 2008, 7:30:16 AM5/27/08

to

Report per-thread I/O statistics in /proc/pid/task/tid/io and aggregate parent
I/O statistics in /proc/pid/io. This approach follows the same model used to
account per-process and per-thread CPU times.

As a practial application, this allows for example to quickly find the top I/O
consumer when a process spawns many child threads that perform the actual I/O
work, because the aggregated I/O statistics can always be found in
/proc/pid/io.

Acked-by: Balbir Singh <bal...@linux.vnet.ibm.com>

Signed-off-by: Andrea Righi <righi....@gmail.com>
---

fs/proc/base.c | 86 ++++++++++++++++++++++++++++++++++++++++--------
include/linux/sched.h | 4 ++
kernel/exit.c | 27 +++++++++++++++
kernel/fork.c | 6 +++
4 files changed, 108 insertions(+), 15 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index c447e07..37ba5c5 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2356,29 +2356,82 @@ static int proc_base_fill_cache(struct file *filp, void *dirent,
}

#ifdef CONFIG_TASK_IO_ACCOUNTING
-static int proc_pid_io_accounting(struct task_struct *task, char *buffer)
-{
+static int do_io_accounting(struct task_struct *task, char *buffer, int whole)
+{
+ u64 rchar, wchar, syscr, syscw;
+ struct task_io_accounting ioac;
+
+ if (!whole) {
+ rchar = task->rchar;
+ wchar = task->wchar;
+ syscr = task->syscr;
+ syscw = task->syscw;
+ memcpy(&ioac, &task->ioac, sizeof(ioac));
+ } else {
+ unsigned long flags;
+ struct task_struct *t = task;
+ rchar = wchar = syscr = syscw = 0;
+ memset(&ioac, 0, sizeof(ioac));
+
+ rcu_read_lock();
+ do {
+ rchar += t->rchar;
+ wchar += t->wchar;
+ syscr += t->syscr;
+ syscw += t->syscw;
+
+ ioac.read_bytes += t->ioac.read_bytes;
+ ioac.write_bytes += t->ioac.write_bytes;
+ ioac.cancelled_write_bytes +=
+ t->ioac.cancelled_write_bytes;
+ t = next_thread(t);
+ } while (t != task);
+ rcu_read_unlock();
+
+ if (lock_task_sighand(task, &flags)) {
+ struct signal_struct *sig = task->signal;
+
+ rchar += sig->rchar;
+ wchar += sig->wchar;
+ syscr += sig->syscr;
+ syscw += sig->syscw;
+
+ ioac.read_bytes += sig->ioac.read_bytes;
+ ioac.write_bytes += sig->ioac.write_bytes;
+ ioac.cancelled_write_bytes +=
+ sig->ioac.cancelled_write_bytes;
+
+ unlock_task_sighand(task, &flags);
+ }
+ }
+
return sprintf(buffer,
-#ifdef CONFIG_TASK_XACCT
"rchar: %llu\n"
"wchar: %llu\n"
"syscr: %llu\n"
"syscw: %llu\n"
-#endif
"read_bytes: %llu\n"
"write_bytes: %llu\n"
"cancelled_write_bytes: %llu\n",
-#ifdef CONFIG_TASK_XACCT
- (unsigned long long)task->rchar,
- (unsigned long long)task->wchar,
- (unsigned long long)task->syscr,
- (unsigned long long)task->syscw,
-#endif
- (unsigned long long)task->ioac.read_bytes,
- (unsigned long long)task->ioac.write_bytes,
- (unsigned long long)task->ioac.cancelled_write_bytes);
+ (unsigned long long)rchar,
+ (unsigned long long)wchar,
+ (unsigned long long)syscr,
+ (unsigned long long)syscw,
+ (unsigned long long)ioac.read_bytes,
+ (unsigned long long)ioac.write_bytes,
+ (unsigned long long)ioac.cancelled_write_bytes);
+}
+
+static int proc_tid_io_accounting(struct task_struct *task, char *buffer)
+{
+ return do_io_accounting(task, buffer, 0);
}
-#endif
+
+static int proc_tgid_io_accounting(struct task_struct *task, char *buffer)
+{
+ return do_io_accounting(task, buffer, 1);
+}
+#endif /* CONFIG_TASK_IO_ACCOUNTING */

/*
* Thread groups
@@ -2450,7 +2503,7 @@ static const struct pid_entry tgid_base_stuff[] = {
REG("coredump_filter", S_IRUGO|S_IWUSR, coredump_filter),
#endif
#ifdef CONFIG_TASK_IO_ACCOUNTING
- INF("io", S_IRUGO, pid_io_accounting),
+ INF("io", S_IRUGO, tgid_io_accounting),
#endif
};

@@ -2778,6 +2831,9 @@ static const struct pid_entry tid_base_stuff[] = {
#ifdef CONFIG_FAULT_INJECTION
REG("make-it-fail", S_IRUGO|S_IWUSR, fault_inject),
#endif
+#ifdef CONFIG_TASK_IO_ACCOUNTING
+ INF("io", S_IRUGO, tid_io_accounting),
+#endif
};

static int proc_tid_base_readdir(struct file * filp,
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5395a61..d4d9adf 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -504,6 +504,10 @@ struct signal_struct {
unsigned long nvcsw, nivcsw, cnvcsw, cnivcsw;
unsigned long min_flt, maj_flt, cmin_flt, cmaj_flt;
unsigned long inblock, oublock, cinblock, coublock;
+#ifdef CONFIG_TASK_XACCT
+ u64 rchar, wchar, syscr, syscw;
+#endif
+ struct task_io_accounting ioac;

/*
* Cumulative ns of scheduled CPU time for dead threads in the
diff --git a/kernel/exit.c b/kernel/exit.c
index 1510f78..1f3c0ec 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -120,6 +120,18 @@ static void __exit_signal(struct task_struct *tsk)
sig->nivcsw += tsk->nivcsw;
sig->inblock += task_io_get_inblock(tsk);
sig->oublock += task_io_get_oublock(tsk);
+#ifdef CONFIG_TASK_XACCT
+ sig->rchar += tsk->rchar;
+ sig->wchar += tsk->wchar;
+ sig->syscr += tsk->syscr;
+ sig->syscw += tsk->syscw;
+#endif /* CONFIG_TASK_XACCT */
+#ifdef CONFIG_TASK_IO_ACCOUNTING
+ sig->ioac.read_bytes += tsk->ioac.read_bytes;
+ sig->ioac.write_bytes += tsk->ioac.write_bytes;
+ sig->ioac.cancelled_write_bytes +=
+ tsk->ioac.cancelled_write_bytes;
+#endif /* CONFIG_TASK_IO_ACCOUNTING */
sig->sum_sched_runtime += tsk->se.sum_exec_runtime;
sig = NULL; /* Marker for below. */
}
@@ -1321,6 +1333,21 @@ static int wait_task_zombie(struct task_struct *p, int noreap,
psig->coublock +=
task_io_get_oublock(p) +
sig->oublock + sig->coublock;
+#ifdef CONFIG_TASK_XACCT
+ psig->rchar += p->rchar + sig->rchar;
+ psig->wchar += p->wchar + sig->wchar;
+ psig->syscr += p->syscr + sig->syscr;
+ psig->syscw += p->syscw + sig->syscw;
+#endif /* CONFIG_TASK_XACCT */
+#ifdef CONFIG_TASK_IO_ACCOUNTING
+ psig->ioac.read_bytes +=
+ p->ioac.read_bytes + sig->ioac.read_bytes;
+ psig->ioac.write_bytes +=
+ p->ioac.write_bytes + sig->ioac.write_bytes;
+ psig->ioac.cancelled_write_bytes +=
+ p->ioac.cancelled_write_bytes +
+ sig->ioac.cancelled_write_bytes;
+#endif /* CONFIG_TASK_IO_ACCOUNTING */
spin_unlock_irq(&p->parent->sighand->siglock);
}

diff --git a/kernel/fork.c b/kernel/fork.c
index 19908b2..b495758 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -785,6 +785,12 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
sig->nvcsw = sig->nivcsw = sig->cnvcsw = sig->cnivcsw = 0;
sig->min_flt = sig->maj_flt = sig->cmin_flt = sig->cmaj_flt = 0;
sig->inblock = sig->oublock = sig->cinblock = sig->coublock = 0;
+#ifdef CONFIG_TASK_XACCT
+ sig->rchar = sig->wchar = sig->syscr = sig->syscw = 0;
+#endif
+#ifdef CONFIG_TASK_IO_ACCOUNTING
+ memset(&sig->ioac, 0, sizeof(sig->ioac));
+#endif
sig->sum_sched_runtime = 0;
INIT_LIST_HEAD(&sig->cpu_timers[0]);
INIT_LIST_HEAD(&sig->cpu_timers[1]);

Pavel Machek

unread,

Jun 1, 2008, 10:20:05 AM6/1/08

to

On Sun 2008-05-25 18:48:19, Tom Spink wrote:
> This patch makes XFS use the file system type specific init and exit
> callbacks, so that XFS only initialises when it's used for the first
> time.
>
> This is useful for when XFS is compiled into the kernel, but never
> actually used as it stops XFS from creating global threads, until
> they are needed.

Yes please. Removing gazillion kernel threads when xfs is unused is
nice.

You should probably have cced fsdevel and xfs lists?

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

Tom Spink

unread,

Jun 1, 2008, 10:50:06 AM6/1/08

to

2008/5/31 Pavel Machek <pa...@ucw.cz>:

> On Sun 2008-05-25 18:48:19, Tom Spink wrote:
>> This patch makes XFS use the file system type specific init and exit
>> callbacks, so that XFS only initialises when it's used for the first
>> time.
>>
>> This is useful for when XFS is compiled into the kernel, but never
>> actually used as it stops XFS from creating global threads, until
>> they are needed.
>
> Yes please. Removing gazillion kernel threads when xfs is unused is
> nice.
>
> You should probably have cced fsdevel and xfs lists?
>
> --
> (english) http://www.livejournal.com/~pavelmachek
> (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
>

Hi Pavel,

I'll re-send with those CC's!

--
Tom Spink

Tom Spink

unread,

Jun 1, 2008, 11:00:19 AM6/1/08

to

(resend to include CCs)

This (short) patch series is another RFC for the patch that introduces on-demand
filesystem initialisation. In addition to the original infrastructure
implementation (with clean-ups), it changes XFS to use this new infrastructure.

I wrote a toy filesystem (testfs) to simulate scheduling/allocation delays and
to torture the mount/unmount cycles. I didn't manage to deadlock the system
in my tests. XFS also works as expected aswell, in that the global threads
are not created until an XFS filesystem is mounted for the first time. When the
last XFS filesystem is unmounted, the threads go away.

Please let me know what you think!

-- Tom

fs/filesystems.c | 2 +
fs/super.c | 47 +++++++++++++++++++++++++++++++++++-
fs/xfs/linux-2.6/xfs_super.c | 55 +++++++++++++++++++++++-------------------
include/linux/fs.h | 3 ++
4 files changed, 81 insertions(+), 26 deletions(-)

Tom Spink

unread,

Jun 1, 2008, 11:00:18 AM6/1/08

to

This patch makes XFS use the file system type specific init and exit
callbacks, so that XFS only initialises when it's used for the first
time.

This is useful for when XFS is compiled into the kernel, but never
actually used as it stops XFS from creating global threads, until
they are needed.

Signed-off-by: Tom Spink <tsp...@gmail.com>
---

--

Tom Spink

unread,

Jun 1, 2008, 11:00:23 AM6/1/08

to

This patch adds on-demand filesystem initialisation capabilities to the VFS,
whereby an init routine will be executed on first use of a particular
filesystem type. Also, an exit routine will be executed when the last
superblock of a filesystem type is deactivated.

This is useful for filesystems that share global resources between all
instances of the filesystem, but only need those resources when there are
any users of the filesystem. This lets the filesystem initialise those
resources (kernel threads or caches, say) when the first superblock is
created. It also lets the filesystem clean up those resources when the
last superblock is deactivated.

Signed-off-by: Tom Spink <tsp...@gmail.com>
---

--

Al Viro

unread,

Jun 1, 2008, 11:40:17 AM6/1/08

to

On Sun, Jun 01, 2008 at 03:51:54PM +0100, Tom Spink wrote:

Occam's Razor...

You've just serialized ->kill_sb() for given fs type (and made sure that
if one gets stuck, _everything_ gets stuck). Moreover, you've serialized
sget() against the same thing (i.e. pretty much each ->get_sb()).

All of that (and a couple of new methods) is done for something that just
plain does not belong to VFS. It's trivially doable in filesystem *and*
it's about the objects with lifetimes that make sense only for filesystem
itself.

Hell, just do

int want_xfs_threads(void)
{
int res = 0;
mutex_lock(&foo_lock);
if (!count++) {
start threads
if failed {
count--;
res = -Esomething;
}
}
mutex_unlock(&foo_lock);
return res;
}

void leave_xfs_threads(void)
{
mutex_lock(&foo_lock);
if (!--count)
stop threads
mutex_unlock(&foo_lock);
}

Call want_xfs_threads() in xfs_fs_fill_super(); call leave_xfs_threads() in
the end of xfs_put_super() and on failure exit from xfs_fs_fill_super().
End of story... Any other fs that wants such things can do the same.

Dave Chinner

unread,

Jun 1, 2008, 10:00:14 PM6/1/08

to

On Sun, Jun 01, 2008 at 03:51:53PM +0100, Tom Spink wrote:
>
> (resend to include CCs)

What cc's? Still no xfs cc on it. I added it to this reply....

> This (short) patch series is another RFC for the patch that introduces on-demand
> filesystem initialisation. In addition to the original infrastructure
> implementation (with clean-ups), it changes XFS to use this new infrastructure.
>
> I wrote a toy filesystem (testfs) to simulate scheduling/allocation delays and
> to torture the mount/unmount cycles. I didn't manage to deadlock the system
> in my tests. XFS also works as expected aswell, in that the global threads
> are not created until an XFS filesystem is mounted for the first time. When the
> last XFS filesystem is unmounted, the threads go away.
>
> Please let me know what you think!

Why even bother? This is why we have /modular/ kernels - if you're
not using XFS then don't load it and you won't see those pesky
threads. That'll save on a bunch of memory as well because the xfs
module ain't small (>480k on i386)....

Cheers,

Dave.
--
Dave Chinner
da...@fromorbit.com

Andrea Righi

unread,

Jun 2, 2008, 7:20:31 AM6/2/08

to

Andrea Righi wrote:
> So, my approach has the disadvantage or the advantage (depending on the
> context and the requirements) to explicitly choke applications'
> requests. Other solutions that operates in the subsystem used to
> dispatch i/o requests are probably better to maximize overall
> performance, but do not offer the same control over a real QoS as
> request limiting can do.
>
> Another difference is that all of them are priority/weighted based. The
> io-throttle controller, instead, allows to define direct bandwidth
> limiting rules. The difference here is that priority based approach
> propagates bursts and does no smoothing. Bandwidth limiting approach
> controls bursts by smoothing the i/o rate. This means better performance
> predictability at the cost of poor throughput optimization.
>
> I'll run some benchmarks and post the results ASAP. It would be also
> interesting to run the same benchmarks using the other i/o controllers
> and compare the results in terms of fairness, performance
> predictability, responsiveness, throughput, etc. I'll see what I can do.

Here is the test report of the benchmarks. Comments, suggestions,
disapprovals, etc. are welcome.

Tests are mainly focused to demonstrate how the io-throttle controller
can be used to improve the i/o performance predictability on shared
block devices (usually any block device shared by many users and
processes, that can be grouped togheter by cgroup).

The goal of the tests is *not* to compare the overall performance of the
different solutions.

Tests have been performed using an ext3 fs + CFQ i/o scheduler on
/dev/sda.

Following some details of the testbed host:
# hdparm -i /dev/sda
/dev/sda:

Model=SAMSUNG HS08XJC , FwRev=GR100-01,
Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs }
RawCHS=16383/16/63, TrkSize=34902, SectSize=0, ECCbytes=4
BuffType=DualPortCache, BuffSize=7986kB, MaxMultSect=16, MultSect=?8?
CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=156301488
IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
PIO modes: pio0 pio1 pio2 pio3 pio4
DMA modes: mdma0 mdma1 mdma2
UDMA modes: udma0 udma1 udma2 udma3 udma4 *udma5
AdvancedPM=yes: unknown setting WriteCache=enabled
Drive conforms to: ATA/ATAPI-6 T13 1410D revision 1: ATA/ATAPI-1,2,3,4,5,6,7

* signifies the current active mode

# grep "model name" /proc/cpuinfo
model name : Intel(R) Core(TM)2 CPU U7600 @ 1.20GHz
model name : Intel(R) Core(TM)2 CPU U7600 @ 1.20GHz

# grep MemTotal /proc/meminfo
MemTotal: 2021616 kB

Latest io-throttle patch, raw data of the benchmarks and scripts used to
calculate the results can be found here:

http://download.systemimager.org/~arighi/linux/patches/io-throttle/

== Test #0: basic i/o ops ==

The goal of this pre-test is to demonstrate the basic functionalities of
the io-throttle controller. In particular to show that the i/o bandwidth
limiting is applied only on the actual i/o and does not involve
read/write operations in the page cache.

A simple script (dd-band.sh) creates a 64MiB file, re-write it (in the
page cache), read the file in O_DIRECT mode and re-read the file from
the page cache.

The i/o limitations are performed during the first write (because it
generates real writes processed asynchronously by the pdflush kernel
thread outside the context of the process) and during the O_DIRECT read.
Re-writes and re-reads that only affect the page cache are never
limited.

Running in cgroup "foo" without any i/o limitation:
$ echo $$ | sudo tee /cgroup/foo/tasks
$ ./dd-band.sh
===
write: 67108864 bytes (67 MB) copied, 0.917311 s, 73.2 MB/s
re-write: 67108864 bytes (67 MB) copied, 0.437337 s, 153 MB/s
read: 67108864 bytes (67 MB) copied, 2.86287 s, 23.4 MB/s
re-read: 67108864 bytes (67 MB) copied, 0.0966508 s, 694 MB/s
===
write: 67108864 bytes (67 MB) copied, 0.893401 s, 75.1 MB/s
re-write: 67108864 bytes (67 MB) copied, 0.447029 s, 150 MB/s
read: 67108864 bytes (67 MB) copied, 2.94607 s, 22.8 MB/s
re-read: 67108864 bytes (67 MB) copied, 0.093962 s, 714 MB/s
===
write: 67108864 bytes (67 MB) copied, 0.896389 s, 74.9 MB/s
re-write: 67108864 bytes (67 MB) copied, 0.437567 s, 153 MB/s
read: 67108864 bytes (67 MB) copied, 2.8686 s, 23.4 MB/s
re-read: 67108864 bytes (67 MB) copied, 0.10307 s, 651 MB/s
===
write: 67108864 bytes (67 MB) copied, 0.896151 s, 74.9 MB/s
re-write: 67108864 bytes (67 MB) copied, 0.436411 s, 154 MB/s
read: 67108864 bytes (67 MB) copied, 3.19625 s, 21.0 MB/s
re-read: 67108864 bytes (67 MB) copied, 0.0914293 s, 734 MB/s
===
write: 67108864 bytes (67 MB) copied, 0.896905 s, 74.8 MB/s
re-write: 67108864 bytes (67 MB) copied, 0.436574 s, 154 MB/s
read: 67108864 bytes (67 MB) copied, 2.77595 s, 24.2 MB/s
re-read: 67108864 bytes (67 MB) copied, 0.0955861 s, 702 MB/s

The i/o bandwidth of cgroup "foo" is limited up to 8MB/s:
# echo 8192 > /cgroup/foo/blockio.bandwidth
$ ./dd-band.sh
===
write: 67108864 bytes (67 MB) copied, 8.51182 s, 7.9 MB/s
re-write: 67108864 bytes (67 MB) copied, 0.473478 s, 142 MB/s
read: 67108864 bytes (67 MB) copied, 8.19994 s, 8.2 MB/s
re-read: 67108864 bytes (67 MB) copied, 0.0886488 s, 757 MB/s
===
write: 67108864 bytes (67 MB) copied, 8.64248 s, 7.8 MB/s
re-write: 67108864 bytes (67 MB) copied, 0.440868 s, 152 MB/s
read: 67108864 bytes (67 MB) copied, 8.19815 s, 8.2 MB/s
re-read: 67108864 bytes (67 MB) copied, 0.0956493 s, 702 MB/s
===
write: 67108864 bytes (67 MB) copied, 8.65532 s, 7.8 MB/s
re-write: 67108864 bytes (67 MB) copied, 0.477425 s, 141 MB/s
read: 67108864 bytes (67 MB) copied, 8.20111 s, 8.2 MB/s
re-read: 67108864 bytes (67 MB) copied, 0.0978265 s, 686 MB/s
===
write: 67108864 bytes (67 MB) copied, 8.6522 s, 7.8 MB/s
re-write: 67108864 bytes (67 MB) copied, 0.439753 s, 153 MB/s
read: 67108864 bytes (67 MB) copied, 8.21767 s, 8.2 MB/s
re-read: 67108864 bytes (67 MB) copied, 0.0991603 s, 677 MB/s
===
write: 67108864 bytes (67 MB) copied, 8.66015 s, 7.7 MB/s
re-write: 67108864 bytes (67 MB) copied, 0.449204 s, 149 MB/s
read: 67108864 bytes (67 MB) copied, 8.27809 s, 8.1 MB/s
re-read: 67108864 bytes (67 MB) copied, 0.0874324 s, 768 MB/s

The i/o bandwidth of cgroup "foo" is increased up to 16MB/s:
# echo 16384 > /cgroup/foo/blockio.bandwidth
$ ./dd-band.sh
===
write: 67108864 bytes (67 MB) copied, 4.37976 s, 15.3 MB/s
re-write: 67108864 bytes (67 MB) copied, 0.460999 s, 146 MB/s
read: 67108864 bytes (67 MB) copied, 4.36709 s, 15.4 MB/s
re-read: 67108864 bytes (67 MB) copied, 0.0978816 s, 686 MB/s
===
write: 67108864 bytes (67 MB) copied, 4.40452 s, 15.2 MB/s
re-write: 67108864 bytes (67 MB) copied, 0.437004 s, 154 MB/s
read: 67108864 bytes (67 MB) copied, 4.28799 s, 15.7 MB/s
re-read: 67108864 bytes (67 MB) copied, 0.0912823 s, 735 MB/s
===
write: 67108864 bytes (67 MB) copied, 4.45284 s, 15.1 MB/s
re-write: 67108864 bytes (67 MB) copied, 0.440571 s, 152 MB/s
read: 67108864 bytes (67 MB) copied, 4.21476 s, 15.9 MB/s
re-read: 67108864 bytes (67 MB) copied, 0.0957647 s, 701 MB/s
===
write: 67108864 bytes (67 MB) copied, 4.4364 s, 15.1 MB/s
re-write: 67108864 bytes (67 MB) copied, 0.440202 s, 152 MB/s
read: 67108864 bytes (67 MB) copied, 4.19432 s, 16.0 MB/s
re-read: 67108864 bytes (67 MB) copied, 0.0930223 s, 721 MB/s
===
write: 67108864 bytes (67 MB) copied, 4.45086 s, 15.1 MB/s
re-write: 67108864 bytes (67 MB) copied, 0.4487 s, 150 MB/s
read: 67108864 bytes (67 MB) copied, 4.37955 s, 15.3 MB/s
re-read: 67108864 bytes (67 MB) copied, 0.0983858 s, 682 MB/s

The i/o bandwidth of cgroup "foo" is reduced to 4MB/s:
# echo 4096 > /cgroup/foo/blockio.bandwidth
$ ./dd-band.sh
==
write: 67108864 bytes (67 MB) copied, 16.783 s, 4.0 MB/s
re-write: 67108864 bytes (67 MB) copied, 0.456829 s, 147 MB/s
read: 67108864 bytes (67 MB) copied, 16.3971 s, 4.1 MB/s
re-read: 67108864 bytes (67 MB) copied, 0.101392 s, 662 MB/s
===
write: 67108864 bytes (67 MB) copied, 16.8167 s, 4.0 MB/s
re-write: 67108864 bytes (67 MB) copied, 0.472562 s, 142 MB/s
read: 67108864 bytes (67 MB) copied, 16.3957 s, 4.1 MB/s
re-read: 67108864 bytes (67 MB) copied, 0.0930801 s, 721 MB/s
===
write: 67108864 bytes (67 MB) copied, 16.8834 s, 4.0 MB/s
re-write: 67108864 bytes (67 MB) copied, 0.445081 s, 151 MB/s
read: 67108864 bytes (67 MB) copied, 16.3901 s, 4.1 MB/s
re-read: 67108864 bytes (67 MB) copied, 0.0875281 s, 767 MB/s
===
write: 67108864 bytes (67 MB) copied, 16.8157 s, 4.0 MB/s
re-write: 67108864 bytes (67 MB) copied, 0.469043 s, 143 MB/s
read: 67108864 bytes (67 MB) copied, 16.3917 s, 4.1 MB/s
re-read: 67108864 bytes (67 MB) copied, 0.0922022 s, 728 MB/s
===
write: 67108864 bytes (67 MB) copied, 16.9313 s, 4.0 MB/s
re-write: 67108864 bytes (67 MB) copied, 0.510635 s, 131 MB/s
read: 67108864 bytes (67 MB) copied, 16.3983 s, 4.1 MB/s
re-read: 67108864 bytes (67 MB) copied, 0.089941 s, 746 MB/s

Following are reported more complex benchmarks, using iozone to generate
different i/o workloads. In particular all the results below are
evaluated considering read, re-read, write, re-write, random-read and
random-write operations on files ranging from 8MB up to 32MB and record
size ranging from 1K to 8K.

Using more linear i/o workloads (like the simple "dd"s) strongly reduces
errors and deviations respect to the expected performance results, but
the purpose of the test is to actually verify the effectiveness of the
throttling approach in performance predictability, in particular in
presence of heterogeneous and complex workloads.

== Test #1: parallel iozone, balanced load (without cgroup limitations) ==

This test is a run of 2 parallel iozone: 1 in cgroup "foo" and the other
in cgroup "bar", without any bandwidth limitation.

cgroup:foo$ iozone -I -a -i0 -i1 -i2 -i3 -i4 -i5 -y 1024 -q 8192 -n 8m -g 32m
cgroup:bar$ iozone -I -a -i0 -i1 -i2 -i3 -i4 -i5 -y 1024 -q 8192 -n 8m -g 32m

Results
+==================+==================+
| cgroup "foo" | cgroup "bar" |
+-------------------+==================+==================+
| Average i/o rate | 15869.72 | 14404.63 |
| (KB/s) | | |
+-------------------+------------------+------------------+
| Standard dev. | 8002.53 | 6147.71 |
+-------------------+------------------+------------------+
| Percentage err.* | 50.43% | 42.68% |
+-------------------+------------------+------------------+

* Percentage error is evaluated as: (standard deviation / mean) * 100

== Test #2: parallel iozone, unbalanced load (without cgroup limitations) ==

This test is a run of 5 parallel iozone: 1 in cgroup "foo" and 4
iozone(s) in cgroup "bar" without any bandwidth limitation.

cgroup:foo$ iozone -I -a -i0 -i1 -i2 -i3 -i4 -i5 -y 1024 -q 8192 -n 8m -g 32m
cgroup:bar$ iozone -I -a -i0 -i1 -i2 -i3 -i4 -i5 -y 1024 -q 8192 -n 8m -g 32m &
cgroup:bar$ iozone -I -a -i0 -i1 -i2 -i3 -i4 -i5 -y 1024 -q 8192 -n 8m -g 32m &
cgroup:bar$ iozone -I -a -i0 -i1 -i2 -i3 -i4 -i5 -y 1024 -q 8192 -n 8m -g 32m &
cgroup:bar$ iozone -I -a -i0 -i1 -i2 -i3 -i4 -i5 -y 1024 -q 8192 -n 8m -g 32m &

Results
+==================+==================+
| cgroup "foo" | cgroup "bar" |
+-------------------+==================+==================+
| Average i/o rate | 7139.43 | 26161.22 |
| (KB/s) | | |
+-------------------+------------------+------------------+
| Standard dev. | 4484.00 | 9098.80 |
+-------------------+------------------+------------------+
| Percentage err. | 62.81% | 34.78% |
+-------------------+------------------+------------------+

== Test #3: parallel iozone, balanced load (limitation: 8MB/s | 8MB/s)==

This test is a run of 2 parallel iozone: 1 in cgroup "foo" with 8MB/s
bandwidth limitation and the other in cgroup "bar" with 8MB/s as well.

# echo 8192 > /cgroup/foo/blockio.bandwidth
# echo 8192 > /cgroup/bar/blockio.bandwidth
cgroup:foo$ iozone -I -a -i0 -i1 -i2 -i3 -i4 -i5 -y 1024 -q 8192 -n 8m -g 32m
cgroup:bar$ iozone -I -a -i0 -i1 -i2 -i3 -i4 -i5 -y 1024 -q 8192 -n 8m -g 32m

Results
+==================+==================+
| cgroup "foo" | cgroup "bar" |
+-------------------+==================+==================+
| Expected i/o rate | 8192.00 | 8192.00 |
| (KB/s) | | |
+-------------------+------------------+------------------+
| Average i/o rate | 8339.79 | 8242.14 |
| (KB/s) | | |
+-------------------+------------------+------------------+
| Standard dev. | 734.66 | 896.59 |
+-------------------+------------------+------------------+
| Percentage err.* | 8.81% | 10.88% |
+-------------------+------------------+------------------+

* Percentage error is evaluated as: (standard deviation / mean) * 100

== Test #4: parallel iozone, balanced load (limitation: 8MB/s | 16MB/s)==

This test is a run of 2 parallel iozone: 1 in cgroup "foo" and the other
in cgroup "bar", using different i/o bandwidth limitations: 8MB/s for
cgroup "foo" and 16MB/s for cgroup "bar".

# echo 8192 > /cgroup/foo/blockio.bandwidth
# echo 16384 > /cgroup/bar/blockio.bandwidth
cgroup:foo$ iozone -I -a -i0 -i1 -i2 -i3 -i4 -i5 -y 1024 -q 8192 -n 8m -g 32m
cgroup:bar$ iozone -I -a -i0 -i1 -i2 -i3 -i4 -i5 -y 1024 -q 8192 -n 8m -g 32m

Results
+==================+==================+
| cgroup "foo" | cgroup "bar" |
+-------------------+==================+==================+
| Expected i/o rate | 8192.00 | 16384.00 |
| (KB/s) | | |
+-------------------+------------------+------------------+
| Average i/o rate | 7978.03 | 14117.72 |
| (KB/s) | | |
+-------------------+------------------+------------------+
| Standard dev. | 1028.41 | 2479.05 |
+-------------------+------------------+------------------+
| Percentage err. | 12.89% | 17.56% |
+-------------------+------------------+------------------+

== Test #5: parallel iozone, unbalanced load (limitation: 8MB/s | 8MB/s)==

This test is a run of 5 parallel iozone: 1 in cgroup "foo" with 8MB/s
bandwidth limitation and 4 parallel iozone(s) in cgroup "bar" with
8MB/s.

# echo 8192 > /cgroup/foo/blockio.bandwidth
# echo 8192 > /cgroup/bar/blockio.bandwidth
cgroup:foo$ iozone -I -a -i0 -i1 -i2 -i3 -i4 -i5 -y 1024 -q 8192 -n 8m -g 32m
cgroup:bar$ iozone -I -a -i0 -i1 -i2 -i3 -i4 -i5 -y 1024 -q 8192 -n 8m -g 32m &
cgroup:bar$ iozone -I -a -i0 -i1 -i2 -i3 -i4 -i5 -y 1024 -q 8192 -n 8m -g 32m &
cgroup:bar$ iozone -I -a -i0 -i1 -i2 -i3 -i4 -i5 -y 1024 -q 8192 -n 8m -g 32m &
cgroup:bar$ iozone -I -a -i0 -i1 -i2 -i3 -i4 -i5 -y 1024 -q 8192 -n 8m -g 32m &

Results
+==================+==================+
| cgroup "foo" | cgroup "bar" |
+-------------------+==================+==================+
| Expected i/o rate | 8192.00 | 8192.00 |
| (KB/s) | | |
+-------------------+------------------+------------------+
| Average i/o rate | 7873.29 | 8831.25 |
| (KB/s) | | |
+-------------------+------------------+------------------+
| Standard dev. | 954.51 | 1191.74 |
+-------------------+------------------+------------------+
| Percentage err. | 12.12% | 13.49% |
+-------------------+------------------+------------------+

== Test #6: parallel iozone, unbalanced load (limitation: 4MB/s | 8MB/s)==

This test is a run of 5 parallel iozone: 1 in cgroup "foo" with 4MB/s
bandwidth limitation and 4 parallel iozone(s) in cgroup "bar" with
8MB/s.

# echo 4096 > /cgroup/foo/blockio.bandwidth
# echo 8192 > /cgroup/bar/blockio.bandwidth
cgroup:foo$ iozone -I -a -i0 -i1 -i2 -i3 -i4 -i5 -y 1024 -q 8192 -n 8m -g 32m
cgroup:bar$ iozone -I -a -i0 -i1 -i2 -i3 -i4 -i5 -y 1024 -q 8192 -n 8m -g 32m &
cgroup:bar$ iozone -I -a -i0 -i1 -i2 -i3 -i4 -i5 -y 1024 -q 8192 -n 8m -g 32m &
cgroup:bar$ iozone -I -a -i0 -i1 -i2 -i3 -i4 -i5 -y 1024 -q 8192 -n 8m -g 32m &
cgroup:bar$ iozone -I -a -i0 -i1 -i2 -i3 -i4 -i5 -y 1024 -q 8192 -n 8m -g 32m &

Results
+==================+==================+
| cgroup "foo" | cgroup "bar" |
+-------------------+==================+==================+
| Expected i/o rate | 4096.00 | 8192.00 |
| (KB/s) | | |
+-------------------+------------------+------------------+
| Average i/o rate | 4166.57 | 8770.42 |
| (KB/s) | | |
+-------------------+------------------+------------------+
| Standard dev. | 487.04 | 988.03 |
+-------------------+------------------+------------------+
| Percentage err. | 11.69% | 11.26% |
+-------------------+------------------+------------------+

== Final Results ==
+==================+==================+
| cgroup "foo" | cgroup "bar" |
+---+-------------------+==================+==================+
| 1 | Percentage err. | 50.43% | 42.68% |
+---+-------------------+------------------+------------------+
| 2 | Percentage err. | 62.81% | 34.78% |
+---+-------------------+------------------+------------------+
| 3 | Percentage err. | 8.81% | 10.88% |
+---+-------------------+------------------+------------------+
| 4 | Percentage err. | 12.89% | 17.56% |
+---+-------------------+------------------+------------------+
| 5 | Percentage err. | 12.12% | 13.49% |
+---+-------------------+------------------+------------------+
| 6 | Percentage err. | 11.69% | 11.26% |
+---+-------------------+------------------+------------------+

1) (no limiting) + balanced i/o load
2) (no limiting) + unbalanced i/o load
3) i/o limit: 8MB/s|8MB/s + balanced i/o load
4) i/o limit: 8MB/s|16MB/s + balanced i/o load
5) i/o limit: 8MB/s|8MB/s + unbalanced i/o load
6) i/o limit: 4MB/s|8MB/s + unbalanced i/o load

Conclusion: limiting the i/o bandwidth of cgroups by the io-throttle
controller allows to reduce the percentage error of i/o performance.
This permits to guarantee certain performance levels according to the
defined i/o limitations and to implement more effectiveness i/o shaping
policies (QoS) for user applications using the cgroups subsystem.

Tom Spink

unread,

Jun 2, 2008, 9:40:19 AM6/2/08

to

2008/6/1 Al Viro <vi...@zeniv.linux.org.uk>:

> On Sun, Jun 01, 2008 at 03:51:54PM +0100, Tom Spink wrote:
>
>
> Occam's Razor...
>
> You've just serialized ->kill_sb() for given fs type (and made sure that
> if one gets stuck, _everything_ gets stuck). Moreover, you've serialized
> sget() against the same thing (i.e. pretty much each ->get_sb()).
>
> All of that (and a couple of new methods) is done for something that just
> plain does not belong to VFS. It's trivially doable in filesystem *and*
> it's about the objects with lifetimes that make sense only for filesystem
> itself.

Okay! Thanks for reviewing, anyway. :-)

--
Tom Spink

Tom Spink

unread,

Jun 2, 2008, 9:50:07 AM6/2/08

to

2008/6/2 Dave Chinner <da...@fromorbit.com>:

> On Sun, Jun 01, 2008 at 03:51:53PM +0100, Tom Spink wrote:
>>
>> (resend to include CCs)
>
> What cc's? Still no xfs cc on it. I added it to this reply....
>
>> This (short) patch series is another RFC for the patch that introduces on-demand
>> filesystem initialisation. In addition to the original infrastructure
>> implementation (with clean-ups), it changes XFS to use this new infrastructure.
>>
>> I wrote a toy filesystem (testfs) to simulate scheduling/allocation delays and
>> to torture the mount/unmount cycles. I didn't manage to deadlock the system
>> in my tests. XFS also works as expected aswell, in that the global threads
>> are not created until an XFS filesystem is mounted for the first time. When the
>> last XFS filesystem is unmounted, the threads go away.
>>
>> Please let me know what you think!
>
> Why even bother? This is why we have /modular/ kernels - if you're
> not using XFS then don't load it and you won't see those pesky
> threads. That'll save on a bunch of memory as well because the xfs
> module ain't small (>480k on i386)....

Yeah, absolutely. But if the filesystem is built-in, you can't unload it.

> Cheers,
>
> Dave.

Thanks for taking a look, anyway!

--
Tom Spink

Randy Dunlap

unread,

Jun 4, 2008, 2:00:19 PM6/4/08

to

On Sat, 24 May 2008 18:56:55 +0200 Andrea Righi wrote:

> Documentation of the block device I/O bandwidth controller: description, usage,
> advantages and design.
>
> Signed-off-by: Andrea Righi <righi....@gmail.com>
> ---
> Documentation/controllers/io-throttle.txt | 81 +++++++++++++++++++++++++++++
> 1 files changed, 81 insertions(+), 0 deletions(-)
>
> diff --git a/Documentation/controllers/io-throttle.txt b/Documentation/controllers/io-throttle.txt
> new file mode 100644
> index 0000000..e7ab050
> --- /dev/null
> +++ b/Documentation/controllers/io-throttle.txt
> @@ -0,0 +1,81 @@
> +
> + Block device I/O bandwidth controller
> +
> +1. Description
> +
> +This controller allows to limit the block I/O bandwidth for specific process
> +containers (cgroups) imposing additional delays on I/O requests for those
> +processes that exceed the limits defined in the control group filesystem.
> +
> +Bandwidth limiting rules offers a better control over QoS respect to priority

offer better control over QoS with respect to priority

> +or weighted-based solutions, that only give information about applications'

weight-based solutions that ...

of

> + CFQ, noop) and/or the underlying block devices
> +* The bandwidth limitations are guaranteed both for synchronous and
> + asynchronous operations, even the I/O passing through the page cache or
> + buffers and not only direct I/O (see below for details)
> +
> +4. Design
> +
> +The I/O throttling is performed imposing an explicit timeout, via
> +schedule_timeout_killable() on the processes that exceed the I/O bandwidth
> +dedicated to the cgroup they belong.

they belong to.

> +
> +It just works as expected for read operations: the real I/O activity is reduced
> +synchronously according to the defined limitations.
> +
> +Write operations, instead, are modeled depending on the dirty pages ratio
> +(write throttling in memory), since the writes to the real block device are
> +processed asynchronously by different kernel threads (pdflush). However, the
> +dirty pages ratio is directly proportional to the actual I/O that will be
> +performed on the real block device. So, due to the asynchronous transfers
> +through the page cache, the I/O throttling in memory can be considered a form
> +of anticipatory throttling to the underlying block devices.
> +
> +Multiple re-writes in already dirtied page cache areas are not considered for
> +accounting the I/O activity. This is valid for multiple re-reads of pages
> +already present in the page cache as well.
> +
> +This means that a process that re-writes and/or re-reads multiple times the
> +same blocks in a file (without re-creating it by truncate(), ftrunctate(),
> +creat(), etc.) is affected by the I/O limitations only for the actual I/O
> +performed to (or from) the underlying block devices.

---
~Randy

Andrea Righi

unread,

Jun 5, 2008, 8:40:07 AM6/5/08

to

Randy Dunlap wrote:
> On Sat, 24 May 2008 18:56:55 +0200 Andrea Righi wrote:
>
>> Documentation of the block device I/O bandwidth controller: description, usage,
>> advantages and design.
>>

[snip]

Thanks for reviewing the documentation Randy. I'll apply your fixes to
the next version of the patch.

-Andrea

Andrea Righi

unread,

Jun 7, 2008, 4:30:09 AM6/7/08

to

This is the core io-throttle kernel infrastructure. It creates the basic
interfaces to cgroups and implements the I/O measurement and throttling
functions.

Signed-off-by: Andrea Righi <righi....@gmail.com>
---
block/Makefile | 2 +
block/blk-io-throttle.c | 405 +++++++++++++++++++++++++++++++++++++++
include/linux/blk-io-throttle.h | 12 ++

include/linux/cgroup_subsys.h | 6 +
init/Kconfig | 10 +

5 files changed, 435 insertions(+), 0 deletions(-)
create mode 100644 block/blk-io-throttle.c
create mode 100644 include/linux/blk-io-throttle.h

diff --git a/block/Makefile b/block/Makefile
index 5a43c7d..8dec69b 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -14,3 +14,5 @@ obj-$(CONFIG_IOSCHED_CFQ) += cfq-iosched.o

obj-$(CONFIG_BLK_DEV_IO_TRACE) += blktrace.o
obj-$(CONFIG_BLOCK_COMPAT) += compat_ioctl.o
+
+obj-$(CONFIG_CGROUP_IO_THROTTLE) += blk-io-throttle.o
diff --git a/block/blk-io-throttle.c b/block/blk-io-throttle.c

new file mode 100644
index 0000000..804df88
--- /dev/null
+++ b/block/blk-io-throttle.c
@@ -0,0 +1,405 @@

+#include <linux/uaccess.h>
+#include <linux/blk-io-throttle.h>
+
+#define iothrottle_for_each(n, r) \
+ for (n = rb_first(r); n; n = rb_next(n))
+
+struct iothrottle_node {
+ struct rb_node node;
+ dev_t dev;

+ unsigned long iorate;
+ unsigned long req;
+ unsigned long last_request;
+};
+
+struct iothrottle {
+ struct cgroup_subsys_state css;
+ spinlock_t lock; /* protects the accounting of the cgroup i/o stats */

+ struct rb_root tree;

+};
+
+static inline struct iothrottle *cgroup_to_iothrottle(struct cgroup *cont)
+{
+ return container_of(cgroup_subsys_state(cont, iothrottle_subsys_id),
+ struct iothrottle, css);
+}
+
+static inline struct iothrottle *task_to_iothrottle(struct task_struct *task)
+{
+ return container_of(task_subsys_state(task, iothrottle_subsys_id),
+ struct iothrottle, css);
+}
+

+static inline struct iothrottle_node *iothrottle_search_node(
+ const struct iothrottle *iot,
+ dev_t dev)
+{
+ struct rb_node *node = (&iot->tree)->rb_node;
+
+ while (node) {
+ struct iothrottle_node *data = container_of(node,
+ struct iothrottle_node, node);
+ if (dev < data->dev)
+ node = node->rb_left;
+ else if (dev > data->dev)
+ node = node->rb_right;
+ else
+ return data;
+ }
+ return NULL;
+}
+
+static inline int iothrottle_insert_node(struct iothrottle *iot,
+ struct iothrottle_node *data)
+{
+ struct rb_root *root = &iot->tree;
+ struct rb_node **new = &(root->rb_node), *parent = NULL;
+
+ while (*new) {
+ struct iothrottle_node *this = container_of(*new,
+ struct iothrottle_node, node);
+ parent = *new;
+ if (data->dev < this->dev)
+ new = &((*new)->rb_left);
+ else if (data->dev > this->dev)
+ new = &((*new)->rb_right);
+ else
+ return -EINVAL;
+ }
+ rb_link_node(&data->node, parent, new);
+ rb_insert_color(&data->node, root);
+
+ return 0;
+}
+
+static inline void iothrottle_delete_node(struct iothrottle *iot, dev_t dev)
+{
+ struct iothrottle_node *data = iothrottle_search_node(iot, dev);
+
+ if (likely(data)) {
+ rb_erase(&data->node, &iot->tree);
+ kfree(data);
+ }

+}
+
+/*
+ * Note: called from kernel/cgroup.c with cgroup_lock() held.
+ */
+static struct cgroup_subsys_state *iothrottle_create(
+ struct cgroup_subsys *ss, struct cgroup *cont)
+{

+ struct iothrottle *iot;
+
+ iot = kmalloc(sizeof(*iot), GFP_KERNEL);

+ if (unlikely(!iot))
+ return ERR_PTR(-ENOMEM);
+
+ spin_lock_init(&iot->lock);

+ iot->tree = RB_ROOT;

+
+ return &iot->css;
+}
+
+/*
+ * Note: called from kernel/cgroup.c with cgroup_lock() held.
+ */
+static void iothrottle_destroy(struct cgroup_subsys *ss, struct cgroup *cont)
+{

+ struct iothrottle_node *data;
+ struct rb_node *next;
+ struct iothrottle *iot = cgroup_to_iothrottle(cont);
+
+ next = rb_first(&iot->tree);
+ while (next) {
+ data = rb_entry(next, struct iothrottle_node, node);
+ next = rb_next(&data->node);
+ rb_erase(&data->node, &iot->tree);
+ kfree(data);
+ }
+ kfree(iot);

+}
+
+static ssize_t iothrottle_read(struct cgroup *cont,

+ struct cftype *cft,
+ struct file *file,
+ char __user *userbuf,
+ size_t nbytes,
+ loff_t *ppos)

+{
+ struct iothrottle *iot;

+ char *buffer, *s;
+ struct rb_node *n;
+ ssize_t ret;
+
+ buffer = kmalloc(nbytes + 1, GFP_KERNEL);
+ if (unlikely(!buffer))

+ return -ENOMEM;
+
+ cgroup_lock();
+ if (cgroup_is_removed(cont)) {

+ ret = -ENODEV;
+ goto out;
+ }
+
+ iot = cgroup_to_iothrottle(cont);
+

+ s = buffer;
+ spin_lock_irq(&iot->lock);
+ iothrottle_for_each(n, &iot->tree) {
+ struct iothrottle_node *node =
+ rb_entry(n, struct iothrottle_node, node);
+ unsigned long delta = (long)jiffies - (long)node->last_request;
+
+ BUG_ON(!node->dev);
+ s += snprintf(s, nbytes - (s - buffer),
+ "=== device (%u,%u) ===\n"
+ "bandwidth-max: %lu KiB/sec\n"

+ " requested: %lu bytes\n"
+ " last request: %lu jiffies\n"
+ " delta: %lu jiffies\n",

+ MAJOR(node->dev), MINOR(node->dev),
+ node->iorate, node->req,
+ node->last_request, delta);
+ }
+ spin_unlock_irq(&iot->lock);
+ buffer[nbytes] = '\0';
+
+ ret = simple_read_from_buffer(userbuf, nbytes,
+ ppos, buffer, (s - buffer));
+out:
+ cgroup_unlock();
+ kfree(buffer);

+ return ret;
+}
+

+static inline dev_t devname2dev_t(const char *buf)
+{
+ struct block_device *bdev;
+ dev_t ret;
+
+ bdev = lookup_bdev(buf);
+ if (IS_ERR(bdev))
+ return 0;
+
+ BUG_ON(!bdev->bd_inode);
+ ret = bdev->bd_inode->i_rdev;
+ bdput(bdev);
+

+ return ret;
+}
+

+static inline int iothrottle_parse_args(char *buf, size_t nbytes,
+ dev_t *dev, unsigned long *val)
+{
+ char *p;
+
+ p = memchr(buf, ':', nbytes);
+ if (!p)
+ return -EINVAL;
+ *p++ = '\0';
+
+ *dev = devname2dev_t(buf);
+ if (!*dev)
+ return -ENOTBLK;
+
+ return strict_strtoul(p, 10, val);
+}
+
+static ssize_t iothrottle_write(struct cgroup *cont,
+ struct cftype *cft,
+ struct file *file,
+ const char __user *userbuf,

+ size_t nbytes, loff_t *ppos)
+{

+ struct iothrottle *iot;
+ struct iothrottle_node *node, *tmpn = NULL;
+ char *buffer, *tmpp;
+ dev_t dev;
+ unsigned long val;
+ int ret;
+
+ if (unlikely(!nbytes))
+ return -EINVAL;
+
+ buffer = kmalloc(nbytes + 1, GFP_KERNEL);
+ if (unlikely(!buffer))
+ return -ENOMEM;
+
+ if (copy_from_user(buffer, userbuf, nbytes)) {
+ ret = -EFAULT;
+ goto out1;
+ }
+
+ buffer[nbytes] = '\0';
+ tmpp = strstrip(buffer);
+
+ ret = iothrottle_parse_args(tmpp, nbytes, &dev, &val);
+ if (ret)
+ goto out1;
+
+ /*
+ * Pre-allocate a temporary node structure outside locks to use
+ * GFP_KERNEL, it will be kfree()ed later if unused.
+ */
+ tmpn = kmalloc(sizeof(*tmpn), GFP_KERNEL);

+
+ cgroup_lock();
+ if (cgroup_is_removed(cont)) {
+ ret = -ENODEV;

+ goto out2;

+ }
+
+ iot = cgroup_to_iothrottle(cont);
+
+ spin_lock_irq(&iot->lock);

+ if (!val) {
+ /* Delete a block device limiting rule */
+ iothrottle_delete_node(iot, dev);
+ ret = nbytes;
+ goto out3;
+ }
+ node = iothrottle_search_node(iot, dev);
+ if (node) {
+ /* Update a block device limiting rule */
+ node->iorate = val;
+ node->req = 0;
+ node->last_request = jiffies;
+ ret = nbytes;
+ goto out3;
+ }
+ /* Add a new block device limiting rule */
+ if (unlikely(!tmpn)) {
+ ret = -ENOMEM;
+ goto out3;
+ }
+ node = tmpn;
+ tmpn = NULL;
+
+ node->iorate = val;
+ node->req = 0;
+ node->last_request = jiffies;
+ node->dev = dev;
+ ret = iothrottle_insert_node(iot, node);
+ BUG_ON(ret);
+ ret = nbytes;
+out3:
+ spin_unlock_irq(&iot->lock);
+out2:
+ cgroup_unlock();
+ if (tmpn)
+ kfree(tmpn);
+out1:
+ kfree(buffer);

+ return ret;
+}
+
+static struct cftype files[] = {
+ {
+ .name = "bandwidth",
+ .read = iothrottle_read,

+ .write = iothrottle_write,

+ },
+};
+
+static int iothrottle_populate(struct cgroup_subsys *ss, struct cgroup *cont)
+{
+ return cgroup_add_files(cont, ss, files, ARRAY_SIZE(files));
+}
+
+struct cgroup_subsys iothrottle_subsys = {
+ .name = "blockio",
+ .create = iothrottle_create,
+ .destroy = iothrottle_destroy,
+ .populate = iothrottle_populate,
+ .subsys_id = iothrottle_subsys_id,
+};
+
+static inline int __cant_sleep(void)
+{
+ return in_atomic() || in_interrupt() || irqs_disabled();
+}
+

+void cgroup_io_account(struct block_device *bdev, size_t bytes)

+{
+ struct iothrottle *iot;

+ struct iothrottle_node *node;

+ unsigned long delta, t;
+ long sleep;
+

+ if (unlikely(!bdev))
+ return;
+
+ BUG_ON(!bdev->bd_inode);

+
+ iot = task_to_iothrottle(current);

+ if (unlikely(!iot))

+ return;
+
+ spin_lock_irq(&iot->lock);
+

+ node = iothrottle_search_node(iot, bdev->bd_inode->i_rdev);
+ if (!node || !node->iorate)

+ goto out;
+
+ /* Account the I/O activity */

+ node->req += bytes;

+
+ /* Evaluate if we need to throttle the current process */

+ delta = (long)jiffies - (long)node->last_request;

+ if (!delta)
+ goto out;
+

+ t = msecs_to_jiffies(node->req / node->iorate);

+ if (!t)
+ goto out;
+
+ sleep = t - delta;

+ if (unlikely(sleep > 0)) {
+ spin_unlock_irq(&iot->lock);
+ if (__cant_sleep())
+ return;

+ pr_debug("io-throttle: task %p (%s) must sleep %lu jiffies\n",
+ current, current->comm, sleep);
+ schedule_timeout_killable(sleep);
+ return;
+ }
+
+ /* Reset I/O accounting */

+ node->req = 0;
+ node->last_request = jiffies;

+out:
+ spin_unlock_irq(&iot->lock);
+}
+EXPORT_SYMBOL(cgroup_io_account);
diff --git a/include/linux/blk-io-throttle.h b/include/linux/blk-io-throttle.h

new file mode 100644
index 0000000..cff0c13
--- /dev/null
+++ b/include/linux/blk-io-throttle.h
@@ -0,0 +1,12 @@

+#ifndef BLK_IO_THROTTLE_H
+#define BLK_IO_THROTTLE_H
+
+#ifdef CONFIG_CGROUP_IO_THROTTLE

+extern void cgroup_io_account(struct block_device *bdev, size_t bytes);
+#else
+static inline void cgroup_io_account(struct block_device *bdev, size_t bytes)
+{
+}

+#endif /* CONFIG_CGROUP_IO_THROTTLE */
+
+#endif /* BLK_IO_THROTTLE_H */
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index e287745..0caf3c2 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -48,3 +48,9 @@ SUBSYS(devices)
#endif

/* */
+
+#ifdef CONFIG_CGROUP_IO_THROTTLE
+SUBSYS(iothrottle)
+#endif
+
+/* */
diff --git a/init/Kconfig b/init/Kconfig

index 6199d11..3117d99 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -306,6 +306,16 @@ config CGROUP_DEVICE

Provides a cgroup implementing whitelists for devices which
a process in the cgroup can mknod or open.

+config CGROUP_IO_THROTTLE
+ bool "Enable cgroup I/O throttling (EXPERIMENTAL)"
+ depends on CGROUPS && EXPERIMENTAL
+ help
+ This allows to limit the maximum I/O bandwidth for specific
+ cgroup(s).
+ See Documentation/controllers/io-throttle.txt for more information.
+
+ If unsure, say N.
+
config CPUSETS
bool "Cpuset support"
depends on SMP && CGROUPS
--
1.5.4.3

--

Andrea Righi

unread,

Jun 7, 2008, 4:30:13 AM6/7/08

to

Documentation of the block device I/O bandwidth controller: description, usage,
advantages and design.

Signed-off-by: Andrea Righi <righi....@gmail.com>
---
Documentation/controllers/io-throttle.txt | 150 +++++++++++++++++++++++++++++
1 files changed, 150 insertions(+), 0 deletions(-)
create mode 100644 Documentation/controllers/io-throttle.txt

diff --git a/Documentation/controllers/io-throttle.txt b/Documentation/controllers/io-throttle.txt
new file mode 100644
index 0000000..5373fa8
--- /dev/null
+++ b/Documentation/controllers/io-throttle.txt
@@ -0,0 +1,150 @@

+
+ Block device I/O bandwidth controller
+
+1. Description
+

+This controller allows to limit the I/O bandwidth of specific block devices for
+specific process containers (cgroups) imposing additional delays on I/O
+requests for those processes that exceed the limits defined in the control
+group filesystem.
+
+Bandwidth limiting rules offer better control over QoS with respect to priority
+or weight-based solutions that only give information about applications'

+relative performance requirements.
+
+The goal of the I/O bandwidth controller is to improve performance
+predictability and QoS of the different control groups sharing the same block
+devices.
+
+NOTE: if you're looking for a way to improve the overall throughput of the
+system probably you should use a different solution.
+
+2. User Interface
+
+A new I/O bandwidth limitation rule is described using the file
+blockio.bandwidth.
+

+The same file can be used to set multiple rules for different block devices
+relatively to the same cgroup.
+
+The syntax is the following:
+# /bin/echo DEVICE:BANDWIDTH > CGROUP/blockio.bandwidth
+
+- DEVICE is the name of the device the limiting rule is applied to,
+- BANDWIDTH is the maximum I/O bandwidth on DEVICE allowed by CGROUP,
+- CGROUP is the name of the limited process container.
+
+Examples:
+
+* Mount the cgroup filesystem (blockio subsystem):

+ # mkdir /mnt/cgroup
+ # mount -t cgroup -oblockio blockio /mnt/cgroup
+
+* Instantiate the new cgroup "foo":
+ # mkdir /mnt/cgroup/foo
+ --> the cgroup foo has been created
+

+* Add the current shell process to the cgroup "foo":

+ # /bin/echo $$ > /mnt/cgroup/foo/tasks
+ --> the current shell has been added to the cgroup "foo"
+

+* Give maximum 1MiB/s of I/O bandwidth on /dev/sda1 for the cgroup "foo":
+ # /bin/echo /dev/sda1:1024 > /mnt/cgroup/foo/blockio.bandwidth

+ # sh
+ --> the subshell 'sh' is running in cgroup "foo" and it can use a maximum I/O

+ bandwidth of 1MiB/s on /dev/sda1 (blockio.bandwidth is expressed in
+ KiB/s).
+
+* Give maximum 8MiB/s of I/O bandwidth on /dev/sdb for the cgroup "foo":
+ # /bin/echo /dev/sdb:8192 > /mnt/cgroup/foo/blockio.bandwidth

+ # sh
+ --> the subshell 'sh' is running in cgroup "foo" and it can use a maximum I/O

+ bandwidth of 1MiB/s on /dev/sda1 and 8MiB/s on /dev/sdb.
+ NOTE: each partition needs its own limitation rule! In this case, for
+ example, there's no limitation on /dev/sdb1 for cgroup "foo".
+
+* Show the I/O limits defined for cgroup "foo":
+ # cat /mnt/cgroup/foo/blockio.bandwidth
+ === device (8,1) ===
+ bandwidth-max: 1024 KiB/sec
+ requested: 0 bytes
+ last request: 4294933948 jiffies
+ delta: 2660 jiffies
+ === device (8,5) ===
+ bandwidth-max: 8192 KiB/sec
+ requested: 0 bytes
+ last request: 4294935736 jiffies
+ delta: 872 jiffies
+
+ Devices are reported using (major, minor) numbers when reading
+ blockio.bandwidth.
+
+ The corresponding device names can be retrieved in /proc/diskstats (or in
+ other places as well).
+
+ For example to find the name of the device (8,5):
+ # sed -ne 's/^ \+8 \+5 $[^ ]\+$.*/\1/p' /proc/diskstats
+ sda5
+
+* Extend the maximum I/O bandwidth for the cgroup "foo" to 8MiB/s:
+ # /bin/echo /dev/sda1:8192 > /mnt/cgroup/foo/blockio-bandwidth
+
+* Remove limiting rule on /dev/sda1 for cgroup "foo":
+ # /bin/echo /dev/sda1:0 > /mnt/cgroup/foo/blockio-bandwidth

+
+3. Advantages of providing this feature
+
+* Allow QoS for block device I/O among different cgroups
+* Improve I/O performance predictability on block devices shared between
+ different cgroups

+* Limiting rules do not depend of the particular I/O scheduler (anticipatory,
+ deadline, CFQ, noop) and/or the type of the underlying block devices

+* The bandwidth limitations are guaranteed both for synchronous and
+ asynchronous operations, even the I/O passing through the page cache or
+ buffers and not only direct I/O (see below for details)

+* It is possible to implement a simple user-space application to dynamically
+ adjust the I/O workload of different process containers at run-time,
+ according to the particular users' requirements and applications' performance
+ constraints
+* It is even possible to implement event-based performance throttling
+ mechanisms; for example the same user-space application could actively
+ throttle the I/O bandwidth to reduce power consumption when the battery of a
+ mobile device is running low (power throttling) or when the temperature of a
+ hardware component is too high (thermal throttling)

+
+4. Design
+
+The I/O throttling is performed imposing an explicit timeout, via
+schedule_timeout_killable() on the processes that exceed the I/O bandwidth

+dedicated to the cgroup they belong to.

+
+It just works as expected for read operations: the real I/O activity is reduced
+synchronously according to the defined limitations.
+

+Write operations, instead, are modeled depending of the dirty pages ratio
+(write throttling in memory), since the writes to the real block devices are

+processed asynchronously by different kernel threads (pdflush). However, the
+dirty pages ratio is directly proportional to the actual I/O that will be
+performed on the real block device. So, due to the asynchronous transfers
+through the page cache, the I/O throttling in memory can be considered a form
+of anticipatory throttling to the underlying block devices.
+
+Multiple re-writes in already dirtied page cache areas are not considered for
+accounting the I/O activity. This is valid for multiple re-reads of pages
+already present in the page cache as well.
+
+This means that a process that re-writes and/or re-reads multiple times the
+same blocks in a file (without re-creating it by truncate(), ftrunctate(),
+creat(), etc.) is affected by the I/O limitations only for the actual I/O
+performed to (or from) the underlying block devices.

+
+Multiple rules for different block devices are stored in a rbtree, using the
+dev_t number of each block device as key. This allows to reduce the controller
+overhead on systems with many LUNs and different per-LUN I/O bandwidth rules
+(exploiting the worst case complexity of O(log n) for search operations in the
+rbtree structure).
+
+WARNING: per-block device limiting rules always refer to the dev_t device
+number. If a block device is unplugged (i.e. a USB device) the limiting rules
+associated to that device persist and they are still valid if a new device is
+plugged in the system and it uses the same major and minor numbers.

Andrea Righi

unread,

Jun 7, 2008, 4:30:16 AM6/7/08

to

The goal of the i/o bandwidth controller is to improve i/o performance
predictability and provide better QoS for different cgroups sharing the same
block devices.

Respect to other priority/weight-based solutions the approach used by this
controller is to explicitly choke applications' requests that directly (or
indirectly) generate i/o activity in the system.

The direct bandwidth limiting method has the advantage of improving the
performance predictability at the cost of reducing, in general, the overall
performance of the system (in terms of throughput).

Detailed informations about design, its goal and usage are described in the
documentation.

Tested against latest git (2.6.26-rc5).

The all-in-one patch can be found at:
http://download.systemimager.org/~arighi/linux/patches/io-throttle/

Previous version and test report can be found here:
http://lkml.org/lkml/2008/5/24/97

Changelog since v1:
* support multiple per-block device i/o limiting rules
* minor optimizations in cgroup_io_account()
* updated the documentation and fixed some typos (thanks to Randy Dunlap for
reviewing)

-Andrea

Andrea Righi

unread,

Jun 7, 2008, 4:30:12 AM6/7/08

to

Apply the io-throttle controller to the opportune kernel functions. Both
accounting and throttling functionalities are performed by cgroup_io_account().

diff --git a/block/blk-core.c b/block/blk-core.c
index 1905aab..bf56046 100644

--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -26,6 +26,7 @@
#include <linux/swap.h>
#include <linux/writeback.h>
#include <linux/task_io_accounting_ops.h>
+#include <linux/blk-io-throttle.h>
#include <linux/interrupt.h>
#include <linux/cpu.h>
#include <linux/blktrace_api.h>

@@ -1482,6 +1483,7 @@ void submit_bio(int rw, struct bio *bio)

count_vm_events(PGPGOUT, count);
} else {
task_io_account_read(bio->bi_size);

+ cgroup_io_account(bio->bi_bdev, bio->bi_size);

count_vm_events(PGPGIN, count);
}

diff --git a/fs/buffer.c b/fs/buffer.c

index a073f3f..700316d 100644

--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -35,6 +35,7 @@
#include <linux/suspend.h>
#include <linux/buffer_head.h>
#include <linux/task_io_accounting_ops.h>
+#include <linux/blk-io-throttle.h>
#include <linux/bio.h>
#include <linux/notifier.h>
#include <linux/cpu.h>

@@ -700,6 +701,9 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);

static int __set_page_dirty(struct page *page,
struct address_space *mapping, int warn)
{

+ struct block_device *bdev = NULL;

+ size_t cgroup_io_acct = 0;
+
if (unlikely(!mapping))
return !TestSetPageDirty(page);

@@ -711,16 +715,23 @@ static int __set_page_dirty(struct page *page,
WARN_ON_ONCE(warn && !PageUptodate(page));

if (mapping_cap_account_dirty(mapping)) {
+ bdev = (mapping->host &&
+ mapping->host->i_sb->s_bdev) ?
+ mapping->host->i_sb->s_bdev : NULL;
+
__inc_zone_page_state(page, NR_FILE_DIRTY);

__inc_bdi_stat(mapping->backing_dev_info,
BDI_RECLAIMABLE);
task_io_account_write(PAGE_CACHE_SIZE);
+ cgroup_io_acct = PAGE_CACHE_SIZE;
}
radix_tree_tag_set(&mapping->page_tree,
page_index(page), PAGECACHE_TAG_DIRTY);
}
write_unlock_irq(&mapping->tree_lock);
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
+ if (cgroup_io_acct)

+ cgroup_io_account(bdev, cgroup_io_acct);

return 1;
}
diff --git a/fs/direct-io.c b/fs/direct-io.c

index 9e81add..9e5d783 100644

--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -35,6 +35,7 @@
#include <linux/buffer_head.h>
#include <linux/rwsem.h>
#include <linux/uio.h>
+#include <linux/blk-io-throttle.h>
#include <asm/atomic.h>

/*

@@ -666,6 +667,8 @@ submit_page_section(struct dio *dio, struct page *page,
/*

* Read accounting is performed in submit_bio()
*/

+ struct block_device *bdev = dio->bio ? dio->bio->bi_bdev : NULL;
+ cgroup_io_account(bdev, len);
task_io_account_write(len);
}

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 789b6ad..aa74fba 100644

--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -23,6 +23,7 @@
#include <linux/init.h>
#include <linux/backing-dev.h>
#include <linux/task_io_accounting_ops.h>
+#include <linux/blk-io-throttle.h>
#include <linux/blkdev.h>
#include <linux/mpage.h>
#include <linux/rmap.h>

@@ -1077,6 +1078,8 @@ int __set_page_dirty_nobuffers(struct page *page)

if (!TestSetPageDirty(page)) {
struct address_space *mapping = page_mapping(page);
struct address_space *mapping2;

+ struct block_device *bdev = NULL;

+ size_t cgroup_io_acct = 0;

if (!mapping)
return 1;

@@ -1087,10 +1090,15 @@ int __set_page_dirty_nobuffers(struct page *page)
BUG_ON(mapping2 != mapping);
WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
if (mapping_cap_account_dirty(mapping)) {
+ bdev = (mapping->host &&
+ mapping->host->i_sb->s_bdev) ?
+ mapping->host->i_sb->s_bdev : NULL;
+
__inc_zone_page_state(page, NR_FILE_DIRTY);

__inc_bdi_stat(mapping->backing_dev_info,
BDI_RECLAIMABLE);
task_io_account_write(PAGE_CACHE_SIZE);
+ cgroup_io_acct = PAGE_CACHE_SIZE;
}
radix_tree_tag_set(&mapping->page_tree,
page_index(page), PAGECACHE_TAG_DIRTY);

@@ -1100,6 +1108,8 @@ int __set_page_dirty_nobuffers(struct page *page)

/* !PageAnon && !swapper_space */
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
}
+ if (cgroup_io_acct)

+ cgroup_io_account(bdev, cgroup_io_acct);

return 1;
}
return 0;
diff --git a/mm/readahead.c b/mm/readahead.c

index d8723a5..9fdfae8 100644

--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -14,6 +14,7 @@
#include <linux/blkdev.h>
#include <linux/backing-dev.h>
#include <linux/task_io_accounting_ops.h>
+#include <linux/blk-io-throttle.h>
#include <linux/pagevec.h>
#include <linux/pagemap.h>

@@ -58,6 +59,9 @@ int read_cache_pages(struct address_space *mapping, struct list_head *pages,
int (*filler)(void *, struct page *), void *data)
{
struct page *page;
+ struct block_device *bdev =
+ (mapping->host && mapping->host->i_sb->s_bdev) ?
+ mapping->host->i_sb->s_bdev : NULL;
int ret = 0;

while (!list_empty(pages)) {
@@ -76,6 +80,7 @@ int read_cache_pages(struct address_space *mapping, struct list_head *pages,
break;
}
task_io_account_read(PAGE_CACHE_SIZE);
+ cgroup_io_account(bdev, PAGE_CACHE_SIZE);
}
return ret;
}
--
1.5.4.3

Andrea Righi

unread,

Jun 9, 2008, 7:50:10 PM6/9/08

to

Provide distinct cgroup VM overcommit accounting and handling using the memory
resource controller.

Patchset against latest Linus git tree.

This patchset allows to set different per-cgroup overcommit rules and,
according to them, it's possible to return a memory allocation failure (ENOMEM)
to the applications, instead of always triggering the OOM killer via
mem_cgroup_out_of_memory() when cgroup memory limits are exceeded.

Default overcommit settings are taken from vm.overcommit_memory and
vm.overcommit_ratio sysctl values. Child cgroups initially inherits the VM
overcommit parent's settings.

Cgroup overcommit settings can be overridden using memory.overcommit_memory and
memory.overcommit_ratio files under the cgroup filesystem.

For example:

1. Initialize a cgroup with 50MB memory limit:
# mount -t cgroup none /cgroups -o memory
# mkdir /cgroups/0
# /bin/echo $$ > /cgroups/0/tasks
# /bin/echo 50M > /cgroups/0/memory.limit_in_bytes

2. Use the "never overcommit" policy with 50% ratio:
# /bin/echo 2 > /cgroups/0/memory.overcommit_memory
# /bin/echo 50 > /cgroups/0/memory.overcommit_ratio

Assuming we have no swap space, cgroup 0 can allocate up to 25MB of virtual
memory. If that limit is exceeded all the further allocation attempts made by
userspace applications will receive a -ENOMEM.

4. Show committed VM statistics:
# cat /cgroups/0/memory.overcommit_as
CommitLimit: 25600 kB
Committed_AS: 9844 kB

5. Use "always overcommmit":
# /bin/echo 1 > /cgroups/0/memory.overcommit_memory

This is very similar to the default memory controller configuration: overcommit
is allowed, but when there's no more available memory oom-killer is invoked.

TODO:
- shared memory is not taken in account (i.e. files in tmpfs)

-Andrea

Andrea Righi

unread,

Jun 9, 2008, 7:50:12 PM6/9/08

to

Split the different __vm_enough_memory() policies in inline functions to
easily reuse them in the memory controller overcommit handling routines.

Accounting functions vm_acct_memory() and vm_unacct_memory() are rewritten as
well, including per-cgroup committed VM accounting concept.

Signed-off-by: Andrea Righi <righi....@gmail.com>
---

include/linux/mman.h | 148 ++++++++++++++++++++++++++++++++++++++++++++++++--
mm/memcontrol.c | 139 ++++++++++++++++++++++++++++++++++++++++++++++-
mm/mmap.c | 85 ++++-------------------------
mm/nommu.c | 84 ++++-------------------------
mm/swap.c | 3 +-
5 files changed, 306 insertions(+), 153 deletions(-)

diff --git a/include/linux/mman.h b/include/linux/mman.h
index dab8892..37f695f 100644
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -12,25 +12,165 @@

#ifdef __KERNEL__
#include <linux/mm.h>
+#include <linux/vmstat.h>
+#include <linux/mmzone.h>
+#include <linux/mm_types.h>
+#include <linux/hugetlb.h>
+#include <linux/swap.h>

#include <asm/atomic.h>

extern int sysctl_overcommit_memory;
extern int sysctl_overcommit_ratio;
extern atomic_long_t vm_committed_space;
+extern unsigned long totalreserve_pages;
+extern unsigned long totalram_pages;
+
+struct vm_acct_values {
+ int overcommit_memory;
+ int overcommit_ratio;
+ atomic_long_t vm_committed_space;
+};
+
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+extern void vm_acct_get_config(const struct mm_struct *mm,
+ struct vm_acct_values *v);
+extern void mem_cgroup_vm_acct_memory(struct mm_struct *mm, long pages);
+#else
+static inline void vm_acct_get_config(const struct mm_struct *mm,
+ struct vm_acct_values *v)
+{
+ v->overcommit_memory = sysctl_overcommit_memory;
+ v->overcommit_ratio = sysctl_overcommit_ratio;
+}
+static inline void mem_cgroup_vm_acct_memory(struct mm_struct *mm, long pages)
+{
+}
+#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
+
+static inline int __vm_enough_memory_guess(struct mm_struct *mm,
+ long pages,
+ int cap_sys_admin)
+{
+ unsigned long n, free;
+
+ free = global_page_state(NR_FILE_PAGES);
+ free += nr_swap_pages;
+
+ /*
+ * Any slabs which are created with the
+ * SLAB_RECLAIM_ACCOUNT flag claim to have contents
+ * which are reclaimable, under pressure. The dentry
+ * cache and most inode caches should fall into this
+ */
+ free += global_page_state(NR_SLAB_RECLAIMABLE);
+
+ /*
+ * Leave the last 3% for root
+ */
+ if (!cap_sys_admin)
+ free -= free / 32;
+
+ if (free > pages)
+ return 0;
+
+ /*
+ * nr_free_pages() is very expensive on large systems,
+ * only call if we're about to fail.
+ */
+ n = nr_free_pages();
+
+ /*
+ * Leave reserved pages. The pages are not for anonymous pages.
+ */
+ if (n <= totalreserve_pages)
+ return -ENOMEM;
+ else
+ n -= totalreserve_pages;
+
+ /*
+ * Leave the last 3% for root
+ */
+ if (!cap_sys_admin)
+ n -= n / 32;
+ free += n;
+
+ if (free > pages)
+ return 0;
+
+ return -ENOMEM;
+}
+
+static inline int __vm_enough_memory_never(struct mm_struct *mm,
+ long pages,
+ int cap_sys_admin)
+{
+ unsigned long allowed;
+ struct vm_acct_values v;
+
+ vm_acct_get_config(mm, &v);
+
+ allowed = (totalram_pages - hugetlb_total_pages())
+ * v.overcommit_ratio / 100;
+ /*
+ * Leave the last 3% for root
+ */
+ if (!cap_sys_admin)
+ allowed -= allowed / 32;
+ allowed += total_swap_pages;
+
+ /* Don't let a single process grow too big:
+ leave 3% of the size of this process for other processes */
+ allowed -= mm->total_vm / 32;
+
+ /*
+ * cast `allowed' as a signed long because vm_committed_space
+ * sometimes has a negative value
+ */
+ if (atomic_long_read(&vm_committed_space) < (long)allowed)
+ return 0;
+
+ return -ENOMEM;
+}
+
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+extern int mem_cgroup_vm_enough_memory_guess(struct mm_struct *mm,
+ long pages,
+ int cap_sys_admin);
+
+extern int mem_cgroup_vm_enough_memory_never(struct mm_struct *mm,
+ long pages,
+ int cap_sys_admin);
+#else /* CONFIG_CGROUP_MEM_RES_CTLR */
+static inline int mem_cgroup_vm_enough_memory_guess(struct mm_struct *mm,
+ long pages,
+ int cap_sys_admin)
+{
+ return __vm_enough_memory_guess(mm, pages, cap_sys_admin);
+}
+
+static inline int mem_cgroup_vm_enough_memory_never(struct mm_struct *mm,
+ long pages,
+ int cap_sys_admin)
+{
+ return __vm_enough_memory_never(mm, pages, cap_sys_admin);
+}
+#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
+

#ifdef CONFIG_SMP
-extern void vm_acct_memory(long pages);
+extern void vm_acct_memory(struct mm_struct *mm, long pages);
#else
-static inline void vm_acct_memory(long pages)
+static inline void vm_acct_memory(struct mm_struct *mm, long pages)
{
atomic_long_add(pages, &vm_committed_space);
+ mem_cgroup_vm_acct_memory(mm, pages);
}
#endif

-static inline void vm_unacct_memory(long pages)
+static inline void vm_unacct_memory(struct mm_struct *mm, long pages)
{
- vm_acct_memory(-pages);
+ vm_acct_memory(mm, -pages);
}

/*
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e46451e..4100e24 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -21,6 +21,7 @@
#include <linux/memcontrol.h>
#include <linux/cgroup.h>
#include <linux/mm.h>
+#include <linux/mman.h>
#include <linux/smp.h>
#include <linux/page-flags.h>
#include <linux/backing-dev.h>
@@ -141,6 +142,10 @@ struct mem_cgroup {
* statistics.
*/
struct mem_cgroup_stat stat;
+ /*
+ * VM overcommit settings
+ */
+ struct vm_acct_values vmacct;
};
static struct mem_cgroup init_mem_cgroup;

@@ -187,6 +192,130 @@ enum charge_type {
MEM_CGROUP_CHARGE_TYPE_MAPPED,
};

+void vm_acct_get_config(const struct mm_struct *mm, struct vm_acct_values *v)
+{
+ struct mem_cgroup *mem;
+ long tmp;
+
+ BUG_ON(!mm);
+
+ rcu_read_lock();
+ mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
+ v->overcommit_memory = mem->vmacct.overcommit_memory;
+ v->overcommit_ratio = mem->vmacct.overcommit_ratio;
+ tmp = atomic_long_read(&mem->vmacct.vm_committed_space);
+ atomic_long_set(&v->vm_committed_space, tmp);
+ rcu_read_unlock();
+}
+
+void mem_cgroup_vm_acct_memory(struct mm_struct *mm, long pages)
+{
+ struct mem_cgroup *mem;
+ struct task_struct *tsk;
+
+ if (!mm)
+ return;
+
+ rcu_read_lock();
+ tsk = rcu_dereference(mm->owner);
+ mem = mem_cgroup_from_task(tsk);
+ /* Update memory cgroup statistic */
+ atomic_long_add(pages, &mem->vmacct.vm_committed_space);
+ /* Update task statistic */
+ atomic_long_add(pages, &tsk->vm_committed_space);
+ rcu_read_unlock();
+}
+
+int mem_cgroup_vm_enough_memory_guess(struct mm_struct *mm,
+ long pages,
+ int cap_sys_admin)
+{
+ unsigned long n, free;
+ struct mem_cgroup *mem;
+ long total, rss, cache;
+
+ rcu_read_lock();
+ mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
+ total = (long) (mem->res.limit >> PAGE_SHIFT) + 1L;
+ if (total > (totalram_pages - hugetlb_total_pages())) {
+ rcu_read_unlock();
+ return __vm_enough_memory_guess(mm, pages, cap_sys_admin);
+ }
+ cache = (long)mem_cgroup_read_stat(&mem->stat,
+ MEM_CGROUP_STAT_CACHE);
+ rss = (long)mem_cgroup_read_stat(&mem->stat,
+ MEM_CGROUP_STAT_RSS);
+ rcu_read_unlock();
+
+ free = cache;
+ free += nr_swap_pages;
+
+ /*
+ * Leave the last 3% for root
+ */
+ if (!cap_sys_admin)
+ free -= free / 32;
+
+ if (free > pages)
+ return 0;
+
+ n = total - rss;
+
+ /*
+ * Leave the last 3% for root
+ */
+ if (!cap_sys_admin)
+ n -= n / 32;
+ free += n;
+
+ if (free > pages)
+ return 0;
+
+ return -ENOMEM;
+}
+
+int mem_cgroup_vm_enough_memory_never(struct mm_struct *mm,
+ long pages,
+ int cap_sys_admin)
+{
+ unsigned long allowed;
+ struct vm_acct_values v;
+ struct mem_cgroup *mem;
+ long total;
+
+ rcu_read_lock();
+ mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
+ total = (long)(mem->res.limit >> PAGE_SHIFT) + 1L;
+ if (total > (totalram_pages - hugetlb_total_pages())) {
+ rcu_read_unlock();
+ return __vm_enough_memory_never(mm, pages, cap_sys_admin);
+ }
+ rcu_read_unlock();
+
+ vm_acct_get_config(mm, &v);
+
+ allowed = total * v.overcommit_ratio / 100;
+ /*
+ * Leave the last 3% for root
+ */
+ if (!cap_sys_admin)
+ allowed -= allowed / 32;
+ allowed += total_swap_pages;
+
+ /* Don't let a single process grow too big:
+ leave 3% of the size of this process for other processes */
+ allowed -= mm->total_vm / 32;
+
+ /*
+ * cast `allowed' as a signed long because vm_committed_space
+ * sometimes has a negative value
+ */
+ if (atomic_long_read(&v.vm_committed_space) < (long)allowed)
+ return 0;
+
+ return -ENOMEM;
+}
+
/*
* Always modified under lru lock. Then, not necessary to preempt_disable()
*/
@@ -1022,17 +1151,25 @@ static struct cgroup_subsys_state *
mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
{
struct mem_cgroup *mem;
+ struct cgroup *p = cont->parent;
int node;

- if (unlikely((cont->parent) == NULL)) {
+ if (unlikely((p) == NULL)) {
mem = &init_mem_cgroup;
page_cgroup_cache = KMEM_CACHE(page_cgroup, SLAB_PANIC);
+ mem->vmacct.overcommit_memory = sysctl_overcommit_memory;
+ mem->vmacct.overcommit_ratio = sysctl_overcommit_ratio;
} else {
mem = mem_cgroup_alloc();
if (!mem)
return ERR_PTR(-ENOMEM);
+ mem->vmacct.overcommit_memory =
+ mem_cgroup_from_cont(p)->vmacct.overcommit_memory;
+ mem->vmacct.overcommit_ratio =
+ mem_cgroup_from_cont(p)->vmacct.overcommit_ratio;
}

+ atomic_long_set(&mem->vmacct.vm_committed_space, 0);
res_counter_init(&mem->res);

for_each_node_state(node, N_POSSIBLE)
diff --git a/mm/mmap.c b/mm/mmap.c
index 3354fdd..256599e 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -25,6 +25,7 @@
#include <linux/module.h>
#include <linux/mount.h>
#include <linux/mempolicy.h>
+#include <linux/memcontrol.h>
#include <linux/rmap.h>

#include <asm/uaccess.h>
@@ -100,87 +101,23 @@ atomic_long_t vm_committed_space = ATOMIC_LONG_INIT(0);
*/
int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
{
- unsigned long free, allowed;
+ struct vm_acct_values v;

- vm_acct_memory(pages);
+ vm_acct_get_config(mm, &v);
+ vm_acct_memory(mm, pages);

/*
* Sometimes we want to use more memory than we have
*/
- if (sysctl_overcommit_memory == OVERCOMMIT_ALWAYS)
+ if (v.overcommit_memory == OVERCOMMIT_ALWAYS)
return 0;
-
- if (sysctl_overcommit_memory == OVERCOMMIT_GUESS) {
- unsigned long n;
-
- free = global_page_state(NR_FILE_PAGES);
- free += nr_swap_pages;
-
- /*
- * Any slabs which are created with the
- * SLAB_RECLAIM_ACCOUNT flag claim to have contents
- * which are reclaimable, under pressure. The dentry
- * cache and most inode caches should fall into this
- */
- free += global_page_state(NR_SLAB_RECLAIMABLE);
-
- /*
- * Leave the last 3% for root
- */
- if (!cap_sys_admin)
- free -= free / 32;
-
- if (free > pages)
- return 0;
-
- /*
- * nr_free_pages() is very expensive on large systems,
- * only call if we're about to fail.
- */
- n = nr_free_pages();
-
- /*
- * Leave reserved pages. The pages are not for anonymous pages.
- */
- if (n <= totalreserve_pages)
- goto error;
- else
- n -= totalreserve_pages;
-
- /*
- * Leave the last 3% for root
- */
- if (!cap_sys_admin)
- n -= n / 32;
- free += n;
-
- if (free > pages)
- return 0;
-
- goto error;
- }
-
- allowed = (totalram_pages - hugetlb_total_pages())
- * sysctl_overcommit_ratio / 100;
- /*
- * Leave the last 3% for root
- */
- if (!cap_sys_admin)
- allowed -= allowed / 32;
- allowed += total_swap_pages;
-
- /* Don't let a single process grow too big:
- leave 3% of the size of this process for other processes */
- allowed -= mm->total_vm / 32;
-
- /*
- * cast `allowed' as a signed long because vm_committed_space
- * sometimes has a negative value
- */
- if (atomic_long_read(&vm_committed_space) < (long)allowed)
+ if ((v.overcommit_memory == OVERCOMMIT_GUESS) &&
+ (!mem_cgroup_vm_enough_memory_guess(mm, pages, cap_sys_admin)))
+ return 0;
+ else if (!mem_cgroup_vm_enough_memory_never(mm, pages, cap_sys_admin))
return 0;
-error:
- vm_unacct_memory(pages);
+
+ vm_unacct_memory(mm, pages);

return -ENOMEM;
}
diff --git a/mm/nommu.c b/mm/nommu.c
index 3abd084..b194a44 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -20,6 +20,7 @@
#include <linux/file.h>
#include <linux/highmem.h>
#include <linux/pagemap.h>
+#include <linux/memcontrol.h>
#include <linux/slab.h>
#include <linux/vmalloc.h>
#include <linux/ptrace.h>
@@ -1356,86 +1357,23 @@ EXPORT_SYMBOL(get_unmapped_area);
*/
int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
{
- unsigned long free, allowed;
+ struct vm_acct_values v;

- vm_acct_memory(pages);
+ vm_acct_get_config(mm, &v);
+ vm_acct_memory(mm, pages);

/*
* Sometimes we want to use more memory than we have
*/
- if (sysctl_overcommit_memory == OVERCOMMIT_ALWAYS)
+ if (v.overcommit_memory == OVERCOMMIT_ALWAYS)
return 0;
-
- if (sysctl_overcommit_memory == OVERCOMMIT_GUESS) {
- unsigned long n;
-
- free = global_page_state(NR_FILE_PAGES);
- free += nr_swap_pages;
-
- /*
- * Any slabs which are created with the
- * SLAB_RECLAIM_ACCOUNT flag claim to have contents
- * which are reclaimable, under pressure. The dentry
- * cache and most inode caches should fall into this
- */
- free += global_page_state(NR_SLAB_RECLAIMABLE);
-
- /*
- * Leave the last 3% for root
- */
- if (!cap_sys_admin)
- free -= free / 32;
-
- if (free > pages)
- return 0;
-
- /*
- * nr_free_pages() is very expensive on large systems,
- * only call if we're about to fail.
- */
- n = nr_free_pages();
-
- /*
- * Leave reserved pages. The pages are not for anonymous pages.
- */
- if (n <= totalreserve_pages)
- goto error;
- else
- n -= totalreserve_pages;
-
- /*
- * Leave the last 3% for root
- */
- if (!cap_sys_admin)
- n -= n / 32;
- free += n;
-
- if (free > pages)
- return 0;
-
- goto error;
- }
-
- allowed = totalram_pages * sysctl_overcommit_ratio / 100;
- /*
- * Leave the last 3% for root
- */
- if (!cap_sys_admin)
- allowed -= allowed / 32;
- allowed += total_swap_pages;
-
- /* Don't let a single process grow too big:
- leave 3% of the size of this process for other processes */
- allowed -= current->mm->total_vm / 32;
-
- /*
- * cast `allowed' as a signed long because vm_committed_space
- * sometimes has a negative value
- */
- if (atomic_long_read(&vm_committed_space) < (long)allowed)
+ if ((v.overcommit_memory == OVERCOMMIT_GUESS) &&
+ (!mem_cgroup_vm_enough_memory_guess(mm, pages, cap_sys_admin)))
+ return 0;
+ else if (!mem_cgroup_vm_enough_memory_never(mm, pages, cap_sys_admin))
return 0;
-error:
- vm_unacct_memory(pages);
+
+ vm_unacct_memory(mm, pages);

return -ENOMEM;
}
diff --git a/mm/swap.c b/mm/swap.c
index 45c9f25..f7676db 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -495,7 +495,7 @@ EXPORT_SYMBOL(pagevec_lookup_tag);

static DEFINE_PER_CPU(long, committed_space) = 0;

-void vm_acct_memory(long pages)
+void vm_acct_memory(struct mm_struct *mm, long pages)
{
long *local;

@@ -507,6 +507,7 @@ void vm_acct_memory(long pages)
*local = 0;
}
preempt_enable();
+ mem_cgroup_vm_acct_memory(mm, pages);
}

#ifdef CONFIG_HOTPLUG_CPU
--
1.5.4.3

Andrea Righi

unread,

Jun 9, 2008, 7:50:10 PM6/9/08

to

Update VM committed space statistics when a task is migrated from a cgroup to
another. To implement this feature we must keep track of the space committed by
each task (that is directly accounted in the task_struct).

Signed-off-by: Andrea Righi <righi....@gmail.com>
---

include/linux/sched.h | 3 +++
kernel/fork.c | 3 +++
mm/memcontrol.c | 6 ++++++
3 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index ae0be3c..8b458df 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1277,6 +1277,9 @@ struct task_struct {
/* cg_list protected by css_set_lock and tsk->alloc_lock */
struct list_head cg_list;
#endif
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+ atomic_long_t vm_committed_space;
+#endif
#ifdef CONFIG_FUTEX
struct robust_list_head __user *robust_list;
#ifdef CONFIG_COMPAT
diff --git a/kernel/fork.c b/kernel/fork.c
index eaffa56..9fafbdb 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -219,6 +219,9 @@ static struct task_struct *dup_task_struct(struct task_struct *orig)
/* One for us, one for whoever does the "release_task()" (usually parent) */
atomic_set(&tsk->usage,2);
atomic_set(&tsk->fs_excl, 0);
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+ atomic_long_set(&tsk->vm_committed_space, 0);
+#endif
#ifdef CONFIG_BLK_DEV_IO_TRACE
tsk->btrace_seq = 0;
#endif
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e3e34e9..bc4923e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1334,6 +1334,7 @@ static void mem_cgroup_move_task(struct cgroup_subsys *ss,
{
struct mm_struct *mm;
struct mem_cgroup *mem, *old_mem;
+ long committed;

if (mem_cgroup_subsys.disabled)
return;
@@ -1355,6 +1356,11 @@ static void mem_cgroup_move_task(struct cgroup_subsys *ss,
if (!thread_group_leader(p))
goto out;

+ preempt_disable();
+ committed = atomic_long_read(&p->vm_committed_space);
+ atomic_long_sub(committed, &old_mem->vmacct.vm_committed_space);
+ atomic_long_add(committed, &mem->vmacct.vm_committed_space);
+ preempt_enable();
out:
mmput(mm);

Andrea Righi

unread,

Jun 9, 2008, 7:50:10 PM6/9/08

to

Documentation of the VM overcommit memory controller included in the generic
memory controller documentation: basic description and usage.

Signed-off-by: Andrea Righi <righi....@gmail.com>
---

Documentation/controllers/memory.txt | 29 +++++++++++++++++++++++++++++
1 files changed, 29 insertions(+), 0 deletions(-)

diff --git a/Documentation/controllers/memory.txt b/Documentation/controllers/memory.txt
index 866b9cd..e984bfb 100644
--- a/Documentation/controllers/memory.txt
+++ b/Documentation/controllers/memory.txt
@@ -12,6 +12,7 @@ c. Provides *zero overhead* for non memory controller users
d. Provides a double LRU: global memory pressure causes reclaim from the
global LRU; a cgroup on hitting a limit, reclaims from the per
cgroup LRU
+e. Provide distinct cgroup VM overcommit accounting and handling

NOTE: Swap Cache (unmapped) is not accounted now.

@@ -142,6 +143,31 @@ The reclaim algorithm has not been modified for cgroups, except that
pages that are selected for reclaiming come from the per cgroup LRU
list.

+2.5 VM overcommit accounting and handling
+
+The concept of committed VM is replicated within each cgroup as well as global
+committed memory concept. Each cgroup can set its own overcommit policy using
+the files:
+
+memory.overcommit_memory
+memory.overcommit_ratio
+
+These settings override the system sysctl settings (`vm.overcommit_memory` and
+`vm.overcommit_ratio`) and they apply locally to the cgroup they refer.
+
+Global sysctl settings are initially used by the root level cgroups. Child
+cgroups initially inherit the parent's settings. Each cgroup can change its own
+overcommit parameters at any time simply modifying the files
+`memory.overcommit_memory` and/or `memory.overcommit_ratio`.
+
+Statistics about the current committed space and limit are reported in
+`memory.overcommit_as` for each cgroup.
+
+Per-cgroup overcommit limit depends of the local cgroup overcommit settings and
+memory limit (RSS + cache) imposed by the memory controller.
+
+See "Documentation/vm/overcommit-accounting" for additional details.
+
2. Locking

The memory controller uses the following hierarchy
@@ -230,6 +256,9 @@ carried forward. The pages allocated from the original cgroup still
remain charged to it, the charge is dropped when the page is freed or
reclaimed.

+The amount of the task's committed VM, instead, is uncharged from the old
+cgroup and accounted to the newer.
+
4.3 Removing a cgroup

A cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a

Andrea Righi

unread,

Jun 9, 2008, 7:50:12 PM6/9/08

to

Per-cgroup overcommit_memory and overcommit_ratio parameters can be shown
and/or manipulated using the files "memory.overcommit_memory" and
"memory.overcommit_ratio" in the cgroup filesystem.

The file "overcommit_as" can be used to retrieve the current committed space
and limit of each cgroup.

Signed-off-by: Andrea Righi <righi....@gmail.com>
---

mm/memcontrol.c | 116 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-
1 files changed, 115 insertions(+), 1 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4100e24..e3e34e9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -36,6 +36,8 @@

#include <asm/uaccess.h>

+#define K(x) ((x) << (PAGE_SHIFT - 10))
+
struct cgroup_subsys mem_cgroup_subsys;
static const int MEM_CGROUP_RECLAIM_RETRIES = 5;
static struct kmem_cache *page_cgroup_cache;
@@ -47,7 +49,7 @@ enum mem_cgroup_stat_index {
/*
* For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
*/
- MEM_CGROUP_STAT_CACHE, /* # of pages charged as cache */
+ MEM_CGROUP_STAT_CACHE, /* # of pages charged as cache */
MEM_CGROUP_STAT_RSS, /* # of pages charged as rss */
MEM_CGROUP_STAT_PGPGIN_COUNT, /* # of pages paged in */
MEM_CGROUP_STAT_PGPGOUT_COUNT, /* # of pages paged out */
@@ -55,6 +57,11 @@ enum mem_cgroup_stat_index {
MEM_CGROUP_STAT_NSTATS,
};

+enum mem_cgroup_vmacct_index {
+ MEM_CGROUP_OVERCOMMIT_MEMORY,
+ MEM_CGROUP_OVERCOMMIT_RATIO,
+};
+
struct mem_cgroup_stat_cpu {
s64 count[MEM_CGROUP_STAT_NSTATS];
} ____cacheline_aligned_in_smp;
@@ -995,6 +1002,97 @@ static ssize_t mem_cgroup_write(struct cgroup *cont, struct cftype *cft,
mem_cgroup_write_strategy);
}

+static ssize_t mem_cgroup_committed_read(struct cgroup *cont,

+ struct cftype *cft,
+ struct file *file,
+ char __user *userbuf,
+ size_t nbytes,
+ loff_t *ppos)

+{
+ struct mem_cgroup *mem;

+ char *page;
+ unsigned long total, committed, allowed;
+ ssize_t count;
+ int ret;
+
+ page = (char *)__get_free_page(GFP_TEMPORARY);
+ if (unlikely(!page))
+ return -ENOMEM;
+

+ cgroup_lock();
+ if (cgroup_is_removed(cont)) {

+ cgroup_unlock();

+ ret = -ENODEV;
+ goto out;
+ }
+

+ mem = mem_cgroup_from_cont(cont);
+ committed = atomic_long_read(&mem->vmacct.vm_committed_space);
+

+ total = (long)(mem->res.limit >> PAGE_SHIFT) + 1L;
+ if (total > (totalram_pages - hugetlb_total_pages()))

+ allowed = ((totalram_pages - hugetlb_total_pages())
+ * mem->vmacct.overcommit_ratio / 100)
+ + total_swap_pages;
+ else
+ allowed = total * mem->vmacct.overcommit_ratio / 100
+ + total_swap_pages;
+ cgroup_unlock();
+
+ count = sprintf(page, "CommitLimit: %8lu kB\n"
+ "Committed_AS: %8lu kB\n",
+ K(allowed), K(committed));
+ ret = simple_read_from_buffer(userbuf, nbytes, ppos, page, count);
+out:
+ free_page((unsigned long)page);

+ return ret;
+}
+

+static s64 mem_cgroup_vmacct_read_s64(struct cgroup *cont, struct cftype *cft)
+{
+ struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
+ enum mem_cgroup_vmacct_index type = cft->private;
+
+ switch (type) {
+ case MEM_CGROUP_OVERCOMMIT_MEMORY:
+ return mem->vmacct.overcommit_memory;
+ case MEM_CGROUP_OVERCOMMIT_RATIO:
+ return mem->vmacct.overcommit_ratio;
+ default:
+ BUG();
+ }
+}
+
+static int mem_cgroup_vmacct_write_s64(struct cgroup *cont, struct cftype *cft,
+ s64 val)
+{
+ struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
+ enum mem_cgroup_vmacct_index type = cft->private;
+ int ret = 0;

+
+ cgroup_lock();
+ if (cgroup_is_removed(cont)) {

+ cgroup_unlock();
+ return -ENODEV;
+ }
+
+ switch (type) {
+ case MEM_CGROUP_OVERCOMMIT_MEMORY:
+ mem->vmacct.overcommit_memory = (int)val;
+ break;
+ case MEM_CGROUP_OVERCOMMIT_RATIO:
+ mem->vmacct.overcommit_ratio = (int)val;
+ break;
+ default:
+ ret = -EINVAL;
+ break;
+ }
+ cgroup_unlock();

+
+ return ret;
+
+}
+

static int mem_cgroup_reset(struct cgroup *cont, unsigned int event)
{
struct mem_cgroup *mem;
@@ -1086,6 +1184,22 @@ static struct cftype mem_cgroup_files[] = {
.name = "stat",
.read_map = mem_control_stat_show,
},
+ {
+ .name = "overcommit_memory",
+ .private = MEM_CGROUP_OVERCOMMIT_MEMORY,
+ .read_s64 = mem_cgroup_vmacct_read_s64,
+ .write_s64 = mem_cgroup_vmacct_write_s64,
+ },
+ {
+ .name = "overcommit_ratio",
+ .private = MEM_CGROUP_OVERCOMMIT_RATIO,
+ .read_s64 = mem_cgroup_vmacct_read_s64,
+ .write_s64 = mem_cgroup_vmacct_write_s64,
+ },
+ {
+ .name = "overcommit_as",
+ .read = mem_cgroup_committed_read,
+ },
};

static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)

Andrea Righi

unread,

Jun 9, 2008, 8:00:15 PM6/9/08

to

Apply the new memory controller VM prototypes to the opportune kernel routines.

Signed-off-by: Andrea Righi <righi....@gmail.com>
---

diff --git a/ipc/shm.c b/ipc/shm.c
index 554429a..420bfa9 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -384,12 +384,14 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
shp->mlock_user = current->user;
} else {
int acctflag = VM_ACCOUNT;
+ struct vm_acct_values v;
/*
* Do not allow no accounting for OVERCOMMIT_NEVER, even
- * if it's asked for.
+ * if it's asked for.
*/
+ vm_acct_get_config(current->mm, &v);
if ((shmflg & SHM_NORESERVE) &&
- sysctl_overcommit_memory != OVERCOMMIT_NEVER)
+ v.overcommit_memory != OVERCOMMIT_NEVER)
acctflag = 0;
file = shmem_file_setup(name, size, acctflag);
}
diff --git a/kernel/fork.c b/kernel/fork.c
index 19908b2..eaffa56 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -336,7 +336,7 @@ fail_nomem_policy:
kmem_cache_free(vm_area_cachep, tmp);
fail_nomem:
retval = -ENOMEM;
- vm_unacct_memory(charge);
+ vm_unacct_memory(mm, charge);
goto out;
}

diff --git a/mm/mmap.c b/mm/mmap.c
index 256599e..ab40277 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1028,6 +1028,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
int error;
struct rb_node **rb_link, *rb_parent;
unsigned long charged = 0;
+ struct vm_acct_values v;
struct inode *inode = file ? file->f_path.dentry->d_inode : NULL;

/* Clear old maps */
@@ -1044,6 +1045,7 @@ munmap_back:
if (!may_expand_vm(mm, len >> PAGE_SHIFT))
return -ENOMEM;

+ vm_acct_get_config(mm, &v);
if (accountable && (!(flags & MAP_NORESERVE) ||
sysctl_overcommit_memory == OVERCOMMIT_NEVER)) {
if (vm_flags & VM_SHARED) {
@@ -1170,7 +1172,7 @@ free_vma:
kmem_cache_free(vm_area_cachep, vma);
unacct_error:
if (charged)
- vm_unacct_memory(charged);
+ vm_unacct_memory(mm, charged);
return error;
}

@@ -1698,7 +1700,7 @@ static void unmap_region(struct mm_struct *mm,
tlb = tlb_gather_mmu(mm, 0);
update_hiwater_rss(mm);
unmap_vmas(&tlb, vma, start, end, &nr_accounted, NULL);
- vm_unacct_memory(nr_accounted);
+ vm_unacct_memory(mm, nr_accounted);
free_pgtables(&tlb, vma, prev? prev->vm_end: FIRST_USER_ADDRESS,
next? next->vm_start: 0);
tlb_finish_mmu(tlb, start, end);
@@ -1959,7 +1961,7 @@ unsigned long do_brk(unsigned long addr, unsigned long len)
*/
vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
if (!vma) {
- vm_unacct_memory(len >> PAGE_SHIFT);
+ vm_unacct_memory(mm, len >> PAGE_SHIFT);
return -ENOMEM;
}

@@ -1998,7 +2000,7 @@ void exit_mmap(struct mm_struct *mm)
/* Don't update_hiwater_rss(mm) here, do_exit already did */
/* Use -1 here to ensure all VMAs in the mm are unmapped */
end = unmap_vmas(&tlb, vma, 0, -1, &nr_accounted, NULL);
- vm_unacct_memory(nr_accounted);
+ vm_unacct_memory(mm, nr_accounted);
free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, 0);
tlb_finish_mmu(tlb, 0, end);

diff --git a/mm/mprotect.c b/mm/mprotect.c
index a5bf31c..2b1ad3b 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -216,7 +216,7 @@ success:
return 0;

fail:
- vm_unacct_memory(charged);
+ vm_unacct_memory(mm, charged);
return error;
}

diff --git a/mm/mremap.c b/mm/mremap.c
index 08e3c7f..9dc5921 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -217,7 +217,7 @@ static unsigned long move_vma(struct vm_area_struct *vma,

if (do_munmap(mm, old_addr, old_len) < 0) {
/* OOM: unable to split vma, just get accounts right */
- vm_unacct_memory(excess >> PAGE_SHIFT);
+ vm_unacct_memory(mm, excess >> PAGE_SHIFT);
excess = 0;
}
mm->hiwater_vm = hiwater_vm;
@@ -407,7 +407,7 @@ unsigned long do_mremap(unsigned long addr,
}
out:
if (ret & ~PAGE_MASK)
- vm_unacct_memory(charged);
+ vm_unacct_memory(mm, charged);
out_nc:
return ret;
}
diff --git a/mm/shmem.c b/mm/shmem.c
index e2a6ae1..2cd56be 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -163,14 +163,30 @@ static inline struct shmem_sb_info *SHMEM_SB(struct super_block *sb)
*/
static inline int shmem_acct_size(unsigned long flags, loff_t size)
{
- return (flags & VM_ACCOUNT)?
- security_vm_enough_memory(VM_ACCT(size)): 0;
+ if (!(flags & VM_ACCOUNT))
+ return 0;
+ /*
+ * TODO: find a way to correctly account shared memory also for the
+ * cgroup memory controller, as well as non-shared memory. For now
+ * simply undo the accounting in the cgroup's committed space and
+ * report the size only in the global committed memory.
+ */
+ mem_cgroup_vm_acct_memory(current->mm, -VM_ACCT(size));
+ return security_vm_enough_memory(VM_ACCT(size));
}

static inline void shmem_unacct_size(unsigned long flags, loff_t size)
{
- if (flags & VM_ACCOUNT)
- vm_unacct_memory(VM_ACCT(size));
+ /*
+ * TODO: find a way to correctly account shared memory also for the
+ * cgroup memory controller, as well as non-shared memory. For now
+ * simply undo the accounting in the cgroup's committed space and
+ * report the size only in the global committed memory.
+ */
+ if (!(flags & VM_ACCOUNT))
+ return;
+ mem_cgroup_vm_acct_memory(current->mm, VM_ACCT(size));
+ vm_unacct_memory(current->mm, VM_ACCT(size));
}

/*
@@ -181,14 +197,31 @@ static inline void shmem_unacct_size(unsigned long flags, loff_t size)
*/
static inline int shmem_acct_block(unsigned long flags)
{
- return (flags & VM_ACCOUNT)?
- 0: security_vm_enough_memory(VM_ACCT(PAGE_CACHE_SIZE));
+ /*
+ * TODO: find a way to correctly account shared memory also for the
+ * cgroup memory controller, as well as non-shared memory. For now
+ * simply undo the accounting in the cgroup's committed space and
+ * report the size only in the global committed memory.
+ */
+ if (flags & VM_ACCOUNT)
+ return 0;
+ mem_cgroup_vm_acct_memory(current->mm, -VM_ACCT(PAGE_CACHE_SIZE));
+ return security_vm_enough_memory(VM_ACCT(PAGE_CACHE_SIZE));
}

static inline void shmem_unacct_blocks(unsigned long flags, long pages)
{
- if (!(flags & VM_ACCOUNT))
- vm_unacct_memory(pages * VM_ACCT(PAGE_CACHE_SIZE));
+ /*
+ * TODO: find a way to correctly account shared memory also for the
+ * cgroup memory controller, as well as non-shared memory. For now
+ * simply undo the accounting in the cgroup's committed space and
+ * report the size only in the global committed memory.
+ */
+ if (flags & VM_ACCOUNT)
+ return;
+ mem_cgroup_vm_acct_memory(current->mm,
+ pages * VM_ACCT(PAGE_CACHE_SIZE));
+ vm_unacct_memory(current->mm, pages * VM_ACCT(PAGE_CACHE_SIZE));
}

static const struct super_operations shmem_ops;
diff --git a/mm/swapfile.c b/mm/swapfile.c
index bd1bb59..87b7d1a 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1213,7 +1213,7 @@ asmlinkage long sys_swapoff(const char __user * specialfile)
char * pathname;
int i, type, prev;
int err;
-
+
if (!capable(CAP_SYS_ADMIN))
return -EPERM;

@@ -1245,7 +1245,7 @@ asmlinkage long sys_swapoff(const char __user * specialfile)
goto out_dput;
}
if (!security_vm_enough_memory(p->pages))
- vm_unacct_memory(p->pages);
+ vm_unacct_memory(current->mm, p->pages);
else {
err = -ENOMEM;
spin_unlock(&swap_lock);

KAMEZAWA Hiroyuki

unread,

Jun 9, 2008, 8:20:05 PM6/9/08

to

On Tue, 10 Jun 2008 01:32:58 +0200
Andrea Righi <righi....@gmail.com> wrote:

>
> Provide distinct cgroup VM overcommit accounting and handling using the memory
> resource controller.
>

Could you explain the benefits of this even when we have memrlimit controller ?
(If unsure, see 2.6.26-rc5-mm1 and search memrlimit controller.)

And this kind of virtual-address-handling things should be implemented on
memrlimit controller (means not on memory-resource-controller.).
It seems this patch doesn't need to handle page_group.

Considering hierarchy, putting several kinds of features on one controller is
not good, I think. Balbir, how do you think ?

Thanks,
-Kame

Balbir Singh

unread,

Jun 10, 2008, 1:20:08 AM6/10/08

to

KAMEZAWA Hiroyuki wrote:
> On Tue, 10 Jun 2008 01:32:58 +0200
> Andrea Righi <righi....@gmail.com> wrote:
>
>> Provide distinct cgroup VM overcommit accounting and handling using the memory
>> resource controller.
>>
>
> Could you explain the benefits of this even when we have memrlimit controller ?
> (If unsure, see 2.6.26-rc5-mm1 and search memrlimit controller.)
>
> And this kind of virtual-address-handling things should be implemented on
> memrlimit controller (means not on memory-resource-controller.).
> It seems this patch doesn't need to handle page_group.
>
> Considering hierarchy, putting several kinds of features on one controller is
> not good, I think. Balbir, how do you think ?
>

I would tend to agree. With the memrlimit controller, can't we do this in user
space now? Figure out the overcommit value and based on that setup the memrlimit?

--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL

Balbir Singh

unread,

Jun 10, 2008, 1:20:07 AM6/10/08

to

Andrea Righi wrote:
> The goal of the i/o bandwidth controller is to improve i/o performance
> predictability and provide better QoS for different cgroups sharing the same
> block devices.
>
> Respect to other priority/weight-based solutions the approach used by this
> controller is to explicitly choke applications' requests that directly (or
> indirectly) generate i/o activity in the system.
>
> The direct bandwidth limiting method has the advantage of improving the
> performance predictability at the cost of reducing, in general, the overall
> performance of the system (in terms of throughput).
>
> Detailed informations about design, its goal and usage are described in the
> documentation.
>

I'll read through the documentation and comment.

> Tested against latest git (2.6.26-rc5).
>
> The all-in-one patch can be found at:
> http://download.systemimager.org/~arighi/linux/patches/io-throttle/
>

Cool! thanks for doing this, it's useful.

> Previous version and test report can be found here:
> http://lkml.org/lkml/2008/5/24/97

--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL

Pavel Emelyanov

unread,

Jun 10, 2008, 4:00:28 AM6/10/08

to

Balbir Singh wrote:
> KAMEZAWA Hiroyuki wrote:
>> On Tue, 10 Jun 2008 01:32:58 +0200
>> Andrea Righi <righi....@gmail.com> wrote:
>>
>>> Provide distinct cgroup VM overcommit accounting and handling using the memory
>>> resource controller.
>>>
>> Could you explain the benefits of this even when we have memrlimit controller ?
>> (If unsure, see 2.6.26-rc5-mm1 and search memrlimit controller.)
>>
>> And this kind of virtual-address-handling things should be implemented on
>> memrlimit controller (means not on memory-resource-controller.).
>> It seems this patch doesn't need to handle page_group.
>>
>> Considering hierarchy, putting several kinds of features on one controller is
>> not good, I think. Balbir, how do you think ?
>>
>
> I would tend to agree. With the memrlimit controller, can't we do this in user
> space now? Figure out the overcommit value and based on that setup the memrlimit?

I also agree with Balbir and Kamezawa. Separate controller for VM (i.e. vma-s
lengths) is more preferable, rather than yet another fancy feature on top of
the existing rss one.

Andrea Righi

unread,

Jun 10, 2008, 4:40:12 AM6/10/08

to

Pavel Emelyanov wrote:
> Balbir Singh wrote:
>> KAMEZAWA Hiroyuki wrote:
>>> On Tue, 10 Jun 2008 01:32:58 +0200
>>> Andrea Righi <righi....@gmail.com> wrote:
>>>
>>>> Provide distinct cgroup VM overcommit accounting and handling using the memory
>>>> resource controller.
>>>>
>>> Could you explain the benefits of this even when we have memrlimit controller ?
>>> (If unsure, see 2.6.26-rc5-mm1 and search memrlimit controller.)
>>>
>>> And this kind of virtual-address-handling things should be implemented on
>>> memrlimit controller (means not on memory-resource-controller.).
>>> It seems this patch doesn't need to handle page_group.
>>>
>>> Considering hierarchy, putting several kinds of features on one controller is
>>> not good, I think. Balbir, how do you think ?
>>>
>> I would tend to agree. With the memrlimit controller, can't we do this in user
>> space now? Figure out the overcommit value and based on that setup the memrlimit?
>
> I also agree with Balbir and Kamezawa. Separate controller for VM (i.e. vma-s
> lengths) is more preferable, rather than yet another fancy feature on top of
> the existing rss one.
>

Yep! it seems I totally miss the memrlimit controller. I was trying to
implement pretty the same functionalities, using a different approach.
However, I agree that a separate controller seems to be a better
solution.

Thank you all for pointing in the right direction. I'll test memrlimit
controller and give a feedback.

-Andrea

Dave Hansen

unread,

Jun 10, 2008, 1:50:08 PM6/10/08

to

On Tue, 2008-06-10 at 01:33 +0200, Andrea Righi wrote:
> + preempt_disable();
> + committed = atomic_long_read(&p->vm_committed_space);
> + atomic_long_sub(committed, &old_mem->vmacct.vm_committed_space);
> + atomic_long_add(committed, &mem->vmacct.vm_committed_space);
> + preempt_enable();
> out:
> mmput(mm);
> }

Why bother with the preempt stuff here? What does the actually protect
against? I assume that you're trying to keep other tasks that might run
on this CPU from seeing weird, inconsistent numbers in here. Is there
some other looks that keeps *other* cpus from seeing this?

In any case, I think it needs a big, fat comment.

-- Dave

Dave Hansen

unread,

Jun 10, 2008, 1:50:09 PM6/10/08

to

On Tue, 2008-06-10 at 01:33 +0200, Andrea Righi wrote:
>

> + cgroup_unlock();
> +
> + count = sprintf(page, "CommitLimit: %8lu kB\n"
> + "Committed_AS: %8lu kB\n",
> + K(allowed), K(committed));
> + ret = simple_read_from_buffer(userbuf, nbytes, ppos, page, count);
> +out:
> + free_page((unsigned long)page);
> + return ret;
> +}

I know you're trying to make this look like meminfo, but it does butcher
the sysfs rules a bit. How about breaking it out into a couple of
files?

-- Dave

Andrea Righi

unread,

Jun 11, 2008, 6:40:14 AM6/11/08

to

Dave Hansen wrote:
> On Tue, 2008-06-10 at 01:33 +0200, Andrea Righi wrote:
>> + preempt_disable();
>> + committed = atomic_long_read(&p->vm_committed_space);
>> + atomic_long_sub(committed, &old_mem->vmacct.vm_committed_space);
>> + atomic_long_add(committed, &mem->vmacct.vm_committed_space);
>> + preempt_enable();
>> out:
>> mmput(mm);
>> }
>
> Why bother with the preempt stuff here? What does the actually protect
> against? I assume that you're trying to keep other tasks that might run
> on this CPU from seeing weird, inconsistent numbers in here. Is there
> some other looks that keeps *other* cpus from seeing this?
>
> In any case, I think it needs a big, fat comment.

Yes, true, mem_cgroup_move_task() is called after the task->cgroups
pointer has been changed. So, even if task changes its committed space
between the atomic_long_sub() and atomic_long_add() it will be correctly
accounted in the new cgroup.

-Andrea

Andrea Righi

unread,

Jun 11, 2008, 6:50:10 AM6/11/08

to

Dave Hansen wrote:
> On Tue, 2008-06-10 at 01:33 +0200, Andrea Righi wrote:
>> + cgroup_unlock();
>> +
>> + count = sprintf(page, "CommitLimit: %8lu kB\n"
>> + "Committed_AS: %8lu kB\n",
>> + K(allowed), K(committed));
>> + ret = simple_read_from_buffer(userbuf, nbytes, ppos, page, count);
>> +out:
>> + free_page((unsigned long)page);
>> + return ret;
>> +}
>
> I know you're trying to make this look like meminfo, but it does butcher
> the sysfs rules a bit. How about breaking it out into a couple of
> files?
>
> -- Dave

Sounds reasonable.

Anyway, I think this patchset should be simply dropped in favor of the
Balbir's memrlimit controller, as reported by Kamezawa.

The memrlimit controller permits to implement the overcommit policies in
userspace directly setting the cgroup VM limits (well.. I've not read the
memrlimit source yet, but from the documentation it seems so).

-Andrea

Andrea Righi

unread,

Jun 11, 2008, 7:00:22 PM6/11/08

to

Randy Dunlap wrote:

> relative
>

I will fix it in the next version.

Thanks again Randy.

-Andrea

Randy Dunlap

unread,

Jun 11, 2008, 7:00:22 PM6/11/08

to

On Sat, 7 Jun 2008 00:27:28 +0200 Andrea Righi wrote:

relative

> +
> +The syntax is the following:
> +# /bin/echo DEVICE:BANDWIDTH > CGROUP/blockio.bandwidth
> +
> +- DEVICE is the name of the device the limiting rule is applied to,
> +- BANDWIDTH is the maximum I/O bandwidth on DEVICE allowed by CGROUP,
> +- CGROUP is the name of the limited process container.

Thanks.
---
~Randy
"'Daemon' is an old piece of jargon from the UNIX operating system,
where it referred to a piece of low-level utility software, a
fundamental part of the operating system."

Andrea Righi

unread,

Jun 15, 2008, 10:40:13 AM6/15/08

to

kernel/power/main.c: In function ‘test_suspend’:
kernel/power/main.c:688: warning: passing argument 2 of ‘class_find_device’ from incompatible pointer type
kernel/power/main.c:688: error: too few arguments to function ‘class_find_device’
make[2]: *** [kernel/power/main.o] Error 1
make[1]: *** [kernel/power] Error 2
make: *** [kernel] Error 2

Signed-off-by: Andrea Righi <righi....@gmail.com>
---

kernel/power/main.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/power/main.c b/kernel/power/main.c
index e937cc8..2bf3d5b 100644
--- a/kernel/power/main.c
+++ b/kernel/power/main.c
@@ -685,7 +685,7 @@ static int __init test_suspend(void)
}

/* RTCs have initialized by now too ... can we use one? */
- class_find_device(rtc_class, &pony, has_wakealarm);
+ class_find_device(rtc_class, NULL, &pony, has_wakealarm);
if (pony)
rtc = rtc_class_open(pony);
if (!rtc) {
--
1.5.4.3

Andrea Righi

unread,

Jun 15, 2008, 11:30:14 AM6/15/08

to

On 32-bit architectures PAGE_ALIGN() truncates 64-bit values to the 32-bit
boundary. For example:

u64 val = PAGE_ALIGN(size);

always returns a value < 4GB even if size is greater than 4GB.

The problem resides in PAGE_MASK definition (from include/asm-x86/page.h for
example):

#define PAGE_SHIFT 12
#define PAGE_SIZE (_AC(1,UL) << PAGE_SHIFT)
#define PAGE_MASK (~(PAGE_SIZE-1))
...
#define PAGE_ALIGN(addr) (((addr)+PAGE_SIZE-1)&PAGE_MASK)

The "~" is performed on a 32-bit value, so everything in "and" with PAGE_MASK
greater than 4GB will be truncated to the 32-bit boundary. Using the ALIGN()
macro seems to be the right way, because it uses typeof(addr) for the mask.

Also move the PAGE_ALIGN() definitions out of include/asm-*/page.h in
include/linux/mm.h.

See also lkml discussion: http://lkml.org/lkml/2008/6/11/237

ChangeLog: v1 -> v2
- fix some PAGE_ALIGN() undefined references due to the move of PAGE_ALIGN
definition in linux/mm.h

diff --git a/arch/arm/kernel/module.c b/arch/arm/kernel/module.c
index 79b7e5c..81891a2 100644
--- a/arch/arm/kernel/module.c
+++ b/arch/arm/kernel/module.c
@@ -14,6 +14,7 @@
#include <linux/moduleloader.h>
#include <linux/kernel.h>
#include <linux/elf.h>
+#include <linux/mm.h>
#include <linux/vmalloc.h>
#include <linux/slab.h>
#include <linux/fs.h>
diff --git a/arch/arm/plat-omap/fb.c b/arch/arm/plat-omap/fb.c
index 96d6f06..5d10752 100644
--- a/arch/arm/plat-omap/fb.c
+++ b/arch/arm/plat-omap/fb.c
@@ -23,6 +23,7 @@

#include <linux/module.h>
#include <linux/kernel.h>
+#include <linux/mm.h>
#include <linux/init.h>
#include <linux/platform_device.h>
#include <linux/bootmem.h>
diff --git a/arch/avr32/mm/ioremap.c b/arch/avr32/mm/ioremap.c
index 3437c82..f03b79f 100644
--- a/arch/avr32/mm/ioremap.c
+++ b/arch/avr32/mm/ioremap.c
@@ -6,6 +6,7 @@
* published by the Free Software Foundation.
*/
#include <linux/vmalloc.h>
+#include <linux/mm.h>
#include <linux/module.h>
#include <linux/io.h>

diff --git a/arch/h8300/kernel/setup.c b/arch/h8300/kernel/setup.c
index b1f25c2..7fda657 100644
--- a/arch/h8300/kernel/setup.c
+++ b/arch/h8300/kernel/setup.c
@@ -20,6 +20,7 @@
#include <linux/sched.h>
#include <linux/delay.h>
#include <linux/interrupt.h>
+#include <linux/mm.h>
#include <linux/fs.h>
#include <linux/fb.h>
#include <linux/console.h>
diff --git a/arch/m68k/amiga/chipram.c b/arch/m68k/amiga/chipram.c
index cbe3653..61df1d3 100644
--- a/arch/m68k/amiga/chipram.c
+++ b/arch/m68k/amiga/chipram.c
@@ -9,6 +9,7 @@

#include <linux/types.h>
#include <linux/kernel.h>
+#include <linux/mm.h>
#include <linux/init.h>
#include <linux/ioport.h>
#include <linux/slab.h>
diff --git a/arch/m68knommu/kernel/setup.c b/arch/m68knommu/kernel/setup.c
index 03f4fe6..5985f19 100644
--- a/arch/m68knommu/kernel/setup.c
+++ b/arch/m68knommu/kernel/setup.c
@@ -22,6 +22,7 @@
#include <linux/interrupt.h>
#include <linux/fb.h>
#include <linux/module.h>
+#include <linux/mm.h>
#include <linux/console.h>
#include <linux/errno.h>
#include <linux/string.h>
diff --git a/arch/mips/kernel/module.c b/arch/mips/kernel/module.c
index e7ed0ac..a418b26 100644
--- a/arch/mips/kernel/module.c
+++ b/arch/mips/kernel/module.c
@@ -28,6 +28,7 @@
#include <linux/string.h>
#include <linux/kernel.h>
#include <linux/module.h>
+#include <linux/mm.h>
#include <linux/spinlock.h>
#include <asm/pgtable.h> /* MODULE_START */

diff --git a/arch/mips/sgi-ip27/ip27-klnuma.c b/arch/mips/sgi-ip27/ip27-klnuma.c
index 48932ce..d9c79d8 100644
--- a/arch/mips/sgi-ip27/ip27-klnuma.c
+++ b/arch/mips/sgi-ip27/ip27-klnuma.c
@@ -4,6 +4,7 @@
* Copyright 2000 - 2001 Kanoj Sarcar (ka...@sgi.com)
*/
#include <linux/init.h>
+#include <linux/mm.h>
#include <linux/mmzone.h>
#include <linux/kernel.h>
#include <linux/nodemask.h>
diff --git a/arch/powerpc/boot/of.c b/arch/powerpc/boot/of.c
index 61d9899..6bc72b1 100644
--- a/arch/powerpc/boot/of.c
+++ b/arch/powerpc/boot/of.c
@@ -8,6 +8,7 @@
*/
#include <stdarg.h>
#include <stddef.h>
+#include <linux/mm.h>
#include "types.h"
#include "elf.h"
#include "string.h"
diff --git a/arch/powerpc/boot/page.h b/arch/powerpc/boot/page.h
index 14eca30..aa42298 100644
--- a/arch/powerpc/boot/page.h
+++ b/arch/powerpc/boot/page.h
@@ -28,7 +28,4 @@
/* align addr on a size boundary - adjust address up if needed */
#define _ALIGN(addr,size) _ALIGN_UP(addr,size)

-/* to align the pointer to the (next) page boundary */
-#define PAGE_ALIGN(addr) _ALIGN(addr, PAGE_SIZE)
-
#endif /* _PPC_BOOT_PAGE_H */
diff --git a/arch/powerpc/kernel/suspend.c b/arch/powerpc/kernel/suspend.c
index 8cee571..6fc6328 100644
--- a/arch/powerpc/kernel/suspend.c
+++ b/arch/powerpc/kernel/suspend.c
@@ -7,6 +7,7 @@
* Copyright (c) 2001 Patrick Mochel <moc...@osdl.org>
*/

+#include <linux/mm.h>
#include <asm/page.h>

/* References to section boundaries */
diff --git a/arch/ppc/boot/simple/misc-embedded.c b/arch/ppc/boot/simple/misc-embedded.c
index d5a00eb..e390cb7 100644
--- a/arch/ppc/boot/simple/misc-embedded.c
+++ b/arch/ppc/boot/simple/misc-embedded.c
@@ -8,6 +8,7 @@

#include <linux/types.h>
#include <linux/string.h>
+#include <linux/mm.h>
#include <asm/bootinfo.h>
#include <asm/mmu.h>
#include <asm/page.h>
diff --git a/arch/ppc/boot/simple/misc.c b/arch/ppc/boot/simple/misc.c
index c3d3305..8c334a2 100644
--- a/arch/ppc/boot/simple/misc.c
+++ b/arch/ppc/boot/simple/misc.c
@@ -16,6 +16,7 @@

#include <linux/types.h>
#include <linux/string.h>
+#include <linux/mm.h>

#include <asm/page.h>
#include <asm/mmu.h>
diff --git a/arch/sparc64/kernel/iommu_common.h b/arch/sparc64/kernel/iommu_common.h
index f3575a6..53b19c8 100644
--- a/arch/sparc64/kernel/iommu_common.h
+++ b/arch/sparc64/kernel/iommu_common.h
@@ -23,7 +23,7 @@
#define IO_PAGE_SHIFT 13
#define IO_PAGE_SIZE (1UL << IO_PAGE_SHIFT)
#define IO_PAGE_MASK (~(IO_PAGE_SIZE-1))
-#define IO_PAGE_ALIGN(addr) (((addr)+IO_PAGE_SIZE-1)&IO_PAGE_MASK)
+#define IO_PAGE_ALIGN(addr) ALIGN(addr, IO_PAGE_SIZE)

#define IO_TSB_ENTRIES (128*1024)
#define IO_TSB_SIZE (IO_TSB_ENTRIES * 8)
diff --git a/arch/x86/kernel/module_64.c b/arch/x86/kernel/module_64.c
index a888e67..adcab87 100644
--- a/arch/x86/kernel/module_64.c
+++ b/arch/x86/kernel/module_64.c
@@ -20,6 +20,7 @@
#include <linux/elf.h>
#include <linux/vmalloc.h>
#include <linux/fs.h>
+#include <linux/mm.h>
#include <linux/string.h>
#include <linux/kernel.h>
#include <linux/slab.h>
diff --git a/arch/xtensa/kernel/setup.c b/arch/xtensa/kernel/setup.c
index 5e6d75c..a00359e 100644
--- a/arch/xtensa/kernel/setup.c
+++ b/arch/xtensa/kernel/setup.c
@@ -16,6 +16,7 @@

#include <linux/errno.h>
#include <linux/init.h>
+#include <linux/mm.h>
#include <linux/proc_fs.h>
#include <linux/screen_info.h>
#include <linux/bootmem.h>
diff --git a/drivers/char/random.c b/drivers/char/random.c
index 0cf98bd..e0d0e37 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -236,6 +236,7 @@
#include <linux/fs.h>
#include <linux/genhd.h>
#include <linux/interrupt.h>
+#include <linux/mm.h>
#include <linux/spinlock.h>
#include <linux/percpu.h>
#include <linux/cryptohash.h>
diff --git a/drivers/ieee1394/iso.c b/drivers/ieee1394/iso.c
index 07ca35c..1cf6487 100644
--- a/drivers/ieee1394/iso.c
+++ b/drivers/ieee1394/iso.c
@@ -11,6 +11,7 @@

#include <linux/pci.h>
#include <linux/sched.h>
+#include <linux/mm.h>
#include <linux/slab.h>

#include "hosts.h"
diff --git a/drivers/media/video/pvrusb2/pvrusb2-ioread.c b/drivers/media/video/pvrusb2/pvrusb2-ioread.c
index 05a1376..b482478 100644
--- a/drivers/media/video/pvrusb2/pvrusb2-ioread.c
+++ b/drivers/media/video/pvrusb2/pvrusb2-ioread.c
@@ -22,6 +22,7 @@
#include "pvrusb2-debug.h"
#include <linux/errno.h>
#include <linux/string.h>
+#include <linux/mm.h>
#include <linux/slab.h>
#include <linux/mutex.h>
#include <asm/uaccess.h>
diff --git a/drivers/media/video/videobuf-core.c b/drivers/media/video/videobuf-core.c
index 0a88c44..b7b0584 100644
--- a/drivers/media/video/videobuf-core.c
+++ b/drivers/media/video/videobuf-core.c
@@ -16,6 +16,7 @@
#include <linux/init.h>
#include <linux/module.h>
#include <linux/moduleparam.h>
+#include <linux/mm.h>
#include <linux/slab.h>
#include <linux/interrupt.h>

diff --git a/drivers/mtd/maps/uclinux.c b/drivers/mtd/maps/uclinux.c
index bac000a..eca31a5 100644
--- a/drivers/mtd/maps/uclinux.c
+++ b/drivers/mtd/maps/uclinux.c
@@ -12,6 +12,7 @@
#include <linux/types.h>
#include <linux/init.h>
#include <linux/kernel.h>
+#include <linux/mm.h>
#include <linux/fs.h>
#include <linux/major.h>
#include <linux/mtd/mtd.h>
diff --git a/drivers/net/mlx4/eq.c b/drivers/net/mlx4/eq.c
index 4ca78aa..7df928d 100644
--- a/drivers/net/mlx4/eq.c
+++ b/drivers/net/mlx4/eq.c
@@ -33,6 +33,7 @@

#include <linux/init.h>
#include <linux/interrupt.h>
+#include <linux/mm.h>
#include <linux/dma-mapping.h>

#include <linux/mlx4/cmd.h>
diff --git a/drivers/pci/intel-iommu.h b/drivers/pci/intel-iommu.h
index afc0ad9..339da69 100644
--- a/drivers/pci/intel-iommu.h
+++ b/drivers/pci/intel-iommu.h
@@ -35,7 +35,7 @@
#define PAGE_SHIFT_4K (12)
#define PAGE_SIZE_4K (1UL << PAGE_SHIFT_4K)
#define PAGE_MASK_4K (((u64)-1) << PAGE_SHIFT_4K)
-#define PAGE_ALIGN_4K(addr) (((addr) + PAGE_SIZE_4K - 1) & PAGE_MASK_4K)
+#define PAGE_ALIGN_4K(addr) ALIGN(addr, PAGE_SIZE_4K)

#define IOVA_PFN(addr) ((addr) >> PAGE_SHIFT_4K)
#define DMA_32BIT_PFN IOVA_PFN(DMA_32BIT_MASK)
diff --git a/drivers/pcmcia/electra_cf.c b/drivers/pcmcia/electra_cf.c
index 52d0aa8..257039e 100644
--- a/drivers/pcmcia/electra_cf.c
+++ b/drivers/pcmcia/electra_cf.c
@@ -28,6 +28,7 @@
#include <linux/init.h>
#include <linux/delay.h>
#include <linux/interrupt.h>
+#include <linux/mm.h>
#include <linux/vmalloc.h>

#include <pcmcia/ss.h>
diff --git a/drivers/scsi/sun_esp.c b/drivers/scsi/sun_esp.c
index 2c87db9..f9cf701 100644
--- a/drivers/scsi/sun_esp.c
+++ b/drivers/scsi/sun_esp.c
@@ -7,6 +7,7 @@
#include <linux/types.h>
#include <linux/delay.h>
#include <linux/module.h>
+#include <linux/mm.h>
#include <linux/init.h>

#include <asm/irq.h>
diff --git a/drivers/video/acornfb.c b/drivers/video/acornfb.c
index eedb828..017233d 100644
--- a/drivers/video/acornfb.c
+++ b/drivers/video/acornfb.c
@@ -23,6 +23,7 @@
#include <linux/string.h>
#include <linux/ctype.h>
#include <linux/slab.h>
+#include <linux/mm.h>
#include <linux/init.h>
#include <linux/fb.h>
#include <linux/platform_device.h>
diff --git a/drivers/video/clps711xfb.c b/drivers/video/clps711xfb.c
index 9f8a389..bd017f2 100644
--- a/drivers/video/clps711xfb.c
+++ b/drivers/video/clps711xfb.c
@@ -22,6 +22,7 @@
#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/slab.h>
+#include <linux/mm.h>
#include <linux/fb.h>
#include <linux/init.h>
#include <linux/proc_fs.h>
diff --git a/drivers/video/imxfb.c b/drivers/video/imxfb.c
index 94e4d3a..0c5a475 100644
--- a/drivers/video/imxfb.c
+++ b/drivers/video/imxfb.c
@@ -24,6 +24,7 @@
#include <linux/string.h>
#include <linux/interrupt.h>
#include <linux/slab.h>
+#include <linux/mm.h>
#include <linux/fb.h>
#include <linux/delay.h>
#include <linux/init.h>
diff --git a/drivers/video/omap/dispc.c b/drivers/video/omap/dispc.c
index ab32ceb..ab77c51 100644
--- a/drivers/video/omap/dispc.c
+++ b/drivers/video/omap/dispc.c
@@ -20,6 +20,7 @@
*/
#include <linux/kernel.h>
#include <linux/dma-mapping.h>
+#include <linux/mm.h>
#include <linux/vmalloc.h>
#include <linux/clk.h>
#include <linux/io.h>
diff --git a/drivers/video/omap/omapfb_main.c b/drivers/video/omap/omapfb_main.c
index 14d0f7a..f85af5c 100644
--- a/drivers/video/omap/omapfb_main.c
+++ b/drivers/video/omap/omapfb_main.c
@@ -25,6 +25,7 @@
* 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
*/
#include <linux/platform_device.h>
+#include <linux/mm.h>
#include <linux/uaccess.h>

#include <asm/mach-types.h>
diff --git a/drivers/video/pxafb.c b/drivers/video/pxafb.c
index 778b250..dc98c3b 100644
--- a/drivers/video/pxafb.c
+++ b/drivers/video/pxafb.c
@@ -30,6 +30,7 @@
#include <linux/string.h>
#include <linux/interrupt.h>
#include <linux/slab.h>
+#include <linux/mm.h>
#include <linux/fb.h>
#include <linux/delay.h>
#include <linux/init.h>
diff --git a/drivers/video/sa1100fb.c b/drivers/video/sa1100fb.c
index 2dd1b8a..78bcdbc 100644
--- a/drivers/video/sa1100fb.c
+++ b/drivers/video/sa1100fb.c
@@ -167,6 +167,7 @@
#include <linux/string.h>
#include <linux/interrupt.h>
#include <linux/slab.h>
+#include <linux/mm.h>
#include <linux/fb.h>
#include <linux/delay.h>
#include <linux/init.h>
diff --git a/include/asm-alpha/page.h b/include/asm-alpha/page.h
index 22ff976..0995f9d 100644
--- a/include/asm-alpha/page.h
+++ b/include/asm-alpha/page.h
@@ -80,9 +80,6 @@ typedef struct page *pgtable_t;

#endif /* !__ASSEMBLY__ */

-/* to align the pointer to the (next) page boundary */
-#define PAGE_ALIGN(addr) (((addr)+PAGE_SIZE-1)&PAGE_MASK)
-
#define __pa(x) ((unsigned long) (x) - PAGE_OFFSET)
#define __va(x) ((void *)((unsigned long) (x) + PAGE_OFFSET))
#ifndef CONFIG_DISCONTIGMEM
diff --git a/include/asm-arm/page-nommu.h b/include/asm-arm/page-nommu.h
index a1bcad0..ea1cde8 100644
--- a/include/asm-arm/page-nommu.h
+++ b/include/asm-arm/page-nommu.h
@@ -7,6 +7,7 @@
* it under the terms of the GNU General Public License version 2 as
* published by the Free Software Foundation.
*/
+
#ifndef _ASMARM_PAGE_NOMMU_H
#define _ASMARM_PAGE_NOMMU_H

@@ -42,9 +43,6 @@ typedef unsigned long pgprot_t;
#define __pmd(x) (x)
#define __pgprot(x) (x)

-/* to align the pointer to the (next) page boundary */
-#define PAGE_ALIGN(addr) (((addr)+PAGE_SIZE-1)&PAGE_MASK)
-
extern unsigned long memory_start;
extern unsigned long memory_end;

diff --git a/include/asm-arm/page.h b/include/asm-arm/page.h
index 8e05bdb..7c5fc55 100644
--- a/include/asm-arm/page.h
+++ b/include/asm-arm/page.h
@@ -15,9 +15,6 @@
#define PAGE_SIZE (1UL << PAGE_SHIFT)
#define PAGE_MASK (~(PAGE_SIZE-1))

-/* to align the pointer to the (next) page boundary */
-#define PAGE_ALIGN(addr) (((addr)+PAGE_SIZE-1)&PAGE_MASK)
-
#ifndef __ASSEMBLY__

#ifndef CONFIG_MMU
diff --git a/include/asm-avr32/page.h b/include/asm-avr32/page.h
index cbbc5ca..f805d1c 100644
--- a/include/asm-avr32/page.h
+++ b/include/asm-avr32/page.h
@@ -57,9 +57,6 @@ static inline int get_order(unsigned long size)

#endif /* !__ASSEMBLY__ */

-/* Align the pointer to the (next) page boundary */
-#define PAGE_ALIGN(addr) (((addr) + PAGE_SIZE - 1) & PAGE_MASK)
-
/*
* The hardware maps the virtual addresses 0x80000000 -> 0x9fffffff
* permanently to the physical addresses 0x00000000 -> 0x1fffffff when
diff --git a/include/asm-blackfin/page.h b/include/asm-blackfin/page.h
index c7db022..344f6a8 100644
--- a/include/asm-blackfin/page.h
+++ b/include/asm-blackfin/page.h
@@ -51,9 +51,6 @@ typedef struct page *pgtable_t;
#define __pgd(x) ((pgd_t) { (x) } )
#define __pgprot(x) ((pgprot_t) { (x) } )

-/* to align the pointer to the (next) page boundary */
-#define PAGE_ALIGN(addr) (((addr)+PAGE_SIZE-1)&PAGE_MASK)
-
extern unsigned long memory_start;
extern unsigned long memory_end;

diff --git a/include/asm-cris/page.h b/include/asm-cris/page.h
index c45bb1e..d19272b 100644
--- a/include/asm-cris/page.h
+++ b/include/asm-cris/page.h
@@ -60,9 +60,6 @@ typedef struct page *pgtable_t;

#define page_to_phys(page) __pa((((page) - mem_map) << PAGE_SHIFT) + PAGE_OFFSET)

-/* to align the pointer to the (next) page boundary */
-#define PAGE_ALIGN(addr) (((addr)+PAGE_SIZE-1)&PAGE_MASK)
-
#ifndef __ASSEMBLY__

#endif /* __ASSEMBLY__ */
diff --git a/include/asm-frv/page.h b/include/asm-frv/page.h
index c2c1e89..bd9c220 100644
--- a/include/asm-frv/page.h
+++ b/include/asm-frv/page.h
@@ -40,9 +40,6 @@ typedef struct page *pgtable_t;
#define __pgprot(x) ((pgprot_t) { (x) } )
#define PTE_MASK PAGE_MASK

-/* to align the pointer to the (next) page boundary */
-#define PAGE_ALIGN(addr) (((addr) + PAGE_SIZE - 1) & PAGE_MASK)
-
#define devmem_is_allowed(pfn) 1

#define __pa(vaddr) virt_to_phys((void *) (unsigned long) (vaddr))
diff --git a/include/asm-h8300/page.h b/include/asm-h8300/page.h
index d6a3eaf..0b6acf0 100644
--- a/include/asm-h8300/page.h
+++ b/include/asm-h8300/page.h
@@ -43,9 +43,6 @@ typedef struct page *pgtable_t;
#define __pgd(x) ((pgd_t) { (x) } )
#define __pgprot(x) ((pgprot_t) { (x) } )

-/* to align the pointer to the (next) page boundary */
-#define PAGE_ALIGN(addr) (((addr)+PAGE_SIZE-1)&PAGE_MASK)
-
extern unsigned long memory_start;
extern unsigned long memory_end;

diff --git a/include/asm-ia64/page.h b/include/asm-ia64/page.h
index 36f3932..5f271bc 100644
--- a/include/asm-ia64/page.h
+++ b/include/asm-ia64/page.h
@@ -40,7 +40,6 @@

#define PAGE_SIZE (__IA64_UL_CONST(1) << PAGE_SHIFT)
#define PAGE_MASK (~(PAGE_SIZE - 1))
-#define PAGE_ALIGN(addr) (((addr) + PAGE_SIZE - 1) & PAGE_MASK)

#define PERCPU_PAGE_SHIFT 16 /* log2() of max. size of per-CPU area */
#define PERCPU_PAGE_SIZE (__IA64_UL_CONST(1) << PERCPU_PAGE_SHIFT)
diff --git a/include/asm-m32r/page.h b/include/asm-m32r/page.h
index 8a677f3..c933308 100644
--- a/include/asm-m32r/page.h
+++ b/include/asm-m32r/page.h
@@ -41,9 +41,6 @@ typedef struct page *pgtable_t;

#endif /* !__ASSEMBLY__ */

-/* to align the pointer to the (next) page boundary */
-#define PAGE_ALIGN(addr) (((addr) + PAGE_SIZE - 1) & PAGE_MASK)
-
/*
* This handles the memory map.. We could make this a config
* option, but too many people screw it up, and too few need
diff --git a/include/asm-m68k/dvma.h b/include/asm-m68k/dvma.h
index 5d28631..5bd2435 100644
--- a/include/asm-m68k/dvma.h
+++ b/include/asm-m68k/dvma.h
@@ -13,7 +13,7 @@
#define DVMA_PAGE_SHIFT 13
#define DVMA_PAGE_SIZE (1UL << DVMA_PAGE_SHIFT)
#define DVMA_PAGE_MASK (~(DVMA_PAGE_SIZE-1))
-#define DVMA_PAGE_ALIGN(addr) (((addr)+DVMA_PAGE_SIZE-1)&DVMA_PAGE_MASK)
+#define DVMA_PAGE_ALIGN(addr) ALIGN(addr, DVMA_PAGE_SIZE)

extern void dvma_init(void);
extern int dvma_map_iommu(unsigned long kaddr, unsigned long baddr,
diff --git a/include/asm-m68k/page.h b/include/asm-m68k/page.h
index 880c2cb..a34b8ba 100644
--- a/include/asm-m68k/page.h
+++ b/include/asm-m68k/page.h
@@ -103,9 +103,6 @@ typedef struct page *pgtable_t;
#define __pgd(x) ((pgd_t) { (x) } )
#define __pgprot(x) ((pgprot_t) { (x) } )

-/* to align the pointer to the (next) page boundary */
-#define PAGE_ALIGN(addr) (((addr)+PAGE_SIZE-1)&PAGE_MASK)
-
#endif /* !__ASSEMBLY__ */

#include <asm/page_offset.h>
diff --git a/include/asm-m68knommu/page.h b/include/asm-m68knommu/page.h
index 1e82ebb..3a1ede4 100644
--- a/include/asm-m68knommu/page.h
+++ b/include/asm-m68knommu/page.h
@@ -43,9 +43,6 @@ typedef struct page *pgtable_t;
#define __pgd(x) ((pgd_t) { (x) } )
#define __pgprot(x) ((pgprot_t) { (x) } )

-/* to align the pointer to the (next) page boundary */
-#define PAGE_ALIGN(addr) (((addr)+PAGE_SIZE-1)&PAGE_MASK)
-
extern unsigned long memory_start;
extern unsigned long memory_end;

diff --git a/include/asm-mips/page.h b/include/asm-mips/page.h
index 8735aa0..dd1cc0a 100644
--- a/include/asm-mips/page.h
+++ b/include/asm-mips/page.h
@@ -134,9 +134,6 @@ typedef struct { unsigned long pgprot; } pgprot_t;

#endif /* !__ASSEMBLY__ */

-/* to align the pointer to the (next) page boundary */
-#define PAGE_ALIGN(addr) (((addr) + PAGE_SIZE - 1) & PAGE_MASK)
-
/*
* __pa()/__va() should be used only during mem init.
*/
diff --git a/include/asm-mn10300/page.h b/include/asm-mn10300/page.h
index 124971b..8288e12 100644
--- a/include/asm-mn10300/page.h
+++ b/include/asm-mn10300/page.h
@@ -61,9 +61,6 @@ typedef struct page *pgtable_t;

#endif /* !__ASSEMBLY__ */

-/* to align the pointer to the (next) page boundary */
-#define PAGE_ALIGN(addr) (((addr) + PAGE_SIZE - 1) & PAGE_MASK)
-
/*
* This handles the memory map.. We could make this a config
* option, but too many people screw it up, and too few need
diff --git a/include/asm-parisc/page.h b/include/asm-parisc/page.h
index 27d50b8..c3941f0 100644
--- a/include/asm-parisc/page.h
+++ b/include/asm-parisc/page.h
@@ -119,10 +119,6 @@ extern int npmem_ranges;
#define PMD_ENTRY_SIZE (1UL << BITS_PER_PMD_ENTRY)
#define PTE_ENTRY_SIZE (1UL << BITS_PER_PTE_ENTRY)

-/* to align the pointer to the (next) page boundary */
-#define PAGE_ALIGN(addr) (((addr)+PAGE_SIZE-1)&PAGE_MASK)
-
-
#define LINUX_GATEWAY_SPACE 0

/* This governs the relationship between virtual and physical addresses.
diff --git a/include/asm-powerpc/page.h b/include/asm-powerpc/page.h
index cffdf0e..e088545 100644
--- a/include/asm-powerpc/page.h
+++ b/include/asm-powerpc/page.h
@@ -119,9 +119,6 @@ extern phys_addr_t kernstart_addr;
/* align addr on a size boundary - adjust address up if needed */
#define _ALIGN(addr,size) _ALIGN_UP(addr,size)

-/* to align the pointer to the (next) page boundary */
-#define PAGE_ALIGN(addr) _ALIGN(addr, PAGE_SIZE)
-
/*
* Don't compare things with KERNELBASE or PAGE_OFFSET to test for
* "kernelness", use is_kernel_addr() - it should do what you want.
diff --git a/include/asm-ppc/page.h b/include/asm-ppc/page.h
index 37e4756..6a53965 100644
--- a/include/asm-ppc/page.h
+++ b/include/asm-ppc/page.h
@@ -43,10 +43,6 @@ typedef unsigned long pte_basic_t;
/* align addr on a size boundary - adjust address up if needed */
#define _ALIGN(addr,size) _ALIGN_UP(addr,size)

-/* to align the pointer to the (next) page boundary */
-#define PAGE_ALIGN(addr) _ALIGN(addr, PAGE_SIZE)
-
-
#undef STRICT_MM_TYPECHECKS

#ifdef STRICT_MM_TYPECHECKS
diff --git a/include/asm-s390/page.h b/include/asm-s390/page.h
index 12fd9c4..991ba93 100644
--- a/include/asm-s390/page.h
+++ b/include/asm-s390/page.h
@@ -138,9 +138,6 @@ void arch_alloc_page(struct page *page, int order);

#endif /* !__ASSEMBLY__ */

-/* to align the pointer to the (next) page boundary */
-#define PAGE_ALIGN(addr) (((addr)+PAGE_SIZE-1)&PAGE_MASK)
-
#define __PAGE_OFFSET 0x0UL
#define PAGE_OFFSET 0x0UL
#define __pa(x) (unsigned long)(x)
diff --git a/include/asm-sh/page.h b/include/asm-sh/page.h
index 3b305ca..77fb8bf 100644
--- a/include/asm-sh/page.h
+++ b/include/asm-sh/page.h
@@ -24,9 +24,6 @@
#define PAGE_MASK (~(PAGE_SIZE-1))
#define PTE_MASK PAGE_MASK

-/* to align the pointer to the (next) page boundary */
-#define PAGE_ALIGN(addr) (((addr)+PAGE_SIZE-1)&PAGE_MASK)
-
#if defined(CONFIG_HUGETLB_PAGE_SIZE_64K)
#define HPAGE_SHIFT 16
#elif defined(CONFIG_HUGETLB_PAGE_SIZE_256K)
diff --git a/include/asm-sparc/page.h b/include/asm-sparc/page.h
index 6aa9e4c..a5029f2 100644
--- a/include/asm-sparc/page.h
+++ b/include/asm-sparc/page.h
@@ -136,9 +136,6 @@ BTFIXUPDEF_SETHI(sparc_unmapped_base)

#endif /* !(__ASSEMBLY__) */

-/* to align the pointer to the (next) page boundary */
-#define PAGE_ALIGN(addr) (((addr)+PAGE_SIZE-1)&PAGE_MASK)
-
#define PAGE_OFFSET 0xf0000000
#ifndef __ASSEMBLY__
extern unsigned long phys_base;
diff --git a/include/asm-sparc64/page.h b/include/asm-sparc64/page.h
index 93f0881..c36a4ba 100644
--- a/include/asm-sparc64/page.h
+++ b/include/asm-sparc64/page.h
@@ -110,9 +110,6 @@ typedef struct page *pgtable_t;

#endif /* !(__ASSEMBLY__) */

-/* to align the pointer to the (next) page boundary */
-#define PAGE_ALIGN(addr) (((addr)+PAGE_SIZE-1)&PAGE_MASK)
-
/* We used to stick this into a hard-coded global register (%g4)
* but that does not make sense anymore.
*/
diff --git a/include/asm-um/page.h b/include/asm-um/page.h
index 93655eb..a6df1f1 100644
--- a/include/asm-um/page.h
+++ b/include/asm-um/page.h
@@ -92,9 +92,6 @@ typedef struct page *pgtable_t;
#define __pgd(x) ((pgd_t) { (x) } )
#define __pgprot(x) ((pgprot_t) { (x) } )

-/* to align the pointer to the (next) page boundary */
-#define PAGE_ALIGN(addr) (((addr)+PAGE_SIZE-1)&PAGE_MASK)
-
extern unsigned long uml_physmem;

#define PAGE_OFFSET (uml_physmem)
diff --git a/include/asm-x86/page.h b/include/asm-x86/page.h
index f409643..534f6f7 100644
--- a/include/asm-x86/page.h
+++ b/include/asm-x86/page.h
@@ -31,9 +31,6 @@

#define HUGE_MAX_HSTATE 2

-/* to align the pointer to the (next) page boundary */
-#define PAGE_ALIGN(addr) (((addr)+PAGE_SIZE-1)&PAGE_MASK)
-
#ifndef __ASSEMBLY__
#include <linux/types.h>
#endif
diff --git a/include/asm-xtensa/page.h b/include/asm-xtensa/page.h
index 80a6ae0..11f7dc2 100644
--- a/include/asm-xtensa/page.h
+++ b/include/asm-xtensa/page.h
@@ -26,13 +26,11 @@

/*
* PAGE_SHIFT determines the page size
- * PAGE_ALIGN(x) aligns the pointer to the (next) page boundary
*/

#define PAGE_SHIFT 12
#define PAGE_SIZE (__XTENSA_UL_CONST(1) << PAGE_SHIFT)
#define PAGE_MASK (~(PAGE_SIZE-1))
-#define PAGE_ALIGN(addr) (((addr)+PAGE_SIZE - 1) & PAGE_MASK)

#define PAGE_OFFSET XCHAL_KSEG_CACHED_VADDR
#define MAX_MEM_PFN XCHAL_KSEG_SIZE
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 460128d..4c36e83 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -41,6 +41,9 @@ extern unsigned long mmap_min_addr;

#define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n))

+/* to align the pointer to the (next) page boundary */
+#define PAGE_ALIGN(addr) ALIGN(addr, PAGE_SIZE)
+
/*
* Linux kernel virtual memory manager primitives.
* The idea being to have a "virtual" mm in the same way
diff --git a/sound/core/info.c b/sound/core/info.c
index cb5ead3..60edb0a 100644
--- a/sound/core/info.c
+++ b/sound/core/info.c
@@ -22,6 +22,7 @@
#include <linux/init.h>
#include <linux/time.h>
#include <linux/smp_lock.h>
+#include <linux/mm.h>
#include <linux/string.h>
#include <sound/core.h>
#include <sound/minors.h>

Andrea Righi

unread,

Jun 15, 2008, 12:00:26 PM6/15/08

to

Andrea Righi wrote:
> Also move the PAGE_ALIGN() definitions out of include/asm-*/page.h in
> include/linux/mm.h.
>
> See also lkml discussion: http://lkml.org/lkml/2008/6/11/237
>
> ChangeLog: v1 -> v2
> - fix some PAGE_ALIGN() undefined references due to the move of PAGE_ALIGN
> definition in linux/mm.h

BTW, I've used the following script to discover the missing inclusions
of <linux/mm.h> in files of the different archs that use PAGE_ALIGN().

I don't know if there's a better way to check this (well... obviously
except recompiling everything on all the architectures).

I'm posting the script here, maybe it could be useful also for other
similar stuff.

It just greps the direct and indirect .h inclusions (up to a maximum
level of 5 indirect inclusions, by default) in .c files and if it doesn't
find the required #include (linux/mm.h in this case), it reports an
error.

It does not cover #ifdefs and different CONFIGs, but it would be able at
least to reduce potential build errors due to undefined references (of
PAGE_ALIGN in this case). If there're no evident bugs the script should
be able to provide a sufficient condition for correctness.

A downloadable version of the script is also available here:
http://download.systemimager.org/~arighi/linux/scripts/check-include.pl

For example to check if the files that use PAGE_ALIGN() include
<linux/mm.h> on ia64 arch (without recompiling everything):

$ time check-include.pl --arch ia64 --include linux/mm.h \
`git-grep -l [^_]PAGE_ALIGN arch/ia64` \
`git-grep -l [^_]PAGE_ALIGN | grep -v '^arch\|^include\|\.h$'`
...

and on a intel core2 duo 1.2GHz it needs only:
real 0m5.229s
user 0m4.536s
sys 0m0.664s

-Andrea
---
#!/usr/bin/perl -w
#
# check-include.pl
#
# Description:
# Check if one or more C files (.c) in a Linux kernel tree directly or
# indirectly include a C header (.h).
#
# Copyright (C) 2008 Andrea Righi <righi....@gmail.com>

use strict;
use File::Basename;
use Getopt::Long;

my $VERSION = '0.1';
my $program_name = 'check-include.pl';

my $version_info = << "EOF";
$program_name v$VERSION

This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
EOF

my $help_info = $version_info . <<"EOF";

Usage: $program_name --include INCLUDE --arch ARCH [OPTION]... FILE...

Options:
--help, -h Display this output.

--version, -v Display version and copyright information.

--include, -i=INCLUDE Search INCLUDE inclusion, as specified in #include
(i.e. linux/mm.h).

--arch, -a=ARCH Use assembly inclusions of the specified ARCH, the
name of the arch must be the same as specified in
include/asm-* (i.e. x86).

--level, -l=NUM Expand header inclusions up to NUM nested levels.
Beyond this limit we consider the inclusion missing.
Default max level is: 5.
EOF

select(STDERR);
$| = 1;
select(STDOUT);
$| = 1;

Getopt::Long::Configure("posix_default");
Getopt::Long::Configure("no_gnu_compat");
Getopt::Long::Configure("bundling");
GetOptions(
"help|h" => \my $help,
"version|v" => \my $version,
"include|i=s" => \my $include,
"arch|a=s" => \my $arch,
"level|l=i" => \my $level,
) or die("$help_info");

if ($help) {
print "$help_info";
exit(0);
}
if ($version) {
print "$version_info";
exit(0);
}
unless ($include) {
print STDERR "ERROR: --include option is mandatory\n";
print "\n$help_info";
exit(1);
}
unless ($arch) {
print STDERR "ERROR: --arch option is mandatory\n";
print "\n$help_info";
exit(1);
}
unless ($level) {
$level = 5;
}

# try to evaluate if we're in a Linux kernel tree
# TODO: there're surely better ways to do so...
unless ((-f 'README') && (-f 'MAINTAINERS')) {
die("fatal: not a Linux kernel tree\n");
}

my @files = sort_unique(@ARGV);
unless (@files) {
print "no file specified\n";
exit(0);
}

my $ret = 0;
my $cache;
foreach my $file (@files) {
if ($file !~ /\.c$/) {
print STDERR "$file: is not a C file\n";
next;
}
if (! -f $file) {
print STDERR "$file: does not exist\n";
next;
}
print "checking $file : ";
my $found = 0;
for (my $i = 1; $i <= $level; $i++) {
my @includes = get_includes($arch, $i, $include, $file);
$_ = resolve_std_inc($arch, $include);
if (grep(/^include\/$_$/, @includes) ||
!check_include($include, @includes)) {
$found = 1;
last;
}
}
if ($found) {
print "ok\n";
next;
}
$ret = 1;
print "$include was not found!\n";
}
exit($ret);

sub sort_unique
{
my @in = @_;
my %saw;
@saw{@_} = ();
return sort keys %saw;
}

sub resolve_std_inc {
my $arch = shift;
$_ = shift;
s/include\/asm\//include\/asm-$arch\//;
return $_;
}

sub resolve_loc_inc {
my $file = shift;
$_ = shift;
return dirname($file) . "/$_";
}

sub get_includes {
my ($arch, $level, $include, $file) = @_;
my @includes = ();

if ($cache->{$file}->{$level}) {
goto out;
}
@_ = ($file);
for (my $i = 1; $i <= $level; $i++) {
if ($cache->{$file}->{$i}) {
@_ = @{$cache->{$file}->{$i}};
push(@includes, @_);
next;
}
my @list = @_;
@_ = ();
foreach my $e (@list) {
my $res = open(IN, "<$e");
next unless($res);
chomp(my @in = <IN>);
close(IN);

my @inc = grep(s/#include\s+<([^>]+)>.*/include\/$1/,
@in);
if (@inc) {
# Resolve assembly inclusions.
map { $_ = resolve_std_inc($arch, $_) } @inc;
push(@_, @inc);
}

@inc = grep(s/#include\s+"([^>]+)".*/$1/,
@in);
if (@inc) {
# Resolve local inclusions.
map { $_ = resolve_loc_inc($e, $_) } @inc;
push(@_, @inc);
}
}
last unless (@_);
push(@includes, @_);
}
@{$cache->{$file}->{$level}} = sort_unique(@includes);
out:
return @{$cache->{$file}->{$level}};
}

sub check_include {
my ($include, @files) = @_;

foreach (@files) {
my $res = open(IN, "<$_");
next unless ($res);
chomp(my @in = <IN>);
close(IN);
return 0 if (grep(/#include\s+<$include>/, @in))
}
return -1;
}

__END__

Paul Mackerras

unread,

Jun 15, 2008, 7:20:07 PM6/15/08

to

Andrea Righi writes:

> Also move the PAGE_ALIGN() definitions out of include/asm-*/page.h in
> include/linux/mm.h.

I'd rather see it in some other place than this, because
include/linux/mm.h is a large header that includes quite a lot of
other stuff. What's wrong with leaving it in each arch's page.h and
only changing it on those archs that have both 32-bit and 64-bit
variants? Or perhaps there is some other, lower-level header in
include/linux where it could go?

> diff --git a/arch/powerpc/boot/of.c b/arch/powerpc/boot/of.c
> index 61d9899..6bc72b1 100644
> --- a/arch/powerpc/boot/of.c
> +++ b/arch/powerpc/boot/of.c
> @@ -8,6 +8,7 @@
> */
> #include <stdarg.h>
> #include <stddef.h>
> +#include <linux/mm.h>
> #include "types.h"
> #include "elf.h"
> #include "string.h"
> diff --git a/arch/powerpc/boot/page.h b/arch/powerpc/boot/page.h
> index 14eca30..aa42298 100644
> --- a/arch/powerpc/boot/page.h
> +++ b/arch/powerpc/boot/page.h
> @@ -28,7 +28,4 @@
> /* align addr on a size boundary - adjust address up if needed */
> #define _ALIGN(addr,size) _ALIGN_UP(addr,size)
>
> -/* to align the pointer to the (next) page boundary */
> -#define PAGE_ALIGN(addr) _ALIGN(addr, PAGE_SIZE)
> -
> #endif /* _PPC_BOOT_PAGE_H */

These parts are NAKed, because arch/powerpc/boot is a separate program
that doesn't use the kernel include files.

> diff --git a/include/asm-powerpc/page.h b/include/asm-powerpc/page.h
> index cffdf0e..e088545 100644
> --- a/include/asm-powerpc/page.h
> +++ b/include/asm-powerpc/page.h
> @@ -119,9 +119,6 @@ extern phys_addr_t kernstart_addr;
> /* align addr on a size boundary - adjust address up if needed */
> #define _ALIGN(addr,size) _ALIGN_UP(addr,size)
>
> -/* to align the pointer to the (next) page boundary */
> -#define PAGE_ALIGN(addr) _ALIGN(addr, PAGE_SIZE)
> -
> /*
> * Don't compare things with KERNELBASE or PAGE_OFFSET to test for
> * "kernelness", use is_kernel_addr() - it should do what you want.

We had already come across this issue on powerpc, and we fixed it by
making sure that the type of PAGE_MASK was int, not unsigned int.
However, I have no objection to using the ALIGN() macro from
include/linux/kernel.h instead.

Paul.

Andrea Righi

unread,

Jun 16, 2008, 5:00:28 AM6/16/08

to

Paul Mackerras wrote:
> Andrea Righi writes:
>
>> Also move the PAGE_ALIGN() definitions out of include/asm-*/page.h in
>> include/linux/mm.h.
>
> I'd rather see it in some other place than this, because
> include/linux/mm.h is a large header that includes quite a lot of
> other stuff. What's wrong with leaving it in each arch's page.h and
> only changing it on those archs that have both 32-bit and 64-bit
> variants? Or perhaps there is some other, lower-level header in
> include/linux where it could go?

I think the only evident advantage of this is to have a single
implementation, instead of dealing with (potentially) N different
implementations.

Maybe a different place could be linux/mm_types.h, it's not so small,
but at least it's smaller than linux/mm.h. However, it's a bit ugly to
put a "function" in a file called mm_types.h.

Anyway, I've to say that fixing PAGE_ALIGN and leaving it in each page.h
for now would be surely a simpler solution and would introduce less
potential errors.

OK, so we also shouldn't use the linux/kernel.h's ALIGN() here, but leave
the local _ALIGN() definition, right?

>
>> diff --git a/include/asm-powerpc/page.h b/include/asm-powerpc/page.h
>> index cffdf0e..e088545 100644
>> --- a/include/asm-powerpc/page.h
>> +++ b/include/asm-powerpc/page.h
>> @@ -119,9 +119,6 @@ extern phys_addr_t kernstart_addr;
>> /* align addr on a size boundary - adjust address up if needed */
>> #define _ALIGN(addr,size) _ALIGN_UP(addr,size)
>>
>> -/* to align the pointer to the (next) page boundary */
>> -#define PAGE_ALIGN(addr) _ALIGN(addr, PAGE_SIZE)
>> -
>> /*
>> * Don't compare things with KERNELBASE or PAGE_OFFSET to test for
>> * "kernelness", use is_kernel_addr() - it should do what you want.
>
> We had already come across this issue on powerpc, and we fixed it by
> making sure that the type of PAGE_MASK was int, not unsigned int.
> However, I have no objection to using the ALIGN() macro from
> include/linux/kernel.h instead.

Thanks,
-Andrea

Rodolfo Giometti

unread,

Jun 18, 2008, 9:10:18 AM6/18/08

to

The driver supports both HDQ (BQ27000) and I2C (BQ27200) chip version.

Signed-off-by: Rodolfo Giometti <giom...@linux.it>
---
drivers/power/Kconfig | 21 ++
drivers/power/Makefile | 1 +
drivers/power/bq27x00_battery.c | 511 +++++++++++++++++++++++++++++++++++++++
3 files changed, 533 insertions(+), 0 deletions(-)
create mode 100644 drivers/power/bq27x00_battery.c

diff --git a/drivers/power/Kconfig b/drivers/power/Kconfig
index 58c806e..35e45aa 100644
--- a/drivers/power/Kconfig
+++ b/drivers/power/Kconfig
@@ -49,4 +49,25 @@ config BATTERY_OLPC
help
Say Y to enable support for the battery on the OLPC laptop.

+config BATTERY_BQ27x00
+ tristate "BQ27x00 battery driver"
+ help
+ Say Y here to enable support for batteries with BQ27000 or
+ BQ27200 chip.
+
+config BATTERY_BQ27000
+ bool "BQ27000 battery driver"
+ depends on BATTERY_BQ27x00
+ select W1
+ select W1_SLAVE_BQ27000
+ help
+ Say Y here to enable support for batteries with BQ27000(HDQ) chip.
+
+config BATTERY_BQ27200
+ bool "BQ27200 battery driver"
+ depends on BATTERY_BQ27x00
+ select I2C
+ help
+ Say Y here to enable support for batteries with BQ27200(I2C) chip.
+
endif # POWER_SUPPLY
diff --git a/drivers/power/Makefile b/drivers/power/Makefile
index 6413ded..15aa8cb 100644
--- a/drivers/power/Makefile
+++ b/drivers/power/Makefile
@@ -20,3 +20,4 @@ obj-$(CONFIG_APM_POWER) += apm_power.o
obj-$(CONFIG_BATTERY_DS2760) += ds2760_battery.o
obj-$(CONFIG_BATTERY_PMU) += pmu_battery.o
obj-$(CONFIG_BATTERY_OLPC) += olpc_battery.o
+obj-$(CONFIG_BATTERY_BQ27x00) += bq27x00_battery.o
diff --git a/drivers/power/bq27x00_battery.c b/drivers/power/bq27x00_battery.c
new file mode 100644
index 0000000..755a64c
--- /dev/null
+++ b/drivers/power/bq27x00_battery.c
@@ -0,0 +1,511 @@
+/*
+ * linux/drivers/power/bq27x00_battery.c
+ *
+ * BQ27000/BQ27200 battery driver
+ *
+ * Copyright (C) 2008 Rodolfo Giometti <giom...@linux.it>
+ * Copyright (C) 2008 Eurotech S.p.A. <in...@eurotech.it>
+ *
+ * Based on a previous work by Copyright (C) 2008 Texas Instruments, Inc.
+ *
+ * This package is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * THIS PACKAGE IS PROVIDED ``AS IS'' AND WITHOUT ANY EXPRESS OR
+ * IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED
+ * WARRANTIES OF MERCHANTIBILITY AND FITNESS FOR A PARTICULAR PURPOSE.
+ *
+ */
+#include <linux/module.h>
+#include <linux/param.h>
+#include <linux/jiffies.h>
+#include <linux/workqueue.h>
+#include <linux/delay.h>
+#include <linux/platform_device.h>
+#include <linux/power_supply.h>
+#include <linux/idr.h>
+
+#define DRIVER_VERSION "1.0.0"
+
+#ifdef CONFIG_BATTERY_BQ27000
+#include "../w1/w1.h"
+#endif
+#ifdef CONFIG_BATTERY_BQ27200
+#include <linux/i2c.h>
+#endif
+
+#define BQ27x00_REG_TEMP 0x06
+#define BQ27x00_REG_VOLT 0x08
+#define BQ27x00_REG_RSOC 0x0B /* Relative State-of-Charge */
+#define BQ27x00_REG_AI 0x14
+#define BQ27x00_REG_FLAGS 0x0A
+#define HIGH_BYTE(A) ((A) << 8)
+
+/* If the system has several batteries we need a different name for each
+ * of them...
+ */
+DEFINE_IDR(battery_id);
+DEFINE_MUTEX(battery_mutex);
+
+struct bq27x00_device_info;
+struct bq27x00_access_methods {
+ int (*read)(u8 reg, int *rt_value, int b_single,
+ struct bq27x00_device_info *di);
+};
+
+struct bq27x00_device_info {
+ struct device *dev;
+#ifdef CONFIG_BATTERY_BQ27200
+ struct i2c_client *client;
+ int id;
+#endif
+ int voltage_uV;
+ int current_uA;
+ int temp_C;
+ int charge_rsoc;
+ struct bq27x00_access_methods *bus;
+ struct power_supply bat;
+};
+
+static enum power_supply_property bq27x00_battery_props[] = {
+ POWER_SUPPLY_PROP_PRESENT,
+ POWER_SUPPLY_PROP_VOLTAGE_NOW,
+ POWER_SUPPLY_PROP_CURRENT_NOW,
+ POWER_SUPPLY_PROP_CAPACITY,
+ POWER_SUPPLY_PROP_TEMP,
+};
+
+/*
+ * Common code for BQ27x00 devices
+ */
+
+static int bq27x00_read(u8 reg, int *rt_value, int b_single,
+ struct bq27x00_device_info *di)
+{
+ int ret;
+
+ ret = di->bus->read(reg, rt_value, b_single, di);
+ *rt_value = be16_to_cpu(*rt_value);

+
+ return ret;
+}
+

+/*
+ * Return the battery temperature in Celcius degrees
+ * Or < 0 if something fails.
+ */
+static int bq27x00_battery_temperature(struct bq27x00_device_info *di)
+{
+ int ret, temp = 0;
+
+ ret = bq27x00_read(BQ27x00_REG_TEMP, &temp, 0, di);
+ if (ret) {
+ dev_err(di->dev, "error reading temperature\n");

+ return ret;
+ }
+

+ return (temp >> 2) - 273;
+}
+
+/*
+ * Return the battery Voltage in milivolts
+ * Or < 0 if something fails.
+ */
+static int bq27x00_battery_voltage(struct bq27x00_device_info *di)
+{
+ int ret, volt = 0;
+
+ ret = bq27x00_read(BQ27x00_REG_VOLT, &volt, 0, di);
+ if (ret) {
+ dev_err(di->dev, "error reading voltage\n");

+ return ret;
+ }
+

+ return volt;
+}
+
+/*
+ * Return the battery average current
+ * Note that current can be negative signed as well
+ * Or 0 if something fails.
+ */
+static int bq27x00_battery_current(struct bq27x00_device_info *di)
+{
+ int ret, curr = 0, flags = 0;
+
+ ret = bq27x00_read(BQ27x00_REG_AI, &curr, 0, di);
+ if (ret) {
+ dev_err(di->dev, "error reading current\n");
+ return 0;
+ }
+ ret = bq27x00_read(BQ27x00_REG_FLAGS, &flags, 0, di);
+ if (ret < 0) {
+ dev_err(di->dev, "error reading flags\n");
+ return 0;
+ }
+ if ((flags & (1 << 7)) != 0) {
+ dev_dbg(di->dev, "negative current!\n");
+ return -curr;
+ } else
+ return curr;
+}
+
+/*
+ * Return the battery Relative State-of-Charge
+ * Or < 0 if something fails.
+ */
+static int bq27x00_battery_rsoc(struct bq27x00_device_info *di)
+{
+ int ret, rsoc = 0;
+
+ ret = bq27x00_read(BQ27x00_REG_RSOC, &rsoc, 1, di);
+ if (ret) {
+ dev_err(di->dev, "error reading relative State-of-Charge\n");

+ return ret;
+ }
+

+ return rsoc >> 8;
+}
+
+#define to_bq27x00_device_info(x) container_of((x), \
+ struct bq27x00_device_info, bat);
+
+static int bq27x00_battery_get_property(struct power_supply *psy,
+ enum power_supply_property psp,
+ union power_supply_propval *val)
+{
+ struct bq27x00_device_info *di = to_bq27x00_device_info(psy);
+
+ switch (psp) {
+ case POWER_SUPPLY_PROP_VOLTAGE_NOW :
+ case POWER_SUPPLY_PROP_PRESENT :
+ val->intval = bq27x00_battery_voltage(di);
+ if (psp == POWER_SUPPLY_PROP_PRESENT)
+ val->intval = val->intval <= 0 ? 0 : 1;
+
+ break;
+
+ case POWER_SUPPLY_PROP_CURRENT_NOW :
+ val->intval = bq27x00_battery_current(di);
+
+ break;
+
+ case POWER_SUPPLY_PROP_CAPACITY :
+ val->intval = bq27x00_battery_rsoc(di);
+
+ break;
+
+ case POWER_SUPPLY_PROP_TEMP :
+ val->intval = bq27x00_battery_temperature(di);
+
+ break;
+
+ default:
+ return -EINVAL;
+ }
+

+ return 0;
+}
+

+static void bq27x00_powersupply_init(struct bq27x00_device_info *di)
+{
+ di->bat.type = POWER_SUPPLY_TYPE_BATTERY;
+ di->bat.properties = bq27x00_battery_props;
+ di->bat.num_properties = ARRAY_SIZE(bq27x00_battery_props);
+ di->bat.get_property = bq27x00_battery_get_property;
+ di->bat.external_power_changed = NULL;
+
+ return;
+}
+
+/*
+ * BQ27000 specific code
+ */
+
+#ifdef CONFIG_BATTERY_BQ27000
+
+extern int w1_bq27000_read(struct device *dev, u8 reg);
+
+static int bq27000_read(u8 reg, int *rt_value, int b_single,
+ struct bq27x00_device_info *di)
+{
+ u8 val;
+
+ val = w1_bq27000_read(di->dev, reg);
+ *rt_value = val;
+
+ if (!b_single) {
+ val = w1_bq27000_read(di->dev, reg + 1);
+ *rt_value += HIGH_BYTE((int) val);
+ }
+

+ return 0;
+}
+

+static int bq27000_battery_probe(struct platform_device *pdev)
+{
+ struct bq27x00_device_info *di;
+ struct bq27x00_access_methods *bus;
+ int retval = 0;
+
+ di = kzalloc(sizeof(*di), GFP_KERNEL);
+ if (!di) {
+ dev_err(&pdev->dev, "failed to allocate device info data\n");

+ return -ENOMEM;
+ }
+

+ bus = kzalloc(sizeof(*bus), GFP_KERNEL);
+ if (!bus) {
+ dev_err(&pdev->dev, "failed to allocate access method data\n");
+ kfree(di);

+ return -ENOMEM;
+ }
+

+ platform_set_drvdata(pdev, di);
+
+ di->dev = &pdev->dev;
+ di->dev = pdev->dev.parent;
+ di->bat.name = "bq27000";
+ bus->read = &bq27000_read;
+ di->bus = bus;
+
+ bq27x00_powersupply_init(di);
+
+ retval = power_supply_register(&pdev->dev, &di->bat);
+ if (retval) {
+ dev_err(&pdev->dev, "failed to register battery\n");
+ goto batt_failed;
+ }
+
+ dev_info(&pdev->dev, "support ver. %s enabled\n", DRIVER_VERSION);
+
+ return 0;
+
+batt_failed:
+ kfree(bus);
+ kfree(di);
+ return retval;
+}
+
+static int bq27000_battery_remove(struct platform_device *pdev)
+{
+ struct bq27x00_device_info *di = platform_get_drvdata(pdev);
+
+ power_supply_unregister(&di->bat);
+ platform_set_drvdata(pdev, NULL);
+ kfree(di);
+

+ return 0;
+}
+

+#endif /* CONFIG_BATTERY_BQ27000 */
+
+/*
+ * BQ27200 specific code
+ */
+
+#ifdef CONFIG_BATTERY_BQ27200
+
+static int bq27200_read(u8 reg, int *rt_value, int b_single,
+ struct bq27x00_device_info *di)
+{
+ struct i2c_client *client = di->client;
+ struct i2c_msg msg[1];
+ unsigned char data[2];
+ int err;
+
+ if (!client->adapter)
+ return -ENODEV;
+
+ msg->addr = client->addr;
+ msg->flags = 0;
+ msg->len = 1;
+ msg->buf = data;
+
+ data[0] = reg;
+ err = i2c_transfer(client->adapter, msg, 1);
+
+ if (err >= 0) {
+ if (!b_single)
+ msg->len = 2;
+ else
+ msg->len = 1;
+
+ msg->flags = I2C_M_RD;
+ err = i2c_transfer(client->adapter, msg, 1);
+ if (err >= 0) {
+ if (!b_single)
+ *rt_value = data[1] | HIGH_BYTE(data[0]);
+ else
+ *rt_value = data[0];
+

+ return 0;
+ } else

+ return err;
+ } else
+ return err;
+}
+
+static int bq27200_battery_probe(struct i2c_client *client,
+ const struct i2c_device_id *id)
+{
+ char *name;
+ struct bq27x00_device_info *di;
+ struct bq27x00_access_methods *bus;
+ int num, retval = 0;
+
+ /* Get new ID for the new battery device */
+ retval = idr_pre_get(&battery_id, GFP_KERNEL);
+ if (retval == 0)
+ return -ENOMEM;
+ mutex_lock(&battery_mutex);
+ retval = idr_get_new(&battery_id, client, &num);
+ mutex_unlock(&battery_mutex);
+ if (retval < 0)
+ return retval;
+
+ name = kmalloc(16, GFP_KERNEL);
+ if (!name) {
+ dev_err(&client->dev, "failed to allocate device name\n");
+ retval = -ENOMEM;
+ goto batt_failed_1;
+ }
+ sprintf(name, "bq27200-%d", num);
+
+ di = kzalloc(sizeof(*di), GFP_KERNEL);
+ if (!di) {
+ dev_err(&client->dev, "failed to allocate device info data\n");
+ retval = -ENOMEM;
+ goto batt_failed_2;
+ }
+ di->id = num;
+
+ bus = kzalloc(sizeof(*bus), GFP_KERNEL);
+ if (!bus) {
+ dev_err(&client->dev, "failed to allocate access method "
+ "data\n");
+ retval = -ENOMEM;
+ goto batt_failed_3;
+ }
+
+ i2c_set_clientdata(client, di);
+ di->dev = &client->dev;
+ di->bat.name = name;
+ bus->read = &bq27200_read;
+ di->bus = bus;
+ di->client = client;
+
+ bq27x00_powersupply_init(di);
+
+ retval = power_supply_register(&client->dev, &di->bat);
+ if (retval) {
+ dev_err(&client->dev, "failed to register battery\n");
+ goto batt_failed_4;
+ }
+
+ dev_info(&client->dev, "support ver. %s enabled\n", DRIVER_VERSION);
+
+ return 0;
+
+batt_failed_4:
+ kfree(bus);
+batt_failed_3:
+ kfree(di);
+batt_failed_2:
+ kfree(name);
+batt_failed_1:
+ mutex_lock(&battery_mutex);
+ idr_remove(&battery_id, num);
+ mutex_unlock(&battery_mutex);
+
+ return retval;
+}
+
+static int bq27200_battery_remove(struct i2c_client *client)
+{
+ struct bq27x00_device_info *di = i2c_get_clientdata(client);
+
+ power_supply_unregister(&di->bat);
+
+ kfree(di->bat.name);
+
+ mutex_lock(&battery_mutex);
+ idr_remove(&battery_id, di->id);
+ mutex_unlock(&battery_mutex);
+
+ kfree(di);
+

+ return 0;
+}
+

+#endif /* CONFIG_BATTERY_BQ27200 */
+
+/*
+ * Module stuff
+ */
+
+#ifdef CONFIG_BATTERY_BQ27000
+
+static struct platform_driver bq27000_battery_driver = {
+ .probe = bq27000_battery_probe,
+ .remove = bq27000_battery_remove,
+
+ .driver = {
+ .name = "bq27000-battery",
+ },
+};
+
+#endif /* CONFIG_BATTERY_BQ27000 */
+
+#ifdef CONFIG_BATTERY_BQ27200
+
+static const struct i2c_device_id bq27200_id[] = {
+ { "bq27200", 0 },
+};
+
+static struct i2c_driver bq27200_battery_driver = {
+ .probe = bq27200_battery_probe,
+ .remove = __devexit_p(bq27200_battery_remove),
+
+ .driver = {
+ .name = "bq27200-battery",
+ },
+ .id_table = bq27200_id,
+};
+
+#endif /* CONFIG_BATTERY_BQ27200 */
+
+static int __init bq27x00_battery_init(void)
+{
+ int status = 0;
+
+#ifdef CONFIG_BATTERY_BQ27000
+ status = platform_driver_register(&bq27000_battery_driver);
+ if (status)
+ printk(KERN_ERR "Unable to register BQ27000 driver\n");
+#endif
+#ifdef CONFIG_BATTERY_BQ27200
+ status = i2c_add_driver(&bq27200_battery_driver);
+ if (status)
+ printk(KERN_ERR "Unable to register BQ27200 driver\n");
+#endif
+ return status;
+}
+
+static void __exit bq27x00_battery_exit(void)
+{
+#ifdef CONFIG_BATTERY_BQ27000
+ platform_driver_unregister(&bq27000_battery_driver);
+#endif
+#ifdef CONFIG_BATTERY_BQ27200
+ i2c_del_driver(&bq27200_battery_driver);
+#endif
+}
+
+module_init(bq27x00_battery_init);
+module_exit(bq27x00_battery_exit);
+
+MODULE_AUTHOR("Texas Instruments");
+MODULE_DESCRIPTION("BQ27x00 battery moniter driver");
+MODULE_LICENSE("GPL");
--
1.5.4.3

Carl Henrik Lunde

unread,

Jun 18, 2008, 11:20:17 AM6/18/08

to

On Sat, Jun 7, 2008 at 00:27, Andrea Righi <righi....@gmail.com> wrote:
[...]
> +3. Advantages of providing this feature
> +
> +* Allow QoS for block device I/O among different cgroups

I'm not sure if this can be called QoS, as it does not guarantee
anything but throttling?

> +* The bandwidth limitations are guaranteed both for synchronous and
> + asynchronous operations, even the I/O passing through the page cache or
> + buffers and not only direct I/O (see below for details)

The throttling does not seem to cover the I/O path for XFS?
I was unable to throttle processes reading from an XFS file system.

Also I think the name of the function cgroup_io_account is a bit too innocent?
It sounds like a inline function "{ io += bytes; }", not like
something which may sleep.

--
Carl Henrik

Carl Henrik Lunde

unread,

Jun 18, 2008, 2:10:12 PM6/18/08

to

On Sat, Jun 07, 2008 at 12:27:29AM +0200, Andrea Righi wrote:
> This is the core io-throttle kernel infrastructure. It creates the basic
> interfaces to cgroups and implements the I/O measurement and throttling
> functions.
[...]
> +void cgroup_io_account(struct block_device *bdev, size_t bytes)
[...]
> + /* Account the I/O activity */
> + node->req += bytes;
> +
> + /* Evaluate if we need to throttle the current process */
> + delta = (long)jiffies - (long)node->last_request;
> + if (!delta)
> + goto out;
> +
> + t = msecs_to_jiffies(node->req / node->iorate);
> + if (!t)
> + goto out;
> +
> + sleep = t - delta;
> + if (unlikely(sleep > 0)) {
> + spin_unlock_irq(&iot->lock);
> + if (__cant_sleep())
> + return;
> + pr_debug("io-throttle: task %p (%s) must sleep %lu jiffies\n",
> + current, current->comm, sleep);
> + schedule_timeout_killable(sleep);
> + return;
> + }
> +
> + /* Reset I/O accounting */
> + node->req = 0;
> + node->last_request = jiffies;
[...]

Did you consider using token bucket instead of this (leaky bucket?)?

I've attached a patch which implements token bucket. Although not as
precise as the leaky bucket the performance is better at high bandwidth
streaming loads.

The leaky bucket stops at around 53 MB/s while token bucket works for
up to 64 MB/s. The baseline (no cgroups) is 66 MB/s.

benchmark:
two streaming readers (fio) with block size 128k, bucket size 4 MB
90% of the bandwidth was allocated to one process, the other gets 10%

bw-limit: actual bw algorithm bw1 bw2
5 MiB/s: 5.0 MiB/s leaky_bucket 0.5 4.5
5 MiB/s: 5.2 MiB/s token_bucket 0.6 4.6
10 MiB/s: 10.0 MiB/s leaky_bucket 1.0 9.0
10 MiB/s: 10.3 MiB/s token_bucket 1.0 9.2
15 MiB/s: 15.0 MiB/s leaky_bucket 1.5 13.5
15 MiB/s: 15.4 MiB/s token_bucket 1.5 13.8
20 MiB/s: 19.9 MiB/s leaky_bucket 2.0 17.9
20 MiB/s: 20.5 MiB/s token_bucket 2.1 18.4
25 MiB/s: 24.4 MiB/s leaky_bucket 2.5 21.9
25 MiB/s: 25.6 MiB/s token_bucket 2.6 23.0
30 MiB/s: 29.2 MiB/s leaky_bucket 3.0 26.2
30 MiB/s: 30.7 MiB/s token_bucket 3.1 27.7
35 MiB/s: 34.3 MiB/s leaky_bucket 3.4 30.9
35 MiB/s: 35.9 MiB/s token_bucket 3.6 32.3
40 MiB/s: 39.7 MiB/s leaky_bucket 3.9 35.8
40 MiB/s: 41.0 MiB/s token_bucket 4.1 36.9
45 MiB/s: 44.0 MiB/s leaky_bucket 4.3 39.7
45 MiB/s: 46.1 MiB/s token_bucket 4.6 41.5
50 MiB/s: 47.9 MiB/s leaky_bucket 4.7 43.2
50 MiB/s: 51.0 MiB/s token_bucket 5.1 45.9
55 MiB/s: 50.5 MiB/s leaky_bucket 5.0 45.5
55 MiB/s: 56.2 MiB/s token_bucket 5.6 50.5
60 MiB/s: 52.9 MiB/s leaky_bucket 5.2 47.7
60 MiB/s: 61.0 MiB/s token_bucket 6.1 54.9
65 MiB/s: 53.0 MiB/s leaky_bucket 5.4 47.6
65 MiB/s: 63.7 MiB/s token_bucket 6.6 57.1
70 MiB/s: 53.8 MiB/s leaky_bucket 5.5 48.4
70 MiB/s: 64.1 MiB/s token_bucket 7.1 57.0

diff --git a/block/blk-io-throttle.c b/block/blk-io-throttle.c
index 804df88..9ed0c7c 100644
--- a/block/blk-io-throttle.c
+++ b/block/blk-io-throttle.c
@@ -40,7 +40,8 @@ struct iothrottle_node {
struct rb_node node;
dev_t dev;
unsigned long iorate;
- unsigned long req;
+ long bucket_size; /* Max value for t */
+ long t;
unsigned long last_request;
};

@@ -180,18 +181,20 @@ static ssize_t iothrottle_read(struct cgroup *cont,
iothrottle_for_each(n, &iot->tree) {
struct iothrottle_node *node =
rb_entry(n, struct iothrottle_node, node);
- unsigned long delta = (long)jiffies - (long)node->last_request;
+ unsigned long delta = (((long)jiffies - (long)node->last_request) * 1000) / HZ;

BUG_ON(!node->dev);
s += snprintf(s, nbytes - (s - buffer),
"=== device (%u,%u) ===\n"
"bandwidth-max: %lu KiB/sec\n"
- " requested: %lu bytes\n"
- " last request: %lu jiffies\n"
- " delta: %lu jiffies\n",
+ "bucket size : %ld KiB\n"
+ "bucket fill : %ld KiB (after last request)\n"
+ "last request : %lu ms ago\n",
MAJOR(node->dev), MINOR(node->dev),
- node->iorate, node->req,
- node->last_request, delta);
+ node->iorate,
+ node->bucket_size / 1024,
+ node->t / 1024,
+ delta);
}
spin_unlock_irq(&iot->lock);
buffer[nbytes] = '\0';
@@ -220,21 +223,33 @@ static inline dev_t devname2dev_t(const char *buf)
return ret;
}

-static inline int iothrottle_parse_args(char *buf, size_t nbytes,
- dev_t *dev, unsigned long *val)
+static inline int iothrottle_parse_args(char *buf, size_t nbytes, dev_t *dev,
+ unsigned long *iorate,
+ unsigned long *bucket_size)
{
- char *p;
+ char *ioratep, *bucket_sizep;

- p = memchr(buf, ':', nbytes);
- if (!p)
+ ioratep = memchr(buf, ':', nbytes);
+ if (!ioratep)
return -EINVAL;
- *p++ = '\0';
+ *ioratep++ = '\0';
+
+ bucket_sizep = memchr(ioratep, ':', nbytes + ioratep - buf);
+ if (!bucket_sizep)
+ return -EINVAL;
+ *bucket_sizep++ = '\0';

*dev = devname2dev_t(buf);
if (!*dev)
return -ENOTBLK;

- return strict_strtoul(p, 10, val);
+ if (strict_strtoul(ioratep, 10, iorate))
+ return -EINVAL;
+
+ if (strict_strtoul(bucket_sizep, 10, bucket_size))
+ return -EINVAL;
+
+ return 0;
}

static ssize_t iothrottle_write(struct cgroup *cont,
@@ -247,7 +262,7 @@ static ssize_t iothrottle_write(struct cgroup *cont,
struct iothrottle_node *node, *tmpn = NULL;
char *buffer, *tmpp;
dev_t dev;
- unsigned long val;
+ unsigned long iorate, bucket_size;
int ret;

if (unlikely(!nbytes))
@@ -265,7 +280,7 @@ static ssize_t iothrottle_write(struct cgroup *cont,
buffer[nbytes] = '\0';
tmpp = strstrip(buffer);

- ret = iothrottle_parse_args(tmpp, nbytes, &dev, &val);
+ ret = iothrottle_parse_args(tmpp, nbytes, &dev, &iorate, &bucket_size);
if (ret)
goto out1;

@@ -284,7 +299,7 @@ static ssize_t iothrottle_write(struct cgroup *cont,
iot = cgroup_to_iothrottle(cont);

spin_lock_irq(&iot->lock);
- if (!val) {
+ if (!iorate) {
/* Delete a block device limiting rule */
iothrottle_delete_node(iot, dev);
ret = nbytes;
@@ -293,8 +308,9 @@ static ssize_t iothrottle_write(struct cgroup *cont,
node = iothrottle_search_node(iot, dev);
if (node) {
/* Update a block device limiting rule */
- node->iorate = val;
- node->req = 0;
+ node->iorate = iorate;
+ node->bucket_size = bucket_size * 1024;
+ node->t = 0;
node->last_request = jiffies;
ret = nbytes;
goto out3;
@@ -307,8 +323,9 @@ static ssize_t iothrottle_write(struct cgroup *cont,
node = tmpn;
tmpn = NULL;

- node->iorate = val;
- node->req = 0;
+ node->iorate = iorate;
+ node->bucket_size = bucket_size * 1024;
+ node->t = 0;
node->last_request = jiffies;
node->dev = dev;
ret = iothrottle_insert_node(iot, node);
@@ -355,7 +372,7 @@ void cgroup_io_account(struct block_device *bdev, size_t bytes)
{
struct iothrottle *iot;
struct iothrottle_node *node;
- unsigned long delta, t;
+ unsigned long delta;
long sleep;

if (unlikely(!bdev))
@@ -370,36 +387,37 @@ void cgroup_io_account(struct block_device *bdev, size_t bytes)
spin_lock_irq(&iot->lock);

node = iothrottle_search_node(iot, bdev->bd_inode->i_rdev);
- if (!node || !node->iorate)
- goto out;
-
- /* Account the I/O activity */
- node->req += bytes;
+ if (!node || !node->iorate) {
+ spin_unlock_irq(&iot->lock);
+ return;
+ }

- /* Evaluate if we need to throttle the current process */
+ /* Add tokens for time elapsed since last read */
delta = (long)jiffies - (long)node->last_request;
- if (!delta)
- goto out;
+ if (delta) {
+ node->last_request = jiffies;
+ node->t += (node->iorate * 1024 * delta) / HZ;

- t = msecs_to_jiffies(node->req / node->iorate);
- if (!t)
- goto out;
+ if (node->t > node->bucket_size)
+ node->t = node->bucket_size;
+ }

- sleep = t - delta;
- if (unlikely(sleep > 0)) {
- spin_unlock_irq(&iot->lock);
- if (__cant_sleep())
- return;
- pr_debug("io-throttle: task %p (%s) must sleep %lu jiffies\n",
- current, current->comm, sleep);
- schedule_timeout_killable(sleep);
- return;
+ /* Account the I/O activity */
+ node->t -= bytes;
+
+ if (node->t < 0) {
+ sleep = (-node->t) * HZ / (node->iorate * 1024);
+ } else {
+ sleep = 0;
}

- /* Reset I/O accounting */
- node->req = 0;
- node->last_request = jiffies;
-out:
spin_unlock_irq(&iot->lock);
+
+ if (sleep && !__cant_sleep()) {
+ pr_debug("io-throttle: %s[%d] must sleep %ld jiffies\n",
+ current->comm, current->pid, sleep);
+
+ schedule_timeout_killable(sleep);
+ }
}
EXPORT_SYMBOL(cgroup_io_account);

Andrea Righi

unread,

Jun 18, 2008, 6:30:14 PM6/18/08

to

Carl Henrik Lunde wrote:
> On Sat, Jun 7, 2008 at 00:27, Andrea Righi <righi....@gmail.com> wrote:
> [...]
>> +3. Advantages of providing this feature
>> +
>> +* Allow QoS for block device I/O among different cgroups
>
> I'm not sure if this can be called QoS, as it does not guarantee
> anything but throttling?

That's correct. There's nothing to guarantee minimum bandwidth levels
right now, the "QoS" is implemented only slowing down i/o "traffic" that
exceeds the limits (probably "i/o traffic shaping" is a better wording).

Minimum thresholds are supposed to be guaranteed if the user configures
a proper i/o bandwidth partitioning of the block devices shared among
the different cgroups (that could mean: the sum of all the single limits
for a device doesn't exceed the total i/o bandwidth of that device... at
least theoretically).

I'll try to clarify better this concept in the documentation that I'll
include in the next patchset version.

I'd also like to explore the io-throttle controller on-top-of other i/o
band controlling solutions (see for example:
http://lkml.org/lkml/2008/4/3/45), in order to exploit both the limiting
feature from io-throttle and use priority / fair queueing alghorithms to
guarantee minimum performance levels.

>> +* The bandwidth limitations are guaranteed both for synchronous and
>> + asynchronous operations, even the I/O passing through the page cache or
>> + buffers and not only direct I/O (see below for details)
>
> The throttling does not seem to cover the I/O path for XFS?
> I was unable to throttle processes reading from an XFS file system.

mmmh... works for me. Are you sure you've limited the correct block
device?

> Also I think the name of the function cgroup_io_account is a bit too innocent?
> It sounds like a inline function "{ io += bytes; }", not like
> something which may sleep.

Agree. What about cgroup_acct_and_throttle_io()? suggestions?

Thanks,
-Andrea

Andrea Righi

unread,

Jun 18, 2008, 6:40:14 PM6/18/08

to

Interesting! it could be great to have both available at runtime and
just switch between leaky or token bucket, e.g. by echo-ing "leaky" or
"token" to a file in the cgroup filesystem, ummm, block.limiting-algorithm?

>
> The leaky bucket stops at around 53 MB/s while token bucket works for
> up to 64 MB/s. The baseline (no cgroups) is 66 MB/s.
>
> benchmark:
> two streaming readers (fio) with block size 128k, bucket size 4 MB
> 90% of the bandwidth was allocated to one process, the other gets 10%

Thanks for posting the results, I'll look closely at your patch, test
it as well and try merge your work.

I also did some improvements in v2 in terms of scalability, in
particular I've replaced the rbtree with a liked list, in order to
remove the spinlocks and replace them by RCU to protect the list
structure. I need to do some stress tests before, but I'll post a v3
soon.

Some minor comments below for now.

> diff --git a/block/blk-io-throttle.c b/block/blk-io-throttle.c
> index 804df88..9ed0c7c 100644
> --- a/block/blk-io-throttle.c
> +++ b/block/blk-io-throttle.c
> @@ -40,7 +40,8 @@ struct iothrottle_node {
> struct rb_node node;
> dev_t dev;
> unsigned long iorate;
> - unsigned long req;
> + long bucket_size; /* Max value for t */
> + long t;
> unsigned long last_request;
> };
>
> @@ -180,18 +181,20 @@ static ssize_t iothrottle_read(struct cgroup *cont,
> iothrottle_for_each(n, &iot->tree) {
> struct iothrottle_node *node =
> rb_entry(n, struct iothrottle_node, node);
> - unsigned long delta = (long)jiffies - (long)node->last_request;
> + unsigned long delta = (((long)jiffies - (long)node->last_request) * 1000) / HZ;

Better to use jiffies_to_msecs() here.

The same here:
node->t += node->iorate * 1024
* jiffies_to_msec(delta) * MSEC_PER_SEC;

>
> - t = msecs_to_jiffies(node->req / node->iorate);
> - if (!t)
> - goto out;
> + if (node->t > node->bucket_size)
> + node->t = node->bucket_size;
> + }
>
> - sleep = t - delta;
> - if (unlikely(sleep > 0)) {
> - spin_unlock_irq(&iot->lock);
> - if (__cant_sleep())
> - return;
> - pr_debug("io-throttle: task %p (%s) must sleep %lu jiffies\n",
> - current, current->comm, sleep);
> - schedule_timeout_killable(sleep);
> - return;
> + /* Account the I/O activity */
> + node->t -= bytes;
> +
> + if (node->t < 0) {
> + sleep = (-node->t) * HZ / (node->iorate * 1024);

And again:
sleep = msec_to_jiffies(-node->t / (node->iorate * 1024)
* MSEC_PER_SEC);

> + } else {
> + sleep = 0;
> }
>
> - /* Reset I/O accounting */
> - node->req = 0;
> - node->last_request = jiffies;
> -out:
> spin_unlock_irq(&iot->lock);
> +
> + if (sleep && !__cant_sleep()) {
> + pr_debug("io-throttle: %s[%d] must sleep %ld jiffies\n",
> + current->comm, current->pid, sleep);
> +
> + schedule_timeout_killable(sleep);
> + }
> }
> EXPORT_SYMBOL(cgroup_io_account);

Thanks,
-Andrea

Andrea Righi

unread,

Jun 19, 2008, 12:10:17 PM6/19/08

to

On 32-bit architectures PAGE_ALIGN() truncates 64-bit values to the
32-bit boundary. For example:

u64 val = PAGE_ALIGN(size);

always returns a value < 4GB even if size is greater than 4GB.

The problem resides in PAGE_MASK definition (from include/asm-x86/page.h
for example):

#define PAGE_SHIFT 12
#define PAGE_SIZE (_AC(1,UL) << PAGE_SHIFT)
#define PAGE_MASK (~(PAGE_SIZE-1))
...
#define PAGE_ALIGN(addr) (((addr)+PAGE_SIZE-1)&PAGE_MASK)

The "~" is performed on a 32-bit value, so everything in "and" with
PAGE_MASK greater than 4GB will be truncated to the 32-bit boundary.

Using the ALIGN() macro seems to be the right way, because it uses
typeof(addr) for the mask.

See also lkml discussion: http://lkml.org/lkml/2008/6/11/237

Changelog: (v2 -> v3)
- do not move PAGE_ALIGN() definition in linux/mm.h, fixing and
leaving it in each page.h seems to be a safer solution right now

Signed-off-by: Andrea Righi <righi....@gmail.com>
Signed-off-by: Andrew Morton <ak...@linux-foundation.org>
---

diff -urpN linux-2.6.26-rc5-mm3/arch/sparc64/kernel/iommu_common.h linux-2.6.26-rc5-mm3-fix-64-bit-page-align/arch/sparc64/kernel/iommu_common.h
--- linux-2.6.26-rc5-mm3/arch/sparc64/kernel/iommu_common.h 2008-04-17 04:49:44.000000000 +0200
+++ linux-2.6.26-rc5-mm3-fix-64-bit-page-align/arch/sparc64/kernel/iommu_common.h 2008-06-19 17:04:52.000000000 +0200

@@ -23,7 +23,7 @@
#define IO_PAGE_SHIFT 13
#define IO_PAGE_SIZE (1UL << IO_PAGE_SHIFT)
#define IO_PAGE_MASK (~(IO_PAGE_SIZE-1))
-#define IO_PAGE_ALIGN(addr) (((addr)+IO_PAGE_SIZE-1)&IO_PAGE_MASK)
+#define IO_PAGE_ALIGN(addr) ALIGN(addr, IO_PAGE_SIZE)

#define IO_TSB_ENTRIES (128*1024)
#define IO_TSB_SIZE (IO_TSB_ENTRIES * 8)

diff -urpN linux-2.6.26-rc5-mm3/drivers/pci/intel-iommu.h linux-2.6.26-rc5-mm3-fix-64-bit-page-align/drivers/pci/intel-iommu.h
--- linux-2.6.26-rc5-mm3/drivers/pci/intel-iommu.h 2008-04-17 04:49:44.000000000 +0200
+++ linux-2.6.26-rc5-mm3-fix-64-bit-page-align/drivers/pci/intel-iommu.h 2008-06-19 17:05:22.000000000 +0200

@@ -35,7 +35,7 @@
#define PAGE_SHIFT_4K (12)
#define PAGE_SIZE_4K (1UL << PAGE_SHIFT_4K)
#define PAGE_MASK_4K (((u64)-1) << PAGE_SHIFT_4K)
-#define PAGE_ALIGN_4K(addr) (((addr) + PAGE_SIZE_4K - 1) & PAGE_MASK_4K)
+#define PAGE_ALIGN_4K(addr) ALIGN(addr, PAGE_SIZE_4K)

#define IOVA_PFN(addr) ((addr) >> PAGE_SHIFT_4K)
#define DMA_32BIT_PFN IOVA_PFN(DMA_32BIT_MASK)

diff -urpN linux-2.6.26-rc5-mm3/include/asm-alpha/page.h linux-2.6.26-rc5-mm3-fix-64-bit-page-align/include/asm-alpha/page.h
--- linux-2.6.26-rc5-mm3/include/asm-alpha/page.h 2008-04-17 04:49:44.000000000 +0200
+++ linux-2.6.26-rc5-mm3-fix-64-bit-page-align/include/asm-alpha/page.h 2008-06-19 17:07:32.000000000 +0200
@@ -81,7 +81,7 @@ typedef struct page *pgtable_t;
#endif /* !__ASSEMBLY__ */

/* to align the pointer to the (next) page boundary */

-#define PAGE_ALIGN(addr) (((addr)+PAGE_SIZE-1)&PAGE_MASK)
+#define PAGE_ALIGN(addr) ALIGN(addr, PAGE_SIZE)

#define __pa(x) ((unsigned long) (x) - PAGE_OFFSET)
#define __va(x) ((void *)((unsigned long) (x) + PAGE_OFFSET))

diff -urpN linux-2.6.26-rc5-mm3/include/asm-arm/page.h linux-2.6.26-rc5-mm3-fix-64-bit-page-align/include/asm-arm/page.h
--- linux-2.6.26-rc5-mm3/include/asm-arm/page.h 2008-06-12 12:35:36.000000000 +0200
+++ linux-2.6.26-rc5-mm3-fix-64-bit-page-align/include/asm-arm/page.h 2008-06-19 17:08:04.000000000 +0200
@@ -16,7 +16,7 @@
#define PAGE_MASK (~(PAGE_SIZE-1))

/* to align the pointer to the (next) page boundary */

-#define PAGE_ALIGN(addr) (((addr)+PAGE_SIZE-1)&PAGE_MASK)
+#define PAGE_ALIGN(addr) ALIGN(addr, PAGE_SIZE)

#ifndef __ASSEMBLY__

diff -urpN linux-2.6.26-rc5-mm3/include/asm-arm/page-nommu.h linux-2.6.26-rc5-mm3-fix-64-bit-page-align/include/asm-arm/page-nommu.h
--- linux-2.6.26-rc5-mm3/include/asm-arm/page-nommu.h 2008-04-17 04:49:44.000000000 +0200
+++ linux-2.6.26-rc5-mm3-fix-64-bit-page-align/include/asm-arm/page-nommu.h 2008-06-19 17:08:40.000000000 +0200
@@ -43,7 +43,7 @@ typedef unsigned long pgprot_t;
#define __pgprot(x) (x)

/* to align the pointer to the (next) page boundary */

-#define PAGE_ALIGN(addr) (((addr)+PAGE_SIZE-1)&PAGE_MASK)
+#define PAGE_ALIGN(addr) ALIGN(addr, PAGE_SIZE)

extern unsigned long memory_start;
extern unsigned long memory_end;

diff -urpN linux-2.6.26-rc5-mm3/include/asm-avr32/page.h linux-2.6.26-rc5-mm3-fix-64-bit-page-align/include/asm-avr32/page.h
--- linux-2.6.26-rc5-mm3/include/asm-avr32/page.h 2008-06-12 12:35:36.000000000 +0200
+++ linux-2.6.26-rc5-mm3-fix-64-bit-page-align/include/asm-avr32/page.h 2008-06-19 17:09:30.000000000 +0200
@@ -58,7 +58,7 @@ static inline int get_order(unsigned lon
#endif /* !__ASSEMBLY__ */

/* Align the pointer to the (next) page boundary */
-#define PAGE_ALIGN(addr) (((addr) + PAGE_SIZE - 1) & PAGE_MASK)

+#define PAGE_ALIGN(addr) ALIGN(addr, PAGE_SIZE)

/*
* The hardware maps the virtual addresses 0x80000000 -> 0x9fffffff

diff -urpN linux-2.6.26-rc5-mm3/include/asm-blackfin/page.h linux-2.6.26-rc5-mm3-fix-64-bit-page-align/include/asm-blackfin/page.h
--- linux-2.6.26-rc5-mm3/include/asm-blackfin/page.h 2008-04-17 04:49:44.000000000 +0200
+++ linux-2.6.26-rc5-mm3-fix-64-bit-page-align/include/asm-blackfin/page.h 2008-06-19 17:09:52.000000000 +0200
@@ -52,7 +52,7 @@ typedef struct page *pgtable_t;

#define __pgprot(x) ((pgprot_t) { (x) } )

/* to align the pointer to the (next) page boundary */

-#define PAGE_ALIGN(addr) (((addr)+PAGE_SIZE-1)&PAGE_MASK)
+#define PAGE_ALIGN(addr) ALIGN(addr, PAGE_SIZE)

extern unsigned long memory_start;
extern unsigned long memory_end;

diff -urpN linux-2.6.26-rc5-mm3/include/asm-cris/page.h linux-2.6.26-rc5-mm3-fix-64-bit-page-align/include/asm-cris/page.h
--- linux-2.6.26-rc5-mm3/include/asm-cris/page.h 2008-04-17 04:49:44.000000000 +0200
+++ linux-2.6.26-rc5-mm3-fix-64-bit-page-align/include/asm-cris/page.h 2008-06-19 17:10:08.000000000 +0200
@@ -61,7 +61,7 @@ typedef struct page *pgtable_t;

#define page_to_phys(page) __pa((((page) - mem_map) << PAGE_SHIFT) + PAGE_OFFSET)

/* to align the pointer to the (next) page boundary */

-#define PAGE_ALIGN(addr) (((addr)+PAGE_SIZE-1)&PAGE_MASK)
+#define PAGE_ALIGN(addr) ALIGN(addr, PAGE_SIZE)

#ifndef __ASSEMBLY__

diff -urpN linux-2.6.26-rc5-mm3/include/asm-frv/page.h linux-2.6.26-rc5-mm3-fix-64-bit-page-align/include/asm-frv/page.h
--- linux-2.6.26-rc5-mm3/include/asm-frv/page.h 2008-04-17 04:49:44.000000000 +0200
+++ linux-2.6.26-rc5-mm3-fix-64-bit-page-align/include/asm-frv/page.h 2008-06-19 17:10:27.000000000 +0200
@@ -41,7 +41,7 @@ typedef struct page *pgtable_t;
#define PTE_MASK PAGE_MASK

/* to align the pointer to the (next) page boundary */

-#define PAGE_ALIGN(addr) (((addr) + PAGE_SIZE - 1) & PAGE_MASK)

+#define PAGE_ALIGN(addr) ALIGN(addr, PAGE_SIZE)

#define devmem_is_allowed(pfn) 1

diff -urpN linux-2.6.26-rc5-mm3/include/asm-h8300/page.h linux-2.6.26-rc5-mm3-fix-64-bit-page-align/include/asm-h8300/page.h
--- linux-2.6.26-rc5-mm3/include/asm-h8300/page.h 2008-04-17 04:49:44.000000000 +0200
+++ linux-2.6.26-rc5-mm3-fix-64-bit-page-align/include/asm-h8300/page.h 2008-06-19 17:10:50.000000000 +0200
@@ -44,7 +44,7 @@ typedef struct page *pgtable_t;

#define __pgprot(x) ((pgprot_t) { (x) } )

/* to align the pointer to the (next) page boundary */

-#define PAGE_ALIGN(addr) (((addr)+PAGE_SIZE-1)&PAGE_MASK)
+#define PAGE_ALIGN(addr) ALIGN(addr, PAGE_SIZE)

extern unsigned long memory_start;
extern unsigned long memory_end;

diff -urpN linux-2.6.26-rc5-mm3/include/asm-ia64/page.h linux-2.6.26-rc5-mm3-fix-64-bit-page-align/include/asm-ia64/page.h
--- linux-2.6.26-rc5-mm3/include/asm-ia64/page.h 2008-06-12 12:35:36.000000000 +0200
+++ linux-2.6.26-rc5-mm3-fix-64-bit-page-align/include/asm-ia64/page.h 2008-06-19 17:11:09.000000000 +0200
@@ -40,7 +40,7 @@

#define PAGE_SIZE (__IA64_UL_CONST(1) << PAGE_SHIFT)
#define PAGE_MASK (~(PAGE_SIZE - 1))
-#define PAGE_ALIGN(addr) (((addr) + PAGE_SIZE - 1) & PAGE_MASK)

+#define PAGE_ALIGN(addr) ALIGN(addr, PAGE_SIZE)

#define PERCPU_PAGE_SHIFT 16 /* log2() of max. size of per-CPU area */
#define PERCPU_PAGE_SIZE (__IA64_UL_CONST(1) << PERCPU_PAGE_SHIFT)

diff -urpN linux-2.6.26-rc5-mm3/include/asm-m32r/page.h linux-2.6.26-rc5-mm3-fix-64-bit-page-align/include/asm-m32r/page.h
--- linux-2.6.26-rc5-mm3/include/asm-m32r/page.h 2008-04-17 04:49:44.000000000 +0200
+++ linux-2.6.26-rc5-mm3-fix-64-bit-page-align/include/asm-m32r/page.h 2008-06-19 17:11:25.000000000 +0200
@@ -42,7 +42,7 @@ typedef struct page *pgtable_t;
#endif /* !__ASSEMBLY__ */

/* to align the pointer to the (next) page boundary */

-#define PAGE_ALIGN(addr) (((addr) + PAGE_SIZE - 1) & PAGE_MASK)

+#define PAGE_ALIGN(addr) ALIGN(addr, PAGE_SIZE)

/*
* This handles the memory map.. We could make this a config

diff -urpN linux-2.6.26-rc5-mm3/include/asm-m68k/dvma.h linux-2.6.26-rc5-mm3-fix-64-bit-page-align/include/asm-m68k/dvma.h
--- linux-2.6.26-rc5-mm3/include/asm-m68k/dvma.h 2008-06-12 12:38:05.000000000 +0200
+++ linux-2.6.26-rc5-mm3-fix-64-bit-page-align/include/asm-m68k/dvma.h 2008-06-19 17:11:58.000000000 +0200

@@ -13,7 +13,7 @@
#define DVMA_PAGE_SHIFT 13
#define DVMA_PAGE_SIZE (1UL << DVMA_PAGE_SHIFT)
#define DVMA_PAGE_MASK (~(DVMA_PAGE_SIZE-1))
-#define DVMA_PAGE_ALIGN(addr) (((addr)+DVMA_PAGE_SIZE-1)&DVMA_PAGE_MASK)
+#define DVMA_PAGE_ALIGN(addr) ALIGN(addr, DVMA_PAGE_SIZE)

extern void dvma_init(void);
extern int dvma_map_iommu(unsigned long kaddr, unsigned long baddr,

diff -urpN linux-2.6.26-rc5-mm3/include/asm-m68k/page.h linux-2.6.26-rc5-mm3-fix-64-bit-page-align/include/asm-m68k/page.h
--- linux-2.6.26-rc5-mm3/include/asm-m68k/page.h 2008-04-17 04:49:44.000000000 +0200
+++ linux-2.6.26-rc5-mm3-fix-64-bit-page-align/include/asm-m68k/page.h 2008-06-19 17:12:57.000000000 +0200
@@ -104,7 +104,7 @@ typedef struct page *pgtable_t;

#define __pgprot(x) ((pgprot_t) { (x) } )

/* to align the pointer to the (next) page boundary */

-#define PAGE_ALIGN(addr) (((addr)+PAGE_SIZE-1)&PAGE_MASK)
+#define PAGE_ALIGN(addr) ALIGN(addr, PAGE_SIZE)

#endif /* !__ASSEMBLY__ */

diff -urpN linux-2.6.26-rc5-mm3/include/asm-m68knommu/page.h linux-2.6.26-rc5-mm3-fix-64-bit-page-align/include/asm-m68knommu/page.h
--- linux-2.6.26-rc5-mm3/include/asm-m68knommu/page.h 2008-04-17 04:49:44.000000000 +0200
+++ linux-2.6.26-rc5-mm3-fix-64-bit-page-align/include/asm-m68knommu/page.h 2008-06-19 17:12:37.000000000 +0200
@@ -44,7 +44,7 @@ typedef struct page *pgtable_t;

#define __pgprot(x) ((pgprot_t) { (x) } )

/* to align the pointer to the (next) page boundary */

-#define PAGE_ALIGN(addr) (((addr)+PAGE_SIZE-1)&PAGE_MASK)
+#define PAGE_ALIGN(addr) ALIGN(addr, PAGE_SIZE)

extern unsigned long memory_start;
extern unsigned long memory_end;

diff -urpN linux-2.6.26-rc5-mm3/include/asm-mips/page.h linux-2.6.26-rc5-mm3-fix-64-bit-page-align/include/asm-mips/page.h
--- linux-2.6.26-rc5-mm3/include/asm-mips/page.h 2008-04-17 04:49:44.000000000 +0200
+++ linux-2.6.26-rc5-mm3-fix-64-bit-page-align/include/asm-mips/page.h 2008-06-19 17:13:17.000000000 +0200
@@ -135,7 +135,7 @@ typedef struct { unsigned long pgprot; }
#endif /* !__ASSEMBLY__ */

/* to align the pointer to the (next) page boundary */

-#define PAGE_ALIGN(addr) (((addr) + PAGE_SIZE - 1) & PAGE_MASK)

+#define PAGE_ALIGN(addr) ALIGN(addr, PAGE_SIZE)

/*
* __pa()/__va() should be used only during mem init.

diff -urpN linux-2.6.26-rc5-mm3/include/asm-mn10300/page.h linux-2.6.26-rc5-mm3-fix-64-bit-page-align/include/asm-mn10300/page.h
--- linux-2.6.26-rc5-mm3/include/asm-mn10300/page.h 2008-04-17 04:49:44.000000000 +0200
+++ linux-2.6.26-rc5-mm3-fix-64-bit-page-align/include/asm-mn10300/page.h 2008-06-19 17:13:34.000000000 +0200
@@ -62,7 +62,7 @@ typedef struct page *pgtable_t;
#endif /* !__ASSEMBLY__ */

/* to align the pointer to the (next) page boundary */

-#define PAGE_ALIGN(addr) (((addr) + PAGE_SIZE - 1) & PAGE_MASK)

+#define PAGE_ALIGN(addr) ALIGN(addr, PAGE_SIZE)

/*
* This handles the memory map.. We could make this a config

diff -urpN linux-2.6.26-rc5-mm3/include/asm-parisc/page.h linux-2.6.26-rc5-mm3-fix-64-bit-page-align/include/asm-parisc/page.h
--- linux-2.6.26-rc5-mm3/include/asm-parisc/page.h 2008-04-17 04:49:44.000000000 +0200
+++ linux-2.6.26-rc5-mm3-fix-64-bit-page-align/include/asm-parisc/page.h 2008-06-19 17:13:48.000000000 +0200
@@ -120,7 +120,7 @@ extern int npmem_ranges;
#define PTE_ENTRY_SIZE (1UL << BITS_PER_PTE_ENTRY)

/* to align the pointer to the (next) page boundary */

-#define PAGE_ALIGN(addr) (((addr)+PAGE_SIZE-1)&PAGE_MASK)
+#define PAGE_ALIGN(addr) ALIGN(addr, PAGE_SIZE)

#define LINUX_GATEWAY_SPACE 0
diff -urpN linux-2.6.26-rc5-mm3/include/asm-s390/page.h linux-2.6.26-rc5-mm3-fix-64-bit-page-align/include/asm-s390/page.h
--- linux-2.6.26-rc5-mm3/include/asm-s390/page.h 2008-06-12 12:35:37.000000000 +0200
+++ linux-2.6.26-rc5-mm3-fix-64-bit-page-align/include/asm-s390/page.h 2008-06-19 17:14:45.000000000 +0200
@@ -139,7 +139,7 @@ void arch_alloc_page(struct page *page,
#endif /* !__ASSEMBLY__ */

/* to align the pointer to the (next) page boundary */

-#define PAGE_ALIGN(addr) (((addr)+PAGE_SIZE-1)&PAGE_MASK)
+#define PAGE_ALIGN(addr) ALIGN(addr, PAGE_SIZE)

#define __PAGE_OFFSET 0x0UL
#define PAGE_OFFSET 0x0UL

diff -urpN linux-2.6.26-rc5-mm3/include/asm-sh/page.h linux-2.6.26-rc5-mm3-fix-64-bit-page-align/include/asm-sh/page.h
--- linux-2.6.26-rc5-mm3/include/asm-sh/page.h 2008-06-12 12:38:06.000000000 +0200
+++ linux-2.6.26-rc5-mm3-fix-64-bit-page-align/include/asm-sh/page.h 2008-06-19 17:15:01.000000000 +0200
@@ -25,7 +25,7 @@
#define PTE_MASK PAGE_MASK

/* to align the pointer to the (next) page boundary */

-#define PAGE_ALIGN(addr) (((addr)+PAGE_SIZE-1)&PAGE_MASK)
+#define PAGE_ALIGN(addr) ALIGN(addr, PAGE_SIZE)

#if defined(CONFIG_HUGETLB_PAGE_SIZE_64K)
#define HPAGE_SHIFT 16

diff -urpN linux-2.6.26-rc5-mm3/include/asm-sparc/page.h linux-2.6.26-rc5-mm3-fix-64-bit-page-align/include/asm-sparc/page.h
--- linux-2.6.26-rc5-mm3/include/asm-sparc/page.h 2008-06-12 12:35:37.000000000 +0200
+++ linux-2.6.26-rc5-mm3-fix-64-bit-page-align/include/asm-sparc/page.h 2008-06-19 17:15:18.000000000 +0200
@@ -137,7 +137,7 @@ BTFIXUPDEF_SETHI(sparc_unmapped_base)
#endif /* !(__ASSEMBLY__) */

/* to align the pointer to the (next) page boundary */

-#define PAGE_ALIGN(addr) (((addr)+PAGE_SIZE-1)&PAGE_MASK)
+#define PAGE_ALIGN(addr) ALIGN(addr, PAGE_SIZE)

#define PAGE_OFFSET 0xf0000000
#ifndef __ASSEMBLY__

diff -urpN linux-2.6.26-rc5-mm3/include/asm-sparc64/page.h linux-2.6.26-rc5-mm3-fix-64-bit-page-align/include/asm-sparc64/page.h
--- linux-2.6.26-rc5-mm3/include/asm-sparc64/page.h 2008-06-12 12:35:40.000000000 +0200
+++ linux-2.6.26-rc5-mm3-fix-64-bit-page-align/include/asm-sparc64/page.h 2008-06-19 17:15:33.000000000 +0200
@@ -111,7 +111,7 @@ typedef struct page *pgtable_t;
#endif /* !(__ASSEMBLY__) */

/* to align the pointer to the (next) page boundary */

-#define PAGE_ALIGN(addr) (((addr)+PAGE_SIZE-1)&PAGE_MASK)
+#define PAGE_ALIGN(addr) ALIGN(addr, PAGE_SIZE)

/* We used to stick this into a hard-coded global register (%g4)
* but that does not make sense anymore.

diff -urpN linux-2.6.26-rc5-mm3/include/asm-um/page.h linux-2.6.26-rc5-mm3-fix-64-bit-page-align/include/asm-um/page.h
--- linux-2.6.26-rc5-mm3/include/asm-um/page.h 2008-06-12 12:38:06.000000000 +0200
+++ linux-2.6.26-rc5-mm3-fix-64-bit-page-align/include/asm-um/page.h 2008-06-19 17:15:48.000000000 +0200
@@ -93,7 +93,7 @@ typedef struct page *pgtable_t;

#define __pgprot(x) ((pgprot_t) { (x) } )

/* to align the pointer to the (next) page boundary */

-#define PAGE_ALIGN(addr) (((addr)+PAGE_SIZE-1)&PAGE_MASK)
+#define PAGE_ALIGN(addr) ALIGN(addr, PAGE_SIZE)

extern unsigned long uml_physmem;

diff -urpN linux-2.6.26-rc5-mm3/include/asm-x86/page.h linux-2.6.26-rc5-mm3-fix-64-bit-page-align/include/asm-x86/page.h
--- linux-2.6.26-rc5-mm3/include/asm-x86/page.h 2008-06-12 12:38:07.000000000 +0200
+++ linux-2.6.26-rc5-mm3-fix-64-bit-page-align/include/asm-x86/page.h 2008-06-19 17:16:04.000000000 +0200
@@ -32,7 +32,7 @@
#define HUGE_MAX_HSTATE 2

/* to align the pointer to the (next) page boundary */

-#define PAGE_ALIGN(addr) (((addr)+PAGE_SIZE-1)&PAGE_MASK)
+#define PAGE_ALIGN(addr) ALIGN(addr, PAGE_SIZE)

#ifndef __ASSEMBLY__
#include <linux/types.h>
diff -urpN linux-2.6.26-rc5-mm3/include/asm-xtensa/page.h linux-2.6.26-rc5-mm3-fix-64-bit-page-align/include/asm-xtensa/page.h
--- linux-2.6.26-rc5-mm3/include/asm-xtensa/page.h 2008-04-17 04:49:44.000000000 +0200
+++ linux-2.6.26-rc5-mm3-fix-64-bit-page-align/include/asm-xtensa/page.h 2008-06-19 17:16:28.000000000 +0200
@@ -32,7 +32,7 @@

#define PAGE_SHIFT 12
#define PAGE_SIZE (__XTENSA_UL_CONST(1) << PAGE_SHIFT)
#define PAGE_MASK (~(PAGE_SIZE-1))
-#define PAGE_ALIGN(addr) (((addr)+PAGE_SIZE - 1) & PAGE_MASK)

+#define PAGE_ALIGN(addr) ALIGN(addr, PAGE_SIZE)

#define PAGE_OFFSET XCHAL_KSEG_CACHED_VADDR
#define MAX_MEM_PFN XCHAL_KSEG_SIZE

Rodolfo Giometti

unread,

Jun 20, 2008, 2:20:06 AM6/20/08

to

Hello,

before answering to other questions I'd like solve the following
issue.

On Fri, Jun 20, 2008 at 03:34:11AM +0400, Anton Vorontsov wrote:

> To avoid these ifdefs, I would suggest you to reconsider the model
> of this driver. How about:
>
> - drivers/power/bq27x00_battery.c:
>
> Registers two platform_drivers, first is "bq27200-bat" (I2C) and
> second is "bq27000-bat" (W1). Both are using the same probe and
> remove methods. This driver will simply register power_supply
> and will do all battery logic, separated from the underlaying
> bus.
>
> (Two platform drivers are used simply because underlaying
> I2C/W1 drivers will not know each other, thus will not able to
> pass uniqe platform_device.id in case of single driver).
>
> - include/linux/bq27x00-battery.h:
>
> Declares
> struct bq27x00_access_methods {
> /*
> * dev argument is I2C or W1 device, battery driver will use
> * platform_device->dev.parent to pass the dev argument.
> */
> int (*read)(struct device *dev, u8 reg, int *rt_value, int b_single);
> };

Where this function should be defined? Can you explain better what you
mean?

>
> - drivers/i2c/chips/bq27200.c:
>
> This driver will do all I2C stuff, and then will register
> "bq27200-bat" platform device, with .platform_data pointed
> to the allocated and filled "struct bq27x00_access_methods".
>
> - drivers/w1/slaves/bq27000.c:
>
> This driver will do all W1 stuff, and then will register
> "bq27000-bat" platform device, with .platform_data pointed
> to the allocated and filled "struct bq27x00_access_methods".
>
> Will this (not) work?

As you can see at the top of the file I based this work an previour
work by Texas Instruments people which never goes into linux vanilla
but lives into linux-omap tree. I just fixed some basic issue which
prevent the driver to compile and work properly and repropose the
driver here.

My hardware just uses the I2C chip version so I cannot test the W1 one
at all. If you agree I can remove the W1 code and provide the
bq27200.c driver only.

Maybe, as you suggested above, we can try to write the code in order
that a future W1 driver writer can write few lines. :)

Ciao,

Rodolfo

--

GNU/Linux Solutions e-mail: giom...@enneenne.com
Linux Device Driver giom...@linux.it
Embedded Systems phone: +39 349 2432127
UNIX programming skype: rodolfo.giometti

Andrea Righi

unread,

Jun 20, 2008, 6:10:07 AM6/20/08

to

Apply the io-throttle controller to the opportune kernel functions. Both
accounting and throttling functionalities are performed by
cgroup_io_throttle().

Signed-off-by: Andrea Righi <righi....@gmail.com>
---

diff --git a/block/blk-core.c b/block/blk-core.c
index 1905aab..8eddef5 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -26,6 +26,7 @@
#include <linux/swap.h>
#include <linux/writeback.h>
#include <linux/task_io_accounting_ops.h>
+#include <linux/blk-io-throttle.h>
#include <linux/interrupt.h>
#include <linux/cpu.h>
#include <linux/blktrace_api.h>
@@ -1482,6 +1483,7 @@ void submit_bio(int rw, struct bio *bio)
count_vm_events(PGPGOUT, count);
} else {
task_io_account_read(bio->bi_size);
+ cgroup_io_throttle(bio->bi_bdev, bio->bi_size);
count_vm_events(PGPGIN, count);
}

diff --git a/fs/buffer.c b/fs/buffer.c
index a073f3f..c63dfe7 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -35,6 +35,7 @@
#include <linux/suspend.h>
#include <linux/buffer_head.h>
#include <linux/task_io_accounting_ops.h>
+#include <linux/blk-io-throttle.h>
#include <linux/bio.h>
#include <linux/notifier.h>
#include <linux/cpu.h>
@@ -700,6 +701,9 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
static int __set_page_dirty(struct page *page,
struct address_space *mapping, int warn)
{
+ struct block_device *bdev = NULL;
+ size_t cgroup_io_acct = 0;
+
if (unlikely(!mapping))
return !TestSetPageDirty(page);

@@ -711,16 +715,22 @@ static int __set_page_dirty(struct page *page,
WARN_ON_ONCE(warn && !PageUptodate(page));

if (mapping_cap_account_dirty(mapping)) {
+ bdev = (mapping->host &&
+ mapping->host->i_sb->s_bdev) ?
+ mapping->host->i_sb->s_bdev : NULL;
+
__inc_zone_page_state(page, NR_FILE_DIRTY);
__inc_bdi_stat(mapping->backing_dev_info,
BDI_RECLAIMABLE);
task_io_account_write(PAGE_CACHE_SIZE);
+ cgroup_io_acct = PAGE_CACHE_SIZE;
}
radix_tree_tag_set(&mapping->page_tree,
page_index(page), PAGECACHE_TAG_DIRTY);
}
write_unlock_irq(&mapping->tree_lock);
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
+ cgroup_io_throttle(bdev, cgroup_io_acct);

return 1;
}
diff --git a/fs/direct-io.c b/fs/direct-io.c
index 9e81add..fe991ac 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -35,6 +35,7 @@
#include <linux/buffer_head.h>
#include <linux/rwsem.h>
#include <linux/uio.h>
+#include <linux/blk-io-throttle.h>
#include <asm/atomic.h>

/*
@@ -666,6 +667,8 @@ submit_page_section(struct dio *dio, struct page *page,
/*
* Read accounting is performed in submit_bio()
*/
+ struct block_device *bdev = dio->bio ? dio->bio->bi_bdev : NULL;
+ cgroup_io_throttle(bdev, len);
task_io_account_write(len);
}

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 789b6ad..a2b820d 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -23,6 +23,7 @@
#include <linux/init.h>
#include <linux/backing-dev.h>
#include <linux/task_io_accounting_ops.h>
+#include <linux/blk-io-throttle.h>
#include <linux/blkdev.h>
#include <linux/mpage.h>
#include <linux/rmap.h>
@@ -1077,6 +1078,8 @@ int __set_page_dirty_nobuffers(struct page *page)
if (!TestSetPageDirty(page)) {
struct address_space *mapping = page_mapping(page);
struct address_space *mapping2;
+ struct block_device *bdev = NULL;
+ size_t cgroup_io_acct = 0;

if (!mapping)
return 1;
@@ -1087,10 +1090,15 @@ int __set_page_dirty_nobuffers(struct page *page)
BUG_ON(mapping2 != mapping);
WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
if (mapping_cap_account_dirty(mapping)) {
+ bdev = (mapping->host &&
+ mapping->host->i_sb->s_bdev) ?
+ mapping->host->i_sb->s_bdev : NULL;
+
__inc_zone_page_state(page, NR_FILE_DIRTY);
__inc_bdi_stat(mapping->backing_dev_info,
BDI_RECLAIMABLE);
task_io_account_write(PAGE_CACHE_SIZE);
+ cgroup_io_acct = PAGE_CACHE_SIZE;
}
radix_tree_tag_set(&mapping->page_tree,
page_index(page), PAGECACHE_TAG_DIRTY);
@@ -1100,6 +1108,7 @@ int __set_page_dirty_nobuffers(struct page *page)
/* !PageAnon && !swapper_space */
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
}
+ cgroup_io_throttle(bdev, cgroup_io_acct);
return 1;
}
return 0;
diff --git a/mm/readahead.c b/mm/readahead.c
index d8723a5..dff6b02 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -14,6 +14,7 @@
#include <linux/blkdev.h>
#include <linux/backing-dev.h>
#include <linux/task_io_accounting_ops.h>
+#include <linux/blk-io-throttle.h>
#include <linux/pagevec.h>
#include <linux/pagemap.h>

@@ -58,6 +59,9 @@ int read_cache_pages(struct address_space *mapping, struct list_head *pages,
int (*filler)(void *, struct page *), void *data)
{
struct page *page;
+ struct block_device *bdev =
+ (mapping->host && mapping->host->i_sb->s_bdev) ?
+ mapping->host->i_sb->s_bdev : NULL;
int ret = 0;

while (!list_empty(pages)) {
@@ -76,6 +80,7 @@ int read_cache_pages(struct address_space *mapping, struct list_head *pages,
break;
}
task_io_account_read(PAGE_CACHE_SIZE);
+ cgroup_io_throttle(bdev, PAGE_CACHE_SIZE);
}
return ret;
}
--
1.5.4.3

Andrea Righi

unread,

Jun 20, 2008, 6:10:10 AM6/20/08

to

This is the core io-throttle kernel infrastructure. It creates the basic
interfaces to cgroups and implements the I/O measurement and throttling
functions.

Signed-off-by: Andrea Righi <righi....@gmail.com>
---
block/Makefile | 2 +
block/blk-io-throttle.c | 393 +++++++++++++++++++++++++++++++++++++++
include/linux/blk-io-throttle.h | 12 ++
include/linux/cgroup_subsys.h | 6 +
init/Kconfig | 10 +
5 files changed, 423 insertions(+), 0 deletions(-)
create mode 100644 block/blk-io-throttle.c
create mode 100644 include/linux/blk-io-throttle.h

diff --git a/block/Makefile b/block/Makefile
index 5a43c7d..8dec69b 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -14,3 +14,5 @@ obj-$(CONFIG_IOSCHED_CFQ) += cfq-iosched.o

obj-$(CONFIG_BLK_DEV_IO_TRACE) += blktrace.o
obj-$(CONFIG_BLOCK_COMPAT) += compat_ioctl.o
+
+obj-$(CONFIG_CGROUP_IO_THROTTLE) += blk-io-throttle.o
diff --git a/block/blk-io-throttle.c b/block/blk-io-throttle.c
new file mode 100644
index 0000000..4ec02bb
--- /dev/null
+++ b/block/blk-io-throttle.c
@@ -0,0 +1,393 @@
+/*
+ * blk-io-throttle.c
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ *
+ * Copyright (C) 2008 Andrea Righi <righi....@gmail.com>
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/cgroup.h>
+#include <linux/slab.h>
+#include <linux/gfp.h>
+#include <linux/err.h>
+#include <linux/sched.h>
+#include <linux/fs.h>
+#include <linux/jiffies.h>
+#include <linux/hardirq.h>
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/uaccess.h>
+#include <linux/vmalloc.h>
+#include <linux/blk-io-throttle.h>
+
+#define ONE_SEC 1000000L /* # of microseconds in a second */
+#define KBS(x) ((x) * ONE_SEC >> 10)
+
+struct iothrottle_node {
+ struct list_head node;
+ dev_t dev;
+ unsigned long iorate;
+ unsigned long timestamp;
+ atomic_long_t stat;
+};
+
+struct iothrottle {
+ struct cgroup_subsys_state css;
+ /* protects the list below, not the single elements */
+ spinlock_t lock;
+ struct list_head list;
+};
+
+static inline struct iothrottle *cgroup_to_iothrottle(struct cgroup *cont)
+{
+ return container_of(cgroup_subsys_state(cont, iothrottle_subsys_id),
+ struct iothrottle, css);
+}
+
+static inline struct iothrottle *task_to_iothrottle(struct task_struct *task)
+{
+ return container_of(task_subsys_state(task, iothrottle_subsys_id),
+ struct iothrottle, css);
+}
+
+static inline struct iothrottle_node *iothrottle_search_node(
+ const struct iothrottle *iot,
+ dev_t dev)
+{
+ struct iothrottle_node *n;
+
+ list_for_each_entry_rcu(n, &iot->list, node)
+ if (n->dev == dev)
+ return n;
+ return NULL;
+}
+
+static inline void iothrottle_insert_node(struct iothrottle *iot,
+ struct iothrottle_node *n)
+{
+ list_add_rcu(&n->node, &iot->list);
+}
+
+static inline struct iothrottle_node *iothrottle_replace_node(
+ struct iothrottle *iot,
+ struct iothrottle_node *old,
+ struct iothrottle_node *new)
+{
+ list_replace_rcu(&old->node, &new->node);
+ return old;
+}
+
+static inline struct iothrottle_node *iothrottle_delete_node(
+ struct iothrottle *iot,
+ dev_t dev)
+{
+ struct iothrottle_node *n;
+
+ list_for_each_entry(n, &iot->list, node)
+ if (n->dev == dev) {
+ list_del_rcu(&n->node);
+ return n;
+ }
+ return NULL;
+}
+
+/*
+ * Note: called from kernel/cgroup.c with cgroup_lock() held.
+ */
+static struct cgroup_subsys_state *iothrottle_create(
+ struct cgroup_subsys *ss, struct cgroup *cont)
+{
+ struct iothrottle *iot;
+
+ iot = kmalloc(sizeof(*iot), GFP_KERNEL);
+ if (unlikely(!iot))
+ return ERR_PTR(-ENOMEM);
+
+ INIT_LIST_HEAD(&iot->list);
+ spin_lock_init(&iot->lock);
+
+ return &iot->css;
+}
+
+/*
+ * Note: called from kernel/cgroup.c with cgroup_lock() held.
+ */
+static void iothrottle_destroy(struct cgroup_subsys *ss, struct cgroup *cont)
+{
+ struct iothrottle_node *n, *p;
+ struct iothrottle *iot = cgroup_to_iothrottle(cont);
+
+ /*
+ * don't worry about locking here, at this point there must be not any
+ * reference to the list.
+ */
+ list_for_each_entry_safe(n, p, &iot->list, node)
+ kfree(n);
+ kfree(iot);
+}
+
+static ssize_t iothrottle_read(struct cgroup *cont,
+ struct cftype *cft,
+ struct file *file,
+ char __user *userbuf,
+ size_t nbytes,
+ loff_t *ppos)
+{
+ struct iothrottle *iot;
+ char *buffer;
+ int s = 0;
+ struct iothrottle_node *n;
+ ssize_t ret;
+
+ buffer = kmalloc(nbytes + 1, GFP_KERNEL);
+ if (!buffer)
+ return -ENOMEM;
+
+ cgroup_lock();
+ if (cgroup_is_removed(cont)) {
+ ret = -ENODEV;

+ goto out;
+ }
+

+ iot = cgroup_to_iothrottle(cont);
+ rcu_read_lock();
+ list_for_each_entry_rcu(n, &iot->list, node) {
+ unsigned long delta, rate;
+
+ BUG_ON(!n->dev);
+ delta = jiffies_to_usecs((long)jiffies - (long)n->timestamp);
+ rate = delta ? KBS(atomic_long_read(&n->stat) / delta) : 0;
+ s += scnprintf(buffer + s, nbytes - s,
+ "=== device (%u,%u) ===\n"
+ " bandwidth limit: %lu KiB/sec\n"
+ "current i/o usage: %lu KiB/sec\n",
+ MAJOR(n->dev), MINOR(n->dev),
+ n->iorate, rate);
+ }
+ rcu_read_unlock();
+ ret = simple_read_from_buffer(userbuf, nbytes, ppos, buffer, s);
+out:
+ cgroup_unlock();
+ kfree(buffer);

+ return ret;
+}
+

+static inline dev_t devname2dev_t(const char *buf)
+{
+ struct block_device *bdev;
+ dev_t ret;
+
+ bdev = lookup_bdev(buf);
+ if (IS_ERR(bdev))
+ return 0;
+
+ BUG_ON(!bdev->bd_inode);
+ ret = bdev->bd_inode->i_rdev;
+ bdput(bdev);

+
+ return ret;
+}
+

+static inline int iothrottle_parse_args(char *buf, size_t nbytes,

+ dev_t *dev, unsigned long *val)
+{
+ char *p;
+
+ p = memchr(buf, ':', nbytes);
+ if (!p)
+ return -EINVAL;
+ *p++ = '\0';
+
+ /* i/o bandiwth is expressed in KiB/s */
+ *val = ALIGN(memparse(p, &p), 1024) >> 10;
+ if (*p)
+ return -EINVAL;
+
+ *dev = devname2dev_t(buf);
+ if (!*dev)
+ return -ENOTBLK;

+
+ return 0;
+}
+

+static ssize_t iothrottle_write(struct cgroup *cont,
+ struct cftype *cft,
+ struct file *file,
+ const char __user *userbuf,
+ size_t nbytes, loff_t *ppos)
+{
+ struct iothrottle *iot;
+ struct iothrottle_node *n, *tmpn = NULL;
+ char *buffer, *tmpp;
+ dev_t dev;
+ unsigned long val;
+ int ret;
+
+ if (!nbytes)
+ return -EINVAL;
+
+ /* Upper limit on largest io-throttle rule string user might write. */
+ if (nbytes > 1024)
+ return -E2BIG;
+
+ buffer = kmalloc(nbytes + 1, GFP_KERNEL);
+ if (!buffer)
+ return -ENOMEM;
+
+ if (copy_from_user(buffer, userbuf, nbytes)) {
+ ret = -EFAULT;
+ goto out1;
+ }
+
+ buffer[nbytes] = '\0';
+ tmpp = strstrip(buffer);
+
+ ret = iothrottle_parse_args(tmpp, nbytes, &dev, &val);
+ if (ret)
+ goto out1;
+
+ if (val) {
+ tmpn = kmalloc(sizeof(*tmpn), GFP_KERNEL);
+ if (!tmpn) {
+ ret = -ENOMEM;
+ goto out1;
+ }
+ atomic_long_set(&tmpn->stat, 0);
+ tmpn->timestamp = jiffies;
+ tmpn->iorate = val;
+ tmpn->dev = dev;
+ }
+
+ cgroup_lock();
+ if (cgroup_is_removed(cont)) {
+ ret = -ENODEV;
+ goto out2;
+ }
+
+ iot = cgroup_to_iothrottle(cont);
+ spin_lock(&iot->lock);
+ if (!val) {
+ /* Delete a block device limiting rule */
+ n = iothrottle_delete_node(iot, dev);
+ goto out3;
+ }
+ n = iothrottle_search_node(iot, dev);
+ if (n) {
+ /* Update a block device limiting rule */
+ iothrottle_replace_node(iot, n, tmpn);
+ goto out3;
+ }
+ /* Add a new block device limiting rule */
+ iothrottle_insert_node(iot, tmpn);
+out3:
+ ret = nbytes;
+ spin_unlock(&iot->lock);
+ if (n) {
+ synchronize_rcu();
+ kfree(n);
+ }
+out2:
+ cgroup_unlock();
+out1:
+ kfree(buffer);

+ return ret;
+}
+

+static struct cftype files[] = {
+ {
+ .name = "bandwidth",
+ .read = iothrottle_read,
+ .write = iothrottle_write,
+ },
+};
+
+static int iothrottle_populate(struct cgroup_subsys *ss, struct cgroup *cont)
+{
+ return cgroup_add_files(cont, ss, files, ARRAY_SIZE(files));
+}
+
+struct cgroup_subsys iothrottle_subsys = {
+ .name = "blockio",
+ .create = iothrottle_create,
+ .destroy = iothrottle_destroy,
+ .populate = iothrottle_populate,
+ .subsys_id = iothrottle_subsys_id,
+};
+
+static inline int __cant_sleep(void)
+{
+ return in_atomic() || in_interrupt() || irqs_disabled();
+}
+
+void cgroup_io_throttle(struct block_device *bdev, size_t bytes)
+{
+ struct iothrottle *iot;
+ struct iothrottle_node *n;
+ unsigned long delta, t;
+ long sleep;
+
+ if (unlikely(!bdev || !bytes))
+ return;
+
+ iot = task_to_iothrottle(current);
+ if (unlikely(!iot))
+ return;
+
+ BUG_ON(!bdev->bd_inode);
+
+ rcu_read_lock();
+ n = iothrottle_search_node(iot, bdev->bd_inode->i_rdev);
+ if (!n || !n->iorate)
+ goto out;
+
+ /* Account the i/o activity */
+ atomic_long_add(bytes, &n->stat);

+
+ /* Evaluate if we need to throttle the current process */

+ delta = (long)jiffies - (long)n->timestamp;

+ if (!delta)
+ goto out;
+

+ t = usecs_to_jiffies(KBS(atomic_long_read(&n->stat) / n->iorate));

+ if (!t)
+ goto out;
+
+ sleep = t - delta;
+ if (unlikely(sleep > 0)) {

+ rcu_read_unlock();

+ if (__cant_sleep())
+ return;
+ pr_debug("io-throttle: task %p (%s) must sleep %lu jiffies\n",

+ current, current->comm, delta);

+ schedule_timeout_killable(sleep);
+ return;
+ }

+ /* Reset i/o statistics */
+ atomic_long_set(&n->stat, 0);
+ /*
+ * NOTE: be sure i/o statistics have been resetted before updating the
+ * timestamp, otherwise a very small time delta may possibly be read by
+ * another CPU w.r.t. accounted i/o statistics, generating unnecessary
+ * long sleeps.
+ */
+ smp_wmb();
+ n->timestamp = jiffies;
+out:
+ rcu_read_unlock();
+}
+EXPORT_SYMBOL(cgroup_io_throttle);
diff --git a/include/linux/blk-io-throttle.h b/include/linux/blk-io-throttle.h
new file mode 100644
index 0000000..3e08738
--- /dev/null
+++ b/include/linux/blk-io-throttle.h
@@ -0,0 +1,12 @@
+#ifndef BLK_IO_THROTTLE_H
+#define BLK_IO_THROTTLE_H
+
+#ifdef CONFIG_CGROUP_IO_THROTTLE
+extern void cgroup_io_throttle(struct block_device *bdev, size_t bytes);
+#else
+static inline void cgroup_io_throttle(struct block_device *bdev, size_t bytes)
+{
+}
+#endif /* CONFIG_CGROUP_IO_THROTTLE */
+
+#endif /* BLK_IO_THROTTLE_H */
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index e287745..0caf3c2 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -48,3 +48,9 @@ SUBSYS(devices)
#endif

/* */
+
+#ifdef CONFIG_CGROUP_IO_THROTTLE
+SUBSYS(iothrottle)
+#endif
+
+/* */
diff --git a/init/Kconfig b/init/Kconfig
index 6199d11..3117d99 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -306,6 +306,16 @@ config CGROUP_DEVICE
Provides a cgroup implementing whitelists for devices which
a process in the cgroup can mknod or open.

+config CGROUP_IO_THROTTLE
+ bool "Enable cgroup I/O throttling (EXPERIMENTAL)"
+ depends on CGROUPS && EXPERIMENTAL
+ help
+ This allows to limit the maximum I/O bandwidth for specific
+ cgroup(s).
+ See Documentation/controllers/io-throttle.txt for more information.
+
+ If unsure, say N.
+
config CPUSETS
bool "Cpuset support"
depends on SMP && CGROUPS

Andrea Righi

unread,

Jun 20, 2008, 6:10:11 AM6/20/08

to

The goal of the i/o bandwidth controller is to improve i/o performance
predictability and provide better QoS for different cgroups sharing the same
block devices.

Respect to other priority/weight-based solutions the approach used by this
controller is to explicitly choke applications' requests that directly (or
indirectly) generate i/o activity in the system.

The direct bandwidth limiting method has the advantage of improving the
performance predictability at the cost of reducing, in general, the overall
performance of the system (in terms of throughput).

Detailed informations about design, its goal and usage are described in the
documentation.

Tested against latest git (2.6.26-rc6).

The all-in-one patch (and previous versions) can be found at:
http://download.systemimager.org/~arighi/linux/patches/io-throttle/

Changelog: (v2 -> v3)
- scalability improvement: replaced the rbtree structure with a linked list
to store multiple per block device I/O limiting rules; this allows to use
RCU to protect the whole list structure, since the elements in the list are
supposed to change rarely (this also provides zero overhead for cgroups
that don't use any I/O limitation)
- improved user interface
- now it's possible to specify a suffix k, K, m, M, g, G to express
bandwidth values in KB/s, MB/s or GB/s
- current per block device I/O usage is reported in blockio.bandwidth
- renamed cgroup_io_account() in cgroup_io_throttle()
- updated the documentation

TODO:
- implement I/O throttling using a token bucket algorithm, as suggested by
Carl Henrik Lunde, in addition to the current leaky bucket approach
- provide a modular interface to switch between different i/o throttling
algorithms at run-time

-Andrea

Andrea Righi

unread,

Jun 20, 2008, 6:10:13 AM6/20/08

to

Documentation of the block device I/O bandwidth controller: description, usage,
advantages and design.

Signed-off-by: Andrea Righi <righi....@gmail.com>
---
Documentation/controllers/io-throttle.txt | 163 +++++++++++++++++++++++++++++
1 files changed, 163 insertions(+), 0 deletions(-)
create mode 100644 Documentation/controllers/io-throttle.txt

diff --git a/Documentation/controllers/io-throttle.txt b/Documentation/controllers/io-throttle.txt
new file mode 100644
index 0000000..e1df98a
--- /dev/null
+++ b/Documentation/controllers/io-throttle.txt
@@ -0,0 +1,163 @@

+
+ Block device I/O bandwidth controller
+
+1. Description
+
+This controller allows to limit the I/O bandwidth of specific block devices for
+specific process containers (cgroups) imposing additional delays on I/O
+requests for those processes that exceed the limits defined in the control
+group filesystem.
+
+Bandwidth limiting rules offer better control over QoS with respect to priority
+or weight-based solutions that only give information about applications'
+relative performance requirements.
+
+The goal of the I/O bandwidth controller is to improve performance
+predictability and QoS of the different control groups sharing the same block
+devices.
+

+NOTE #1: if you're looking for a way to improve the overall throughput of the

+system probably you should use a different solution.
+

+NOTE #2: the current implementation does not guarantee minimum bandwidth
+levels, the QoS is implemented only slowing down i/o "traffic" that exceeds the
+limits specified by the user. Minimum i/o rate thresholds are supposed to be
+guaranteed if the user configures a proper i/o bandwidth partitioning of the
+block devices shared among the different cgroups (theoretically if the sum of
+all the single limits defined for a block device doesn't exceed the total i/o
+bandwidth of that device).

+
+2. User Interface
+
+A new I/O bandwidth limitation rule is described using the file
+blockio.bandwidth.
+
+The same file can be used to set multiple rules for different block devices

+relative to the same cgroup.

+
+The syntax is the following:
+# /bin/echo DEVICE:BANDWIDTH > CGROUP/blockio.bandwidth
+
+- DEVICE is the name of the device the limiting rule is applied to,

+- BANDWIDTH is the maximum I/O bandwidth on DEVICE allowed by CGROUP (we can
+ use a suffix k, K, m, M, g or G to indicate bandwidth values in KB/s, MB/s
+ or GB/s),

+- CGROUP is the name of the limited process container.

+
+Examples:
+
+* Mount the cgroup filesystem (blockio subsystem):
+ # mkdir /mnt/cgroup
+ # mount -t cgroup -oblockio blockio /mnt/cgroup
+
+* Instantiate the new cgroup "foo":
+ # mkdir /mnt/cgroup/foo
+ --> the cgroup foo has been created
+
+* Add the current shell process to the cgroup "foo":
+ # /bin/echo $$ > /mnt/cgroup/foo/tasks
+ --> the current shell has been added to the cgroup "foo"
+
+* Give maximum 1MiB/s of I/O bandwidth on /dev/sda1 for the cgroup "foo":
+ # /bin/echo /dev/sda1:1M > /mnt/cgroup/foo/blockio.bandwidth
+ # sh
+ --> the subshell 'sh' is running in cgroup "foo" and it can use a maximum I/O
+ bandwidth of 1MiB/s on /dev/sda1 (blockio.bandwidth is expressed in
+ KiB/s).
+
+* Give maximum 8MiB/s of I/O bandwidth on /dev/sdb for the cgroup "foo":
+ # /bin/echo /dev/sda5:8M > /mnt/cgroup/foo/blockio.bandwidth
+ # sh
+ --> the subshell 'sh' is running in cgroup "foo" and it can use a maximum I/O
+ bandwidth of 1MiB/s on /dev/sda1 and 8MiB/s on /dev/sda5.
+ NOTE: each partition needs its own limitation rule! In this case, for
+ example, there's no limitation on /dev/sda5 for cgroup "foo".
+
+* Run a benchmark doing I/O on /dev/sda1 and /dev/sda5; I/O limits and usage
+ defined for cgroup "foo" can be shown as following:
+ # cat /mnt/cgroup/foo/blockio.bandwidth
+ === device (8,1) ===
+ bandwidth limit: 1024 KiB/sec
+ current i/o usage: 819 KiB/sec
+ === device (8,5) ===
+ bandwidth limit: 1024 KiB/sec
+ current i/o usage: 3102 KiB/sec
+
+ Devices are reported using (major, minor) numbers when reading
+ blockio.bandwidth.
+
+ The corresponding device names can be retrieved in /proc/diskstats (or in
+ other places as well).
+
+ For example to find the name of the device (8,5):
+ # sed -ne 's/^ \+8 \+5 $[^ ]\+$.*/\1/p' /proc/diskstats
+ sda5
+
+ Current I/O usage can be greater than bandwidth limit, this means the i/o
+ controller is going to impose the limitation.
+
+* Extend the maximum I/O bandwidth for the cgroup "foo" to 8MiB/s:
+ # /bin/echo /dev/sda1:8M > /mnt/cgroup/foo/blockio-bandwidth
+
+* Remove limiting rule on /dev/sda1 for cgroup "foo":
+ # /bin/echo /dev/sda1:0 > /mnt/cgroup/foo/blockio-bandwidth
+

+3. Advantages of providing this feature
+

+* Allow I/O traffic shaping for block device shared among different cgroups
+* Improve I/O performance predictability on block devices shared between
+ different cgroups
+* Limiting rules do not depend of the particular I/O scheduler (anticipatory,
+ deadline, CFQ, noop) and/or the type of the underlying block devices

+* The bandwidth limitations are guaranteed both for synchronous and
+ asynchronous operations, even the I/O passing through the page cache or
+ buffers and not only direct I/O (see below for details)

+* It is possible to implement a simple user-space application to dynamically
+ adjust the I/O workload of different process containers at run-time,
+ according to the particular users' requirements and applications' performance
+ constraints
+* It is even possible to implement event-based performance throttling
+ mechanisms; for example the same user-space application could actively
+ throttle the I/O bandwidth to reduce power consumption when the battery of a
+ mobile device is running low (power throttling) or when the temperature of a
+ hardware component is too high (thermal throttling)
+* Provides zero overhead for non block device I/O bandwidth controller users
+
+4. Design
+
+The I/O throttling is performed imposing an explicit timeout, via
+schedule_timeout_killable() on the processes that exceed the I/O bandwidth
+dedicated to the cgroup they belong to. I/O accounting happens per cgroup.
+
+It just works as expected for read operations: the real I/O activity is reduced
+synchronously according to the defined limitations.
+
+Write operations, instead, are modeled depending of the dirty pages ratio
+(write throttling in memory), since the writes to the real block devices are
+processed asynchronously by different kernel threads (pdflush). However, the
+dirty pages ratio is directly proportional to the actual I/O that will be
+performed on the real block device. So, due to the asynchronous transfers
+through the page cache, the I/O throttling in memory can be considered a form
+of anticipatory throttling to the underlying block devices.
+
+Multiple re-writes in already dirtied page cache areas are not considered for
+accounting the I/O activity. This is valid for multiple re-reads of pages
+already present in the page cache as well.
+
+This means that a process that re-writes and/or re-reads multiple times the
+same blocks in a file (without re-creating it by truncate(), ftrunctate(),
+creat(), etc.) is affected by the I/O limitations only for the actual I/O
+performed to (or from) the underlying block devices.
+
+Multiple rules for different block devices are stored in a linked list, using
+the dev_t number of each block device as key to uniquely identify each element
+of the list. RCU synchronization is used to protect the whole list structure,
+since the elements in the list are not supposed to change frequently (they
+change only when a new rule is defined or an old rule is removed or updated),
+while the reads in the list occur at each operation that generates I/O. This
+allows to provide zero overhead for cgroups that do not use any limitation.
+
+WARNING: per-block device limiting rules always refer to the dev_t device
+number. If a block device is unplugged (i.e. a USB device) the limiting rules
+associated to that device persist and they are still valid if a new device is
+plugged in the system and it uses the same major and minor numbers.
--
1.5.4.3

Peter T. Breuer

unread,

Jun 20, 2008, 12:20:09 PM6/20/08

to

> + Block device I/O bandwidth controller

How can this work? You will limit the number of available buffer heads
per second?

Unfortunaely, the problem is the fs above the block device. If the
block device is artificially slowed then the fs will still happily allow
a process to fill buffers forever until memory is full, while the block
device continues to trickle the buffers away.

What one wants is for the fs buffering to be linked to the underlying
block device i/o speed. One wants the rate at which fs buffers are
filled to be no more than (modulu brief spurts) the rate at which the
device operates.

That way networked block devices have a chance of having some memory
left to send the dirty buffers out to the net with. B/w limiting the
device itself doesn't seem to me to do any good.

Peter

Randy Dunlap

unread,

Jun 20, 2008, 1:30:18 PM6/20/08

to

I would s/if/If/

> +system probably you should use a different solution.
> +
> +NOTE #2: the current implementation does not guarantee minimum bandwidth

s/the/The/

> +levels, the QoS is implemented only slowing down i/o "traffic" that exceeds the

Please consistenly use "I/O" instead of "i/o".

Above comma makes a run-on sentence. A period or semi-colon would be better IMO.

Ugh, this makes it look like the output does "pretty printing" (formatting),
which is generally not a good idea. Let some app be responsible for that,
not the kernel. Basically this means don't use leading spaces just to make the
":"s line up in the output.

> +
> + Devices are reported using (major, minor) numbers when reading
> + blockio.bandwidth.
> +
> + The corresponding device names can be retrieved in /proc/diskstats (or in
> + other places as well).
> +
> + For example to find the name of the device (8,5):
> + # sed -ne 's/^ \+8 \+5 $[^ ]\+$.*/\1/p' /proc/diskstats
> + sda5
> +
> + Current I/O usage can be greater than bandwidth limit, this means the i/o

Run-on sentence. Change , to . (with This) or use ;

associated with (?)

> +plugged in the system and it uses the same major and minor numbers.
> --

---
~Randy
Linux Plumbers Conference, 17-19 September 2008, Portland, Oregon USA
http://linuxplumbersconf.org/

Matt LaPlante

unread,

Jun 21, 2008, 12:10:11 AM6/21/08

to

---
Documentation/Intel-IOMMU.txt | 4 ++--
Documentation/accounting/taskstats-struct.txt | 2 +-
Documentation/cpu-freq/governors.txt | 2 +-
Documentation/edac.txt | 2 +-
Documentation/filesystems/proc.txt | 4 ++--
Documentation/filesystems/vfs.txt | 6 +++---
Documentation/ia64/kvm.txt | 8 ++++----
Documentation/input/cs461x.txt | 2 +-
Documentation/ioctl/ioctl-decoding.txt | 4 ++--
Documentation/iostats.txt | 2 +-
Documentation/kdump/kdump.txt | 2 +-
Documentation/keys.txt | 2 +-
Documentation/leds-class.txt | 2 +-
Documentation/local_ops.txt | 2 +-
Documentation/networking/bonding.txt | 4 ++--
Documentation/networking/can.txt | 4 ++--
Documentation/networking/ip-sysctl.txt | 2 +-
Documentation/networking/packet_mmap.txt | 2 +-
Documentation/networking/tc-actions-env-rules.txt | 15 ++++++++-------
Documentation/powerpc/booting-without-of.txt | 12 ++++++------
Documentation/powerpc/qe_firmware.txt | 2 +-
Documentation/s390/driver-model.txt | 2 +-
Documentation/scsi/ibmmca.txt | 6 +++---
Documentation/scsi/lpfc.txt | 2 +-
Documentation/scsi/scsi_fc_transport.txt | 6 +++---
Documentation/sh/clk.txt | 2 +-
Documentation/sound/alsa/Audiophile-Usb.txt | 10 +++++-----
Documentation/sound/alsa/hda_codec.txt | 2 +-
Documentation/sound/alsa/soc/dapm.txt | 2 +-
Documentation/sysctl/vm.txt | 2 +-
Documentation/timers/highres.txt | 2 +-
Documentation/usb/authorization.txt | 2 +-
Documentation/video4linux/sn9c102.txt | 2 +-
Documentation/vm/hugetlbpage.txt | 2 +-
Documentation/vm/numa_memory_policy.txt | 4 ++--
Documentation/volatile-considered-harmful.txt | 2 +-
drivers/message/fusion/lsi/mpi_history.txt | 6 +++---
37 files changed, 70 insertions(+), 69 deletions(-)

diff --git a/Documentation/Intel-IOMMU.txt b/Documentation/Intel-IOMMU.txt
index c232190..21bc416 100644
--- a/Documentation/Intel-IOMMU.txt
+++ b/Documentation/Intel-IOMMU.txt
@@ -48,7 +48,7 @@ IOVA generation is pretty generic. We used the same technique as vmalloc()
but these are not global address spaces, but separate for each domain.
Different DMA engines may support different number of domains.

-We also allocate gaurd pages with each mapping, so we can attempt to catch
+We also allocate guard pages with each mapping, so we can attempt to catch
any overflow that might happen.

@@ -112,4 +112,4 @@ TBD

- For compatibility testing, could use unity map domain for all devices, just
provide a 1-1 for all useful memory under a single domain for all devices.
-- API for paravirt ops for abstracting functionlity for VMM folks.
+- API for paravirt ops for abstracting functionality for VMM folks.
diff --git a/Documentation/accounting/taskstats-struct.txt b/Documentation/accounting/taskstats-struct.txt
index 8aa7529..3d0e4f2 100644
--- a/Documentation/accounting/taskstats-struct.txt
+++ b/Documentation/accounting/taskstats-struct.txt
@@ -6,7 +6,7 @@ This document contains an explanation of the struct taskstats fields.
There are three different groups of fields in the struct taskstats:

1) Common and basic accounting fields
- If CONFIG_TASKSTATS is set, the taskstats inteface is enabled and
+ If CONFIG_TASKSTATS is set, the taskstats interface is enabled and
the common fields and basic accounting fields are collected for
delivery at do_exit() of a task.
2) Delay accounting fields
diff --git a/Documentation/cpu-freq/governors.txt b/Documentation/cpu-freq/governors.txt
index dcec056..5b0cfa6 100644
--- a/Documentation/cpu-freq/governors.txt
+++ b/Documentation/cpu-freq/governors.txt
@@ -122,7 +122,7 @@ around '10000' or more.
show_sampling_rate_(min|max): the minimum and maximum sampling rates
available that you may set 'sampling_rate' to.

-up_threshold: defines what the average CPU usaged between the samplings
+up_threshold: defines what the average CPU usage between the samplings
of 'sampling_rate' needs to be for the kernel to make a decision on
whether it should increase the frequency. For example when it is set
to its default value of '80' it means that between the checking
diff --git a/Documentation/edac.txt b/Documentation/edac.txt
index a5c3684..658c654 100644
--- a/Documentation/edac.txt
+++ b/Documentation/edac.txt
@@ -392,7 +392,7 @@ Sdram memory scrubbing rate:
'sdram_scrub_rate'

Read/Write attribute file that controls memory scrubbing. The scrubbing
- rate is set by writing a minimum bandwith in bytes/sec to the attribute
+ rate is set by writing a minimum bandwidth in bytes/sec to the attribute
file. The rate will be translated to an internal value that gives at
least the specified rate.

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index dbc3c6a..7845ec3 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -880,7 +880,7 @@ group_prealloc max_to_scan mb_groups mb_history min_to_scan order2_req
stats stream_req

mb_groups:
-This file gives the details of mutiblock allocator buddy cache of free blocks
+This file gives the details of multiblock allocator buddy cache of free blocks

mb_history:
Multiblock allocation history.
@@ -1423,7 +1423,7 @@ used because pages_free(1355) is smaller than watermark + protection[2]
normal page requirement. If requirement is DMA zone(index=0), protection[0]
(=0) is used.

-zone[i]'s protection[j] is calculated by following exprssion.
+zone[i]'s protection[j] is calculated by following expression.

(i < j):
zone[i]->protection[j]
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index b7522c6..c4d348d 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -143,7 +143,7 @@ struct file_system_type {

The get_sb() method has the following arguments:

- struct file_system_type *fs_type: decribes the filesystem, partly initialized
+ struct file_system_type *fs_type: describes the filesystem, partly initialized
by the specific filesystem code

int flags: mount flags
@@ -895,9 +895,9 @@ struct dentry_operations {
iput() yourself

d_dname: called when the pathname of a dentry should be generated.
- Usefull for some pseudo filesystems (sockfs, pipefs, ...) to delay
+ Useful for some pseudo filesystems (sockfs, pipefs, ...) to delay
pathname generation. (Instead of doing it when dentry is created,
- its done only when the path is needed.). Real filesystems probably
+ it's done only when the path is needed.). Real filesystems probably
dont want to use it, because their dentries are present in global
dcache hash, so their hash should be an invariant. As no lock is
held, d_dname() should not try to modify the dentry itself, unless
diff --git a/Documentation/ia64/kvm.txt b/Documentation/ia64/kvm.txt
index bec9d81..914d07f 100644
--- a/Documentation/ia64/kvm.txt
+++ b/Documentation/ia64/kvm.txt
@@ -50,9 +50,9 @@ Note: For step 2, please make sure that host page size == TARGET_PAGE_SIZE of qe
/usr/local/bin/qemu-system-ia64 -smp xx -m 512 -hda $your_image
(xx is the number of virtual processors for the guest, now the maximum value is 4)

-5. Known possibile issue on some platforms with old Firmware.
+5. Known possible issue on some platforms with old Firmware.

-If meet strange host crashe issues, try to solve it through either of the following ways:
+In the event of strange host crash issues, try to solve it through either of the following ways:

(1): Upgrade your Firmware to the latest one.

@@ -65,8 +65,8 @@ index 0b53344..f02b0f7 100644
mov ar.pfs = loc1
mov rp = loc0
;;
-- srlz.d // seralize restoration of psr.l
-+ srlz.i // seralize restoration of psr.l
+- srlz.d // serialize restoration of psr.l
++ srlz.i // serialize restoration of psr.l
+ ;;
br.ret.sptk.many b0
END(ia64_pal_call_static)
diff --git a/Documentation/input/cs461x.txt b/Documentation/input/cs461x.txt
index afe0d65..202e9db 100644
--- a/Documentation/input/cs461x.txt
+++ b/Documentation/input/cs461x.txt
@@ -31,7 +31,7 @@ The driver works with ALSA drivers simultaneously. For example, the xracer
uses joystick as input device and PCM device as sound output in one time.
There are no sound or input collisions detected. The source code have
comments about them; but I've found the joystick can be initialized
-separately of ALSA modules. So, you canm use only one joystick driver
+separately of ALSA modules. So, you can use only one joystick driver
without ALSA drivers. The ALSA drivers are not needed to compile or
run this driver.

diff --git a/Documentation/ioctl/ioctl-decoding.txt b/Documentation/ioctl/ioctl-decoding.txt
index bfdf7f3..e35efb0 100644
--- a/Documentation/ioctl/ioctl-decoding.txt
+++ b/Documentation/ioctl/ioctl-decoding.txt
@@ -1,6 +1,6 @@
To decode a hex IOCTL code:

-Most architecures use this generic format, but check
+Most architectures use this generic format, but check
include/ARCH/ioctl.h for specifics, e.g. powerpc
uses 3 bits to encode read/write and 13 bits for size.

@@ -18,7 +18,7 @@ uses 3 bits to encode read/write and 13 bits for size.
7-0 function #

- So for example 0x82187201 is a read with arg length of 0x218,
+So for example 0x82187201 is a read with arg length of 0x218,
character 'r' function 1. Grepping the source reveals this is:

#define VFAT_IOCTL_READDIR_BOTH _IOR('r', 1, struct dirent [2])
diff --git a/Documentation/iostats.txt b/Documentation/iostats.txt
index 5925c3c..59a69ec 100644
--- a/Documentation/iostats.txt
+++ b/Documentation/iostats.txt
@@ -143,7 +143,7 @@ disk and partition statistics are consistent again. Since we still don't
keep record of the partition-relative address, an operation is attributed to
the partition which contains the first sector of the request after the
eventual merges. As requests can be merged across partition, this could lead
-to some (probably insignificant) innacuracy.
+to some (probably insignificant) inaccuracy.

Additional notes
----------------
diff --git a/Documentation/kdump/kdump.txt b/Documentation/kdump/kdump.txt
index b8e52c0..6ec901c 100644
--- a/Documentation/kdump/kdump.txt
+++ b/Documentation/kdump/kdump.txt
@@ -108,7 +108,7 @@ There are two possible methods of using Kdump.

2) Or use the system kernel binary itself as dump-capture kernel and there is
no need to build a separate dump-capture kernel. This is possible
- only with the architecutres which support a relocatable kernel. As
+ only with the architectures which support a relocatable kernel. As
of today i386 and ia64 architectures support relocatable kernel.

Building a relocatable kernel is advantageous from the point of view that
diff --git a/Documentation/keys.txt b/Documentation/keys.txt
index d5c7a57..b56aacc 100644
--- a/Documentation/keys.txt
+++ b/Documentation/keys.txt
@@ -864,7 +864,7 @@ payload contents" for more information.
request_key_with_auxdata() respectively.

These two functions return with the key potentially still under
- construction. To wait for contruction completion, the following should be
+ construction. To wait for construction completion, the following should be
called:

int wait_for_key_construction(struct key *key, bool intr);
diff --git a/Documentation/leds-class.txt b/Documentation/leds-class.txt
index 18860ad..6399557 100644
--- a/Documentation/leds-class.txt
+++ b/Documentation/leds-class.txt
@@ -59,7 +59,7 @@ Hardware accelerated blink of LEDs

Some LEDs can be programmed to blink without any CPU interaction. To
support this feature, a LED driver can optionally implement the
-blink_set() function (see <linux/leds.h>). If implemeted, triggers can
+blink_set() function (see <linux/leds.h>). If implemented, triggers can
attempt to use it before falling back to software timers. The blink_set()
function should return 0 if the blink setting is supported, or -EINVAL
otherwise, which means that LED blinking will be handled by software.
diff --git a/Documentation/local_ops.txt b/Documentation/local_ops.txt
index 4269a11..f4f8b1c 100644
--- a/Documentation/local_ops.txt
+++ b/Documentation/local_ops.txt
@@ -36,7 +36,7 @@ It can be done by slightly modifying the standard atomic operations : only
their UP variant must be kept. It typically means removing LOCK prefix (on
i386 and x86_64) and any SMP sychronization barrier. If the architecture does
not have a different behavior between SMP and UP, including asm-generic/local.h
-in your archtecture's local.h is sufficient.
+in your architecture's local.h is sufficient.

The local_t type is defined as an opaque signed long by embedding an
atomic_long_t inside a structure. This is made so a cast from this type to a
diff --git a/Documentation/networking/bonding.txt b/Documentation/networking/bonding.txt
index a0cda06..549a38f 100644
--- a/Documentation/networking/bonding.txt
+++ b/Documentation/networking/bonding.txt
@@ -305,7 +305,7 @@ fail_over_mac
traffic to update its tables) for the traditional method. If
the gratuitous ARP is lost, communication may be disrupted.

- When fail over MAC is used in conjuction with the mii monitor,
+ When fail over MAC is used in conjunction with the mii monitor,
devices which assert link up prior to being able to actually
transmit and receive are particularly susecptible to loss of
the gratuitous ARP, and an appropriate updelay setting may be
@@ -581,7 +581,7 @@ xmit_hash_policy
in environments where a layer3 gateway device is
required to reach most destinations.

- This algorithm is 802.3ad complient.
+ This algorithm is 802.3ad compliant.

layer3+4

diff --git a/Documentation/networking/can.txt b/Documentation/networking/can.txt
index 641d2af..297ba7b 100644
--- a/Documentation/networking/can.txt
+++ b/Documentation/networking/can.txt
@@ -186,7 +186,7 @@ solution for a couple of reasons:

The Linux network devices (by default) just can handle the
transmission and reception of media dependent frames. Due to the
- arbritration on the CAN bus the transmission of a low prio CAN-ID
+ arbitration on the CAN bus the transmission of a low prio CAN-ID
may be delayed by the reception of a high prio CAN frame. To
reflect the correct* traffic on the node the loopback of the sent
data has to be performed right after a successful transmission. If
@@ -481,7 +481,7 @@ solution for a couple of reasons:
- stats_timer: To calculate the Socket CAN core statistics
(e.g. current/maximum frames per second) this 1 second timer is
invoked at can.ko module start time by default. This timer can be
- disabled by using stattimer=0 on the module comandline.
+ disabled by using stattimer=0 on the module commandline.

- debug: (removed since SocketCAN SVN r546)

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index 17a6e46..a6af9e1 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -150,7 +150,7 @@ tcp_available_congestion_control - STRING
tcp_base_mss - INTEGER
The initial value of search_low to be used by Packetization Layer
Path MTU Discovery (MTU probing). If MTU probing is enabled,
- this is the inital MSS used by the connection.
+ this is the initial MSS used by the connection.

tcp_congestion_control - STRING
Set the congestion control algorithm to be used for new
diff --git a/Documentation/networking/packet_mmap.txt b/Documentation/networking/packet_mmap.txt
index db0cd51..07c53d5 100644
--- a/Documentation/networking/packet_mmap.txt
+++ b/Documentation/networking/packet_mmap.txt
@@ -326,7 +326,7 @@ just one call to mmap is needed:
mmap(0, size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);

If tp_frame_size is a divisor of tp_block_size frames will be
-contiguosly spaced by tp_frame_size bytes. If not, each
+contiguously spaced by tp_frame_size bytes. If not, each
tp_block_size/tp_frame_size frames there will be a gap between
the frames. This is because a frame cannot be spawn across two
blocks.
diff --git a/Documentation/networking/tc-actions-env-rules.txt b/Documentation/networking/tc-actions-env-rules.txt
index 01e716d..dcadf6f 100644
--- a/Documentation/networking/tc-actions-env-rules.txt
+++ b/Documentation/networking/tc-actions-env-rules.txt
@@ -4,26 +4,27 @@ The "enviromental" rules for authors of any new tc actions are:
1) If you stealeth or borroweth any packet thou shalt be branching
from the righteous path and thou shalt cloneth.

-For example if your action queues a packet to be processed later
-or intentionaly branches by redirecting a packet then you need to
+For example if your action queues a packet to be processed later,
+or intentionally branches by redirecting a packet, then you need to
clone the packet.
+
There are certain fields in the skb tc_verd that need to be reset so we
-avoid loops etc. A few are generic enough so much so that skb_act_clone()
-resets them for you. So invoke skb_act_clone() rather than skb_clone()
+avoid loops, etc. A few are generic enough that skb_act_clone()
+resets them for you, so invoke skb_act_clone() rather than skb_clone().

2) If you munge any packet thou shalt call pskb_expand_head in the case
someone else is referencing the skb. After that you "own" the skb.
You must also tell us if it is ok to munge the packet (TC_OK2MUNGE),
this way any action downstream can stomp on the packet.

-3) dropping packets you dont own is a nono. You simply return
+3) Dropping packets you don't own is a no-no. You simply return
TC_ACT_SHOT to the caller and they will drop it.

The "enviromental" rules for callers of actions (qdiscs etc) are:

-*) thou art responsible for freeing anything returned as being
+*) Thou art responsible for freeing anything returned as being
TC_ACT_SHOT/STOLEN/QUEUED. If none of TC_ACT_SHOT/STOLEN/QUEUED is
-returned then all is great and you dont need to do anything.
+returned, then all is great and you don't need to do anything.

Post on netdev if something is unclear.

diff --git a/Documentation/powerpc/booting-without-of.txt b/Documentation/powerpc/booting-without-of.txt
index 1d2a772..fe32a3c 100644
--- a/Documentation/powerpc/booting-without-of.txt
+++ b/Documentation/powerpc/booting-without-of.txt
@@ -704,7 +704,7 @@ device or bus to be described by the device tree.
In general, the format of an address for a device is defined by the
parent bus type, based on the #address-cells and #size-cells
properties. Note that the parent's parent definitions of #address-cells
-and #size-cells are not inhereted so every node with children must specify
+and #size-cells are not inherited so every node with children must specify
them. The kernel requires the root node to have those properties defining
addresses format for devices directly mapped on the processor bus.

@@ -1820,7 +1820,7 @@ platforms are moved over to use the flattened-device-tree model.

viii) Uploaded QE firmware

- If a new firwmare has been uploaded to the QE (usually by the
+ If a new firmware has been uploaded to the QE (usually by the
boot loader), then a 'firmware' child node should be added to the QE
node. This node provides information on the uploaded firmware that
device drivers may need.
@@ -1956,7 +1956,7 @@ platforms are moved over to use the flattened-device-tree model.
reg = <119c0 30>;
}

- ii) Properties common to mulitple CPM/QE devices
+ ii) Properties common to multiple CPM/QE devices

- fsl,cpm-command : This value is ORed with the opcode and command flag
to specify the device on which a CPM command operates.
@@ -2584,7 +2584,7 @@ platforms are moved over to use the flattened-device-tree model.

Xilinx uartlite devices are simple fixed speed serial ports.

- Requred properties:
+ Required properties:
- current-speed : Baud rate of uartlite

v) Xilinx hwicap
@@ -2606,7 +2606,7 @@ platforms are moved over to use the flattened-device-tree model.
Xilinx UART 16550 devices are very similar to the NS16550 but with
different register spacing and an offset from the base address.

- Requred properties:
+ Required properties:
- clock-frequency : Frequency of the clock input
- reg-offset : A value of 3 is required
- reg-shift : A value of 2 is required
@@ -2883,7 +2883,7 @@ prefixed with the string "marvell,", for Marvell Technology Group Ltd.
1) The /system-controller node

This node is used to represent the system-controller and must be
- present when the system uses a system contller chip. The top-level
+ present when the system uses a system controller chip. The top-level
system-controller node contains information that is global to all
devices within the system controller chip. The node name begins
with "system-controller" followed by the unit address, which is
diff --git a/Documentation/powerpc/qe_firmware.txt b/Documentation/powerpc/qe_firmware.txt
index 8962664..06da4d4 100644
--- a/Documentation/powerpc/qe_firmware.txt
+++ b/Documentation/powerpc/qe_firmware.txt
@@ -217,7 +217,7 @@ Although it is not recommended, you can specify '0' in the soc.model
field to skip matching SOCs altogether.

The 'model' field is a 16-bit number that matches the actual SOC. The
-'major' and 'minor' fields are the major and minor revision numbrs,
+'major' and 'minor' fields are the major and minor revision numbers,
respectively, of the SOC.

For example, to match the 8323, revision 1.0:
diff --git a/Documentation/s390/driver-model.txt b/Documentation/s390/driver-model.txt
index e938c44..bde473d 100644
--- a/Documentation/s390/driver-model.txt
+++ b/Documentation/s390/driver-model.txt
@@ -25,7 +25,7 @@ device 4711 via subchannel 1 in subchannel set 0, and subchannel 2 is a non-I/O
subchannel. Device 1234 is accessed via subchannel 0 in subchannel set 1.

The subchannel named 'defunct' does not represent any real subchannel on the
-system; it is a pseudo subchannel where disconnnected ccw devices are moved to
+system; it is a pseudo subchannel where disconnected ccw devices are moved to
if they are displaced by another ccw device becoming operational on their
former subchannel. The ccw devices will be moved again to a proper subchannel
if they become operational again on that subchannel.
diff --git a/Documentation/scsi/ibmmca.txt b/Documentation/scsi/ibmmca.txt
index a810421..3920f28 100644
--- a/Documentation/scsi/ibmmca.txt
+++ b/Documentation/scsi/ibmmca.txt
@@ -524,7 +524,7 @@
- Michael Lang

June 25 1997: (v1.8b)
- 1) Some cosmetical changes for the handling of SCSI-device-types.
+ 1) Some cosmetic changes for the handling of SCSI-device-types.
Now, also CD-Burners / WORMs and SCSI-scanners should work. For
MO-drives I have no experience, therefore not yet supported.
In logical_devices I changed from different type-variables to one
@@ -914,7 +914,7 @@
in version 4.0. This was never really necessary, as all troubles were
based on non-command related reasons up to now, so bypassing commands
did not help to avoid any bugs. It is kept in 3.2X for debugging reasons.
- 5) Dynamical reassignment of ldns was again verified and analyzed to be
+ 5) Dynamic reassignment of ldns was again verified and analyzed to be
completely inoperational. This is corrected and should work now.
6) All commands that get sent to the SCSI adapter were verified and
completed in such a way, that they are now completely conform to the
@@ -1386,7 +1386,7 @@
concerning the Linux-kernel in special, this SCSI-driver comes without any
warranty. Its functionality is tested as good as possible on certain
machines and combinations of computer hardware, which does not exclude,
- that dataloss or severe damage of hardware is possible while using this
+ that data loss or severe damage of hardware is possible while using this
part of software on some arbitrary computer hardware or in combination
with other software packages. It is highly recommended to make backup
copies of your data before using this software. Furthermore, personal
diff --git a/Documentation/scsi/lpfc.txt b/Documentation/scsi/lpfc.txt
index 4dbe413..5741ea8 100644
--- a/Documentation/scsi/lpfc.txt
+++ b/Documentation/scsi/lpfc.txt
@@ -36,7 +36,7 @@ Cable pull and temporary device Loss:
being removed, a switch rebooting, or a device reboot), the driver could
hide the disappearance of the device from the midlayer. I/O's issued to
the LLDD would simply be queued for a short duration, allowing the device
- to reappear or link come back alive, with no inadvertant side effects
+ to reappear or link come back alive, with no inadvertent side effects
to the system. If the driver did not hide these conditions, i/o would be
errored by the driver, the mid-layer would exhaust its retries, and the
device would be taken offline. Manual intervention would be required to
diff --git a/Documentation/scsi/scsi_fc_transport.txt b/Documentation/scsi/scsi_fc_transport.txt
index d403e46..75143f0 100644
--- a/Documentation/scsi/scsi_fc_transport.txt
+++ b/Documentation/scsi/scsi_fc_transport.txt
@@ -65,7 +65,7 @@ Overview:
discussion will concentrate on NPIV.

Note: World Wide Name assignment (and uniqueness guarantees) are left
- up to an administrative entity controling the vport. For example,
+ up to an administrative entity controlling the vport. For example,
if vports are to be associated with virtual machines, a XEN mgmt
utility would be responsible for creating wwpn/wwnn's for the vport,
using it's own naming authority and OUI. (Note: it already does this
@@ -91,7 +91,7 @@ Device Trees and Vport Objects:
Here's what to expect in the device tree :
The typical Physical Port's Scsi_Host:
/sys/devices/.../host17/
- and it has the typical decendent tree:
+ and it has the typical descendant tree:
/sys/devices/.../host17/rport-17:0-0/target17:0:0/17:0:0:0:
and then the vport is created on the Physical Port:
/sys/devices/.../host17/vport-17:0-0
@@ -192,7 +192,7 @@ Vport States:
independent of the adapter's link state.
- Instantiation of the vport on the FC link via ELS traffic, etc.
This is equivalent to a "link up" and successfull link initialization.
- Futher information can be found in the interfaces section below for
+ Further information can be found in the interfaces section below for
Vport Creation.

Once a vport has been instantiated with the kernel/LLDD, a vport state
diff --git a/Documentation/sh/clk.txt b/Documentation/sh/clk.txt
index 9aef710..114b595 100644
--- a/Documentation/sh/clk.txt
+++ b/Documentation/sh/clk.txt
@@ -12,7 +12,7 @@ means no changes to adjanced clock
Internally, the clk_set_rate_ex forwards request to clk->ops->set_rate method,
if it is present in ops structure. The method should set the clock rate and adjust
all needed clocks according to the passed algo_id.
-Exact values for algo_id are machine-dependend. For the sh7722, the following
+Exact values for algo_id are machine-dependent. For the sh7722, the following
values are defined:

NO_CHANGE = 0,
diff --git a/Documentation/sound/alsa/Audiophile-Usb.txt b/Documentation/sound/alsa/Audiophile-Usb.txt
index 2ad5e63..a4c53d8 100644
--- a/Documentation/sound/alsa/Audiophile-Usb.txt
+++ b/Documentation/sound/alsa/Audiophile-Usb.txt
@@ -236,15 +236,15 @@ The parameter can be given:
alias snd-card-1 snd-usb-audio
options snd-usb-audio index=1 device_setup=0x09

-CAUTION when initializaing the device
+CAUTION when initializing the device
-------------------------------------

* Correct initialization on the device requires that device_setup is given to
the module BEFORE the device is turned on. So, if you use the "manual probing"
method described above, take care to power-on the device AFTER this initialization.

- * Failing to respect this will lead in a misconfiguration of the device. In this case
- turn off the device, unproble the snd-usb-audio module, then probe it again with
+ * Failing to respect this will lead to a misconfiguration of the device. In this case
+ turn off the device, unprobe the snd-usb-audio module, then probe it again with
correct device_setup parameter and then (and only then) turn on the device again.

* If you've correctly initialized the device in a valid mode and then want to switch
@@ -388,9 +388,9 @@ There are 2 main potential issues when using Jackd with the device:

Jack supports big endian devices only in recent versions (thanks to
Andreas Steinmetz for his first big-endian patch). I can't remember
-extacly when this support was released into jackd, let's just say that
+exactly when this support was released into jackd, let's just say that
with jackd version 0.103.0 it's almost ok (just a small bug is affecting
-16bits Big-Endian devices, but since you've read carefully the above
+16bits Big-Endian devices, but since you've read carefully the above
paragraphs, you're now using kernel >= 2.6.23 and your 16bits devices
are now Little Endians ;-) ).

diff --git a/Documentation/sound/alsa/hda_codec.txt b/Documentation/sound/alsa/hda_codec.txt
index 8e1b025..34e87ec 100644
--- a/Documentation/sound/alsa/hda_codec.txt
+++ b/Documentation/sound/alsa/hda_codec.txt
@@ -67,7 +67,7 @@ CONFIG_SND_HDA_POWER_SAVE kconfig. It's called when the codec needs
to power up or may power down. The controller should check the all
belonging codecs on the bus whether they are actually powered off
(check codec->power_on), and optionally the driver may power down the
-contoller side, too.
+controller side, too.

The bus instance is created via snd_hda_bus_new(). You need to pass
the card instance, the template, and the pointer to store the
diff --git a/Documentation/sound/alsa/soc/dapm.txt b/Documentation/sound/alsa/soc/dapm.txt
index c784a18..b2ed698 100644
--- a/Documentation/sound/alsa/soc/dapm.txt
+++ b/Documentation/sound/alsa/soc/dapm.txt
@@ -68,7 +68,7 @@ Audio DAPM widgets fall into a number of types:-
(Widgets are defined in include/sound/soc-dapm.h)

Widgets are usually added in the codec driver and the machine driver. There are
-convience macros defined in soc-dapm.h that can be used to quickly build a
+convenience macros defined in soc-dapm.h that can be used to quickly build a
list of widgets of the codecs and machines DAPM widgets.

Most widgets have a name, register, shift and invert. Some widgets have extra
diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 8a4863c..d79eeda 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -116,7 +116,7 @@ of kilobytes free. The VM uses this number to compute a pages_min
value for each lowmem zone in the system. Each lowmem zone gets
a number of reserved free pages based proportionally on its size.

-Some minimal ammount of memory is needed to satisfy PF_MEMALLOC
+Some minimal amount of memory is needed to satisfy PF_MEMALLOC
allocations; if you set this to lower than 1024KB, your system will
become subtly broken, and prone to deadlock under high loads.

diff --git a/Documentation/timers/highres.txt b/Documentation/timers/highres.txt
index a73ecf5..2133223 100644
--- a/Documentation/timers/highres.txt
+++ b/Documentation/timers/highres.txt
@@ -125,7 +125,7 @@ increase of flexibility and the avoidance of duplicated code across
architectures justifies the slight increase of the binary size.

The conversion of an architecture has no functional impact, but allows to
-utilize the high resolution and dynamic tick functionalites without any change
+utilize the high resolution and dynamic tick functionalities without any change
to the clock event device and timer interrupt code. After the conversion the
enabling of high resolution timers and dynamic ticks is simply provided by
adding the kernel/time/Kconfig file to the architecture specific Kconfig and
diff --git a/Documentation/usb/authorization.txt b/Documentation/usb/authorization.txt
index 2af4006..381b22e 100644
--- a/Documentation/usb/authorization.txt
+++ b/Documentation/usb/authorization.txt
@@ -8,7 +8,7 @@ not) in a system. This feature will allow you to implement a lock-down
of USB devices, fully controlled by user space.

As of now, when a USB device is connected it is configured and
-it's interfaces inmediately made available to the users. With this
+its interfaces are immediately made available to the users. With this
modification, only if root authorizes the device to be configured will
then it be possible to use it.

diff --git a/Documentation/video4linux/sn9c102.txt b/Documentation/video4linux/sn9c102.txt
index b26f519..73de405 100644
--- a/Documentation/video4linux/sn9c102.txt
+++ b/Documentation/video4linux/sn9c102.txt
@@ -157,7 +157,7 @@ Loading can be done as shown below:

[root@localhost home]# modprobe sn9c102

-Note that the module is called "sn9c102" for historic reasons, althought it
+Note that the module is called "sn9c102" for historic reasons, although it
does not just support the SN9C102.

At this point all the devices supported by the driver and connected to the USB
diff --git a/Documentation/vm/hugetlbpage.txt b/Documentation/vm/hugetlbpage.txt
index 3102b81..899b343 100644
--- a/Documentation/vm/hugetlbpage.txt
+++ b/Documentation/vm/hugetlbpage.txt
@@ -77,7 +77,7 @@ memory that is preset in system at this time. System administrators may want
to put this command in one of the local rc init files. This will enable the
kernel to request huge pages early in the boot process (when the possibility
of getting physical contiguous pages is still very high). In either
-case, adminstrators will want to verify the number of hugepages actually
+case, administrators will want to verify the number of hugepages actually
allocated by checking the sysctl or meminfo.

/proc/sys/vm/nr_overcommit_hugepages indicates how large the pool of
diff --git a/Documentation/vm/numa_memory_policy.txt b/Documentation/vm/numa_memory_policy.txt
index bad16d3..6aaaeb3 100644
--- a/Documentation/vm/numa_memory_policy.txt
+++ b/Documentation/vm/numa_memory_policy.txt
@@ -58,7 +58,7 @@ most general to most specific:
the policy at the time they were allocated.

VMA Policy: A "VMA" or "Virtual Memory Area" refers to a range of a task's
- virtual adddress space. A task may define a specific policy for a range
+ virtual address space. A task may define a specific policy for a range
of its virtual address space. See the MEMORY POLICIES APIS section,
below, for an overview of the mbind() system call used to set a VMA
policy.
@@ -353,7 +353,7 @@ follows:

Because of this extra reference counting, and because we must lookup
shared policies in a tree structure under spinlock, shared policies are
- more expensive to use in the page allocation path. This is expecially
+ more expensive to use in the page allocation path. This is especially
true for shared policies on shared memory regions shared by tasks running
on different NUMA nodes. This extra overhead can be avoided by always
falling back to task or system default policy for shared memory regions,
diff --git a/Documentation/volatile-considered-harmful.txt b/Documentation/volatile-considered-harmful.txt
index 10c2e41..991c26a 100644
--- a/Documentation/volatile-considered-harmful.txt
+++ b/Documentation/volatile-considered-harmful.txt
@@ -114,6 +114,6 @@ CREDITS

Original impetus and research by Randy Dunlap
Written by Jonathan Corbet
-Improvements via coments from Satyam Sharma, Johannes Stezenbach, Jesper
+Improvements via comments from Satyam Sharma, Johannes Stezenbach, Jesper
Juhl, Heikki Orsila, H. Peter Anvin, Philipp Hahn, and Stefan
Richter.
diff --git a/drivers/message/fusion/lsi/mpi_history.txt b/drivers/message/fusion/lsi/mpi_history.txt
index 241592a..3f15fcf 100644
--- a/drivers/message/fusion/lsi/mpi_history.txt
+++ b/drivers/message/fusion/lsi/mpi_history.txt
@@ -127,7 +127,7 @@ mpi_ioc.h
* 08-08-01 01.02.01 Original release for v1.2 work.
* New format for FWVersion and ProductId in
* MSG_IOC_FACTS_REPLY and MPI_FW_HEADER.
- * 08-31-01 01.02.02 Addded event MPI_EVENT_SCSI_DEVICE_STATUS_CHANGE and
+ * 08-31-01 01.02.02 Added event MPI_EVENT_SCSI_DEVICE_STATUS_CHANGE and
* related structure and defines.
* Added event MPI_EVENT_ON_BUS_TIMER_EXPIRED.
* Added MPI_IOCINIT_FLAGS_DISCARD_FW_IMAGE.
@@ -187,7 +187,7 @@ mpi_ioc.h
* 10-11-06 01.05.12 Added MPI_IOCFACTS_EXCEPT_METADATA_UNSUPPORTED.
* Added MaxInitiators field to PortFacts reply.
* Added SAS Device Status Change ReasonCode for
- * asynchronous notificaiton.
+ * asynchronous notification.
* Added MPI_EVENT_SAS_EXPANDER_STATUS_CHANGE and event
* data structure.
* Added new ImageType values for FWDownload and FWUpload
@@ -213,7 +213,7 @@ mpi_cnfg.h
* Added _RESPONSE_ID_MASK definition to SCSI_PORT_1
* page and updated the page version.
* Added Information field and _INFO_PARAMS_NEGOTIATED
- * definitionto SCSI_DEVICE_0 page.
+ * definition to SCSI_DEVICE_0 page.
* 06-22-00 01.00.03 Removed batch controls from LAN_0 page and updated the
* page version.
* Added BucketsRemaining to LAN_1 page, redefined the
--
1.5.4.3

Andrea Righi

unread,

Jun 21, 2008, 6:40:13 AM6/21/08

to

Thanks Randy, I've applied all your fixes to my local documentation,
next patchset version will include them. A few small comments below.

Randy Dunlap wrote:
>> +* Run a benchmark doing I/O on /dev/sda1 and /dev/sda5; I/O limits and usage
>> + defined for cgroup "foo" can be shown as following:
>> + # cat /mnt/cgroup/foo/blockio.bandwidth
>> + === device (8,1) ===
>> + bandwidth limit: 1024 KiB/sec
>> + current i/o usage: 819 KiB/sec
>> + === device (8,5) ===
>> + bandwidth limit: 1024 KiB/sec
>> + current i/o usage: 3102 KiB/sec
>
> Ugh, this makes it look like the output does "pretty printing" (formatting),
> which is generally not a good idea. Let some app be responsible for that,
> not the kernel. Basically this means don't use leading spaces just to make the
> ":"s line up in the output.

Sounds reasonable. I think the output could be further reduced,
the following format should be explanatory enough.

device: %u,%u
bandwidth: %lu KiB/sec
usage: %lu KiB/sec

>> +WARNING: per-block device limiting rules always refer to the dev_t device
>> +number. If a block device is unplugged (i.e. a USB device) the limiting rules
>> +associated to that device persist and they are still valid if a new device is
>
> associated with (?)

what about:

...the limiting rules defined for that device...

-Andrea

Andrea Righi

unread,

Jun 22, 2008, 9:20:15 AM6/22/08

to

Carl,

based on your token bucket solution I've implemented a run-time leaky
bucket / token bucket switcher:

# leaky bucket #
echo 0 > /cgroups/foo/blockio.throttling_strategy
# token bucket #
echo 1 > /cgroups/foo/blockio.throttling_strategy

The -rc of the new io-throttle patch 2/3 is below, 1/3 and 3/3 are the
same as patchset version 3, even if documentation must be updated. It
would be great if you could review the patch, in particular the
token_bucket() implementation and repeat your tests.

The all-in-one patch is available here:
http://download.systemimager.org/~arighi/linux/patches/io-throttle/cgroup-io-throttle-v4-rc1.patch

I also did some quick tests similar to yours, the benchmark I've used is
available here as well:
http://download.systemimager.org/~arighi/linux/patches/io-throttle/benchmark/iobw.c

= Results =
I/O scheduler: cfq
filesystem: ext3
Command: ionice -c 1 -n 0 iobw -direct 2 4m 32m
Bucket size: 4MiB

testing 2 parallel streams, chunk_size 4096KiB, data_size 32768KiB

=== no throttling ===
testing 2 parallel streams, chunk_size 4096KiB, data_size 32768KiB
[task 2] time: 2.929, bw: 10742 KiB/s (WRITE)
[task 2] time: 2.878, bw: 10742 KiB/s (READ )
[task 1] time: 2.377, bw: 13671 KiB/s (WRITE)
[task 1] time: 3.979, bw: 7812 KiB/s (READ )
[parent 0] time: 6.397, bw: 19531 KiB/s (TOTAL)

=== bandwidth limit: 4MiB/s (leaky bucket) ===
[task 2] time: 15.880, bw: 1953 KiB/s (WRITE)
[task 2] time: 14.278, bw: 1953 KiB/s (READ )
[task 1] time: 14.711, bw: 1953 KiB/s (WRITE)
[task 1] time: 16.563, bw: 1953 KiB/s (READ )
[parent 0] time: 31.316, bw: 3906 KiB/s (TOTAL)

=== bandwidth limit: 4MiB/s (token bucket) ===
[task 2] time: 11.864, bw: 1953 KiB/s (WRITE)
[task 2] time: 15.958, bw: 1953 KiB/s (READ )
[task 1] time: 19.233, bw: 976 KiB/s (WRITE)
[task 1] time: 12.643, bw: 1953 KiB/s (READ )
[parent 0] time: 31.917, bw: 3906 KiB/s (TOTAL)

=== bandwidth limit: 8MiB/s (leaky bucket) ===
[task 2] time: 7.198, bw: 3906 KiB/s (WRITE)
[task 2] time: 8.012, bw: 3906 KiB/s (READ )
[task 1] time: 7.891, bw: 3906 KiB/s (WRITE)
[task 1] time: 7.846, bw: 3906 KiB/s (READ )
[parent 0] time: 15.780, bw: 7812 KiB/s (TOTAL)

=== bandwidth limit: 8MiB/s (token bucket) ===
[task 1] time: 6.996, bw: 3906 KiB/s (WRITE)
[task 1] time: 6.529, bw: 4882 KiB/s (READ )
[task 2] time: 10.341, bw: 2929 KiB/s (WRITE)
[task 2] time: 5.681, bw: 4882 KiB/s (READ )
[parent 0] time: 16.079, bw: 7812 KiB/s (TOTAL)

=== bandwidth limit: 12MiB/s (leaky bucket) ===
[task 2] time: 4.992, bw: 5859 KiB/s (WRITE)
[task 2] time: 5.077, bw: 5859 KiB/s (READ )
[task 1] time: 5.500, bw: 5859 KiB/s (WRITE)
[task 1] time: 5.061, bw: 5859 KiB/s (READ )
[parent 0] time: 10.603, bw: 11718 KiB/s (TOTAL)

=== bandwidth limit: 12MiB/s (token bucket) ===
[task 1] time: 5.057, bw: 5859 KiB/s (WRITE)
[task 1] time: 4.329, bw: 6835 KiB/s (READ )
[task 2] time: 5.771, bw: 4882 KiB/s (WRITE)
[task 2] time: 4.961, bw: 5859 KiB/s (READ )
[parent 0] time: 10.786, bw: 11718 KiB/s (TOTAL)

=== bandwidth limit: 16MiB/s (leaky bucket) ===
[task 1] time: 3.737, bw: 7812 KiB/s (WRITE)
[task 1] time: 3.988, bw: 7812 KiB/s (READ )
[task 2] time: 4.043, bw: 7812 KiB/s (WRITE)
[task 2] time: 3.954, bw: 7812 KiB/s (READ )
[parent 0] time: 8.040, bw: 15625 KiB/s (TOTAL)

=== bandwidth limit: 16MiB/s (token bucket) ===
[task 1] time: 3.224, bw: 9765 KiB/s (WRITE)
[task 1] time: 3.550, bw: 8789 KiB/s (READ )
[task 2] time: 5.085, bw: 5859 KiB/s (WRITE)
[task 2] time: 3.033, bw: 10742 KiB/s (READ )
[parent 0] time: 8.160, bw: 15625 KiB/s (TOTAL)

=== bandwidth limit: 20MiB/s (leaky bucket) ===
[task 1] time: 3.265, bw: 9765 KiB/s (WRITE)
[task 1] time: 3.339, bw: 9765 KiB/s (READ )
[task 2] time: 3.001, bw: 10742 KiB/s (WRITE)
[task 2] time: 3.840, bw: 7812 KiB/s (READ )
[parent 0] time: 6.884, bw: 18554 KiB/s (TOTAL)

=== bandwidth limit: 20MiB/s (token bucket) ===
[task 1] time: 2.897, bw: 10742 KiB/s (WRITE)
[task 1] time: 3.071, bw: 9765 KiB/s (READ )
[task 2] time: 3.697, bw: 8789 KiB/s (WRITE)
[task 2] time: 2.925, bw: 10742 KiB/s (READ )
[parent 0] time: 6.657, bw: 19531 KiB/s (TOTAL)

=== bandwidth limit: 24MiB/s (leaky bucket) ===
[task 1] time: 2.283, bw: 13671 KiB/s (WRITE)
[task 1] time: 3.626, bw: 8789 KiB/s (READ )
[task 2] time: 3.892, bw: 7812 KiB/s (WRITE)
[task 2] time: 2.774, bw: 11718 KiB/s (READ )
[parent 0] time: 6.724, bw: 18554 KiB/s (TOTAL)

=== bandwidth limit: 24MiB/s (token bucket) ===
[task 2] time: 3.215, bw: 9765 KiB/s (WRITE)
[task 2] time: 2.767, bw: 11718 KiB/s (READ )
[task 1] time: 2.615, bw: 11718 KiB/s (WRITE)
[task 1] time: 3.958, bw: 7812 KiB/s (READ )
[parent 0] time: 6.610, bw: 19531 KiB/s (TOTAL)

In conclusion, results seem to confirm that leaky bucket is more precise
(more smoothed) than token bucket; token bucket, instead, is better in
terms of efficiency when approaching to the disk's I/O physical limit,
as the theory claims.

It would be also interesting to test how token bucket performance
changes using different bucket size values. I'll do more accurate tests
ASAP.

Signed-off-by: Andrea Righi <righi....@gmail.com>
---

block/Makefile | 2 +
block/blk-io-throttle.c | 490 +++++++++++++++++++++++++++++++++++++++
include/linux/blk-io-throttle.h | 12 +

include/linux/cgroup_subsys.h | 6 +
init/Kconfig | 10 +

5 files changed, 520 insertions(+), 0 deletions(-)

diff --git a/block/Makefile b/block/Makefile
index 5a43c7d..8dec69b 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -14,3 +14,5 @@ obj-$(CONFIG_IOSCHED_CFQ) += cfq-iosched.o

obj-$(CONFIG_BLK_DEV_IO_TRACE) += blktrace.o
obj-$(CONFIG_BLOCK_COMPAT) += compat_ioctl.o
+
+obj-$(CONFIG_CGROUP_IO_THROTTLE) += blk-io-throttle.o
diff --git a/block/blk-io-throttle.c b/block/blk-io-throttle.c

new file mode 100644
index 0000000..c6af273
--- /dev/null
+++ b/block/blk-io-throttle.c
@@ -0,0 +1,490 @@

+ long bucket_size;
+ atomic_long_t token;

+};
+
+struct iothrottle {
+ struct cgroup_subsys_state css;
+ /* protects the list below, not the single elements */
+ spinlock_t lock;
+ struct list_head list;

+ int strategy;

+ iot->strategy = 0;

+ goto out;
+ }
+

+ iot = cgroup_to_iothrottle(cont);
+ rcu_read_lock();
+ list_for_each_entry_rcu(n, &iot->list, node) {
+ unsigned long delta, rate;
+
+ BUG_ON(!n->dev);
+ delta = jiffies_to_usecs((long)jiffies - (long)n->timestamp);
+ rate = delta ? KBS(atomic_long_read(&n->stat) / delta) : 0;
+ s += scnprintf(buffer + s, nbytes - s,

+ "device: %u,%u\n"
+ "bandwidth: %lu KiB/sec\n"
+ "usage: %lu KiB/sec\n"
+ "bucket size: %lu KiB\n"
+ "bucket fill: %li KiB\n",

+ MAJOR(n->dev), MINOR(n->dev),

+ n->iorate, rate,
+ n->bucket_size,
+ atomic_long_read(&n->token) >> 10);

+static inline int iothrottle_parse_args(char *buf, size_t nbytes,

+ dev_t *dev, unsigned long *iorate,
+ unsigned long *bucket_size)
+{
+ char *ioratep, *bucket_sizep;
+

+ ioratep = memchr(buf, ':', nbytes);
+ if (!ioratep)

+ return -EINVAL;

+ *ioratep++ = '\0';
+
+ bucket_sizep = memchr(ioratep, ':', nbytes + ioratep - buf);
+ if (!bucket_sizep)
+ return -EINVAL;
+ *bucket_sizep++ = '\0';

+
+ /* i/o bandiwth is expressed in KiB/s */

+ *iorate = ALIGN(memparse(ioratep, &ioratep), 1024) >> 10;
+ if (*ioratep)
+ return -EINVAL;
+ *bucket_size = ALIGN(memparse(bucket_sizep, &bucket_sizep), 1024) >> 10;
+ if (*bucket_sizep)
+ return -EINVAL;
+

+ *dev = devname2dev_t(buf);
+ if (!*dev)
+ return -ENOTBLK;
+
+ return 0;
+}
+
+static ssize_t iothrottle_write(struct cgroup *cont,
+ struct cftype *cft,
+ struct file *file,
+ const char __user *userbuf,
+ size_t nbytes, loff_t *ppos)
+{
+ struct iothrottle *iot;
+ struct iothrottle_node *n, *tmpn = NULL;
+ char *buffer, *tmpp;
+ dev_t dev;

+ unsigned long iorate, bucket_size;

+ int ret;
+
+ if (!nbytes)

+ return -EINVAL;
+

+ /* Upper limit on largest io-throttle rule string user might write. */
+ if (nbytes > 1024)
+ return -E2BIG;
+
+ buffer = kmalloc(nbytes + 1, GFP_KERNEL);
+ if (!buffer)
+ return -ENOMEM;
+
+ if (copy_from_user(buffer, userbuf, nbytes)) {
+ ret = -EFAULT;
+ goto out1;
+ }
+
+ buffer[nbytes] = '\0';
+ tmpp = strstrip(buffer);
+

+ ret = iothrottle_parse_args(tmpp, nbytes, &dev, &iorate, &bucket_size);

+ if (ret)
+ goto out1;
+

+ if (iorate) {

+ tmpn = kmalloc(sizeof(*tmpn), GFP_KERNEL);
+ if (!tmpn) {
+ ret = -ENOMEM;
+ goto out1;
+ }
+ atomic_long_set(&tmpn->stat, 0);
+ tmpn->timestamp = jiffies;

+ tmpn->iorate = iorate;
+ tmpn->bucket_size = bucket_size;
+ atomic_long_set(&tmpn->token, 0);

+ tmpn->dev = dev;
+ }
+
+ cgroup_lock();
+ if (cgroup_is_removed(cont)) {
+ ret = -ENODEV;
+ goto out2;
+ }
+
+ iot = cgroup_to_iothrottle(cont);
+ spin_lock(&iot->lock);

+ if (!iorate) {

+static s64 iothrottle_strategy_read(struct cgroup *cont, struct cftype *cft)

+{
+ struct iothrottle *iot;

+ s64 ret;

+
+ cgroup_lock();
+ if (cgroup_is_removed(cont)) {

+ cgroup_unlock();
+ return -ENODEV;

+ }
+ iot = cgroup_to_iothrottle(cont);

+ ret = iot->strategy;
+ cgroup_unlock();

+ return ret;
+}
+

+static int iothrottle_strategy_write(struct cgroup *cont,
+ struct cftype *cft, s64 val)

+{
+ struct iothrottle *iot;
+

+ cgroup_lock();
+ if (cgroup_is_removed(cont)) {

+ cgroup_unlock();
+ return -ENODEV;

+ }
+ iot = cgroup_to_iothrottle(cont);

+ iot->strategy = (int)val;
+ cgroup_unlock();

+ return 0;
+}
+

+static struct cftype files[] = {
+ {
+ .name = "bandwidth",
+ .read = iothrottle_read,
+ .write = iothrottle_write,
+ },
+ {

+ .name = "throttling_strategy",
+ .read_s64 = iothrottle_strategy_read,
+ .write_s64 = iothrottle_strategy_write,

+ },
+};
+
+static int iothrottle_populate(struct cgroup_subsys *ss, struct cgroup *cont)
+{
+ return cgroup_add_files(cont, ss, files, ARRAY_SIZE(files));
+}
+
+struct cgroup_subsys iothrottle_subsys = {
+ .name = "blockio",
+ .create = iothrottle_create,
+ .destroy = iothrottle_destroy,
+ .populate = iothrottle_populate,
+ .subsys_id = iothrottle_subsys_id,
+};
+
+static inline int __cant_sleep(void)
+{
+ return in_atomic() || in_interrupt() || irqs_disabled();
+}
+

+static long leaky_bucket(struct iothrottle_node *n, size_t bytes)
+{
+ unsigned long delta, t;
+ long sleep;

+
+ /* Account the i/o activity */
+ atomic_long_add(bytes, &n->stat);

+
+ /* Evaluate if we need to throttle the current process */

+ delta = (long)jiffies - (long)n->timestamp;
+ if (!delta)

+ return 0;

+
+ t = usecs_to_jiffies(KBS(atomic_long_read(&n->stat) / n->iorate));
+ if (!t)

+ return 0;

+
+ sleep = t - delta;
+ if (unlikely(sleep > 0))

+ return sleep;

+
+ /* Reset i/o statistics */
+ atomic_long_set(&n->stat, 0);
+ /*
+ * NOTE: be sure i/o statistics have been resetted before updating the
+ * timestamp, otherwise a very small time delta may possibly be read by
+ * another CPU w.r.t. accounted i/o statistics, generating unnecessary
+ * long sleeps.
+ */
+ smp_wmb();
+ n->timestamp = jiffies;

+ return 0;
+}
+

+/* XXX: need locking in order to evaluate a consistent sleep??? */
+static long token_bucket(struct iothrottle_node *n, size_t bytes)
+{
+ unsigned long delta;
+ long tok;
+
+ atomic_long_sub(bytes, &n->token);
+

+ delta = (long)jiffies - (long)n->timestamp;
+ if (!delta)

+ return 0;
+

+ n->timestamp = jiffies;

+ tok = atomic_long_read(&n->token) + jiffies_to_msecs(delta) * n->iorate;
+ if (tok > n->bucket_size)
+ tok = n->bucket_size;
+ atomic_long_set(&n->token, tok);
+
+ return (tok < 0) ? msecs_to_jiffies(-tok / n->iorate) : 0;

+}
+
+void cgroup_io_throttle(struct block_device *bdev, size_t bytes)
+{
+ struct iothrottle *iot;
+ struct iothrottle_node *n;

+ long sleep;
+
+ if (unlikely(!bdev || !bytes))
+ return;
+
+ iot = task_to_iothrottle(current);
+ if (unlikely(!iot))
+ return;
+
+ BUG_ON(!bdev->bd_inode);
+
+ rcu_read_lock();
+ n = iothrottle_search_node(iot, bdev->bd_inode->i_rdev);
+ if (!n || !n->iorate) {

+ rcu_read_unlock();
+ return;
+ }
+ switch (iot->strategy) {
+ case 0:
+ sleep = leaky_bucket(n, bytes);
+ break;
+ case 1:
+ sleep = token_bucket(n, bytes);
+ break;
+ default:
+ sleep = 0;
+ }
+ if (unlikely(sleep)) {
+ rcu_read_unlock();

+ if (__cant_sleep())
+ return;
+ pr_debug("io-throttle: task %p (%s) must sleep %lu jiffies\n",
+ current, current->comm, sleep);
+ schedule_timeout_killable(sleep);
+ return;
+ }

+ rcu_read_unlock();
+}
+EXPORT_SYMBOL(cgroup_io_throttle);
diff --git a/include/linux/blk-io-throttle.h b/include/linux/blk-io-throttle.h

new file mode 100644

Randy Dunlap

unread,

Jun 22, 2008, 12:10:13 PM6/22/08

to

--- Original Message ---
> Thanks Randy, I've applied all your fixes to my local
> documentation,
> next patchset version will include them. A few small comments
> below.
>

> >> +WARNING: per-block device limiting rules always refer to the dev_t device
> >> +number. If a block device is unplugged (i.e. a USB device) the limiting rules
> >> +associated to that device persist and they are still valid if a new device is
> >
> > associated with (?)
>
> what about:
>
> ...the limiting rules defined for that device...

Hi Andrea,

Yes, that's fine.

Thanks.

Randy Dunlap

unread,

Jun 22, 2008, 11:40:14 PM6/22/08

to

On Fri, 20 Jun 2008 23:00:21 -0500 Matt LaPlante wrote:

Signed-off-by: ??

A few nits below...

> diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
> index b7522c6..c4d348d 100644
> --- a/Documentation/filesystems/vfs.txt
> +++ b/Documentation/filesystems/vfs.txt

> @@ -895,9 +895,9 @@ struct dentry_operations {
> iput() yourself
>
> d_dname: called when the pathname of a dentry should be generated.
> - Usefull for some pseudo filesystems (sockfs, pipefs, ...) to delay
> + Useful for some pseudo filesystems (sockfs, pipefs, ...) to delay
> pathname generation. (Instead of doing it when dentry is created,
> - its done only when the path is needed.). Real filesystems probably
> + it's done only when the path is needed.). Real filesystems probably
> dont want to use it, because their dentries are present in global

don't

> dcache hash, so their hash should be an invariant. As no lock is
> held, d_dname() should not try to modify the dentry itself, unless

> diff --git a/Documentation/sound/alsa/Audiophile-Usb.txt b/Documentation/sound/alsa/Audiophile-Usb.txt

> index 2ad5e63..a4c53d8 100644
> --- a/Documentation/sound/alsa/Audiophile-Usb.txt
> +++ b/Documentation/sound/alsa/Audiophile-Usb.txt

> @@ -388,9 +388,9 @@ There are 2 main potential issues when using Jackd with the device:
>
> Jack supports big endian devices only in recent versions (thanks to
> Andreas Steinmetz for his first big-endian patch). I can't remember
> -extacly when this support was released into jackd, let's just say that
> +exactly when this support was released into jackd, let's just say that
> with jackd version 0.103.0 it's almost ok (just a small bug is affecting
> -16bits Big-Endian devices, but since you've read carefully the above
> +16bits Big-Endian devices, but since you've read carefully the above

16-bit (above and below here)

> paragraphs, you're now using kernel >= 2.6.23 and your 16bits devices
> are now Little Endians ;-) ).
>

Thanks. Big ack for all of the others.

---
~Randy
Linux Plumbers Conference, 17-19 September 2008, Portland, Oregon USA
http://linuxplumbersconf.org/

Anton Vorontsov

unread,

Jun 23, 2008, 2:40:17 AM6/23/08

to

On Fri, Jun 20, 2008 at 08:13:45AM +0200, Rodolfo Giometti wrote:
[..]

> > To avoid these ifdefs, I would suggest you to reconsider the model
> > of this driver. How about:
> >
> > - drivers/power/bq27x00_battery.c:
> >
> > Registers two platform_drivers, first is "bq27200-bat" (I2C) and
> > second is "bq27000-bat" (W1). Both are using the same probe and
> > remove methods. This driver will simply register power_supply
> > and will do all battery logic, separated from the underlaying
> > bus.
> >
> > (Two platform drivers are used simply because underlaying
> > I2C/W1 drivers will not know each other, thus will not able to
> > pass uniqe platform_device.id in case of single driver).
> >
> > - include/linux/bq27x00-battery.h:
> >
> > Declares
> > struct bq27x00_access_methods {
> > /*
> > * dev argument is I2C or W1 device, battery driver will use
> > * platform_device->dev.parent to pass the dev argument.
> > */
> > int (*read)(struct device *dev, u8 reg, int *rt_value, int b_single);
> > };
>
> Where this function should be defined? Can you explain better what you
> mean?

Here is the patch to show the idea. I only tested it to build, so no
guaranties that it will work on the actual hardware. :-)

This patch should be separated: one patch is for I2C subsystem
maintainer (drivers/i2c/* changes) and another (drivers/power/* +
include/linux/bq27x00.h) is for me.

diff --git a/drivers/i2c/chips/Kconfig b/drivers/i2c/chips/Kconfig
index 2da2edf..d4ad4a0 100644
--- a/drivers/i2c/chips/Kconfig
+++ b/drivers/i2c/chips/Kconfig
@@ -129,6 +129,15 @@ config SENSORS_TSL2550
This driver can also be built as a module. If so, the module
will be called tsl2550.

+config BATTERY_BQ27200
+ tristate "BQ27200 I2C battery driver"
+ default y if BATTERY_BQ27x00
+ help
+ Say Y here to enable support for batteries with BQ27200 I2C chip.
+
+ Note: you'll also need to select BATTERY_BQ27x00 driver to get
+ userspace interface for this chip.
+
config MENELAUS
bool "TWL92330/Menelaus PM chip"
depends on I2C=y && ARCH_OMAP24XX
diff --git a/drivers/i2c/chips/Makefile b/drivers/i2c/chips/Makefile
index e47aca0..2f39f73 100644
--- a/drivers/i2c/chips/Makefile
+++ b/drivers/i2c/chips/Makefile
@@ -19,6 +19,7 @@ obj-$(CONFIG_SENSORS_PCF8591) += pcf8591.o
obj-$(CONFIG_ISP1301_OMAP) += isp1301_omap.o
obj-$(CONFIG_TPS65010) += tps65010.o
obj-$(CONFIG_MENELAUS) += menelaus.o
+obj-$(CONFIG_BATTERY_BQ27200) += bq27200.o
obj-$(CONFIG_SENSORS_TSL2550) += tsl2550.o

ifeq ($(CONFIG_I2C_DEBUG_CHIP),y)
diff --git a/drivers/i2c/chips/bq27200.c b/drivers/i2c/chips/bq27200.c
new file mode 100644
index 0000000..fc485be
--- /dev/null
+++ b/drivers/i2c/chips/bq27200.c
@@ -0,0 +1,167 @@
+/*
+ * BQ27200 I2C battery monitor driver

+ *
+ * Copyright (C) 2008 Rodolfo Giometti <giom...@linux.it>
+ * Copyright (C) 2008 Eurotech S.p.A. <in...@eurotech.it>
+ *
+ * Based on a previous work by Copyright (C) 2008 Texas Instruments, Inc.
+ *
+ * This package is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * THIS PACKAGE IS PROVIDED ``AS IS'' AND WITHOUT ANY EXPRESS OR
+ * IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED
+ * WARRANTIES OF MERCHANTIBILITY AND FITNESS FOR A PARTICULAR PURPOSE.

+ */
+
+#include <linux/module.h>
+#include <linux/platform_device.h>
+#include <linux/idr.h>
+#include <linux/i2c.h>
+#include <linux/bq27x00.h>
+

+#define HIGH_BYTE(A) ((A) << 8)
+
+/*

+ * If the system has several batteries we need a different id for each
+ * of them... Sadly, there is no support for dynamic ids for platform
+ * devices.
+ */
+static DEFINE_IDR(battery_id);
+static DEFINE_MUTEX(battery_mutex);
+
+static int bq27200_read(struct device *dev, u8 reg, int *rt_value, int b_single)
+{
+ struct i2c_client *client = to_i2c_client(dev);

+ struct i2c_msg msg[1];
+ unsigned char data[2];
+ int err;
+
+ if (!client->adapter)
+ return -ENODEV;
+
+ msg->addr = client->addr;
+ msg->flags = 0;
+ msg->len = 1;
+ msg->buf = data;
+ data[0] = reg;
+

+ err = i2c_transfer(client->adapter, msg, 1);
+ if (err >= 0) {
+ if (!b_single)
+ msg->len = 2;
+ else
+ msg->len = 1;
+
+ msg->flags = I2C_M_RD;
+ err = i2c_transfer(client->adapter, msg, 1);
+ if (err >= 0) {
+ if (!b_single)
+ *rt_value = data[1] | HIGH_BYTE(data[0]);
+ else
+ *rt_value = data[0];
+ return 0;
+ } else
+ return err;
+ } else {
+ return err;
+ }
+}
+

+static int __init bq27200_battery_probe(struct i2c_client *client,

+ const struct i2c_device_id *id)
+{

+ struct platform_device *pdev;
+ struct bq27x00_access_methods am = {
+ .read = bq27200_read,
+ };
+ int bat_id;
+ int ret = 0;
+
+idr_try_again:

+ /* Get new ID for the new battery device */

+ ret = idr_pre_get(&battery_id, GFP_KERNEL);
+ if (ret == 0)

+ return -ENOMEM;
+ mutex_lock(&battery_mutex);

+ ret = idr_get_new(&battery_id, client, &bat_id);
+ mutex_unlock(&battery_mutex);
+
+ if (ret == -EAGAIN)
+ goto idr_try_again;
+ else if (ret != 0)
+ return ret;
+
+ bat_id &= MAX_ID_MASK;
+
+ pdev = platform_device_alloc("bq27000-battery", bat_id);
+ if (!pdev) {
+ ret = -ENOMEM;
+ goto err_pdev_alloc;
+ }
+
+ ret = platform_device_add_data(pdev, &am, sizeof(am));
+ if (ret)
+ goto err_pdev;
+
+ pdev->dev.parent = &client->dev;
+
+ ret = platform_device_add(pdev);
+ if (ret)
+ goto err_pdev;
+
+ i2c_set_clientdata(client, pdev);
+
+ dev_info(&client->dev, "probed fine\n");
+
+ return 0;
+err_pdev:
+ platform_device_del(pdev);
+err_pdev_alloc:
+ mutex_lock(&battery_mutex);
+ idr_remove(&battery_id, bat_id);
+ mutex_unlock(&battery_mutex);

+ return ret;
+}
+

+static int __devexit bq27200_battery_remove(struct i2c_client *client)
+{
+ struct platform_device *pdev = i2c_get_clientdata(client);
+
+ platform_device_unregister(pdev);
+ mutex_lock(&battery_mutex);
+ idr_remove(&battery_id, pdev->id);
+ mutex_unlock(&battery_mutex);

+ return 0;
+}
+

+static const struct i2c_device_id bq27200_id[] = {
+ { "bq27200", 0 },
+};
+
+static struct i2c_driver bq27200_battery_driver = {
+ .probe = bq27200_battery_probe,
+ .remove = __devexit_p(bq27200_battery_remove),
+
+ .driver = {
+ .name = "bq27200-battery",
+ },
+ .id_table = bq27200_id,
+};
+

+static int __init bq27200_battery_init(void)
+{
+ return i2c_add_driver(&bq27200_battery_driver);
+}
+module_init(bq27200_battery_init);
+
+static void __exit bq27200_battery_exit(void)
+{
+ i2c_del_driver(&bq27200_battery_driver);
+}
+module_exit(bq27200_battery_exit);

+
+MODULE_AUTHOR("Texas Instruments");

+MODULE_DESCRIPTION("BQ27000 I2C driver");
+MODULE_LICENSE("GPL");
diff --git a/drivers/power/Kconfig b/drivers/power/Kconfig
index 58c806e..4fe3f88 100644
--- a/drivers/power/Kconfig
+++ b/drivers/power/Kconfig
@@ -49,4 +49,14 @@ config BATTERY_OLPC

help
Say Y to enable support for the battery on the OLPC laptop.

+config BATTERY_BQ27x00

+ tristate "BQ27x00 battery monitor driver"
+ help
+ Say Y here to enable support for common userspace interface for
+ batteries with BQ27000 or BQ27200 chips inside.
+
+ NOTE: this driver only provides userspace interface for these
+ chips, you need to select BQ27x00 chip variant, I2C or W1,
+ as found in the appropriate menus.

+
endif # POWER_SUPPLY
diff --git a/drivers/power/Makefile b/drivers/power/Makefile
index 6413ded..15aa8cb 100644
--- a/drivers/power/Makefile
+++ b/drivers/power/Makefile
@@ -20,3 +20,4 @@ obj-$(CONFIG_APM_POWER) += apm_power.o
obj-$(CONFIG_BATTERY_DS2760) += ds2760_battery.o
obj-$(CONFIG_BATTERY_PMU) += pmu_battery.o
obj-$(CONFIG_BATTERY_OLPC) += olpc_battery.o
+obj-$(CONFIG_BATTERY_BQ27x00) += bq27x00_battery.o
diff --git a/drivers/power/bq27x00_battery.c b/drivers/power/bq27x00_battery.c
new file mode 100644

index 0000000..ae5bff7
--- /dev/null
+++ b/drivers/power/bq27x00_battery.c
@@ -0,0 +1,259 @@
+/*
+ * BQ27000/BQ27200 battery monitor driver

+ *
+ * Copyright (C) 2008 Rodolfo Giometti <giom...@linux.it>
+ * Copyright (C) 2008 Eurotech S.p.A. <in...@eurotech.it>
+ *
+ * Based on a previous work by Copyright (C) 2008 Texas Instruments, Inc.
+ *
+ * This package is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * THIS PACKAGE IS PROVIDED ``AS IS'' AND WITHOUT ANY EXPRESS OR
+ * IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED
+ * WARRANTIES OF MERCHANTIBILITY AND FITNESS FOR A PARTICULAR PURPOSE.

+ */
+
+#include <linux/module.h>
+#include <linux/platform_device.h>
+#include <linux/power_supply.h>
+#include <linux/bq27x00.h>

+
+#define DRIVER_VERSION "1.0.0"
+

+#define BQ27x00_REG_TEMP 0x06
+#define BQ27x00_REG_VOLT 0x08
+#define BQ27x00_REG_RSOC 0x0B /* Relative State-of-Charge */
+#define BQ27x00_REG_AI 0x14
+#define BQ27x00_REG_FLAGS 0x0A

+
+struct bq27x00_device_info {
+ struct device *dev;

+ int voltage_uV;
+ int current_uA;
+ int temp_C;
+ int charge_rsoc;
+ struct bq27x00_access_methods *bus;
+ struct power_supply bat;
+};
+

+#define to_bq27x00_device_info(x) container_of((x), \
+ struct bq27x00_device_info, bat);
+

+static enum power_supply_property bq27x00_battery_props[] = {

+ POWER_SUPPLY_PROP_VOLTAGE_NOW,
+ POWER_SUPPLY_PROP_CURRENT_NOW,
+ POWER_SUPPLY_PROP_CAPACITY,
+ POWER_SUPPLY_PROP_TEMP,
+};
+

+static int bq27x00_read(struct bq27x00_device_info *di, u8 reg, int *rt_value,
+ int b_single)

+{
+ int ret;
+

+ ret = di->bus->read(di->dev->parent, reg, rt_value, b_single);

+ *rt_value = be16_to_cpu(*rt_value);
+
+ return ret;
+}
+
+/*
+ * Return the battery temperature in Celcius degrees
+ * Or < 0 if something fails.
+ */
+static int bq27x00_battery_temperature(struct bq27x00_device_info *di)
+{

+ int ret;
+ int temp = 0;
+
+ ret = bq27x00_read(di, BQ27x00_REG_TEMP, &temp, 0);

+ if (ret) {
+ dev_err(di->dev, "error reading temperature\n");
+ return ret;
+ }
+
+ return (temp >> 2) - 273;
+}
+
+/*
+ * Return the battery Voltage in milivolts
+ * Or < 0 if something fails.
+ */
+static int bq27x00_battery_voltage(struct bq27x00_device_info *di)
+{

+ int ret;
+ int volt = 0;
+
+ ret = bq27x00_read(di, BQ27x00_REG_VOLT, &volt, 0);

+ if (ret) {
+ dev_err(di->dev, "error reading voltage\n");
+ return ret;
+ }
+
+ return volt;
+}
+
+/*
+ * Return the battery average current
+ * Note that current can be negative signed as well
+ * Or 0 if something fails.
+ */
+static int bq27x00_battery_current(struct bq27x00_device_info *di)
+{

+ int ret;
+ int curr = 0;
+ int flags = 0;
+
+ ret = bq27x00_read(di, BQ27x00_REG_AI, &curr, 0);

+ if (ret) {
+ dev_err(di->dev, "error reading current\n");
+ return 0;
+ }

+ ret = bq27x00_read(di, BQ27x00_REG_FLAGS, &flags, 0);

+ if (ret < 0) {
+ dev_err(di->dev, "error reading flags\n");
+ return 0;
+ }
+ if ((flags & (1 << 7)) != 0) {
+ dev_dbg(di->dev, "negative current!\n");
+ return -curr;
+ }

+ return curr;
+}
+
+/*
+ * Return the battery Relative State-of-Charge
+ * Or < 0 if something fails.
+ */
+static int bq27x00_battery_rsoc(struct bq27x00_device_info *di)
+{

+ int ret;
+ int rsoc = 0;
+
+ ret = bq27x00_read(di, BQ27x00_REG_RSOC, &rsoc, 1);

+ if (ret) {
+ dev_err(di->dev, "error reading relative State-of-Charge\n");
+ return ret;
+ }
+
+ return rsoc >> 8;
+}
+

+static int bq27x00_battery_get_property(struct power_supply *psy,
+ enum power_supply_property psp,
+ union power_supply_propval *val)
+{
+ struct bq27x00_device_info *di = to_bq27x00_device_info(psy);
+
+ switch (psp) {
+ case POWER_SUPPLY_PROP_VOLTAGE_NOW:

+ val->intval = bq27x00_battery_voltage(di);

+ break;

+ case POWER_SUPPLY_PROP_CURRENT_NOW:
+ val->intval = bq27x00_battery_current(di);

+ break;

+ case POWER_SUPPLY_PROP_CAPACITY:
+ val->intval = bq27x00_battery_rsoc(di);

+ break;

+ case POWER_SUPPLY_PROP_TEMP:
+ val->intval = bq27x00_battery_temperature(di);

+ break;

+ default:
+ return -EINVAL;
+ }
+
+ return 0;
+}
+

+static int __init bq27x00_battery_probe(struct platform_device *pdev)

+{
+ struct bq27x00_device_info *di;

+ int ret;

+
+ di = kzalloc(sizeof(*di), GFP_KERNEL);
+ if (!di) {
+ dev_err(&pdev->dev, "failed to allocate device info data\n");
+ return -ENOMEM;
+ }
+

+ platform_set_drvdata(pdev, di);
+
+ di->dev = &pdev->dev;

+ di->bus = platform_get_drvdata(pdev);
+ di->bat.name = pdev->dev.bus_id;

+ di->bat.type = POWER_SUPPLY_TYPE_BATTERY;
+ di->bat.properties = bq27x00_battery_props;
+ di->bat.num_properties = ARRAY_SIZE(bq27x00_battery_props);
+ di->bat.get_property = bq27x00_battery_get_property;
+

+ ret = power_supply_register(&pdev->dev, &di->bat);
+ if (ret) {

+ dev_err(&pdev->dev, "failed to register battery\n");
+ goto batt_failed;
+ }
+
+ dev_info(&pdev->dev, "support ver. %s enabled\n", DRIVER_VERSION);
+
+ return 0;
+
+batt_failed:

+ kfree(di);

+ return ret;
+}
+

+static int __devexit bq27x00_battery_remove(struct platform_device *pdev)

+{
+ struct bq27x00_device_info *di = platform_get_drvdata(pdev);
+
+ power_supply_unregister(&di->bat);

+ kfree(di);

+ return 0;
+}
+

+static struct platform_driver bq27000_battery_driver = {

+ .probe = bq27x00_battery_probe,
+ .remove = bq27x00_battery_remove,

+ .driver = {
+ .name = "bq27000-battery",
+ },
+};

+MODULE_ALIAS("platform:bq27000-battery");
+
+static struct platform_driver bq27200_battery_driver = {
+ .probe = bq27x00_battery_probe,
+ .remove = __devexit_p(bq27x00_battery_remove),

+ .driver = {
+ .name = "bq27200-battery",
+ },

+};
+MODULE_ALIAS("platform:bq27200-battery");

+
+static int __init bq27x00_battery_init(void)
+{

+ int ret;
+
+ ret = platform_driver_register(&bq27000_battery_driver);
+ if (ret)
+ return -ENOMEM;
+
+ ret = platform_driver_register(&bq27200_battery_driver);
+ if (ret) {
+ platform_driver_unregister(&bq27000_battery_driver);
+ return -ENOMEM;
+ }
+ return 0;
+}
+module_init(bq27x00_battery_init);

+
+static void __exit bq27x00_battery_exit(void)
+{

+ platform_driver_unregister(&bq27000_battery_driver);
+ platform_driver_unregister(&bq27200_battery_driver);
+}

+module_exit(bq27x00_battery_exit);
+
+MODULE_AUTHOR("Texas Instruments");

+MODULE_DESCRIPTION("BQ27x00 battery monitor driver");
+MODULE_LICENSE("GPL");
diff --git a/include/linux/bq27x00.h b/include/linux/bq27x00.h
new file mode 100644
index 0000000..a3484eb
--- /dev/null
+++ b/include/linux/bq27x00.h
@@ -0,0 +1,28 @@
+/*
+ * BQ27x00 battery monitors

+ *
+ * Copyright (C) 2008 Rodolfo Giometti <giom...@linux.it>
+ * Copyright (C) 2008 Eurotech S.p.A. <in...@eurotech.it>
+ *
+ * Based on a previous work by Copyright (C) 2008 Texas Instruments, Inc.
+ *
+ * This package is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * THIS PACKAGE IS PROVIDED ``AS IS'' AND WITHOUT ANY EXPRESS OR
+ * IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED
+ * WARRANTIES OF MERCHANTIBILITY AND FITNESS FOR A PARTICULAR PURPOSE.

+ */
+
+#ifndef __LINUX_BQ27x00_H
+#define __LINUX_BQ27x00_H
+
+#include <linux/types.h>
+#include <linux/device.h>
+
+struct bq27x00_access_methods {
+ int (*read)(struct device *dev, u8 reg, int *rt_value, int b_single);
+};
+
+#endif /* __LINUX_BQ27x00_H */

Matt LaPlante

unread,

Jun 23, 2008, 1:20:23 PM6/23/08

to

+ Corrections for Randy.

Signed-off-by: Matt LaPlante <ker...@cyberdogtech.com>
---
Documentation/Intel-IOMMU.txt | 4 +-

Documentation/accounting/taskstats-struct.txt | 2 +-
Documentation/cpu-freq/governors.txt | 2 +-
Documentation/edac.txt | 2 +-

Documentation/networking/bonding.txt | 4 +-
Documentation/networking/can.txt | 4 +-

Documentation/networking/ip-sysctl.txt | 2 +-
Documentation/networking/packet_mmap.txt | 2 +-

Documentation/networking/tc-actions-env-rules.txt | 15 ++++---
Documentation/powerpc/booting-without-of.txt | 12 +++---

Documentation/powerpc/qe_firmware.txt | 2 +-
Documentation/s390/driver-model.txt | 2 +-

Documentation/vm/numa_memory_policy.txt | 4 +-
Documentation/volatile-considered-harmful.txt | 2 +-
drivers/message/fusion/lsi/mpi_history.txt | 6 +-
37 files changed, 89 insertions(+), 88 deletions(-)

index b7522c6..9124733 100644

--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -143,7 +143,7 @@ struct file_system_type {

The get_sb() method has the following arguments:

- struct file_system_type *fs_type: decribes the filesystem, partly initialized
+ struct file_system_type *fs_type: describes the filesystem, partly initialized
by the specific filesystem code

int flags: mount flags

@@ -895,10 +895,10 @@ struct dentry_operations {

iput() yourself

d_dname: called when the pathname of a dentry should be generated.
- Usefull for some pseudo filesystems (sockfs, pipefs, ...) to delay
+ Useful for some pseudo filesystems (sockfs, pipefs, ...) to delay
pathname generation. (Instead of doing it when dentry is created,
- its done only when the path is needed.). Real filesystems probably

- dont want to use it, because their dentries are present in global

+ it's done only when the path is needed.). Real filesystems probably

+ don't want to use it, because their dentries are present in global

dcache hash, so their hash should be an invariant. As no lock is
held, d_dname() should not try to modify the dentry itself, unless

appropriate SMP safety is used. CAUTION : d_path() logic is quite

index 2ad5e63..971c8bb 100644
--- a/Documentation/sound/alsa/Audiophile-Usb.txt
+++ b/Documentation/sound/alsa/Audiophile-Usb.txt
@@ -9,7 +9,7 @@ ALSA and JACK.
History
=======
* v1.4 - Thibault Le Meur (2007-07-11)
- - Added Low Endianness nature of 16bits-modes
+ - Added Low Endianness nature of 16-bit modes
found by Hakan Lennestal <Hakan.L...@brfsodrahamn.se>
- Modifying document structure
* v1.5 - Thibault Le Meur (2007-07-12)
@@ -106,8 +106,8 @@ way (I suppose the device's index is 1):
* hw:1,2 is Do in AC3/DTS passthrough mode

In this mode, the device uses Big Endian byte-encoding so that
-supported audio format are S16_BE for 16-bit depth modes and S24_3BE for
-24-bits depth mode.
+supported audio formats are S16_BE for 16-bit depth mode and S24_3BE for
+24-bit depth mode.

One exception is the hw:1,2 port which was reported to be Little Endian
compliant (supposedly supporting S16_LE) but processes in fact only S16_BE streams.
@@ -159,18 +159,18 @@ Others modes are described in the following subsections.
The two supported modes are:

* device_setup=0x01
- - 16bits 48kHz mode with Di disabled
+ - 16-bit 48kHz mode with Di disabled
- Ai,Ao,Do can be used at the same time
- hw:1,0 is not available in capture mode
- hw:1,2 is not available

* device_setup=0x11
- - 16bits 48kHz mode with Di enabled
+ - 16-bit 48kHz mode with Di enabled
- Ai,Ao,Di,Do can be used at the same time
- hw:1,0 is available in capture mode
- hw:1,2 is not available

-In this modes the device operates only at 16bits-modes. Before kernel 2.6.23,
+In these modes the device operates only at 16-bit mode. Before kernel 2.6.23,
the devices where reported to be Big-Endian when in fact they were Little-Endian
so that playing a file was a matter of using:
% aplay -D hw:1,1 -c2 -t raw -r48000 -fS16_BE test_S16_LE.raw
@@ -187,20 +187,20 @@ using:
The three supported modes are:

* device_setup=0x09
- - 24bits 48kHz mode with Di disabled
+ - 24-bit 48kHz mode with Di disabled
- Ai,Ao,Do can be used at the same time
- hw:1,0 is not available in capture mode
- hw:1,2 is not available

* device_setup=0x19
- - 24bits 48kHz mode with Di enabled
+ - 24-bit 48kHz mode with Di enabled
- 3 ports from {Ai,Ao,Di,Do} can be used at the same time
- hw:1,0 is available in capture mode and an active digital source must be
connected to Di
- hw:1,2 is not available

* device_setup=0x0D or 0x10
- - 24bits 96kHz mode
+ - 24-bit 96kHz mode
- Di is enabled by default for this mode but does not need to be connected
to an active source
- Only 1 port from {Ai,Ao,Di,Do} can be used at the same time
@@ -215,7 +215,7 @@ mode" above for an aplay command example)
Thanks to Hakan Lennestal, I now have a report saying that this mode works.

* device_setup=0x03
- - 16bits 48kHz mode with only the Do port enabled
+ - 16-bit 48kHz mode with only the Do port enabled
- AC3 with DTS passthru
- Caution with this setup the Do port is mapped to the pcm device hw:1,0

@@ -236,15 +236,15 @@ The parameter can be given:
alias snd-card-1 snd-usb-audio
options snd-usb-audio index=1 device_setup=0x09

-CAUTION when initializaing the device
+CAUTION when initializing the device
-------------------------------------

* Correct initialization on the device requires that device_setup is given to
the module BEFORE the device is turned on. So, if you use the "manual probing"
method described above, take care to power-on the device AFTER this initialization.

- * Failing to respect this will lead in a misconfiguration of the device. In this case
- turn off the device, unproble the snd-usb-audio module, then probe it again with
+ * Failing to respect this will lead to a misconfiguration of the device. In this case
+ turn off the device, unprobe the snd-usb-audio module, then probe it again with
correct device_setup parameter and then (and only then) turn on the device again.

* If you've correctly initialized the device in a valid mode and then want to switch

@@ -287,9 +287,9 @@ Where:
- When set to "1" the rate range is 48.1-96kHz
- Otherwise the sample rate range is 8-48kHz
* b3 is the bit depth selection flag
- - When set to "1" samples are 24bits long
- - Otherwise they are 16bits long
- - Note that b2 implies b3 as the 96kHz mode is only supported for 24 bits
+ - When set to "1" samples are 24 bits long
+ - Otherwise they are 16 bits long
+ - Note that b2 implies b3 as the 96kHz mode is only supported for 24 bit
samples
* b4 is the Digital input flag
- When set to "1" the device assumes that an active digital source is
@@ -303,10 +303,10 @@ Where:

Caution:
* there is no check on the value you will give to device_setup
- - for instance choosing 0x05 (16bits 96kHz) will fail back to 0x09 since
+ - for instance choosing 0x05 (16-bit 96kHz) will fail back to 0x09 since
b2 implies b3. But _there_will_be_no_warning_ in /var/log/messages
* Hardware constraints due to the USB bus limitation aren't checked
- - choosing b2 will prepare all interfaces for 24bits/96kHz but you'll
+ - choosing b2 will prepare all interfaces for 24-bit/96kHz but you'll
only be able to use one at the same time

3.2.3.2 - USB implementation details for this device
@@ -362,10 +362,10 @@ _must_know_ how the device will be used:
* if DTS is chosen, only Interface 2 with AltSet nb.6 must be
registered
* if 96KHz only AltSets nb.1 of each interface must be selected
- * if samples are using 24bits/48KHz then AltSet 2 must me used if
+ * if samples are using 24-bit/48KHz then AltSet 2 must me used if
Digital input is connected, and only AltSet nb.3 if Digital input
is not connected
- * if samples are using 16bits/48KHz then AltSet 4 must me used if
+ * if samples are using 16-bit/48KHz then AltSet 4 must me used if
Digital input is connected, and only AltSet nb.5 if Digital input
is not connected

@@ -388,10 +388,10 @@ There are 2 main potential issues when using Jackd with the device:

Jack supports big endian devices only in recent versions (thanks to
Andreas Steinmetz for his first big-endian patch). I can't remember
-extacly when this support was released into jackd, let's just say that
+exactly when this support was released into jackd, let's just say that
with jackd version 0.103.0 it's almost ok (just a small bug is affecting
-16bits Big-Endian devices, but since you've read carefully the above

-paragraphs, you're now using kernel >= 2.6.23 and your 16bits devices
+16-bit Big-Endian devices, but since you've read carefully the above
+paragraphs, you're now using kernel >= 2.6.23 and your 16-bit devices

are now Little Endians ;-) ).

You can run jackd with the following command for playback with Ao and

Andrew Morton

unread,

Jun 25, 2008, 8:30:09 PM6/25/08

to

Remove this, use USEC_PER_SEC throughout.

> +#define KBS(x) ((x) * ONE_SEC >> 10)

Convert to lower-case-named C function, please.

> +
> +struct iothrottle_node {
> + struct list_head node;
> + dev_t dev;
> + unsigned long iorate;
> + unsigned long timestamp;
> + atomic_long_t stat;
> +};

Please document each field in structures. This is usually more useful
and important than documenting the code whcih manipulates those fields.

It is important that the units of fields such as iorate and timestamp
and stamp be documented.

> +struct iothrottle {
> + struct cgroup_subsys_state css;
> + /* protects the list below, not the single elements */
> + spinlock_t lock;
> + struct list_head list;
> +};

Looking elsewhere in the code it appears that some RCU-based locking is
performed. That should be documented somewhere. Fully. At the
definition site of the data whcih is RCU-protceted would be a good
site.

> +static inline struct iothrottle *cgroup_to_iothrottle(struct cgroup *cont)
> +{
> + return container_of(cgroup_subsys_state(cont, iothrottle_subsys_id),
> + struct iothrottle, css);
> +}
> +
> +static inline struct iothrottle *task_to_iothrottle(struct task_struct *task)
> +{
> + return container_of(task_subsys_state(task, iothrottle_subsys_id),
> + struct iothrottle, css);
> +}
> +
> +static inline struct iothrottle_node *iothrottle_search_node(
> + const struct iothrottle *iot,
> + dev_t dev)
> +{
> + struct iothrottle_node *n;
> +
> + list_for_each_entry_rcu(n, &iot->list, node)
> + if (n->dev == dev)
> + return n;
> + return NULL;
> +}

This will be too large for inlining.

This function presumably has caller-provided locking requirements?
They should be documented here.

> +static inline void iothrottle_insert_node(struct iothrottle *iot,
> + struct iothrottle_node *n)
> +{
> + list_add_rcu(&n->node, &iot->list);
> +}
> +
> +static inline struct iothrottle_node *iothrottle_replace_node(
> + struct iothrottle *iot,
> + struct iothrottle_node *old,
> + struct iothrottle_node *new)
> +{
> + list_replace_rcu(&old->node, &new->node);
> + return old;
> +}

Dittoes.

> +static inline struct iothrottle_node *iothrottle_delete_node(
> + struct iothrottle *iot,
> + dev_t dev)
> +{
> + struct iothrottle_node *n;
> +
> + list_for_each_entry(n, &iot->list, node)
> + if (n->dev == dev) {
> + list_del_rcu(&n->node);
> + return n;
> + }
> + return NULL;
> +}

Too large for inlining.

Was list_for_each_entry_rcu() needed?

Does this function have any caller-provided locking requirements?

> +/*
> + * Note: called from kernel/cgroup.c with cgroup_lock() held.
> + */
> +static struct cgroup_subsys_state *iothrottle_create(
> + struct cgroup_subsys *ss, struct cgroup *cont)

static struct cgroup_subsys_state *iothrottle_create(struct cgroup_subsys *ss,
struct cgroup *cont)

would be more typical code layout (here and elsewhere)

static struct cgroup_subsys_state *
iothrottle_create(struct cgroup_subsys *ss, struct cgroup *cont)

is another way.

This is a kernel->userspace interface. It is part of the kernel ABI.
We will need to support in a back-compatible fashion for ever. Hence
it is important. The entire proposed kernel<->userspace interface
should be completely described in the changelog or the documentation so
that we can understand and review what you are proposing.

> +static inline dev_t devname2dev_t(const char *buf)
> +{
> + struct block_device *bdev;
> + dev_t ret;
> +
> + bdev = lookup_bdev(buf);
> + if (IS_ERR(bdev))
> + return 0;
> +
> + BUG_ON(!bdev->bd_inode);
> + ret = bdev->bd_inode->i_rdev;
> + bdput(bdev);
> +
> + return ret;
> +}

Too large to inline. I get tired of telling people this. Please just
remove all the inlining from all the patches. Then go back and
selectively inline any functions which really do need to be inlined
(overall reduction in generated .text is a good heuristic).

How can this function not be racy? We're returning a dev_t which
refers to a device upon which we have no reference. A better design
might be to rework the who9le thing to operate on a `struct
block_device *' upon whcih this code holds a reference, rather than
using bare dev_t.

I _guess_ it's OK doing an in-kernel filesystem lookup here. But did
you consider just passing the dev_t in from userspace? It's just a
stat().

Does all this code treat /dev/sda1 as a separate device from /dev/sda2?
If so, that would be broken.

> +static inline int iothrottle_parse_args(char *buf, size_t nbytes,
> + dev_t *dev, unsigned long *val)
> +{
> + char *p;
> +
> + p = memchr(buf, ':', nbytes);
> + if (!p)
> + return -EINVAL;
> + *p++ = '\0';
> +
> + /* i/o bandiwth is expressed in KiB/s */

typo.

This comment is incorrect, isn't it? Or at least misleading. The
bandwidth can be expressed in an exotically broad number of different
ways.

> + *val = ALIGN(memparse(p, &p), 1024) >> 10;
> + if (*p)
> + return -EINVAL;
> +
> + *dev = devname2dev_t(buf);
> + if (!*dev)
> + return -ENOTBLK;
> +
> + return 0;
> +}

uninline...

I think the whole memparse() thing is over the top:

+- BANDWIDTH is the maximum I/O bandwidth on DEVICE allowed by CGROUP (we can
+ use a suffix k, K, m, M, g or G to indicate bandwidth values in KB/s, MB/s
+ or GB/s),

For starters, we don't _display_ the bacndwidth back to the user in the
units with which it was written, so what's the point?

Secondly, we hope and expect that humans won't be diorectly echoing raw
data into kernel pseudo files. We shouild expect and plan for (or even
write) front-end management applications. And such applications won't
need these ornate designed-for-human interfaces.

IOW: I'd suggest this interface be changed to accept a plain old 64-bit
bytes-per-second and leave it at that.

> +static ssize_t iothrottle_write(struct cgroup *cont,
> + struct cftype *cft,
> + struct file *file,
> + const char __user *userbuf,
> + size_t nbytes, loff_t *ppos)
> +{
> + struct iothrottle *iot;
> + struct iothrottle_node *n, *tmpn = NULL;
> + char *buffer, *tmpp;

Please avoid variables called tmp or things derived from it. Surely we
can think of some more communucative identifier?

> + dev_t dev;
> + unsigned long val;
> + int ret;
> +
> + if (!nbytes)
> + return -EINVAL;
> +
> + /* Upper limit on largest io-throttle rule string user might write. */
> + if (nbytes > 1024)
> + return -E2BIG;
> +
> + buffer = kmalloc(nbytes + 1, GFP_KERNEL);
> + if (!buffer)
> + return -ENOMEM;
> +
> + if (copy_from_user(buffer, userbuf, nbytes)) {
> + ret = -EFAULT;
> + goto out1;
> + }
> +
> + buffer[nbytes] = '\0';

strncpy_from_user()? (I'm not sure that strncpy_from_user() does the
null-termination as desired).

err, no.

I don't know what this is doing or why it was added, but whatever it is
it's a hack and it all needs to go away.

Please describe what problem this is trying to solve and let's take a
look at it. That should have been covered in code comments anyway.
But because it wasn't, I am presently unable to help.

Are you sure that n->iorate cannot be set to zero between the above
test and this division? Taking a copy into a local variable would fix
that small race.

I'm confused. This code is using jiffies but the string "HZ" doesn't
appears anywhere in the diff. Where are we converting from the
kernel-internal HZ rate into suitable-for-exposing-to-userspace units?

HZ can vary from 100 to 1000 (approx). What are the implications of
this for the accuracy of this code?

I have no comments on the overall design. I'm not sure that I understand
it yet.

Andrew Morton

unread,

Jun 25, 2008, 8:50:05 PM6/25/08

to

We do this on every submit_bio(READ)? Ow.

I guess it's not _hugely_ expensive, but the lengthy pointer hop in
task_subsys_state() is going to cost.

All this could be made cheaper if we were to reduce the sampling rate:
only call cgroup_io_throttle() on each megabyte of IO (for example).

current->amount_of_io += bio->bi_size;
if (current->amount_of_io > 1024*1024) {
cgroup_io_throttle(bio->bi_bdev, bio->bi_size);
current->amount_of_io -= 1024 * 1024;
}

but perhaps that has the potential to fail to throttle correctly when
accessing multiple devices, dunno.

Bear in mind that tasks such as pdflush and kswapd will do reads when
performing writeout (indirect blocks).

Ah, and here we see the reason for the __cant_sleep() hack.

It doesn't work, sorry. in_atomic() will return false when called
under spinlock when CONFIG_PREEMPT=n. Your code will call
schedule_timeout_killable() under spinlock and is deadlockable.

We need to think of something smarter here. This problem has already
been largely solved at the balance_dirty_pages() level. Perhaps that
is where the io-throttling should be cutting in?

--

Andrea Righi

unread,

Jun 26, 2008, 6:40:06 PM6/26/08

to

What about ratelimiting the sampling based on i/o requests?

current->io_requests++;
if (current->io_requests > CGROUP_IOTHROTTLE_RATELIMIT) {
cgroup_io_throttle(bio->bi_bdev, bio->bi_size);
current->io_requests = 0;
}

The throttle would fail for large bio->bi_size requests, but it would
work also with multiple devices. And probably this would penalize tasks
having a seeky i/o workload (many requests means more checks for
throttling).

Andrea Righi

unread,

Jun 26, 2008, 6:40:14 PM6/26/08

to

Andrew,

thanks for your time and the detailed revision. I'll try to fix
everything you reported and better document the code according to your
suggestions. I'll re-submit a new patchset version ASAP.

A few comments below.

Andrew Morton wrote:
[snip]

and BTW I was actually wondering if the output could be changed with
something less human readable and better parseable, I means just print
only raw numbers. And describe the semantic in the documentation.

>> +static inline dev_t devname2dev_t(const char *buf)
>> +{
>> + struct block_device *bdev;
>> + dev_t ret;
>> +
>> + bdev = lookup_bdev(buf);
>> + if (IS_ERR(bdev))
>> + return 0;
>> +
>> + BUG_ON(!bdev->bd_inode);
>> + ret = bdev->bd_inode->i_rdev;
>> + bdput(bdev);
>> +
>> + return ret;
>> +}
>
> Too large to inline. I get tired of telling people this. Please just
> remove all the inlining from all the patches. Then go back and
> selectively inline any functions which really do need to be inlined
> (overall reduction in generated .text is a good heuristic).
>
> How can this function not be racy? We're returning a dev_t which
> refers to a device upon which we have no reference. A better design
> might be to rework the who9le thing to operate on a `struct
> block_device *' upon whcih this code holds a reference, rather than
> using bare dev_t.

However, holding a reference wouldn't allow to unplug the device, i.e.
a USB disk. As reported in Documentation/controllers/io-throttle.txt:

WARNING: per-block device limiting rules always refer to the dev_t

device number. If a block device is unplugged (i.e. a USB device) the
limiting rules defined for that device persist and they are still valid
if a new device is plugged in the system and it uses the same major and
minor numbers.

This would be a feature in my case, but I don't know if it would be a
bug in general.

> I _guess_ it's OK doing an in-kernel filesystem lookup here. But did
> you consider just passing the dev_t in from userspace? It's just a
> stat().

Yes, and that seems to be more reasonable, since we display major,minor
numbers in the output.

> Does all this code treat /dev/sda1 as a separate device from /dev/sda2?
> If so, that would be broken.

Yes, all the partitions are treated as separate devices with
(potentially) different limiting rules, but I don't understand why it
would be broken... dev_t has both minor and major numbers, so it would
be possible to select single partitions as well.

>> +static inline int iothrottle_parse_args(char *buf, size_t nbytes,
>> + dev_t *dev, unsigned long *val)
>> +{
>> + char *p;
>> +
>> + p = memchr(buf, ':', nbytes);
>> + if (!p)
>> + return -EINVAL;
>> + *p++ = '\0';
>> +
>> + /* i/o bandiwth is expressed in KiB/s */
>
> typo.
>
> This comment is incorrect, isn't it? Or at least misleading. The
> bandwidth can be expressed in an exotically broad number of different
> ways.

Yes.

>
>> + *val = ALIGN(memparse(p, &p), 1024) >> 10;
>> + if (*p)
>> + return -EINVAL;
>> +
>> + *dev = devname2dev_t(buf);
>> + if (!*dev)
>> + return -ENOTBLK;
>> +
>> + return 0;
>> +}
>
> uninline...
>
> I think the whole memparse() thing is over the top:
>
> +- BANDWIDTH is the maximum I/O bandwidth on DEVICE allowed by CGROUP (we can
> + use a suffix k, K, m, M, g or G to indicate bandwidth values in KB/s, MB/s
> + or GB/s),
>
> For starters, we don't _display_ the bacndwidth back to the user in the
> units with which it was written, so what's the point?
>
> Secondly, we hope and expect that humans won't be diorectly echoing raw
> data into kernel pseudo files. We shouild expect and plan for (or even
> write) front-end management applications. And such applications won't
> need these ornate designed-for-human interfaces.
>
> IOW: I'd suggest this interface be changed to accept a plain old 64-bit
> bytes-per-second and leave it at that.

I agree.

n->iorate can only change via userspace->kernel interface, that just
replaces the node in the list using the rcu way. AFAIK this shouldn't be
racy, but better to do it using the local variable to avoid future bugs.

The code uses jiffies_to_usecs() and usecs_to_jiffies(), that should be
ok, isn't it?

-Andrea

Andrew Morton

unread,

Jun 26, 2008, 7:10:10 PM6/26/08

to

On Fri, 27 Jun 2008 00:36:46 +0200
Andrea Righi <righi....@gmail.com> wrote:

> > Does all this code treat /dev/sda1 as a separate device from /dev/sda2?
> > If so, that would be broken.
>
> Yes, all the partitions are treated as separate devices with
> (potentially) different limiting rules, but I don't understand why it
> would be broken... dev_t has both minor and major numbers, so it would
> be possible to select single partitions as well.

Well it's functionally broken, isn't it? A physical disk has a fixed
IO bandwidth and when the administrator wants to partition that
bandwidth amongst control groups he will need to consider the entire
device when doing so?

I mean, the whole point of this feature and of control groups as a
whole is isolation. But /dev/sda1 and /dev/sda2 are very much _not_
isolated. Whereas /dev/sda and /dev/sdb are (to a large degree)
isolated.

Andrew Morton

unread,

Jun 26, 2008, 7:20:16 PM6/26/08

to

On Fri, 27 Jun 2008 00:37:25 +0200
Andrea Righi <righi....@gmail.com> wrote:

> > All this could be made cheaper if we were to reduce the sampling rate:
> > only call cgroup_io_throttle() on each megabyte of IO (for example).
> >
> > current->amount_of_io += bio->bi_size;
> > if (current->amount_of_io > 1024*1024) {
> > cgroup_io_throttle(bio->bi_bdev, bio->bi_size);
> > current->amount_of_io -= 1024 * 1024;
> > }
>
> What about ratelimiting the sampling based on i/o requests?
>
> current->io_requests++;
> if (current->io_requests > CGROUP_IOTHROTTLE_RATELIMIT) {
> cgroup_io_throttle(bio->bi_bdev, bio->bi_size);
> current->io_requests = 0;
> }
>
> The throttle would fail for large bio->bi_size requests, but it would
> work also with multiple devices. And probably this would penalize tasks
> having a seeky i/o workload (many requests means more checks for
> throttling).

Yup. To a large degree, a 4k IO has a similar cost to a 1MB IO.
Certainly the 1MB IO is not 256 times as expensive!

Some sort of averaging fudge factor could be used here. For example, a
1MB IO might be considered, umm 3.4 times as expensive as a 4k IO. But
it varies a lot depending upon the underlying device. For a USB stick,
sure, we're close to 256x. For a slow-seek, high-bandwidth device
(optical?) it's getting closer to 1x. No single fudge-factor will suit
all devices, hence userspace-controlled tunability would be needed here
to avoid orders-of-magnitude inaccuracies.

The above

cost ~= per-device-fudge-factor * io-size

can of course go horridly wrong because it doesn't account for seeks at
all. Some heuristic which incorporates per-cgroup seekiness (in some
weighted fashion) will help there.

I dunno. It's not a trivial problem, and I suspect that we'll need to
get very fancy in doing this if we are to avoid an implementation which
goes unusably badly wrong in some situations.

I wonder if we'd be better off with a completely different design.

At present we're proposing that we look at the request stream and
a-priori predict how expensive it will be. That's hard. What if we
were to look at the requests post-facto and work out how expensive they
_were_? Say, for each request which this cgroup issued, look at the
start-time and the completion-time. Take the difference (elapsed time)
and divide that by wall time. Then we end up with a simple percentage:
"this cgroup is using the disk 10% of the time".

That's fine, as long as nobody else is using the device! If multiple
cgroups are using the device then we need to do, err, something.

Or do we? Maybe not - things will, on average, sort themselves out,
because everyone will slow everyone else down. It'll have inaccuracies
and failure scenarios and the smarty-pants IO schedulers will get in
the way.

Interesting project, but I do suspect that we'll need a lot of
complexity in there (and huge amounts of testing) to get something
which is sufficiently accurate to be generally useful.

Andrea Righi

unread,

Jun 27, 2008, 7:00:21 AM6/27/08

to

Andrew Morton wrote:
> On Fri, 27 Jun 2008 00:36:46 +0200
> Andrea Righi <righi....@gmail.com> wrote:
>
>>> Does all this code treat /dev/sda1 as a separate device from /dev/sda2?
>>> If so, that would be broken.
>> Yes, all the partitions are treated as separate devices with
>> (potentially) different limiting rules, but I don't understand why it
>> would be broken... dev_t has both minor and major numbers, so it would
>> be possible to select single partitions as well.
>
> Well it's functionally broken, isn't it? A physical disk has a fixed
> IO bandwidth and when the administrator wants to partition that
> bandwidth amongst control groups he will need to consider the entire
> device when doing so?
>
> I mean, the whole point of this feature and of control groups as a
> whole is isolation. But /dev/sda1 and /dev/sda2 are very much _not_
> isolated. Whereas /dev/sda and /dev/sdb are (to a large degree)
> isolated.

well... yes, sounds reasonable. In this case we could just ignore the
minor number and consider only major number as the key to identify a
specific block device (both for userspace<->kernel interface and when
accounting/throttling i/o requests).

-Andrea

Andrea Righi

unread,

Jun 30, 2008, 12:20:15 PM6/30/08

to

Andrea Righi wrote:
> Andrew Morton wrote:
>> On Fri, 27 Jun 2008 00:36:46 +0200
>> Andrea Righi <righi....@gmail.com> wrote:
>>
>>>> Does all this code treat /dev/sda1 as a separate device from /dev/sda2?
>>>> If so, that would be broken.
>>> Yes, all the partitions are treated as separate devices with
>>> (potentially) different limiting rules, but I don't understand why it
>>> would be broken... dev_t has both minor and major numbers, so it would
>>> be possible to select single partitions as well.
>> Well it's functionally broken, isn't it? A physical disk has a fixed
>> IO bandwidth and when the administrator wants to partition that
>> bandwidth amongst control groups he will need to consider the entire
>> device when doing so?
>>
>> I mean, the whole point of this feature and of control groups as a
>> whole is isolation. But /dev/sda1 and /dev/sda2 are very much _not_
>> isolated. Whereas /dev/sda and /dev/sdb are (to a large degree)
>> isolated.
>
> well... yes, sounds reasonable. In this case we could just ignore the
> minor number and consider only major number as the key to identify a
> specific block device (both for userspace<->kernel interface and when
> accounting/throttling i/o requests).

oops.. no, this is obviously wrong. So, I dunno if it would be better to
add complexity in cgroup_io_throttle() to identify the disk a partition
belongs or to just use the struct block_device as key, instead of dev_t,
as you intially suggested. I'll investigate.

Andrea Righi

unread,

Jul 4, 2008, 10:00:21 AM7/4/08

to

Documentation of the block device I/O bandwidth controller: description, usage,
advantages and design.

Signed-off-by: Andrea Righi <righi....@gmail.com>
---
Documentation/controllers/io-throttle.txt | 265 +++++++++++++++++++++++++++++
1 files changed, 265 insertions(+), 0 deletions(-)
create mode 100644 Documentation/controllers/io-throttle.txt

diff --git a/Documentation/controllers/io-throttle.txt b/Documentation/controllers/io-throttle.txt
new file mode 100644
index 0000000..578d78e
--- /dev/null
+++ b/Documentation/controllers/io-throttle.txt
@@ -0,0 +1,265 @@

+
+ Block device I/O bandwidth controller
+
+1. Description
+
+This controller allows to limit the I/O bandwidth of specific block devices for
+specific process containers (cgroups) imposing additional delays on I/O
+requests for those processes that exceed the limits defined in the control
+group filesystem.
+
+Bandwidth limiting rules offer better control over QoS with respect to priority
+or weight-based solutions that only give information about applications'

+relative performance requirements. Nevertheless, priority based solutions are
+affected by performance bursts, when only low-priority requests are submitted
+to a general purpose resource dispatcher.

+
+The goal of the I/O bandwidth controller is to improve performance

+predictability and provide performance isolation of different control groups
+sharing the same block devices.
+
+NOTE #1: If you're looking for a way to improve the overall throughput of the

+system probably you should use a different solution.
+

+NOTE #2: The current implementation does not guarantee minimum bandwidth
+levels, the QoS is implemented only slowing down I/O "traffic" that exceeds the
+limits specified by the user; minimum I/O rate thresholds are supposed to be
+guaranteed if the user configures a proper I/O bandwidth partitioning of the

+block devices shared among the different cgroups (theoretically if the sum of

+all the single limits defined for a block device doesn't exceed the total I/O

+bandwidth of that device).
+
+2. User Interface
+
+A new I/O bandwidth limitation rule is described using the file
+blockio.bandwidth.
+
+The same file can be used to set multiple rules for different block devices
+relative to the same cgroup.
+

+The syntax to configure a limiting rule is the following:
+
+# /bin/echo DEV:BW:STRATEGY:BUCKET_SIZE > CGROUP/blockio.bandwidth
+
+- DEV is the name of the device the limiting rule is applied to.
+
+- BW is the maximum I/O bandwidth on DEVICE allowed by CGROUP; bandwidth must
+ be expressed in bytes/s.
+
+- STRATEGY is the throttling strategy used to throttle the applications' I/O
+ requests from/to device DEV. At the moment two different strategies can be
+ used:
+
+ 0 = leaky bucket: the controller accepts at most B bytes (B = BW * time);
+ further I/O requests are delayed scheduling a timeout for
+ the tasks that made those requests.
+
+ Different I/O flow
+ | | |
+ | v |
+ | v
+ v
+ .......
+ \ /
+ \ / leaky-bucket
+ ---
+ |||
+ vvv
+ Smoothed I/O flow
+
+ 1 = token bucket: BW tokens are added to the bucket every seconds; the bucket
+ can hold at the most BUCKET_SIZE tokens; I/O requests are
+ accepted if there are available tokens in the bucket; when
+ a request of N bytes arrives N tokens are removed from the
+ bucket; if fewer than N tokens are available the request is
+ delayed until a sufficient amount of token is available in
+ the bucket.
+
+ Tokens (I/O rate)
+ o
+ o
+ o
+ ....... <--.
+ \ / | Bucket size (burst limit)
+ \ooo/ |
+ --- <--'
+ |ooo
+ Incoming --->|---> Conforming
+ I/O |oo I/O
+ requests -->|--> requests
+ |
+ ---->|
+
+ Leaky bucket is more precise than token bucket to respect the bandwidth
+ limits, because bursty workloads are always smoothed. Token bucket, instead,
+ allows a small irregularity degree in the I/O flows (burst limit), and, for
+ this, it is better in terms of efficiency (bursty workloads are not smoothed
+ when there are sufficient tokens in the bucket).
+
+- BUCKET_SIZE is used only with token bucket (STRATEGY == 1) and defines the
+ size of the bucket in bytes.
+

+- CGROUP is the name of the limited process container.
+

+All the defined rules and statistics for a specific cgroup can be shown reading
+the file blockio.bandwidth. The following syntax is used:
+
+$ cat CGROUP/blockio.bandwidth
+MAJOR MINOR BW STRATEGY LEAKY_STAT BUCKET_SIZE BUCKET_FILL TIME_DELTA
+
+- MAJOR is the major device number of DEV (defined above)
+
+- MINOR is the minor device number of DEV (defined above)
+
+- BW, STRATEGY and BUCKET_SIZE are the same parameters defined above
+
+- LEAKY_STAT is the amount of bytes currently allowed by the I/O bandwidth
+ controller (only used with leaky bucket strategy - STRATEGY == 0)
+
+- BUCKET_FILL represents the amount of tokens present in the bucket (only used
+ with token bucket strategy - STRATEGY == 1)
+
+- TIME_DELTA can be one of the following:
+ - the amount of jiffies elapsed from the last I/O request (token bucket)
+ - the amount of jiffies during which the bytes given by LEAKY_STAT have been
+ accumulated (leaky bucket)
+
+Multiple per-block device rules are reported in multiple rows
+(DEVi, i = 1 .. n):
+
+$ cat CGROUP/blockio.bandwidth
+MAJOR1 MINOR1 BW1 STRATEGY1 LEAKY_STAT1 BUCKET_SIZE1 BUCKET_FILL1 TIME_DELTA1
+MAJOR1 MINOR1 BW2 STRATEGY2 LEAKY_STAT2 BUCKET_SIZE2 BUCKET_FILL2 TIME_DELTA2
+...
+MAJORn MINORn BWn STRATEGYn LEAKY_STATn BUCKET_SIZEn BUCKET_FILLn TIME_DELTAn
+
+I/O bandwidth limiting rules can be removed setting the BW value to 0.

+
+Examples:
+
+* Mount the cgroup filesystem (blockio subsystem):
+ # mkdir /mnt/cgroup
+ # mount -t cgroup -oblockio blockio /mnt/cgroup
+
+* Instantiate the new cgroup "foo":
+ # mkdir /mnt/cgroup/foo
+ --> the cgroup foo has been created
+
+* Add the current shell process to the cgroup "foo":
+ # /bin/echo $$ > /mnt/cgroup/foo/tasks
+ --> the current shell has been added to the cgroup "foo"
+

+* Give maximum 1MiB/s of I/O bandwidth on /dev/sda for the cgroup "foo", using
+ leaky bucket throttling strategy:
+ # /bin/echo /dev/sda:$((1024 * 1024)):0:0 > \
+ > /mnt/cgroup/foo/blockio.bandwidth

+ # sh
+ --> the subshell 'sh' is running in cgroup "foo" and it can use a maximum I/O
+ bandwidth of 1MiB/s on /dev/sda

+
+* Give maximum 8MiB/s of I/O bandwidth on /dev/sdb for the cgroup "foo", using
+ token bucket throttling strategy, bucket size = 8MB:
+ # /bin/echo /dev/sdb:$((8 * 1024 * 1024)):1:$((8 * 1024 * 1024)) > \
+ > /mnt/cgroup/foo/blockio.bandwidth

+ # sh
+ --> the subshell 'sh' is running in cgroup "foo" and it can use a maximum I/O

+ bandwidth of 1MiB/s on /dev/sda (controlled by leaky bucket throttling)
+ and 8MiB/s on /dev/sdb (controlled by token bucket throttling)
+
+* Run a benchmark doing I/O on /dev/sda and /dev/sdb; I/O limits and usage

+ defined for cgroup "foo" can be shown as following:
+ # cat /mnt/cgroup/foo/blockio.bandwidth

+ 8 16 8388608 1 0 8388608 -522560 48
+ 8 0 1048576 0 737280 0 0 216
+
+* Extend the maximum I/O bandwidth for the cgroup "foo" to 16MiB/s on /dev/sda:
+ # /bin/echo /dev/sda:$((16 * 1024 * 1024)):0:0 > \
+ > /mnt/cgroup/foo/blockio.bandwidth
+ # cat /mnt/cgroup/foo/blockio.bandwidth
+ 8 16 8388608 1 0 8388608 -84432 206436
+ 8 0 16777216 0 0 0 0 15212
+
+* Remove limiting rule on /dev/sdb for cgroup "foo":
+ # /bin/echo /dev/sdb:0:0:0 > /mnt/cgroup/foo/blockio.bandwidth
+ # cat /mnt/cgroup/foo/blockio.bandwidth
+ 8 0 16777216 0 0 0 0 110388

+WARNING: per-block device limiting rules always refer to the dev_t device
+number. If a block device is unplugged (i.e. a USB device) the limiting rules
+defined for that device persist and they are still valid if a new device is
+plugged in the system and it uses the same major and minor numbers.
+
+5. Todo
+
+* Think an alternative design for general purpose usage; special purpose usage
+ right now is restricted to improve I/O performance predictability and
+ evaluate more precise response timings for applications doing I/O. To a large
+ degree the block I/O bandwidth controller should implement a more complex
+ logic to better evaluate real I/O operations cost, depending also on the
+ particular block device profile (i.e. USB stick, optical drive, hard disk,
+ etc.). This would also allow to appropriately account I/O cost for seeky
+ workloads, respect to large stream workloads. Instead of looking at the
+ request stream and try to predict how expensive the I/O cost will be, a
+ totally different approach could be to collect request timings (start time /
+ elapsed time) and based on collected informations, try to estimate the I/O
+ cost and usage (idea proposed by Andrew Morton <ak...@linux-foundation.org>).
+
+* Correcly handle AIO: at the moment the approach is to make a task sleep also
+ when doing asynchronous I/O. A more reasonable behaviour would be to return
+ EAGAIN from aio_write()/aio_read()
+ (reported by Eric Rannaud <eric.r...@gmail.com>).
--
1.5.4.3

Andrea Righi

unread,

Jul 4, 2008, 10:10:18 AM7/4/08

to

This is the core io-throttle kernel infrastructure. It creates the basic
interfaces to cgroups and implements the I/O measurement and throttling
functions.

Signed-off-by: Andrea Righi <righi....@gmail.com>
---
block/Makefile | 2 +
block/blk-io-throttle.c | 529 +++++++++++++++++++++++++++++++++++++++
include/linux/blk-io-throttle.h | 14 +

include/linux/cgroup_subsys.h | 6 +
init/Kconfig | 10 +

5 files changed, 561 insertions(+), 0 deletions(-)

create mode 100644 block/blk-io-throttle.c
create mode 100644 include/linux/blk-io-throttle.h

diff --git a/block/Makefile b/block/Makefile
index 5a43c7d..8dec69b 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -14,3 +14,5 @@ obj-$(CONFIG_IOSCHED_CFQ) += cfq-iosched.o

obj-$(CONFIG_BLK_DEV_IO_TRACE) += blktrace.o
obj-$(CONFIG_BLOCK_COMPAT) += compat_ioctl.o
+
+obj-$(CONFIG_CGROUP_IO_THROTTLE) += blk-io-throttle.o
diff --git a/block/blk-io-throttle.c b/block/blk-io-throttle.c

new file mode 100644
index 0000000..caf740a
--- /dev/null
+++ b/block/blk-io-throttle.c
@@ -0,0 +1,529 @@

+#include <linux/genhd.h>

+#include <linux/fs.h>
+#include <linux/jiffies.h>
+#include <linux/hardirq.h>
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/uaccess.h>

+#include <linux/blk-io-throttle.h>
+
+/**
+ * struct iothrottle_node - throttling rule of a single block device
+ * @node: list of per block device throttling rules
+ * @dev: block device number, used as key in the list
+ * @iorate: max i/o bandwidth (in bytes/s)
+ * @strategy: throttling strategy (0 = leaky bucket, 1 = token bucket)
+ * @timestamp: timestamp of the last I/O request (in jiffies)
+ * @stat: i/o activity counter (leaky bucket only)
+ * @bucket_size: bucket size in bytes (token bucket only)
+ * @token: token counter (token bucket only)
+ *
+ * Define a i/o throttling rule for a single block device.
+ *
+ * NOTE: limiting rules always refer to dev_t; if a block device is unplugged
+ * the limiting rules defined for that device persist and they are still valid
+ * if a new device is plugged and it uses the same dev_t number.
+ */

+struct iothrottle_node {
+ struct list_head node;
+ dev_t dev;

+ u64 iorate;
+ long strategy;

+ unsigned long timestamp;
+ atomic_long_t stat;

+ s64 bucket_size;

+ atomic_long_t token;
+};
+

+/**
+ * struct iothrottle - throttling rules for a cgroup
+ * @css: pointer to the cgroup state
+ * @lock: spinlock used to protect write operations in the list
+ * @list: list of iothrottle_node elements
+ *
+ * Define multiple per-block device i/o throttling rules.
+ * Note: the list of the throttling rules is protected by RCU locking.
+ */

+struct iothrottle {
+ struct cgroup_subsys_state css;

+ spinlock_t lock;
+ struct list_head list;
+};

+
+static inline struct iothrottle *cgroup_to_iothrottle(struct cgroup *cont)
+{
+ return container_of(cgroup_subsys_state(cont, iothrottle_subsys_id),
+ struct iothrottle, css);
+}
+
+static inline struct iothrottle *task_to_iothrottle(struct task_struct *task)
+{
+ return container_of(task_subsys_state(task, iothrottle_subsys_id),
+ struct iothrottle, css);
+}
+

+/*
+ * Note: called with rcu_read_lock() held.
+ */
+static struct iothrottle_node *
+iothrottle_search_node(const struct iothrottle *iot, dev_t dev)
+{

+ struct iothrottle_node *n;
+

+ list_for_each_entry_rcu(n, &iot->list, node)

+ if (n->dev == dev)
+ return n;
+ return NULL;
+}

+
+/*
+ * Note: called with iot->lock held.
+ */

+static inline void iothrottle_insert_node(struct iothrottle *iot,
+ struct iothrottle_node *n)
+{
+ list_add_rcu(&n->node, &iot->list);
+}
+

+/*
+ * Note: called with iot->lock held.
+ */
+static inline struct iothrottle_node *
+iothrottle_replace_node(struct iothrottle *iot, struct iothrottle_node *old,

+ struct iothrottle_node *new)
+{
+ list_replace_rcu(&old->node, &new->node);
+ return old;
+}

+
+/*
+ * Note: called with iot->lock held.
+ */
+static struct iothrottle_node *
+iothrottle_delete_node(struct iothrottle *iot, dev_t dev)
+{

+ struct iothrottle_node *n;
+

+ list_for_each_entry_rcu(n, &iot->list, node)

+ if (n->dev == dev) {
+ list_del_rcu(&n->node);
+ return n;
+ }
+ return NULL;
+}

+
+/*
+ * Note: called from kernel/cgroup.c with cgroup_lock() held.
+ */

+static struct cgroup_subsys_state *
+iothrottle_create(struct cgroup_subsys *ss, struct cgroup *cont)

+{
+ struct iothrottle *iot;
+

+ iot = kmalloc(sizeof(*iot), GFP_KERNEL);

+ if (unlikely(!iot))

+static ssize_t iothrottle_read(struct cgroup *cont, struct cftype *cft,
+ struct file *file, char __user *userbuf,
+ size_t nbytes, loff_t *ppos)

+{
+ struct iothrottle *iot;
+ char *buffer;
+ int s = 0;
+ struct iothrottle_node *n;
+ ssize_t ret;
+
+ buffer = kmalloc(nbytes + 1, GFP_KERNEL);
+ if (!buffer)
+ return -ENOMEM;
+
+ cgroup_lock();
+ if (cgroup_is_removed(cont)) {
+ ret = -ENODEV;
+ goto out;
+ }
+
+ iot = cgroup_to_iothrottle(cont);
+ rcu_read_lock();
+ list_for_each_entry_rcu(n, &iot->list, node) {

+ unsigned long delta;
+
+ BUG_ON(!n->dev);
+ delta = jiffies_to_msecs((long)jiffies - (long)n->timestamp);

+ s += scnprintf(buffer + s, nbytes - s,

+ "%u %u %llu %li %li %lli %li %lu\n",
+ MAJOR(n->dev), MINOR(n->dev), n->iorate,
+ n->strategy, atomic_long_read(&n->stat),
+ n->bucket_size, atomic_long_read(&n->token),
+ delta);

+ }
+ rcu_read_unlock();
+ ret = simple_read_from_buffer(userbuf, nbytes, ppos, buffer, s);
+out:
+ cgroup_unlock();
+ kfree(buffer);
+ return ret;
+}

+
+static dev_t devname2dev_t(const char *buf)

+{
+ struct block_device *bdev;

+ dev_t dev = 0;
+ struct gendisk *disk;
+ int part;
+
+ /* use a lookup to validate the block device */

+ bdev = lookup_bdev(buf);
+ if (IS_ERR(bdev))
+ return 0;
+

+ /* only entire devices are allowed, not single partitions */
+ disk = get_gendisk(bdev->bd_dev, &part);
+ if (disk && !part) {
+ BUG_ON(!bdev->bd_inode);
+ dev = bdev->bd_inode->i_rdev;
+ }
+ bdput(bdev);
+
+ return dev;
+}
+
+/*
+ * The userspace input string must use the following syntax:
+ *
+ * device:bw-limit:strategy:bucket-size
+ */
+static int iothrottle_parse_args(char *buf, size_t nbytes,
+ dev_t *dev, u64 *iorate,
+ long *strategy, s64 *bucket_size)
+{
+ char *ioratep, *strategyp, *bucket_sizep;
+ int ret;
+

+ ioratep = memchr(buf, ':', nbytes);
+ if (!ioratep)
+ return -EINVAL;
+ *ioratep++ = '\0';
+

+ strategyp = memchr(ioratep, ':', buf + nbytes - ioratep);
+ if (!strategyp)
+ return -EINVAL;
+ *strategyp++ = '\0';
+
+ bucket_sizep = memchr(strategyp, ':', buf + nbytes - strategyp);

+ if (!bucket_sizep)
+ return -EINVAL;

+ *bucket_sizep++ = '\0';
+
+ /* i/o bandwidth limit (0 to delete a limiting rule) */
+ ret = strict_strtoull(ioratep, 10, iorate);
+ if (ret < 0)
+ return ret;
+ *iorate = ALIGN(*iorate, 1024);
+
+ /* throttling strategy */
+ ret = strict_strtol(strategyp, 10, strategy);
+ if (ret < 0)
+ return ret;
+
+ /* bucket size */
+ ret = strict_strtoll(bucket_sizep, 10, bucket_size);
+ if (ret < 0)
+ return ret;
+ if (*bucket_size < 0)
+ return -EINVAL;
+ *bucket_size = ALIGN(*bucket_size, 1024);
+
+ /* block device number */

+ *dev = devname2dev_t(buf);
+ if (!*dev)

+ return -EINVAL;

+
+ return 0;
+}

+
+static ssize_t iothrottle_write(struct cgroup *cont, struct cftype *cft,
+ struct file *file, const char __user *userbuf,
+ size_t nbytes, loff_t *ppos)

+{
+ struct iothrottle *iot;

+ struct iothrottle_node *n, *newn = NULL;
+ char *buffer, *s;
+ dev_t dev;
+ u64 iorate;
+ long strategy;
+ s64 bucket_size;

+ int ret;
+
+ if (!nbytes)

+ return -EINVAL;
+

+ /* Upper limit on largest io-throttle rule string user might write. */
+ if (nbytes > 1024)
+ return -E2BIG;

+
+ buffer = kmalloc(nbytes + 1, GFP_KERNEL);
+ if (!buffer)
+ return -ENOMEM;
+

+ ret = strncpy_from_user(buffer, userbuf, nbytes);
+ if (ret < 0)
+ goto out1;

+ buffer[nbytes] = '\0';

+ s = strstrip(buffer);
+
+ ret = iothrottle_parse_args(s, nbytes, &dev, &iorate,
+ &strategy, &bucket_size);

+ if (ret)
+ goto out1;
+

+ if (iorate) {
+ newn = kmalloc(sizeof(*newn), GFP_KERNEL);
+ if (!newn) {

+ ret = -ENOMEM;
+ goto out1;
+ }

+ newn->dev = dev;
+ newn->iorate = iorate;
+ newn->strategy = strategy;
+ newn->bucket_size = bucket_size;
+ newn->timestamp = jiffies;
+ atomic_long_set(&newn->stat, 0);
+ atomic_long_set(&newn->token, 0);
+ }

+
+ cgroup_lock();
+ if (cgroup_is_removed(cont)) {
+ ret = -ENODEV;

+ goto out2;

+ }
+
+ iot = cgroup_to_iothrottle(cont);

+ spin_lock(&iot->lock);
+ if (!iorate) {

+ /* Delete a block device limiting rule */
+ n = iothrottle_delete_node(iot, dev);
+ goto out3;
+ }
+ n = iothrottle_search_node(iot, dev);
+ if (n) {
+ /* Update a block device limiting rule */

+ iothrottle_replace_node(iot, n, newn);

+ goto out3;
+ }
+ /* Add a new block device limiting rule */

+ iothrottle_insert_node(iot, newn);

+out3:
+ ret = nbytes;
+ spin_unlock(&iot->lock);
+ if (n) {
+ synchronize_rcu();
+ kfree(n);
+ }
+out2:
+ cgroup_unlock();
+out1:

+ kfree(buffer);
+ return ret;
+}

+
+static struct cftype files[] = {
+ {
+ .name = "bandwidth",
+ .read = iothrottle_read,
+ .write = iothrottle_write,
+ },
+};
+
+static int iothrottle_populate(struct cgroup_subsys *ss, struct cgroup *cont)
+{
+ return cgroup_add_files(cont, ss, files, ARRAY_SIZE(files));
+}
+
+struct cgroup_subsys iothrottle_subsys = {
+ .name = "blockio",
+ .create = iothrottle_create,
+ .destroy = iothrottle_destroy,
+ .populate = iothrottle_populate,
+ .subsys_id = iothrottle_subsys_id,
+};
+

+/*
+ * Note: called with rcu_read_lock() held.
+ */
+static unsigned long leaky_bucket(struct iothrottle_node *n, size_t bytes)
+{

+ unsigned long delta, t;
+ long sleep;
+

+ /* Account the i/o activity */
+ atomic_long_add(bytes, &n->stat);
+
+ /* Evaluate if we need to throttle the current process */
+ delta = (long)jiffies - (long)n->timestamp;
+ if (!delta)

+ return 0;
+
+ /*
+ * NOTE: n->iorate cannot be set to zero here, iorate can only change
+ * via the userspace->kernel interface that in case of update fully
+ * replaces the iothrottle_node pointer in the list, using the RCU way.
+ */
+ t = usecs_to_jiffies(atomic_long_read(&n->stat)
+ * USEC_PER_SEC / n->iorate);
+ if (!t)
+ return 0;
+

+ sleep = t - delta;
+ if (unlikely(sleep > 0))

+ return sleep;

+
+ /* Reset i/o statistics */
+ atomic_long_set(&n->stat, 0);
+ /*
+ * NOTE: be sure i/o statistics have been resetted before updating the
+ * timestamp, otherwise a very small time delta may possibly be read by
+ * another CPU w.r.t. accounted i/o statistics, generating unnecessary
+ * long sleeps.
+ */
+ smp_wmb();
+ n->timestamp = jiffies;

+ return 0;
+}
+

+/*
+ * Note: called with rcu_read_lock() held.
+ * XXX: need locking in order to evaluate a consistent sleep???
+ */
+static unsigned long token_bucket(struct iothrottle_node *n, size_t bytes)
+{
+ unsigned long iorate = n->iorate / MSEC_PER_SEC;

+ unsigned long delta;
+ long tok;
+

+ BUG_ON(!iorate);

+
+ atomic_long_sub(bytes, &n->token);

+ delta = jiffies_to_msecs((long)jiffies - (long)n->timestamp);

+ n->timestamp = jiffies;

+ tok = atomic_long_read(&n->token);

+ if (delta && tok < n->bucket_size) {
+ tok += delta * iorate;
+ pr_debug("io-throttle: adding %lu tokens\n", delta * iorate);

+ if (tok > n->bucket_size)
+ tok = n->bucket_size;
+ atomic_long_set(&n->token, tok);
+ }
+ atomic_long_set(&n->token, tok);
+

+ return (tok < 0) ? msecs_to_jiffies(-tok / iorate) : 0;
+}
+
+/**
+ * cgroup_io_throttle() - account and throttle i/o activity
+ * @bdev: block device involved for the i/o.
+ * @bytes: size in bytes of the i/o operation.
+ * @can_sleep: used to set to 1 if we're in a sleep()able context, 0
+ * otherwise; into a non-sleep()able context we only account the
+ * i/o activity without applying any throttling sleep.
+ *
+ * This is the core of the block device i/o bandwidth controller. This function
+ * must be called by any function that generates i/o activity (directly or
+ * indirectly). It provides both i/o accounting and throttling functionalities;
+ * throttling is disabled if @can_sleep is set to 0.
+ **/
+void cgroup_io_throttle(struct block_device *bdev, size_t bytes, int can_sleep)

+{
+ struct iothrottle *iot;
+ struct iothrottle_node *n;

+ dev_t dev;
+ unsigned long sleep;
+
+ if (unlikely(!bdev))

+ return;
+
+ iot = task_to_iothrottle(current);
+ if (unlikely(!iot))
+ return;
+

+ BUG_ON(!bdev->bd_inode || !bdev->bd_disk);
+ dev = MKDEV(MAJOR(bdev->bd_inode->i_rdev), bdev->bd_disk->first_minor);
+
+ rcu_read_lock();
+ n = iothrottle_search_node(iot, dev);

+ if (!n || !n->iorate) {

+ rcu_read_unlock();
+ return;
+ }

+ switch (n->strategy) {

+ case 0:
+ sleep = leaky_bucket(n, bytes);
+ break;
+ case 1:
+ sleep = token_bucket(n, bytes);

+ break;
+ default:

+ sleep = 0;
+ }

+ if (unlikely(can_sleep && sleep)) {
+ rcu_read_unlock();

+ pr_debug("io-throttle: task %p (%s) must sleep %lu jiffies\n",

+ current, current->comm, sleep);

+ schedule_timeout_killable(sleep);
+ return;
+ }

+ rcu_read_unlock();
+}
+EXPORT_SYMBOL(cgroup_io_throttle);
diff --git a/include/linux/blk-io-throttle.h b/include/linux/blk-io-throttle.h

new file mode 100644
index 0000000..0fe7430
--- /dev/null
+++ b/include/linux/blk-io-throttle.h
@@ -0,0 +1,14 @@

+#ifndef BLK_IO_THROTTLE_H
+#define BLK_IO_THROTTLE_H
+
+#ifdef CONFIG_CGROUP_IO_THROTTLE
+extern void

+cgroup_io_throttle(struct block_device *bdev, size_t bytes, int can_sleep);
+#else
+static inline void
+cgroup_io_throttle(struct block_device *bdev, size_t bytes, int can_sleep)

Andrea Righi

unread,

Jul 4, 2008, 10:10:17 AM7/4/08

to

Apply the io-throttle controller to the opportune kernel functions. Both
accounting and throttling functionalities are performed by
cgroup_io_throttle().

Signed-off-by: Andrea Righi <righi....@gmail.com>
---

block/blk-core.c | 2 ++
fs/buffer.c | 10 ++++++++++
fs/direct-io.c | 3 +++

mm/page-writeback.c | 20 ++++++++++++++++++++
mm/readahead.c | 5 +++++
5 files changed, 40 insertions(+), 0 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 1905aab..2ac5463 100644

--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -26,6 +26,7 @@
#include <linux/swap.h>
#include <linux/writeback.h>
#include <linux/task_io_accounting_ops.h>
+#include <linux/blk-io-throttle.h>
#include <linux/interrupt.h>
#include <linux/cpu.h>
#include <linux/blktrace_api.h>
@@ -1482,6 +1483,7 @@ void submit_bio(int rw, struct bio *bio)
count_vm_events(PGPGOUT, count);
} else {
task_io_account_read(bio->bi_size);

+ cgroup_io_throttle(bio->bi_bdev, bio->bi_size, 1);

count_vm_events(PGPGIN, count);
}

diff --git a/fs/buffer.c b/fs/buffer.c

index 0f51c0f..6f41c3a 100644

+ cgroup_io_throttle(bdev, cgroup_io_acct, 0);

return 1;
}
diff --git a/fs/direct-io.c b/fs/direct-io.c

index 9e81add..42c8e54 100644

--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -35,6 +35,7 @@
#include <linux/buffer_head.h>
#include <linux/rwsem.h>
#include <linux/uio.h>
+#include <linux/blk-io-throttle.h>
#include <asm/atomic.h>

/*
@@ -666,6 +667,8 @@ submit_page_section(struct dio *dio, struct page *page,
/*
* Read accounting is performed in submit_bio()
*/
+ struct block_device *bdev = dio->bio ? dio->bio->bi_bdev : NULL;

+ cgroup_io_throttle(bdev, len, 1);

task_io_account_write(len);
}

diff --git a/mm/page-writeback.c b/mm/page-writeback.c

index 789b6ad..8e5e99d 100644

--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -23,6 +23,7 @@
#include <linux/init.h>
#include <linux/backing-dev.h>
#include <linux/task_io_accounting_ops.h>
+#include <linux/blk-io-throttle.h>
#include <linux/blkdev.h>
#include <linux/mpage.h>
#include <linux/rmap.h>

@@ -430,6 +431,9 @@ static void balance_dirty_pages(struct address_space *mapping)
unsigned long write_chunk = sync_writeback_pages();

struct backing_dev_info *bdi = mapping->backing_dev_info;
+ struct block_device *bdev = (mapping->host &&

+ mapping->host->i_sb->s_bdev) ?
+ mapping->host->i_sb->s_bdev : NULL;

for (;;) {
struct writeback_control wbc = {
@@ -512,6 +516,14 @@ static void balance_dirty_pages(struct address_space *mapping)
return; /* pdflush is already working this queue */

/*
+ * Apply the cgroup i/o throttling limitations. The accounting of write
+ * activity in page cache is performed in __set_page_dirty(), but since
+ * we cannot sleep there, 0 bytes are accounted here and the function
+ * is invoked only for throttling purpose.
+ */
+ cgroup_io_throttle(bdev, 0, 1);
+
+ /*
* In laptop mode, we wait until hitting the higher threshold before
* starting background writeout, and then write out all the way down
* to the lower threshold. So slow writers cause minimal disk activity.
@@ -1077,6 +1089,8 @@ int __set_page_dirty_nobuffers(struct page *page)

if (!TestSetPageDirty(page)) {
struct address_space *mapping = page_mapping(page);
struct address_space *mapping2;
+ struct block_device *bdev = NULL;
+ size_t cgroup_io_acct = 0;

if (!mapping)
return 1;

@@ -1087,10 +1101,15 @@ int __set_page_dirty_nobuffers(struct page *page)

BUG_ON(mapping2 != mapping);
WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
if (mapping_cap_account_dirty(mapping)) {
+ bdev = (mapping->host &&
+ mapping->host->i_sb->s_bdev) ?
+ mapping->host->i_sb->s_bdev : NULL;
+
__inc_zone_page_state(page, NR_FILE_DIRTY);
__inc_bdi_stat(mapping->backing_dev_info,
BDI_RECLAIMABLE);
task_io_account_write(PAGE_CACHE_SIZE);
+ cgroup_io_acct = PAGE_CACHE_SIZE;
}
radix_tree_tag_set(&mapping->page_tree,
page_index(page), PAGECACHE_TAG_DIRTY);

@@ -1100,6 +1119,7 @@ int __set_page_dirty_nobuffers(struct page *page)

/* !PageAnon && !swapper_space */
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
}

+ cgroup_io_throttle(bdev, cgroup_io_acct, 1);

return 1;
}
return 0;
diff --git a/mm/readahead.c b/mm/readahead.c

index d8723a5..ec83e63 100644

--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -14,6 +14,7 @@
#include <linux/blkdev.h>
#include <linux/backing-dev.h>
#include <linux/task_io_accounting_ops.h>
+#include <linux/blk-io-throttle.h>
#include <linux/pagevec.h>
#include <linux/pagemap.h>

@@ -58,6 +59,9 @@ int read_cache_pages(struct address_space *mapping, struct list_head *pages,
int (*filler)(void *, struct page *), void *data)
{
struct page *page;
+ struct block_device *bdev =
+ (mapping->host && mapping->host->i_sb->s_bdev) ?
+ mapping->host->i_sb->s_bdev : NULL;
int ret = 0;

while (!list_empty(pages)) {
@@ -76,6 +80,7 @@ int read_cache_pages(struct address_space *mapping, struct list_head *pages,
break;
}
task_io_account_read(PAGE_CACHE_SIZE);

+ cgroup_io_throttle(bdev, PAGE_CACHE_SIZE, 1);
}
return ret;

Andrea Righi

unread,

Jul 4, 2008, 10:10:14 AM7/4/08

to

The objective of the i/o bandwidth controller is to improve i/o performance
predictability of different cgroups sharing the same block devices.

Respect to other priority/weight-based solutions the approach used by this
controller is to explicitly choke applications' requests that directly (or
indirectly) generate i/o activity in the system.

The direct bandwidth limiting method has the advantage of improving the
performance predictability at the cost of reducing, in general, the overall
performance of the system (in terms of throughput).

Detailed informations about design, its goal and usage are described in the
documentation.

Tested against latest git (2.6.26-rc8).

The all-in-one patch (and previous versions) can be found at:
http://download.systemimager.org/~arighi/linux/patches/io-throttle/

Most of the changes in v4 are based on the Andrew Morton's review of patchset
v3. Thanks Andrew.

Changelog: (v3 -> v4)
- avoid potential deadlock in __set_page_dirty() with CONFIG_PREEMPT=n
- do not treat partitions as separate block devices: only entire block
devices are allowed to define i/o throttling rules; moreover, i/o activity
on partitions is accounted to opportune entire block device they belong to.
- reworked userspace<->kernel interface: accept and store all values in
bytes/sec (a userspace front-end application will take care of properly
showing and accepting values in human-readable format)
- uninlined a lot of functions
- code formatting fixes
- more documentation

Todo:
- see documentation

-Andrea

Andrea Righi

unread,

Jul 4, 2008, 11:40:06 AM7/4/08

to

Peter T. Breuer wrote:
>> + Block device I/O bandwidth controller
>
> How can this work? You will limit the number of available buffer heads
> per second?
>
> Unfortunaely, the problem is the fs above the block device. If the
> block device is artificially slowed then the fs will still happily allow
> a process to fill buffers forever until memory is full, while the block
> device continues to trickle the buffers away.
>
> What one wants is for the fs buffering to be linked to the underlying
> block device i/o speed. One wants the rate at which fs buffers are
> filled to be no more than (modulu brief spurts) the rate at which the
> device operates.
>
> That way networked block devices have a chance of having some memory
> left to send the dirty buffers out to the net with. B/w limiting the
> device itself doesn't seem to me to do any good.
>
> Peter

Peter,

I see your message only now, it seems you didn't add me in to or cc.

Anyway, I totally agree with you, but it seems there's a
misunderstanding here. The block device i/o bw controller *does*
throttling slowing down applications' requests and not the dispatching
of the already submitted i/o requests.

IMHO, for the same reason you pointed, delaying the dispatching of i/o
requests simply leads to an excessive page cache and buffers
consumption, because userspace apps dirty ratio is actually never
limited.

As reported in the io-throttle documentation:

"This controller allows to limit the I/O bandwidth of specific block

devices for specific process containers (cgroups) imposing additional
delays on I/O requests for those processes that exceed the limits
defined in the control group filesystem."

Do you think we can use a better wording to describe this concept?

-Andrea