Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Slab Fragmentation Reduction V15

2 views
Skip to first unread message

Christoph Lameter

unread,
Jan 29, 2010, 4:00:03 PM1/29/10
to
This is one of these year long projects to address fundamental issues in the
Linux VM. The problem is that sparse use of objects in slab caches can cause
large amounts of memory to become unusable. The first ideas to address this
were developed in 2005 by various people. Some of the issues with SLAB that
we discovered while prototyping these ideas also contributed to the locking
design in SLUB which is highly decentralized and allows stabilizing the object
state slab wise by taking a per slab lock.

This patchset was first proposed in the beginning of 2007. It was almost merged
in 2008 when last minute objections arose in the way this interacts with
filesystem objects (inode/dentry).

Andi has asked that we reconsider this issue. So I have updated the patchset
to apply against current upstream (and also -next with a special patch
at the end). The issues with icache/dentry locking remain. In order
for this to be merged we would have to come up with a revised dentry/inode
locking code that can

1. Establish a reference to an dentry/inode so that it is pinned.
Hopefully in a way that is not too expensive (i.e. no superblock
lock)

2. A means to free a dentry/inode objects from the VM reclaim context.

Both of those do not need to work reliably and can fail. Reclaim is a heuristic
process after all. Failure to reclaim will make the allocator skip the slab on
future scans and use it for allocations instead. When all objects in a slab have
been used and an object is freed then the slab becomes subject to
VM reclaim scans again.

The other objection against this patchset was that it does not support
reclaim through SLAB. It is possible to add this type of support to SLAB too
but one would have to take the node l3 lock to lock down all objects on
a node (and purge the percpu caches beforehand). This would stop all
allocations during a reclaim pass on a slab and make targeted reclaim
much more expensive.


Patch description

Slab fragmentation is mainly an issue if Linux is used as a fileserver
and large amounts of dentries, inodes and buffer heads accumulate. In some
load situations the slabs become very sparsely populated so that a lot of
memory is wasted by slabs that only contain one or a few objects. In
extreme cases the performance of a machine will become sluggish since
we are continually running reclaim without much succes.
Slab defragmentation adds the capability to recover the memory that
is wasted.

Memory reclaim for the following slab caches is possible:

1. dentry cache
2. inode cache (with a generic interface to allow easy setup of more
filesystems than the currently supported ext2/3/4 reiserfs, XFS
and proc)
3. buffer_heads

One typical mechanism that triggers slab defragmentation on my systems
is the daily run of

updatedb

Updatedb scans all files on the system which causes a high inode and dentry
use. After updatedb is complete we need to go back to the regular use
patterns (typical on my machine: kernel compiles). Those need the memory now
for different purposes. The inodes and dentries used for updatedb will
gradually be aged by the dentry/inode reclaim algorithm which will free
up the dentries and inode entries randomly through the slabs that were
allocated. As a result the slabs will become sparsely populated. If they
become empty then they can be freed but a lot of them will remain sparsely
populated. That is where slab defrag comes in: It removes the objects from
the slabs with just a few entries reclaiming more memory for other uses.
In the simplest case (as provided here) this is done by simply reclaiming
the objects.

However, if the logic in the kick() function is made more
sophisticated then we will be able to move the objects out of the slabs.
Allocations of objects is possible if a slab is fragmented without the use of
the page allocator because a large number of free slots are available. Moving
an object will reduce fragmentation in the slab the object is moved to.

V14->V15
- Provide missing Documentation/ABI documentation pieces
- Add -next transition patch
- Re-add the dentry patch
- Put warnings into the patches with issues

V13->V14
- Rediff against linux-next on request of Andrew
- TestSetPageLocked -> trylock_page conversion.

V12->v13:
- Rebase onto Linux 2.6.27-rc1 (deal with page flags conversion, ctor parameters etc)
- Fix unitialized variable issue

V11->V12:
- Pekka and me fixed various minor issues pointed out by Andrew.
- Split ext2/3/4 defrag support patches.
- Add more documentation
- Revise the way that slab defrag is triggered from reclaim. No longer
use a timeout but track the amount of slab reclaim done by the shrinkers.
Add a field in /proc/sys/vm/slab_defrag_limit to control the threshold.
- Display current slab_defrag_counters in /proc/zoneinfo (for a zone) and
/proc/sys/vm/slab_defrag_count (for global reclaim).
- Add new config vaue slab_defrag_limit to /proc/sys/vm/slab_defrag_limit
- Add a patch that obsoletes SLAB and explains why SLOB does not support
defrag (Either of those could be theoretically equipped to support
slab defrag in some way but it seems that Andrew/Linus want to reduce
the number of slab allocators).

V10->V11
- Simplify determination when to reclaim: Just scan over all partials
and check if they are sparsely populated.
- Add support for performance counters
- Rediff on top of current slab-mm.
- Reduce frequency of scanning. A look at the stats showed that we
were calling into reclaim very frequently when the system was under
memory pressure which slowed things down. Various measures to
avoid scanning the partial list too frequently were added and the
earlier (expensive) method of determining the defrag ratio of the slab
cache as a whole was dropped. I think this addresses the issues that
Mel saw with V10.

V9->V10
- Rediff against upstream

V8->V9
- Rediff against 2.6.24-rc6-mm1

V7->V8
- Rediff against 2.6.24-rc3-mm2

V6->V7
- Rediff against 2.6.24-rc2-mm1
- Remove lumpy reclaim support. No point anymore given that the antifrag
handling in 2.6.24-rc2 puts reclaimable slabs into different sections.
Targeted reclaim never triggers. This has to wait until we make
slabs movable or we need to perform a special version of lumpy reclaim
in SLUB while we scan the partial lists for slabs to kick out.
Removal simplifies handling significantly since we
get to slabs in a more controlled way via the partial lists.
The patchset now provides pure reduction of fragmentation levels.
- SLAB/SLOB: Provide inlines that do nothing
- Fix various smaller issues that were brought up during review of V6.

V5->V6
- Rediff against 2.6.24-rc2 + mm slub patches.
- Add reviewed by lines.
- Take out the experimental code to make slab pages movable. That
has to wait until this has been considered by Mel.

V4->V5:
- Support lumpy reclaim for slabs
- Support reclaim via slab_shrink()
- Add constructors to insure a consistent object state at all times.

V3->V4:
- Optimize scan for slabs that need defragmentation
- Add /sys/slab/*/defrag_ratio to allow setting defrag limits
per slab.
- Add support for buffer heads.
- Describe how the cleanup after the daily updatedb can be
improved by slab defragmentation.

V2->V3
- Support directory reclaim
- Add infrastructure to trigger defragmentation after slab shrinking if we
have slabs with a high degree of fragmentation.

V1->V2
- Clean up control flow using a state variable. Simplify API. Back to 2
functions that now take arrays of objects.
- Inode defrag support for a set of filesystems
- Fix up dentry defrag support to work on negative dentries by adding
a new dentry flag that indicates that a dentry is not in the process
of being freed or allocated.


--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Christoph Lameter

unread,
Jan 29, 2010, 4:00:03 PM1/29/10
to
defrag_fs_generic

Christoph Lameter

unread,
Jan 29, 2010, 4:00:02 PM1/29/10
to
slub_add_defrag_ratio

Christoph Lameter

unread,
Jan 29, 2010, 4:00:01 PM1/29/10
to
defrag_dentry

Christoph Lameter

unread,
Jan 29, 2010, 4:00:03 PM1/29/10
to
slub_add_defrag_stats

Christoph Lameter

unread,
Jan 29, 2010, 4:00:02 PM1/29/10
to
defrag_buffer_head

Christoph Lameter

unread,
Jan 29, 2010, 4:00:03 PM1/29/10
to
slub_replace_ctor_field

Christoph Lameter

unread,
Jan 29, 2010, 4:00:02 PM1/29/10
to
fixup-next

Christoph Lameter

unread,
Jan 29, 2010, 4:00:03 PM1/29/10
to
ext4-defrag

Christoph Lameter

unread,
Jan 29, 2010, 4:00:01 PM1/29/10
to
defrag_proc

Christoph Lameter

unread,
Jan 29, 2010, 4:00:02 PM1/29/10
to
ext3-defrag

Christoph Lameter

unread,
Jan 29, 2010, 4:00:03 PM1/29/10
to
slub_defrag_core

Al Viro

unread,
Jan 29, 2010, 5:10:02 PM1/29/10
to
On Fri, Jan 29, 2010 at 02:49:48PM -0600, Christoph Lameter wrote:
> + if ((d_unhashed(dentry) && list_empty(&dentry->d_lru)) ||
> + (!d_unhashed(dentry) && hlist_unhashed(&dentry->d_hash)) ||
> + (dentry->d_inode &&
> + !mapping_cap_writeback_dirty(dentry->d_inode->i_mapping)))
> + /* Ignore this dentry */
> + v[i] = NULL;
> + else
> + /* dget_locked will remove the dentry from the LRU */
> + dget_locked(dentry);
> + }
> + spin_unlock(&dcache_lock);
> + return NULL;
> +}

No. As the matter of fact - fuck, no. For one thing, it's going to race
with umount. For another, kicking busy dentry out of hash is worse than
useless - you are just asking to get more and more copies of that sucker
in dcache. This is fundamentally bogus, especially since there is a 100%
safe time for killing dentry - when dput() drives the refcount to 0 and
you *are* doing dput() on the references you've acquired. If anything, I'd
suggest setting a flag that would trigger immediate freeing on the final
dput().

And that does not cover the umount races. You *can't* go around grabbing
dentries without making sure that superblock won't be shut down under
you. And no, I don't know how to deal with that cleanly - simply bumping
superblock ->s_count under sb_lock is enough to make sure it's not freed
under you, but what you want is more than that. An active reference would
be enough, except that you'd get sudden "oh, sorry, now there's no way
to make sure that superblock is shut down at umount(2), no matter what kind
of setup you have". So you really need to get ->s_umount held shared,
which is, not particulary locking-order-friendly, to put it mildly.

Dave Chinner

unread,
Jan 29, 2010, 9:00:02 PM1/29/10
to
On Fri, Jan 29, 2010 at 02:49:41PM -0600, Christoph Lameter wrote:
> Defragmentation support for buffer heads. We convert the references to
> buffers to struct page references and try to remove the buffers from
> those pages. If the pages are dirty then trigger writeout so that the
> buffer heads can be removed later.

NACK.

We don't want another random single page writeback trigger into
the VM - it will only slow down cleaning of dirty pages by causing
disk thrashing (i.e. turns writeback into small random write
workload), and that will ultimately slow down the rate at which we can
reclaim buffer heads.

Hence I suggest that if the buffer head is dirty, then just ignore
it - it'll be cleaned soon enough by one of the other mechanisms we
have and then it can be reclaimed in a later pass.

Cheers,

Dave.
--
Dave Chinner
da...@fromorbit.com

Dave Chinner

unread,
Jan 29, 2010, 9:50:01 PM1/29/10
to
On Fri, Jan 29, 2010 at 02:49:42PM -0600, Christoph Lameter wrote:
> This implements the ability to remove inodes in a particular slab
> from inode caches. In order to remove an inode we may have to write out
> the pages of an inode, the inode itself and remove the dentries referring
> to the node.
>
> Provide generic functionality that can be used by filesystems that have
> their own inode caches to also tie into the defragmentation functions
> that are made available here.
>
> FIXES NEEDED!
>
> Note Miklos comments on the patch at http://lkml.indiana.edu/hypermail/linux/kernel/0810.1/2003.html
>
> The way we obtain a reference to a inode entry may be unreliable since inode
> refcounting works in different ways. Also a reference to the superblock is necessary
> in order to be able to operate on the inodes.
>
> Cc: Miklos Szeredi <mik...@szeredi.hu>
> Cc: Alexander Viro <vi...@ftp.linux.org.uk>
> Cc: Christoph Hellwig <h...@infradead.org>
> Reviewed-by: Rik van Riel <ri...@redhat.com>
> Signed-off-by: Christoph Lameter <clam...@sgi.com>
> Signed-off-by: Christoph Lameter <c...@linux-foundation.org>
>
> ---
> fs/inode.c | 123 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> include/linux/fs.h | 6 ++
> 2 files changed, 129 insertions(+)
>
> Index: linux-2.6/fs/inode.c
> ===================================================================
> --- linux-2.6.orig/fs/inode.c 2010-01-29 12:03:04.000000000 -0600
> +++ linux-2.6/fs/inode.c 2010-01-29 12:03:25.000000000 -0600
> @@ -1538,6 +1538,128 @@ static int __init set_ihash_entries(char
> __setup("ihash_entries=", set_ihash_entries);
>
> /*
> + * Obtain a refcount on a list of struct inodes pointed to by v. If the
> + * inode is in the process of being freed then zap the v[] entry so that
> + * we skip the freeing attempts later.
> + *
> + * This is a generic function for the ->get slab defrag callback.
> + */
> +void *get_inodes(struct kmem_cache *s, int nr, void **v)
> +{
> + int i;
> +
> + spin_lock(&inode_lock);
> + for (i = 0; i < nr; i++) {
> + struct inode *inode = v[i];
> +
> + if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE))

> + v[i] = NULL;
> + else
> + __iget(inode);
> + }
> + spin_unlock(&inode_lock);
> + return NULL;
> +}
> +EXPORT_SYMBOL(get_inodes);

How do you expect defrag to behave when the filesystem doesn't free
the inode immediately during dispose_list()? That is, the above code
only finds inodes that are still active at the VFS level but they
may still live for a significant period of time after the
dispose_list() call. This is a real issue now that XFS has combined
the VFS and XFS inodes into the same slab...

Cheers,

Dave.
--
Dave Chinner
da...@fromorbit.com

Pekka Enberg

unread,
Jan 30, 2010, 4:00:02 AM1/30/10
to
On Fri, Jan 29, 2010 at 10:49 PM, Christoph Lameter
<c...@linux-foundation.org> wrote:
> This is one of these year long projects to address fundamental issues in the
> Linux VM. The problem is that sparse use of objects in slab caches can cause
> large amounts of memory to become unusable. The first ideas to address this
> were developed in 2005 by various people. Some of the issues with SLAB that
> we discovered while prototyping these ideas also contributed to the locking
> design in SLUB which is highly decentralized and allows stabilizing the object
> state slab wise by taking a per slab lock.
>
> This patchset was first proposed in the beginning of 2007. It was almost merged
> in 2008 when last minute objections arose in the way this interacts with
> filesystem objects (inode/dentry).

Yeah, I think the SLUB bits were fine but there wasn't clear whether
or not the FS bits would be merged. No point in merging functionality
in SLUB unless it's going to be used.

Andi Kleen

unread,
Jan 30, 2010, 5:50:02 AM1/30/10
to
On Fri, Jan 29, 2010 at 02:49:31PM -0600, Christoph Lameter wrote:
> This patchset was first proposed in the beginning of 2007. It was almost merged
> in 2008 when last minute objections arose in the way this interacts with
> filesystem objects (inode/dentry).
>
> Andi has asked that we reconsider this issue. So I have updated the patchset

Thanks for reposting.

My motivation here is to improve hwpoison soft offlining, but I think
having this would be a general improvement.

> to apply against current upstream (and also -next with a special patch
> at the end). The issues with icache/dentry locking remain. In order
> for this to be merged we would have to come up with a revised dentry/inode
> locking code that can
>
> 1. Establish a reference to an dentry/inode so that it is pinned.
> Hopefully in a way that is not too expensive (i.e. no superblock
> lock)
>
> 2. A means to free a dentry/inode objects from the VM reclaim context.


Al, do you have a suggestions on a good way to do that?

I guess the problem could be simplified by ignoring dentries in "unusual"
states?

> The other objection against this patchset was that it does not support
> reclaim through SLAB. It is possible to add this type of support to SLAB too

I think not supporting SLAB/SLOB is fine.

-Andi

Rik van Riel

unread,
Jan 30, 2010, 10:00:03 AM1/30/10
to
On 01/30/2010 05:48 AM, Andi Kleen wrote:
> On Fri, Jan 29, 2010 at 02:49:31PM -0600, Christoph Lameter wrote:

>> 1. Establish a reference to an dentry/inode so that it is pinned.
>> Hopefully in a way that is not too expensive (i.e. no superblock
>> lock)
>>
>> 2. A means to free a dentry/inode objects from the VM reclaim context.
>
>
> Al, do you have a suggestions on a good way to do that?

You cannot free inode objects for files that are open, mmapped, etc.

> I guess the problem could be simplified by ignoring dentries in "unusual"
> states?

You mean dentries that are in use? :)

--
All rights reversed.

ty...@mit.edu

unread,
Jan 30, 2010, 6:10:02 PM1/30/10
to
On Fri, Jan 29, 2010 at 02:49:42PM -0600, Christoph Lameter wrote:
> This implements the ability to remove inodes in a particular slab
> from inode caches. In order to remove an inode we may have to write out
> the pages of an inode, the inode itself and remove the dentries referring
> to the node.

How often is this going to happen? Removing an inode is an incredibly
expensive operation. We have to eject all of the pages from the page
cache, and if those pages are getting a huge amount of use --- say,
those corresponding to some shared library like libc --- or which have
a huge number of pages that are actively getting used, the thrashing
that is going to result is going to enormous.

There needs to be some kind of cost/benefit analysis done about
whether or not this is worth it. Does it make sense to potentially
force hundreds and hundreds of megaytes of pages to get thrashed in
and out just to recover a single 4k page? In some cases, maybe yes.
But in other cases, the results could be disastrous.

> +/*
> + * Generic callback function slab defrag ->kick methods. Takes the
> + * array with inodes where we obtained refcounts using fs_get_inodes()
> + * or get_inodes() and tries to free them.
> + */
> +void kick_inodes(struct kmem_cache *s, int nr, void **v, void *private)
> +{
> + struct inode *inode;
> + int i;
> + int abort = 0;
> + LIST_HEAD(freeable);
> + int active;
> +


> + for (i = 0; i < nr; i++) {

> + inode = v[i];
> + if (!inode)
> + continue;

In some cases, it's going to be impossible to empty a particular slab
cache page. For example, there may be one inode which has pages
locked into memory, or which we may decide (once we add some
intelligence into this function) is really not worth ejecting. In
that case, there's no point dumping the rest of the inodes on that
particular slab page.

> + if (inode_has_buffers(inode) || inode->i_data.nrpages) {
> + if (remove_inode_buffers(inode))
> + /*
> + * Should we really be doing this? Or
> + * limit the writeback here to only a few pages?
> + *
> + * Possibly an expensive operation but we
> + * cannot reclaim the inode if the pages
> + * are still present.
> + */
> + invalidate_mapping_pages(&inode->i_data,
> + 0, -1);

> + }

I do not thing this function does what you think it does....

"invalidate_mapping_pages() will not block on I/O activity, and it
will refuse to invalidate pages which are dirty, locked, under
writeback, or mapped into page tables."

So you need to force the data to be written *first*, then get the
pages removed from the page table, and only then, call
invalidate_mapping_pages(). Otherwise, this is just going to
pointlessly drop pages from the page cache and trashing the page
cache's effectiveness, without actually making it possible to drop a
particular inode if it is being used at all by any process.

Presumably then the defrag code, since it was unable to free a
particular page, will go on pillaging and raping other inodes in the
inode cache, until it can actually create a hugepage. This is why you
really shouldn't start trying to trash an inode until you're
**really** sure it's possible completely evict a 4k slab page of all
of its inodes.

- Ted

Andi Kleen

unread,
Jan 31, 2010, 3:40:02 AM1/31/10
to
On Sat, Jan 30, 2010 at 02:26:23PM -0500, ty...@mit.edu wrote:
> On Fri, Jan 29, 2010 at 02:49:42PM -0600, Christoph Lameter wrote:
> > This implements the ability to remove inodes in a particular slab
> > from inode caches. In order to remove an inode we may have to write out
> > the pages of an inode, the inode itself and remove the dentries referring
> > to the node.
>
> How often is this going to happen? Removing an inode is an incredibly

The standard case is the classic updatedb. Lots of dentries/inodes cached
with no or little corresponding data cache.

> a huge number of pages that are actively getting used, the thrashing
> that is going to result is going to enormous.

I think the consensus so far is to try to avoid any inodes/dentries
which are dirty or used in any way.

I personally would prefer it to be more aggressive for memory offlining
though for RAS purposes though, but just handling the unused cases is a
good first step.

> "invalidate_mapping_pages() will not block on I/O activity, and it
> will refuse to invalidate pages which are dirty, locked, under
> writeback, or mapped into page tables."

I think that was the point.

-Andi
--
a...@linux.intel.com -- Speaking for myself only.

ty...@mit.edu

unread,
Jan 31, 2010, 4:10:02 PM1/31/10
to
On Sun, Jan 31, 2010 at 09:34:09AM +0100, Andi Kleen wrote:
>
> The standard case is the classic updatedb. Lots of dentries/inodes cached
> with no or little corresponding data cache.
>
> > a huge number of pages that are actively getting used, the thrashing
> > that is going to result is going to enormous.
>
> I think the consensus so far is to try to avoid any inodes/dentries
> which are dirty or used in any way.

OK, but in that case, the kick_inodes should check to see if the inode
is in use in any way (i.e., has dentries open that will tie it down,
is open, has pages that are dirty or are mapped into some page table)
before attempting to invalidating any of its pages. The patch as
currently constituted doesn't do that. It will attempt to drop all
pages owned by that inode before checking for any of these conditions.
If I wanted that, I'd just do "echo 3 > /proc/sys/vm/drop_caches".

Worse yet, *after* it does this, it tries to write out the pages the
inode. #1, this is pointless, since if the inode had any dirty pages,
they wouldn't have been invalidated, since it calls write_inode_now()
*after* calling invalidate_mapping_pages(), so the previously dirty
pages will still be mapped, and prevent the the inode from being
flushed. #2, it interferes with delayed allocation and becomes
another writeback path, which means some dirty pages might get flushed
too early and it does this writeout without any of the congestion
handling code in the bdi writeback paths.

If the consensus is "avoid any inodes/dentries which are dirty or
used in any way", THIS PATCH DOESN'T DO THAT.

I'd go further, and say that it should avoid trying to flush any inode
if any of its sibling inodes on the slab cache are dirty or in use in
any way. Otherwise, you end up dropping pages from the page cache and
still not be able to do any defragmentation.

> I personally would prefer it to be more aggressive for memory offlining
> though for RAS purposes though, but just handling the unused cases is a
> good first step.

If you want something more aggressive, why not just "echo 3 >
/proc/sys/vm/drop_caches"? We have that already. If the answer is,
because it will trash the performance of the system, I'm concerned
this patch series will do this --- consistently.

If the concern is that the inode cache is filled with crap after an
updatedb run, then we should fix *that* problem; we need a way for
programs like updatedb to indicate that they are scanning lots of
inodes, and if the inode wasn't in cache before it was opened, it
should be placed on the short list to be dropped after it's closed.
Making that a new open(2) flag makes a lot of sense. Let's solve the
real problem here, if that's the concern.

But most of the time, I *want* the page cache filled, since it means
less time wasted accessing spinning rust platters. The last thing I
want is a some helpful defragmentation kernel thread constantly
wandering through inode caches, and randomly calling
"invalidate_mapping_pages" on inodes since it thinks this will help
defrag huge pages. If I'm not running an Oracle database on my
laptop, but instead am concerned about battery lifetime, this is the
last thing I would want.

- Ted

Nick Piggin

unread,
Feb 1, 2010, 3:30:02 AM2/1/10
to

I always preferred to do defrag in the opposite way. Ie. query the
slab allocator from existing shrinkers rather than opposite way
around. This lets you reuse more of the locking and refcounting etc.

So you have a pin on the object somehow via the normal shrinker path,
and therefore you get a pin on the underlying slab. I would just like
to see even performance of a real simple approach that just asks
whether we are in this slab defrag mode, and if so, whether the slab
is very sparse. If yes, then reclaim aggressively.

If that doesn't perform well enough and you have to go further and
discover objects on the same slab, then it does get a bit more
tricky because:
- you need the pin on the first object in order to discover more
- discovered objects may not be expected in the existing shrinker
code that just picks objects off LRUs

However your code already has to handle the 2nd case anyway, and for
the 1st case it is probably not too hard to do with dcache/icache. And
in either case you seem to avoid the worst of the sleeping and lock
ordering and slab inversion problems of your ->get approach.

But I'm really interested to see numbers, and especially numbers of
the simpler approaches before adding this complexity.

Nick Piggin

unread,
Feb 1, 2010, 3:40:02 AM2/1/10
to
On Fri, Jan 29, 2010 at 02:49:41PM -0600, Christoph Lameter wrote:
> Defragmentation support for buffer heads. We convert the references to
> buffers to struct page references and try to remove the buffers from
> those pages. If the pages are dirty then trigger writeout so that the
> buffer heads can be removed later.
>
> Reviewed-by: Rik van Riel <ri...@redhat.com>
> Signed-off-by: Christoph Lameter <clam...@sgi.com>
> Signed-off-by: Christoph Lameter <c...@linux-foundation.org>
>
> ---
> fs/buffer.c | 99 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 99 insertions(+)
>
> Index: slab-2.6/fs/buffer.c
> ===================================================================
> --- slab-2.6.orig/fs/buffer.c 2010-01-22 15:09:43.000000000 -0600
> +++ slab-2.6/fs/buffer.c 2010-01-22 16:17:27.000000000 -0600
> @@ -3352,6 +3352,104 @@ int bh_submit_read(struct buffer_head *b
> }
> EXPORT_SYMBOL(bh_submit_read);
>
> +/*
> + * Writeback a page to clean the dirty state
> + */
> +static void trigger_write(struct page *page)
> +{
> + struct address_space *mapping = page_mapping(page);
> + int rc;
> + struct writeback_control wbc = {
> + .sync_mode = WB_SYNC_NONE,
> + .nr_to_write = 1,
> + .range_start = 0,
> + .range_end = LLONG_MAX,
> + .nonblocking = 1,
> + .for_reclaim = 0
> + };
> +
> + if (!mapping->a_ops->writepage)
> + /* No write method for the address space */
> + return;
> +
> + if (!clear_page_dirty_for_io(page))
> + /* Someone else already triggered a write */
> + return;
> +
> + rc = mapping->a_ops->writepage(page, &wbc);
> + if (rc < 0)
> + /* I/O Error writing */
> + return;
> +
> + if (rc == AOP_WRITEPAGE_ACTIVATE)
> + unlock_page(page);
> +}
> +
> +/*
> + * Get references on buffers.
> + *
> + * We obtain references on the page that uses the buffer. v[i] will point to
> + * the corresponding page after get_buffers() is through.
> + *
> + * We are safe from the underlying page being removed simply by doing
> + * a get_page_unless_zero. The buffer head removal may race at will.
> + * try_to_free_buffes will later take appropriate locks to remove the
> + * buffers if they are still there.
> + */
> +static void *get_buffers(struct kmem_cache *s, int nr, void **v)
> +{
> + struct page *page;
> + struct buffer_head *bh;
> + int i, j;
> + int n = 0;

> +
> + for (i = 0; i < nr; i++) {
> + bh = v[i];

> + v[i] = NULL;
> +
> + page = bh->b_page;
> +
> + if (page && PagePrivate(page)) {
> + for (j = 0; j < n; j++)
> + if (page == v[j])
> + continue;
> + }
> +
> + if (get_page_unless_zero(page))
> + v[n++] = page;

This seems wrong to me. The page can have been reused at this
stage.

You technically can't re-check using page->private because that
can be anything and doesn't actually need to be a pointer. You
could re-check bh->b_page, provided that you ensure it is always
cleared before a page is detached, and the correct barriers are
in place.


> + }
> + return NULL;
> +}
> +
> +/*
> + * Despite its name: kick_buffers operates on a list of pointers to
> + * page structs that was set up by get_buffer().
> + */
> +static void kick_buffers(struct kmem_cache *s, int nr, void **v,
> + void *private)
> +{
> + struct page *page;
> + int i;


> +
> + for (i = 0; i < nr; i++) {

> + page = v[i];
> +
> + if (!page || PageWriteback(page))
> + continue;
> +
> + if (trylock_page(page)) {
> + if (PageDirty(page))
> + trigger_write(page);
> + else {
> + if (PagePrivate(page))
> + try_to_free_buffers(page);
> + unlock_page(page);

PagePrivate doesn't necessarily mean it has buffers. try_to_release_page
should be a better idea.

> + }
> + }
> + put_page(page);
> + }
> +}
> +
> static void
> init_buffer_head(void *data)
> {
> @@ -3370,6 +3468,7 @@ void __init buffer_init(void)
> (SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|
> SLAB_MEM_SPREAD),
> init_buffer_head);
> + kmem_cache_setup_defrag(bh_cachep, get_buffers, kick_buffers);
>
> /*
> * Limit the bh occupancy to 10% of ZONE_NORMAL


Buffer heads and buffer head refcounting really stinks badly. Although
I can see the need for a medium term solution until fsblock / some
actual sane refcounting.

Andi Kleen

unread,
Feb 1, 2010, 5:20:01 AM2/1/10
to
On Sun, Jan 31, 2010 at 04:02:07PM -0500, ty...@mit.edu wrote:
> OK, but in that case, the kick_inodes should check to see if the inode
> is in use in any way (i.e., has dentries open that will tie it down,
> is open, has pages that are dirty or are mapped into some page table)
> before attempting to invalidating any of its pages. The patch as
> currently constituted doesn't do that. It will attempt to drop all
> pages owned by that inode before checking for any of these conditions.
> If I wanted that, I'd just do "echo 3 > /proc/sys/vm/drop_caches".

Yes the patch is more aggressive and probably needs to be fixed.

On the other hand I would like to keep the option to be more aggressive
for soft page offlining where it's useful and nobody cares about
the cost.

> Worse yet, *after* it does this, it tries to write out the pages the
> inode. #1, this is pointless, since if the inode had any dirty pages,
> they wouldn't have been invalidated, since it calls write_inode_now()

Yes .... fought with all that for hwpoison too.

> I'd go further, and say that it should avoid trying to flush any inode
> if any of its sibling inodes on the slab cache are dirty or in use in
> any way. Otherwise, you end up dropping pages from the page cache and
> still not be able to do any defragmentation.

It depends -- for normal operation when running low on memory I agree
with you.
But for hwpoison soft offline purposes it's better to be more aggressive
-- even if that is inefficient -- but number one priority is to still
be correct of course.

>
> If the concern is that the inode cache is filled with crap after an
> updatedb run, then we should fix *that* problem; we need a way for
> programs like updatedb to indicate that they are scanning lots of
> inodes, and if the inode wasn't in cache before it was opened, it
> this patch series will do this --- consistently.

This has been tried many times and nobody came up with a good approach
to detect it automatically that doesn't have bad regressions in corner
cases.

Or the "let's add a updatedb" hint approach has the problem that
it won't cover a lot of other programs (as Linus always points out
these new interfaces rarely actually get used)

Also as Linus always points out -- thi

> But most of the time, I *want* the page cache filled, since it means
> less time wasted accessing spinning rust platters. The last thing I
> want is a some helpful defragmentation kernel thread constantly
> wandering through inode caches, and randomly calling

The problem right now this patch series tries to access is that
when you run out of memory it tends to blow away your dcaches caches
because the dcache reclaim is just too stupid to actually free
memory without going through most of the LRU list.

So yes it's all about improving caching. But yes also
some details need to be improved

-Andi
--
a...@linux.intel.com -- Speaking for myself only.

Andi Kleen

unread,
Feb 1, 2010, 5:20:02 AM2/1/10
to
On Mon, Feb 01, 2010 at 06:08:35PM +1100, Nick Piggin wrote:
> I always preferred to do defrag in the opposite way. Ie. query the
> slab allocator from existing shrinkers rather than opposite way
> around. This lets you reuse more of the locking and refcounting etc.

I looked at this for hwpoison soft offline.

But it works really badly because the LRU list ordering
has nothing to do with the actual ordering inside the slab pages.

Christoph's basic approach is more efficient.

> So you have a pin on the object somehow via the normal shrinker path,
> and therefore you get a pin on the underlying slab. I would just like
> to see even performance of a real simple approach that just asks
> whether we are in this slab defrag mode, and if so, whether the slab
> is very sparse. If yes, then reclaim aggressively.

The typical result is that you need to get through most of the LRU
list (and prune them all) just to free the page.

>
> If that doesn't perform well enough and you have to go further and

It doesn't.

-Andi
--
a...@linux.intel.com -- Speaking for myself only.

Nick Piggin

unread,
Feb 1, 2010, 5:20:02 AM2/1/10
to
On Mon, Feb 01, 2010 at 11:10:13AM +0100, Andi Kleen wrote:
> On Mon, Feb 01, 2010 at 06:08:35PM +1100, Nick Piggin wrote:
> > I always preferred to do defrag in the opposite way. Ie. query the
> > slab allocator from existing shrinkers rather than opposite way
> > around. This lets you reuse more of the locking and refcounting etc.
>
> I looked at this for hwpoison soft offline.
>
> But it works really badly because the LRU list ordering
> has nothing to do with the actual ordering inside the slab pages.

No, you don't *have* to follow LRU order. The most important thing
is if you followed what I wrote is to get a pin on the objects and
the slabs via the regular shrinker path first, then querying slab
rather than calling into all these subsystems from an atomic, and
non-slab-reentrant path.

Following LRU order would just be the first and simplest cut at
this.


> Christoph's basic approach is more efficient.

I want to see numbers because it is also the far more complex
approach.


> > So you have a pin on the object somehow via the normal shrinker path,
> > and therefore you get a pin on the underlying slab. I would just like
> > to see even performance of a real simple approach that just asks
> > whether we are in this slab defrag mode, and if so, whether the slab
> > is very sparse. If yes, then reclaim aggressively.
>
> The typical result is that you need to get through most of the LRU
> list (and prune them all) just to free the page.

Really? If you have a large proportion of slabs which are quite
internally fragmented, then I would have thought it would give a
significant improvement (aggressive reclaim, that is).


> > If that doesn't perform well enough and you have to go further and
>
> It doesn't.

Can we see your numbers? And the patches you tried?

Thanks,
Nick

Andi Kleen

unread,
Feb 1, 2010, 5:30:02 AM2/1/10
to
On Mon, Feb 01, 2010 at 09:16:45PM +1100, Nick Piggin wrote:
> On Mon, Feb 01, 2010 at 11:10:13AM +0100, Andi Kleen wrote:
> > On Mon, Feb 01, 2010 at 06:08:35PM +1100, Nick Piggin wrote:
> > > I always preferred to do defrag in the opposite way. Ie. query the
> > > slab allocator from existing shrinkers rather than opposite way
> > > around. This lets you reuse more of the locking and refcounting etc.
> >
> > I looked at this for hwpoison soft offline.
> >
> > But it works really badly because the LRU list ordering
> > has nothing to do with the actual ordering inside the slab pages.
>
> No, you don't *have* to follow LRU order. The most important thing

What list would you follow then?

There's LRU, there's hast (which is as random) and there's slab
itself. The only one who is guaranteed to match the physical
layout in memory is slab. That is what this patchkit is trying
to attempt.

> is if you followed what I wrote is to get a pin on the objects and

Which objects? You first need to collect all that belong to a page.
How else would you do that?

> > > whether we are in this slab defrag mode, and if so, whether the slab
> > > is very sparse. If yes, then reclaim aggressively.
> >
> > The typical result is that you need to get through most of the LRU
> > list (and prune them all) just to free the page.
>
> Really? If you have a large proportion of slabs which are quite
> internally fragmented, then I would have thought it would give a
> significant improvement (aggressive reclaim, that is)


You wrote the same as me?


>
>
> > > If that doesn't perform well enough and you have to go further and
> >
> > It doesn't.
>
> Can we see your numbers? And the patches you tried?

What I tried (in some dirty patches you probably don't want to see)
was to just implement slab shrinking for a single page for soft hwpoison.
But it didn't work too well because it couldn't free the objects
still actually in the dcache.

Then I called the shrinker and tried to pass in the page as a hint
and drop only objects on that page, but I realized that it's terrible
inefficient to do it this way.

Now soft hwpoison doesn't care about a little inefficiency, but I still
didn't like to be terrible inefficient.

That is why I asked Christoph to repost his old patchkit that can
do the shrink from the slab side (which is the right order here)

BTW the other potential user for this would be defragmentation
for large page allocation.

-Andi


--
a...@linux.intel.com -- Speaking for myself only.

Nick Piggin

unread,
Feb 1, 2010, 5:40:02 AM2/1/10
to
On Mon, Feb 01, 2010 at 11:22:53AM +0100, Andi Kleen wrote:
> On Mon, Feb 01, 2010 at 09:16:45PM +1100, Nick Piggin wrote:
> > On Mon, Feb 01, 2010 at 11:10:13AM +0100, Andi Kleen wrote:
> > > On Mon, Feb 01, 2010 at 06:08:35PM +1100, Nick Piggin wrote:
> > > > I always preferred to do defrag in the opposite way. Ie. query the
> > > > slab allocator from existing shrinkers rather than opposite way
> > > > around. This lets you reuse more of the locking and refcounting etc.
> > >
> > > I looked at this for hwpoison soft offline.
> > >
> > > But it works really badly because the LRU list ordering
> > > has nothing to do with the actual ordering inside the slab pages.
> >
> > No, you don't *have* to follow LRU order. The most important thing
>
> What list would you follow then?

You can follow the slab, as I said in the first mail.

> There's LRU, there's hast (which is as random) and there's slab
> itself. The only one who is guaranteed to match the physical
> layout in memory is slab. That is what this patchkit is trying
> to attempt.
>
> > is if you followed what I wrote is to get a pin on the objects and
>
> Which objects? You first need to collect all that belong to a page.
> How else would you do that?

Objects that you're interested in reclaiming, I guess. I don't
understand the question.


> > > > whether we are in this slab defrag mode, and if so, whether the slab
> > > > is very sparse. If yes, then reclaim aggressively.
> > >
> > > The typical result is that you need to get through most of the LRU
> > > list (and prune them all) just to free the page.
> >
> > Really? If you have a large proportion of slabs which are quite
> > internally fragmented, then I would have thought it would give a
> > significant improvement (aggressive reclaim, that is)
>
>
> You wrote the same as me?

Aggressive reclaim: as-in, ignoring referenced bit on the LRU,
*possibly* even trying to actively invalidate the dentry.


> > > > If that doesn't perform well enough and you have to go further and
> > >
> > > It doesn't.
> >
> > Can we see your numbers? And the patches you tried?
>
> What I tried (in some dirty patches you probably don't want to see)
> was to just implement slab shrinking for a single page for soft hwpoison.
> But it didn't work too well because it couldn't free the objects
> still actually in the dcache.
>
> Then I called the shrinker and tried to pass in the page as a hint
> and drop only objects on that page, but I realized that it's terrible
> inefficient to do it this way.
>
> Now soft hwpoison doesn't care about a little inefficiency, but I still
> didn't like to be terrible inefficient.
>
> That is why I asked Christoph to repost his old patchkit that can
> do the shrink from the slab side (which is the right order here)

Right, but as you can see it is complex to do it this way. And I
think for reclaim driven targetted reclaim, then it needn't be so
inefficient because you aren't restricted to just one page, but
in any page which is heavily fragmented (and by definition there
should be a lot of them in the system).

Hwpoison I don't think adds much weight, frankly. Just panic and
reboot if you get unrecoverable error. We have everything to handle
that so I can't see how it's worth adding much complexity to the
kernel for.

Andi Kleen

unread,
Feb 1, 2010, 5:50:02 AM2/1/10
to
On Mon, Feb 01, 2010 at 09:35:26PM +1100, Nick Piggin wrote:
> > > > > I always preferred to do defrag in the opposite way. Ie. query the
> > > > > slab allocator from existing shrinkers rather than opposite way
> > > > > around. This lets you reuse more of the locking and refcounting etc.
> > > >
> > > > I looked at this for hwpoison soft offline.
> > > >
> > > > But it works really badly because the LRU list ordering
> > > > has nothing to do with the actual ordering inside the slab pages.
> > >
> > > No, you don't *have* to follow LRU order. The most important thing
> >
> > What list would you follow then?
>
> You can follow the slab, as I said in the first mail.

That's pretty much what Christoph's patchkit is about (with yes some details
improved)

>
> > There's LRU, there's hast (which is as random) and there's slab
> > itself. The only one who is guaranteed to match the physical
> > layout in memory is slab. That is what this patchkit is trying
> > to attempt.
> >
> > > is if you followed what I wrote is to get a pin on the objects and
> >
> > Which objects? You first need to collect all that belong to a page.
> > How else would you do that?
>
> Objects that you're interested in reclaiming, I guess. I don't
> understand the question.

Objects that are in the same page

There are really two different cases here:
- Run out of memory: in this case i just want to find all the objects
of any page, ideally of not that recently used pages.
- I am very fragmented and want a specific page freed to get a 2MB
region back or for hwpoison: same, but do it for a specific page.


> Right, but as you can see it is complex to do it this way. And I
> think for reclaim driven targetted reclaim, then it needn't be so
> inefficient because you aren't restricted to just one page, but
> in any page which is heavily fragmented (and by definition there
> should be a lot of them in the system).

Assuming you can identify them quickly.

>
> Hwpoison I don't think adds much weight, frankly. Just panic and
> reboot if you get unrecoverable error. We have everything to handle

This is for soft hwpoison :- offlining pages that might go bad
in the future.

But soft hwpoison isn't the only user. The other big one would
be for large pages or other large page allocations.

-Andi
--
a...@linux.intel.com -- Speaking for myself only.

Nick Piggin

unread,
Feb 1, 2010, 6:00:02 AM2/1/10
to
On Mon, Feb 01, 2010 at 11:45:44AM +0100, Andi Kleen wrote:
> On Mon, Feb 01, 2010 at 09:35:26PM +1100, Nick Piggin wrote:
> > > > > > I always preferred to do defrag in the opposite way. Ie. query the
> > > > > > slab allocator from existing shrinkers rather than opposite way
> > > > > > around. This lets you reuse more of the locking and refcounting etc.
> > > > >
> > > > > I looked at this for hwpoison soft offline.
> > > > >
> > > > > But it works really badly because the LRU list ordering
> > > > > has nothing to do with the actual ordering inside the slab pages.
> > > >
> > > > No, you don't *have* to follow LRU order. The most important thing
> > >
> > > What list would you follow then?
> >
> > You can follow the slab, as I said in the first mail.
>
> That's pretty much what Christoph's patchkit is about (with yes some details
> improved)

I know what the patch is about. Can you re-read my first mail?


> > > There's LRU, there's hast (which is as random) and there's slab
> > > itself. The only one who is guaranteed to match the physical
> > > layout in memory is slab. That is what this patchkit is trying
> > > to attempt.
> > >
> > > > is if you followed what I wrote is to get a pin on the objects and
> > >
> > > Which objects? You first need to collect all that belong to a page.
> > > How else would you do that?
> >
> > Objects that you're interested in reclaiming, I guess. I don't
> > understand the question.
>
> Objects that are in the same page

OK, well you can pin an object, and from there you can find other
objects in the same page.

This is totally different to how Christoph's patch has to pin the
slab, then (in a restrictive context) pin the objects, then go to
a more relaxed context to reclaim the objects. This is where much
of the complexity comes from.


> There are really two different cases here:
> - Run out of memory: in this case i just want to find all the objects
> of any page, ideally of not that recently used pages.
> - I am very fragmented and want a specific page freed to get a 2MB
> region back or for hwpoison: same, but do it for a specific page.
>
>
> > Right, but as you can see it is complex to do it this way. And I
> > think for reclaim driven targetted reclaim, then it needn't be so
> > inefficient because you aren't restricted to just one page, but
> > in any page which is heavily fragmented (and by definition there
> > should be a lot of them in the system).
>
> Assuming you can identify them quickly.

Well because there are a large number of them, then you are likely
to encounter one very quickly just off the LRU list.


> > Hwpoison I don't think adds much weight, frankly. Just panic and
> > reboot if you get unrecoverable error. We have everything to handle
>
> This is for soft hwpoison :- offlining pages that might go bad
> in the future.

I still don't think it adds much weight. Especially if you can just
try an inefficient scan.

> But soft hwpoison isn't the only user. The other big one would
> be for large pages or other large page allocations.

Andi Kleen

unread,
Feb 1, 2010, 8:30:02 AM2/1/10
to
>
> > > Right, but as you can see it is complex to do it this way. And I
> > > think for reclaim driven targetted reclaim, then it needn't be so
> > > inefficient because you aren't restricted to just one page, but
> > > in any page which is heavily fragmented (and by definition there
> > > should be a lot of them in the system).
> >
> > Assuming you can identify them quickly.
>
> Well because there are a large number of them, then you are likely
> to encounter one very quickly just off the LRU list.

There were some cases in the past where this wasn't the case.
But yes some uptodate numbers on this would be good.

Also it doesn't address the second case here quoted again.

> > There are really two different cases here:
> > - Run out of memory: in this case i just want to find all the objects
> > of any page, ideally of not that recently used pages.
> > - I am very fragmented and want a specific page freed to get a 2MB
> > region back or for hwpoison: same, but do it for a specific page.
> >
>
>

> I still don't think it adds much weight. Especially if you can just
> try an inefficient scan.

Also see second point below.


>
>
> > But soft hwpoison isn't the only user. The other big one would
> > be for large pages or other large page allocations.


-Andi

--
a...@linux.intel.com -- Speaking for myself only.

Nick Piggin

unread,
Feb 1, 2010, 8:40:04 AM2/1/10
to
On Mon, Feb 01, 2010 at 02:25:27PM +0100, Andi Kleen wrote:
> >
> > > > Right, but as you can see it is complex to do it this way. And I
> > > > think for reclaim driven targetted reclaim, then it needn't be so
> > > > inefficient because you aren't restricted to just one page, but
> > > > in any page which is heavily fragmented (and by definition there
> > > > should be a lot of them in the system).
> > >
> > > Assuming you can identify them quickly.
> >
> > Well because there are a large number of them, then you are likely
> > to encounter one very quickly just off the LRU list.
>
> There were some cases in the past where this wasn't the case.
> But yes some uptodate numbers on this would be good.
>
> Also it doesn't address the second case here quoted again.
>
> > > There are really two different cases here:
> > > - Run out of memory: in this case i just want to find all the objects
> > > of any page, ideally of not that recently used pages.
> > > - I am very fragmented and want a specific page freed to get a 2MB
> > > region back or for hwpoison: same, but do it for a specific page.
> > >
> >
> >
> > I still don't think it adds much weight. Especially if you can just
> > try an inefficient scan.
>
> Also see second point below.
> >
> >
> > > But soft hwpoison isn't the only user. The other big one would
> > > be for large pages or other large page allocations.

Well yes it's possible that it could help there.

But it is always possible to do the same reclaim work via the LRU, in
worst case it just requires reclaiming of most objects. So it
probably doesn't fundamentally enable something we can't do already.
More a matter of performance, so again, numbers are needed.

ty...@mit.edu

unread,
Feb 1, 2010, 8:50:02 AM2/1/10
to
On Mon, Feb 01, 2010 at 11:17:02AM +0100, Andi Kleen wrote:
>
> On the other hand I would like to keep the option to be more aggressive
> for soft page offlining where it's useful and nobody cares about
> the cost.

I'm not sure I understand what the goals are for "soft page
offlining". Can you say a bit more about that?

> Or the "let's add a updatedb" hint approach has the problem that
> it won't cover a lot of other programs (as Linus always points out
> these new interfaces rarely actually get used)

Sure, but the number of programs that scan all of the files in a file
system and would need this sort of hint are actually pretty small.
Uptdatedb and backup programs are pretty much about it.

- Ted

Andi Kleen

unread,
Feb 1, 2010, 9:00:02 AM2/1/10
to
On Mon, Feb 01, 2010 at 08:47:39AM -0500, ty...@mit.edu wrote:
> On Mon, Feb 01, 2010 at 11:17:02AM +0100, Andi Kleen wrote:
> >
> > On the other hand I would like to keep the option to be more aggressive
> > for soft page offlining where it's useful and nobody cares about
> > the cost.
>
> I'm not sure I understand what the goals are for "soft page
> offlining". Can you say a bit more about that?

Predictive offlining of memory pages based on corrected error counts.
This is a useful feature to get more out of lower quality (and even
high quality) DIMMs.

This is already implemented in mcelog+.33ish memory-failure.c, but right
now it's quite dumb when trying to free a dcache/inode page (it basically
always has to blow away everything)

Also this is just one use case for this. The other would be 2MB
page at runtime support by doing targetted freeing (would be especially
useful with the upcoming transparent huge pages). Probably others
too. I merely mostly quoted hwpoison because I happen to work on that.

>
> > Or the "let's add a updatedb" hint approach has the problem that
> > it won't cover a lot of other programs (as Linus always points out
> > these new interfaces rarely actually get used)
>
> Sure, but the number of programs that scan all of the files in a file
> system and would need this sort of hint are actually pretty small.

Not sure that's true.

Also consider a file server :- how would you get that hint from the
clients?

-Andi
--
a...@linux.intel.com -- Speaking for myself only.

Christoph Lameter

unread,
Feb 1, 2010, 1:00:01 PM2/1/10
to
On Sat, 30 Jan 2010, Dave Chinner wrote:

> How do you expect defrag to behave when the filesystem doesn't free
> the inode immediately during dispose_list()? That is, the above code
> only finds inodes that are still active at the VFS level but they
> may still live for a significant period of time after the
> dispose_list() call. This is a real issue now that XFS has combined
> the VFS and XFS inodes into the same slab...

Then the freeing of the slab has to be delayed until the objects are
freed.

Christoph Lameter

unread,
Feb 1, 2010, 1:00:02 PM2/1/10
to
On Sat, 30 Jan 2010, Andi Kleen wrote:

> I guess the problem could be simplified by ignoring dentries in "unusual"
> states?

Sure.

Christoph Lameter

unread,
Feb 1, 2010, 1:00:03 PM2/1/10
to
On Sat, 30 Jan 2010, Rik van Riel wrote:

> On 01/30/2010 05:48 AM, Andi Kleen wrote:
> > On Fri, Jan 29, 2010 at 02:49:31PM -0600, Christoph Lameter wrote:
>
> > > 1. Establish a reference to an dentry/inode so that it is pinned.
> > > Hopefully in a way that is not too expensive (i.e. no
> > > superblock
> > > lock)
> > >
> > > 2. A means to free a dentry/inode objects from the VM reclaim context.
> >
> >
> > Al, do you have a suggestions on a good way to do that?
>
> You cannot free inode objects for files that are open, mmapped, etc.

Of course. Those objects need to prevent reclaim attempts.

> > I guess the problem could be simplified by ignoring dentries in "unusual"
> > states?
>
> You mean dentries that are in use? :)

The existing patch already tried to discern that and avoid the reclaim of
these.

Dave Chinner

unread,
Feb 2, 2010, 9:20:01 PM2/2/10
to
On Sun, Jan 31, 2010 at 09:34:09AM +0100, Andi Kleen wrote:
> On Sat, Jan 30, 2010 at 02:26:23PM -0500, ty...@mit.edu wrote:
> > On Fri, Jan 29, 2010 at 02:49:42PM -0600, Christoph Lameter wrote:
> > > This implements the ability to remove inodes in a particular slab
> > > from inode caches. In order to remove an inode we may have to write out
> > > the pages of an inode, the inode itself and remove the dentries referring
> > > to the node.
> >
> > How often is this going to happen? Removing an inode is an incredibly
>
> The standard case is the classic updatedb. Lots of dentries/inodes cached
> with no or little corresponding data cache.

I don't believe that updatedb has anything to do with causing
internal inode/dentry slab fragmentation. In all my testing I rarely
see use-once filesystem traversals cause internal slab
fragmentation. This appears to be a result of use-once filesystem
traversal resulting in slab pages full of objects that have the same
locality of access. Hence each new slab page that traversal
allocates will contain objects that will be adjacent in the LRU.
Hence LRU-based reclaim is very likely to free all the objects on
each page in the same pass and as such no fragmentation will occur.

All the cases of inode/dentry slab fragmentation I have seen are a
result of access patterns that result in slab pages containing
objects with different temporal localities. It's when the access
pattern is sufficiently distributed throughout the working set we
get the "need to free 95% of the objects in the entire cache to free
a single page" type of reclaim behaviour.

AFAICT, the defrag patches as they stand don't really address the
fundamental problem of differing temporal locality inside a slab
page. It makes the assumption that "partial page == defrag
candidate" but there isn't any further consideration of when any of
the remaing objects were last accessed. I think that this really
does need to be taken into account, especially considering that the
allocator tries to fill partial pages with new objects before
allocating new pages and so the page under reclaim might contain
very recently allocated objects.

Someone in a previous discussion on this patch set (Nick? Hugh,
maybe? I can't find the reference right now) mentioned something
like this about the design of the force-reclaim operations. IIRC the
suggestion was that it may be better to track LRU-ness by per-slab
page rather than per-object so that reclaim can target the slab
pages that - on aggregate - had the oldest objects in it. I think
this has merit - prevention of internal fragmentation seems like a
better approach to me than to try to cure it after it is already
present....

Cheers,

Dave.
--
Dave Chinner
da...@fromorbit.com

Christoph Lameter

unread,
Feb 3, 2010, 10:40:02 AM2/3/10
to
On Mon, 1 Feb 2010, Dave Chinner wrote:

> > The standard case is the classic updatedb. Lots of dentries/inodes cached
> > with no or little corresponding data cache.
>
> I don't believe that updatedb has anything to do with causing
> internal inode/dentry slab fragmentation. In all my testing I rarely
> see use-once filesystem traversals cause internal slab
> fragmentation. This appears to be a result of use-once filesystem
> traversal resulting in slab pages full of objects that have the same
> locality of access. Hence each new slab page that traversal
> allocates will contain objects that will be adjacent in the LRU.
> Hence LRU-based reclaim is very likely to free all the objects on
> each page in the same pass and as such no fragmentation will occur.

updatedb causes lots of partially allocated slab pages. While updatedb
runs other filesystem activities occur. And updatedb does not work in
straightforward linear fashion. dentries are cached and slowly expired etc
etc. Updatedb may not cause the fragmentation on a level that you observed
with some of the filesystem loads on large systems.

> All the cases of inode/dentry slab fragmentation I have seen are a
> result of access patterns that result in slab pages containing
> objects with different temporal localities. It's when the access
> pattern is sufficiently distributed throughout the working set we
> get the "need to free 95% of the objects in the entire cache to free
> a single page" type of reclaim behaviour.

There are also other factors at play like the different NUMA node,
concurrent processes. A strict optimized HPC workload may be able to
eliminate other factors but that is not the case for typical workloads.
Access patterns are typically somewhat distribyted.

> AFAICT, the defrag patches as they stand don't really address the
> fundamental problem of differing temporal locality inside a slab
> page. It makes the assumption that "partial page == defrag
> candidate" but there isn't any further consideration of when any of
> the remaing objects were last accessed. I think that this really
> does need to be taken into account, especially considering that the
> allocator tries to fill partial pages with new objects before
> allocating new pages and so the page under reclaim might contain
> very recently allocated objects.

Reclaim is only run if there is memory pressure. This means that lots of
reclaimable entities exist and therefore we can assume that many of these
have had a somewhat long lifetime. The allocator tries to fill partial
pages with new objects and then retires those pages to the full slab list.
Those are not subject to reclaim efforts covered here. A page under
reclaim is likely to contain many recently freed objects.

The remaining objects may have a long lifetime and a high usage pattern
but it is worth to relocate them into other slabs if they prevent reclaim
of the page. Relocation occurs in this patchset by reclaim and then the
next use likely causes the reallocation in a partially allocated slab.
This means that objects with a high usage count will tend to be aggregated
in full slabs that are no longer subject to targeted reclaim.

We could improve the situation by allowing the moving of objects (which
would avoid the reclaim and realloc) but that is complex and so needs to
be deferred to a second stage (same approach we went through with page
migration).

> Someone in a previous discussion on this patch set (Nick? Hugh,
> maybe? I can't find the reference right now) mentioned something
> like this about the design of the force-reclaim operations. IIRC the
> suggestion was that it may be better to track LRU-ness by per-slab
> page rather than per-object so that reclaim can target the slab
> pages that - on aggregate - had the oldest objects in it. I think
> this has merit - prevention of internal fragmentation seems like a
> better approach to me than to try to cure it after it is already
> present....

LRUness exists in terms of the list of partial slab pages. Frequently
allocated slabs are in the front of the queue and less used slabs are in
the rear. Defrag/reclaim occurs from the rear.

Dave Chinner

unread,
Feb 3, 2010, 7:40:01 PM2/3/10
to
On Wed, Feb 03, 2010 at 09:31:49AM -0600, Christoph Lameter wrote:
> On Mon, 1 Feb 2010, Dave Chinner wrote:
>
> > > The standard case is the classic updatedb. Lots of dentries/inodes cached
> > > with no or little corresponding data cache.
> >
> > I don't believe that updatedb has anything to do with causing
> > internal inode/dentry slab fragmentation. In all my testing I rarely
> > see use-once filesystem traversals cause internal slab
> > fragmentation. This appears to be a result of use-once filesystem
> > traversal resulting in slab pages full of objects that have the same
> > locality of access. Hence each new slab page that traversal
> > allocates will contain objects that will be adjacent in the LRU.
> > Hence LRU-based reclaim is very likely to free all the objects on
> > each page in the same pass and as such no fragmentation will occur.
>
> updatedb causes lots of partially allocated slab pages. While updatedb
> runs other filesystem activities occur. And updatedb does not work in
> straightforward linear fashion. dentries are cached and slowly expired etc
> etc.

Sure, but my point was that updatedb hits lots of inodes only once,
and for those objects the order of caching and expiration are
exactly the same. Hence after reclaim of the updatedb dentries/inodes
the amount of fragmentation in the slab will be almost exactly the
same as it was before the updatedb run.

> > All the cases of inode/dentry slab fragmentation I have seen are a
> > result of access patterns that result in slab pages containing
> > objects with different temporal localities. It's when the access
> > pattern is sufficiently distributed throughout the working set we
> > get the "need to free 95% of the objects in the entire cache to free
> > a single page" type of reclaim behaviour.
>
> There are also other factors at play like the different NUMA node,
> concurrent processes.

Yes, those are just more factors in the access patterns being
"sufficiently distributed throughout the working set".

> > AFAICT, the defrag patches as they stand don't really address the
> > fundamental problem of differing temporal locality inside a slab
> > page. It makes the assumption that "partial page == defrag
> > candidate" but there isn't any further consideration of when any of
> > the remaing objects were last accessed. I think that this really
> > does need to be taken into account, especially considering that the
> > allocator tries to fill partial pages with new objects before
> > allocating new pages and so the page under reclaim might contain
> > very recently allocated objects.
>
> Reclaim is only run if there is memory pressure. This means that lots of
> reclaimable entities exist and therefore we can assume that many of these
> have had a somewhat long lifetime. The allocator tries to fill partial
> pages with new objects and then retires those pages to the full slab list.
> Those are not subject to reclaim efforts covered here. A page under
> reclaim is likely to contain many recently freed objects.

Not necessarily. It might contain only one recently reclaimed object,
but have several other hot objects in the page....

> The remaining objects may have a long lifetime and a high usage pattern
> but it is worth to relocate them into other slabs if they prevent reclaim
> of the page.

I completely disagree. If you have to trash all the cache hot
information related to the cached object in the process of
relocating it, then you've just screwed up application performance
and in a completely unpredictable manner. Admins will be tearing out
their hair trying to work out why their applications randomly slow
down....

> > Someone in a previous discussion on this patch set (Nick? Hugh,
> > maybe? I can't find the reference right now) mentioned something
> > like this about the design of the force-reclaim operations. IIRC the
> > suggestion was that it may be better to track LRU-ness by per-slab
> > page rather than per-object so that reclaim can target the slab
> > pages that - on aggregate - had the oldest objects in it. I think
> > this has merit - prevention of internal fragmentation seems like a
> > better approach to me than to try to cure it after it is already
> > present....
>
> LRUness exists in terms of the list of partial slab pages. Frequently
> allocated slabs are in the front of the queue and less used slabs are in
> the rear. Defrag/reclaim occurs from the rear.

You missed my point again. You're still talking about tracking pages
with no regard to the objects remaining in the pages. A page, full
or partial, is a candidate for object reclaim if none of the objects
on it are referenced and have not been referenced for some time.

You are currently relying on the existing LRU reclaim to move a slab
from full to partial to trigger defragmentation, but you ignore the
hotness of the rest of the objects on the page by trying to reclaim
the page that has been partial for the longest period of time.

What it comes down to is that the slab has two states for objects -
allocated and free - but what we really need here is 3 states -
allocated, unused and freed. We currently track unused objects
outside the slab in LRU lists and, IMO, that is the source of our
fragmentation problems because it has no knowledge of the spatial
layout of the slabs and the state of other objects in the page.

What I'm suggesting is that we ditch the external LRUs and track the
"unused" state inside the slab and then use that knowledge to decide
which pages to reclaim. e.g. slab_object_used() is called when the
first reference on an object is taken. slab_object_unused() is
called when the reference count goes to zero. The slab can then
track unused objects internally and when reclaim is needed can
select pages (full or partial) that only contain unused objects to
reclaim.

From there the existing reclaim algorithms could be used to reclaim
the objects. i.e. the shrinkers become a slab reclaim callout that
are passed a linked list of objects to reclaim, very similar to the
way __shrink_dcache_sb() and prune_icache() first build a list of
objects to reclaim, then work off that list of objects.

If the goal is to reduce fragmentation, then this seems like a
much better approach to me - it is inherently fragmentation
resistent and much more closely aligned to existing object reclaim
algorithms.

If the goal is random slab page shootdown (e.g. for hwpoison), then
it's a much more complex problem because you can't shoot down
active, referenced objects without revoke(). Hence I think the
two problem spaces should be kept separate as it's not obvious
that they can both be solved with the same mechanism....

Cheers,

Dave.
--
Dave Chinner
da...@fromorbit.com

ty...@mit.edu

unread,
Feb 3, 2010, 10:10:02 PM2/3/10
to
On Thu, Feb 04, 2010 at 11:34:10AM +1100, Dave Chinner wrote:
>
> I completely disagree. If you have to trash all the cache hot
> information related to the cached object in the process of
> relocating it, then you've just screwed up application performance
> and in a completely unpredictable manner. Admins will be tearing out
> their hair trying to work out why their applications randomly slow
> down....

...

> You missed my point again. You're still talking about tracking pages
> with no regard to the objects remaining in the pages. A page, full
> or partial, is a candidate for object reclaim if none of the objects
> on it are referenced and have not been referenced for some time.
>
> You are currently relying on the existing LRU reclaim to move a slab
> from full to partial to trigger defragmentation, but you ignore the
> hotness of the rest of the objects on the page by trying to reclaim
> the page that has been partial for the longest period of time.

Well said.

This is exactly what I was complaining about as well, but apparently I
wasn't understood the first time either. :-(

This *has* to be fixed, or this set of patches is going to completely
trash the overall system performance, by trashing the page cache.

> What it comes down to is that the slab has two states for objects -
> allocated and free - but what we really need here is 3 states -
> allocated, unused and freed. We currently track unused objects
> outside the slab in LRU lists and, IMO, that is the source of our
> fragmentation problems because it has no knowledge of the spatial
> layout of the slabs and the state of other objects in the page.
>
> What I'm suggesting is that we ditch the external LRUs and track the
> "unused" state inside the slab and then use that knowledge to decide
> which pages to reclaim.

Or maybe we need to have the way to track the LRU of the slab page as
a whole? Any time we touch an object on the slab page, we touch the
last updatedness of the slab as a hole.

It's actually more complicated than that, though. Even if no one has
touched a particular inode, if one of the inode in the slab page is
pinned down because it is in use, so there's no point for the
defragmenter trying to throw away valuable cached pages associated
with other inodes in the same slab page --- since because of that
single pinned inode, YOU'RE NEVER GOING TO DEFRAG THAT PAGE.

And of course, if the inode is pinned down because it is opened and/or
mmaped, then its associated dcache entry can't be freed either, so
there's no point trying to trash all of its sibling dentries on the
same page as that dcache entry.

Randomly shooting down dcache and inode entries in the hopes of
creating coalescing free pages into hugepages is just not cool. If
you're that desperate, you might as well just do "echo 3 >
/proc/sys/vm/drop_caches". From my read of the algorithms, it's going
to be almost as destructive to system performance.

- Ted

Dave Chinner

unread,
Feb 3, 2010, 10:40:01 PM2/3/10
to
On Wed, Feb 03, 2010 at 10:07:36PM -0500, ty...@mit.edu wrote:
> On Thu, Feb 04, 2010 at 11:34:10AM +1100, Dave Chinner wrote:
> > What it comes down to is that the slab has two states for objects -
> > allocated and free - but what we really need here is 3 states -
> > allocated, unused and freed. We currently track unused objects
> > outside the slab in LRU lists and, IMO, that is the source of our
> > fragmentation problems because it has no knowledge of the spatial
> > layout of the slabs and the state of other objects in the page.
> >
> > What I'm suggesting is that we ditch the external LRUs and track the
> > "unused" state inside the slab and then use that knowledge to decide
> > which pages to reclaim.
>
> Or maybe we need to have the way to track the LRU of the slab page as
> a whole? Any time we touch an object on the slab page, we touch the
> last updatedness of the slab as a hole.

Yes, that's pretty much what I have been trying to describe. ;)
(And, IIUC, what I think Nick has been trying to describe as well
when he's been saying we should "turn reclaim upside down".)

It seems to me to be pretty simple to track, too, if we define pages
for reclaim to only be those that are full of unused objects. i.e.
the pages have the two states:

- Active: some allocated and referenced object on the page
=> no need for LRU tracking of these
- Unused: all allocated objects on the page are not used
=> these pages are LRU tracked within the slab

A single referenced object is enough to change the state of the
page from Unused to Active, and when page transitions from
Active to Unused is goes on the MRU end of the LRU queue.
Reclaim would then start with the oldest pages on the LRU....

> It's actually more complicated than that, though. Even if no one has
> touched a particular inode, if one of the inode in the slab page is
> pinned down because it is in use,

A single active object like this would the slab page Active, and
therefore not a candidate for reclaim. Also, we already reclaim
dentries before inodes because dentries pin inodes, so our
algorithms for reclaim already deal with these ordering issues for
us.

...

> And of course, if the inode is pinned down because it is opened and/or
> mmaped, then its associated dcache entry can't be freed either, so
> there's no point trying to trash all of its sibling dentries on the
> same page as that dcache entry.

Agreed - that's why I think preventing fragemntation caused by LRU
reclaim is best dealt with internally to slab where both object age
and locality can be taken into account.

Cheers,

Dave.
--
Dave Chinner
da...@fromorbit.com

Nick Piggin

unread,
Feb 4, 2010, 4:40:03 AM2/4/10
to
On Thu, Feb 04, 2010 at 02:39:11PM +1100, Dave Chinner wrote:
> On Wed, Feb 03, 2010 at 10:07:36PM -0500, ty...@mit.edu wrote:
> > On Thu, Feb 04, 2010 at 11:34:10AM +1100, Dave Chinner wrote:
> > > What it comes down to is that the slab has two states for objects -
> > > allocated and free - but what we really need here is 3 states -
> > > allocated, unused and freed. We currently track unused objects
> > > outside the slab in LRU lists and, IMO, that is the source of our
> > > fragmentation problems because it has no knowledge of the spatial
> > > layout of the slabs and the state of other objects in the page.
> > >
> > > What I'm suggesting is that we ditch the external LRUs and track the
> > > "unused" state inside the slab and then use that knowledge to decide
> > > which pages to reclaim.
> >
> > Or maybe we need to have the way to track the LRU of the slab page as
> > a whole? Any time we touch an object on the slab page, we touch the
> > last updatedness of the slab as a hole.
>
> Yes, that's pretty much what I have been trying to describe. ;)
> (And, IIUC, what I think Nick has been trying to describe as well
> when he's been saying we should "turn reclaim upside down".)

Well what I described is to do the slab pinning from the reclaim path
(rather than from slab calling into the subsystem). All slab locking
basically "innermost", so you can pretty much poke the slab layer as
much as you like from the subsystem.

After that, LRU on slabs should be fairly easy. Slab could provide a
private per-slab pointer for example that is managed by the caller.
Subsystem can then call into slab to find the objects.

--

Christoph Lameter

unread,
Feb 4, 2010, 12:10:03 PM2/4/10
to
On Thu, 4 Feb 2010, Dave Chinner wrote:

> > Or maybe we need to have the way to track the LRU of the slab page as
> > a whole? Any time we touch an object on the slab page, we touch the
> > last updatedness of the slab as a hole.
>
> Yes, that's pretty much what I have been trying to describe. ;)
> (And, IIUC, what I think Nick has been trying to describe as well
> when he's been saying we should "turn reclaim upside down".)
>
> It seems to me to be pretty simple to track, too, if we define pages
> for reclaim to only be those that are full of unused objects. i.e.
> the pages have the two states:
>
> - Active: some allocated and referenced object on the page
> => no need for LRU tracking of these
> - Unused: all allocated objects on the page are not used
> => these pages are LRU tracked within the slab
>
> A single referenced object is enough to change the state of the
> page from Unused to Active, and when page transitions from
> Active to Unused is goes on the MRU end of the LRU queue.
> Reclaim would then start with the oldest pages on the LRU....

These are describing ways of reclaim that could be implemented by the fs
layer. The information what item is "unused" or "referenced" is a notion
of the fs. The slab caches know only of two object states: Free or
allocated. LRU handling of slab pages is something entirely different
from the LRU of the inodes and dentries.

> > And of course, if the inode is pinned down because it is opened and/or
> > mmaped, then its associated dcache entry can't be freed either, so
> > there's no point trying to trash all of its sibling dentries on the
> > same page as that dcache entry.
>
> Agreed - that's why I think preventing fragemntation caused by LRU
> reclaim is best dealt with internally to slab where both object age
> and locality can be taken into account.

Object age is not known by the slab. Locality is only considered in terms
of hardware placement (Numa nodes) not in relationship to objects of other
caches (like inodes and dentries) or the same caches.

If we want this then we may end up with a special allocator for the
filesystem.

You and I have discussed a couple of years ago to add a reference count to
the objects of the slab allocator. Those explorations resulted in am much
more complicated and different allocator that is geared to the needs of
the filesystem for reclaim.

Christoph Lameter

unread,
Feb 4, 2010, 12:20:02 PM2/4/10
to
On Thu, 4 Feb 2010, Nick Piggin wrote:

> Well what I described is to do the slab pinning from the reclaim path
> (rather than from slab calling into the subsystem). All slab locking
> basically "innermost", so you can pretty much poke the slab layer as
> much as you like from the subsystem.

Reclaim/defrag is called from the reclaim path (of the VM). We could
enable a call from the fs reclaim code into the slab. But how would this
work?

> After that, LRU on slabs should be fairly easy. Slab could provide a
> private per-slab pointer for example that is managed by the caller.
> Subsystem can then call into slab to find the objects.

Sure with some minor changes we could have a call that is giving you the
list of neighboring objects in a slab, while locking it? Then you can look
at the objects and decide which ones can be tossed and then do another
call to release the objects and unlock the slab.

Dave Chinner

unread,
Feb 5, 2010, 8:00:02 PM2/5/10
to
On Thu, Feb 04, 2010 at 10:59:26AM -0600, Christoph Lameter wrote:
> On Thu, 4 Feb 2010, Dave Chinner wrote:
>
> > > Or maybe we need to have the way to track the LRU of the slab page as
> > > a whole? Any time we touch an object on the slab page, we touch the
> > > last updatedness of the slab as a hole.
> >
> > Yes, that's pretty much what I have been trying to describe. ;)
> > (And, IIUC, what I think Nick has been trying to describe as well
> > when he's been saying we should "turn reclaim upside down".)
> >
> > It seems to me to be pretty simple to track, too, if we define pages
> > for reclaim to only be those that are full of unused objects. i.e.
> > the pages have the two states:
> >
> > - Active: some allocated and referenced object on the page
> > => no need for LRU tracking of these
> > - Unused: all allocated objects on the page are not used
> > => these pages are LRU tracked within the slab
> >
> > A single referenced object is enough to change the state of the
> > page from Unused to Active, and when page transitions from
> > Active to Unused is goes on the MRU end of the LRU queue.
> > Reclaim would then start with the oldest pages on the LRU....
>
> These are describing ways of reclaim that could be implemented by the fs
> layer. The information what item is "unused" or "referenced" is a notion
> of the fs. The slab caches know only of two object states: Free or
> allocated. LRU handling of slab pages is something entirely different
> from the LRU of the inodes and dentries.

Ah, perhaps you missed my previous email in the thread about adding
a third object state to the slab - i.e. an unused state? And an
interface (slab_object_used()/slab_object_unused()) to allow the
external uses to tell the slab about state changes of objects
on the first/last reference to the object. That would allow the
tracking as I stated above....

> > > And of course, if the inode is pinned down because it is opened and/or
> > > mmaped, then its associated dcache entry can't be freed either, so
> > > there's no point trying to trash all of its sibling dentries on the
> > > same page as that dcache entry.
> >
> > Agreed - that's why I think preventing fragemntation caused by LRU
> > reclaim is best dealt with internally to slab where both object age
> > and locality can be taken into account.
>
> Object age is not known by the slab.

See above.

> Locality is only considered in terms
> of hardware placement (Numa nodes) not in relationship to objects of other
> caches (like inodes and dentries) or the same caches.

And that is the defficiency we've been talking about correcting! i.e
that object <-> page locality needs tobe taken into account during
reclaim. Moving used/unused knowledge into the slab where page/object
locality is known is one way of doing that....

> If we want this then we may end up with a special allocator for the
> filesystem.

I don't see why a small extension to the slab code can't fix this...

> You and I have discussed a couple of years ago to add a reference count to
> the objects of the slab allocator. Those explorations resulted in am much
> more complicated and different allocator that is geared to the needs of
> the filesystem for reclaim.

And those discussions and explorations lead to the current defrag
code. After a couple of year, I don't think that the design we came
up with back then is the best way to approach the problem - it still
has many, many flaws. We need to explore different approaches
because none of the evolutionary approaches (i.e. tack something
on the side) appear to be sufficient.

Cheers,

Dave.

>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majo...@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

--
Dave Chinner
da...@fromorbit.com

Nick Piggin

unread,
Feb 8, 2010, 2:40:01 AM2/8/10
to
On Thu, Feb 04, 2010 at 11:13:15AM -0600, Christoph Lameter wrote:
> On Thu, 4 Feb 2010, Nick Piggin wrote:
>
> > Well what I described is to do the slab pinning from the reclaim path
> > (rather than from slab calling into the subsystem). All slab locking
> > basically "innermost", so you can pretty much poke the slab layer as
> > much as you like from the subsystem.
>
> Reclaim/defrag is called from the reclaim path (of the VM). We could
> enable a call from the fs reclaim code into the slab. But how would this
> work?

Well the exact details will depend, but I feel that things should
get easier because you pin the object (and therefore the slab) via
the normal and well tested reclaim paths.

So for example, for dcache, you will come in and take the normal
locks: dcache_lock, sb_lock, pin the sb, umount_lock. At which
point you have pinned dentries without changing any locking. So
then you can find the first entry on the LRU, and should be able
to then build a list of dentries on the same slab.

You still have the potential issue of now finding objects that would
not be visible by searching the LRU alone. However at least the
locking should be simplified.


> > After that, LRU on slabs should be fairly easy. Slab could provide a
> > private per-slab pointer for example that is managed by the caller.
> > Subsystem can then call into slab to find the objects.
>
> Sure with some minor changes we could have a call that is giving you the
> list of neighboring objects in a slab, while locking it? Then you can look
> at the objects and decide which ones can be tossed and then do another
> call to release the objects and unlock the slab.

Yep. Well... you may not even need to ask slab layer to lock the
slab. Provided that the subsystem is locking out changes. It could
possibly be helpful to have a call to lock and unlock the slab,
although usage of such an API would have to be very careful.

Thanks,
Nick

Christoph Lameter

unread,
Feb 8, 2010, 12:50:02 PM2/8/10
to
On Mon, 8 Feb 2010, Nick Piggin wrote:

> > > After that, LRU on slabs should be fairly easy. Slab could provide a
> > > private per-slab pointer for example that is managed by the caller.
> > > Subsystem can then call into slab to find the objects.
> >
> > Sure with some minor changes we could have a call that is giving you the
> > list of neighboring objects in a slab, while locking it? Then you can look
> > at the objects and decide which ones can be tossed and then do another
> > call to release the objects and unlock the slab.
>
> Yep. Well... you may not even need to ask slab layer to lock the
> slab. Provided that the subsystem is locking out changes. It could
> possibly be helpful to have a call to lock and unlock the slab,
> although usage of such an API would have to be very careful.

True, if you are holding a reference to an object in a slab page and
there is a guarantee that the object is not going away then the slab is already
effectively pinned.

So we just need a call that returns

1. The number of allocated objects in a slab page
2. The total possible number of objects
3. A list of pointers to the objects

?

Then reclaim could make a decision if you want these objects to be
reclaimed.

Such a function could actually be a much less code than the current
patchset and would also be easy to do for SLAB/SLOB.

Dave Chinner

unread,
Feb 8, 2010, 5:20:01 PM2/8/10
to
On Mon, Feb 08, 2010 at 06:37:53PM +1100, Nick Piggin wrote:
> On Thu, Feb 04, 2010 at 11:13:15AM -0600, Christoph Lameter wrote:
> > On Thu, 4 Feb 2010, Nick Piggin wrote:
> >
> > > Well what I described is to do the slab pinning from the reclaim path
> > > (rather than from slab calling into the subsystem). All slab locking
> > > basically "innermost", so you can pretty much poke the slab layer as
> > > much as you like from the subsystem.
> >
> > Reclaim/defrag is called from the reclaim path (of the VM). We could
> > enable a call from the fs reclaim code into the slab. But how would this
> > work?
>
> Well the exact details will depend, but I feel that things should
> get easier because you pin the object (and therefore the slab) via
> the normal and well tested reclaim paths.
>
> So for example, for dcache, you will come in and take the normal
> locks: dcache_lock, sb_lock, pin the sb, umount_lock. At which
> point you have pinned dentries without changing any locking. So
> then you can find the first entry on the LRU, and should be able
> to then build a list of dentries on the same slab.
>
> You still have the potential issue of now finding objects that would
> not be visible by searching the LRU alone. However at least the
> locking should be simplified.

Very true, but that leads us to the same problem of fragmented
caches because we empty unused objects off slabs that are still
pinned by hot objects and don't free the page. I agree that we can't
totally avoid this problem, but I still think that using an object
based LRU for reclaim has a fundamental mismatch with page based
reclaim that makes this problem worse than it could be.

FWIW, if we change the above to keeping a page based LRU in the slab
cache and the slab picks a page to reclaim, then the problem goes
mostly away, I think. We don't need to pin the slab to select and
prepare a page to reclaim - the cache only needs to be locked before
it starts reclaim. I think this has a much better chance of
reclaiming entire pages in situations where LRU based reclaim will
leave fragmentation.

i.e. instead of:

shrink_slab
-> external shrinker
-> lock cache
-> find reclaimable object
-> call into slab w/ object
-> return longer list of objects
-> reclaim objects

we do:

shrink_slab
-> internal shrinker
-> find oldest page and make object list
-> external shrinker
-> lock cache
-> reclaim objects

Cheers,

Dave.
--
Dave Chinner
da...@fromorbit.com

0 new messages