[patch] mm, page_alloc: allow __GFP_NOFAIL to allocate below watermarks after reclaim

David Rientjes

unread,

Dec 9, 2013, 5:10:02 PM12/9/13

to

If direct reclaim has failed to free memory, __GFP_NOFAIL allocations
can potentially loop forever in the page allocator. In this case, it's
better to give them the ability to access below watermarks so that they
may allocate similar to the same privilege given to GFP_ATOMIC
allocations.

We're careful to ensure this is only done after direct reclaim has had
the chance to free memory, however.

Signed-off-by: David Rientjes <rien...@google.com>
---
mm/page_alloc.c | 5 +++++
1 file changed, 5 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2629,6 +2629,11 @@ rebalance:
pages_reclaimed)) {
/* Wait for some write requests to complete then retry */
wait_iff_congested(preferred_zone, BLK_RW_ASYNC, HZ/50);
+
+ /* Allocations that cannot fail must allocate from somewhere */
+ if (gfp_mask & __GFP_NOFAIL)
+ alloc_flags |= ALLOC_HARDER;
+
goto rebalance;
} else {
/*
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Mel Gorman

unread,

Dec 10, 2013, 3:00:03 AM12/10/13

to

On Mon, Dec 09, 2013 at 02:03:45PM -0800, David Rientjes wrote:
> If direct reclaim has failed to free memory, __GFP_NOFAIL allocations
> can potentially loop forever in the page allocator. In this case, it's
> better to give them the ability to access below watermarks so that they
> may allocate similar to the same privilege given to GFP_ATOMIC
> allocations.
>
> We're careful to ensure this is only done after direct reclaim has had
> the chance to free memory, however.
>
> Signed-off-by: David Rientjes <rien...@google.com>

The main problem with doing something like this is that it just smacks
into the adjusted watermark if there are a number of __GFP_NOFAIL. Who
was the user of __GFP_NOFAIL that was fixed by this patch?

It appears there are more __GFP_NOFAIL users than I expected and some of
them are silly. md uses it after mempool_alloc fails GFP_ATOMIC and then
immediately calls with __GFP_NOFAIL in a context that can sleep. It could
just have used GFP_NOIO for the mempool alloc which would "never" fail.

btrfs is using __GFP_NOFAIL to call the slab allocator for the extent
cache but also a kmalloc cache which is just dangerous. After this
patch, that thing can push the system below watermarks and then
effectively "leak" them to other !__GFP_NOFAIL users.

Buffer cache uses __GFP_NOFAIL to grow buffers where it expects the page
allocator can loop endlessly but again, allowing it to go below reserves
is just going to hit the same wall a short time later

gfs is using the flag with kmalloc slabs, same as btrfs this can "leak"
the reserves. jbd is the same although jbd2 avoids using the flag in a
manner of speaking.

There are enough bad users of __GFP_NOFAIL that I really question how
good an idea it is to allow emergency reserves to be used when they are
potentially leaked to other !__GFP_NOFAIL users via the slab allocator
shortly afterwards.

--
Mel Gorman
SUSE Labs

David Rientjes

unread,

Dec 10, 2013, 6:10:03 PM12/10/13

to

On Tue, 10 Dec 2013, Mel Gorman wrote:

> > If direct reclaim has failed to free memory, __GFP_NOFAIL allocations
> > can potentially loop forever in the page allocator. In this case, it's
> > better to give them the ability to access below watermarks so that they
> > may allocate similar to the same privilege given to GFP_ATOMIC
> > allocations.
> >
> > We're careful to ensure this is only done after direct reclaim has had
> > the chance to free memory, however.
> >
> > Signed-off-by: David Rientjes <rien...@google.com>
>
> The main problem with doing something like this is that it just smacks
> into the adjusted watermark if there are a number of __GFP_NOFAIL. Who
> was the user of __GFP_NOFAIL that was fixed by this patch?
>

Nobody, it comes out of a memcg discussion where __GFP_NOFAIL were
recently given the ability to bypass charges to the root memcg when the
memcg has hit its limit since we disallow the oom killer to kill a process
(for the same reason that the vast majority of __GFP_NOFAIL users, those
that do GFP_NOFS | __GFP_NOFAIL, disallow the oom killer in the page
allocator).

Without some other thread freeing memory, these allocations simply loop
forever. We probably don't want to reconsider the choice that prevents
calling the oom killer in !__GFP_FS contexts since it will allow
unnecessary oom killing when memory can actually be freed by another
thread.

Since there are comments in both gfp.h and page_alloc.c that say no new
users will be added, it seems legitimate to ensure that the allocation
will at least have a chance of succeeding, but not the point of depleting
memory reserves entirely.

> There are enough bad users of __GFP_NOFAIL that I really question how
> good an idea it is to allow emergency reserves to be used when they are
> potentially leaked to other !__GFP_NOFAIL users via the slab allocator
> shortly afterwards.
>

You could make the same argument for GFP_ATOMIC which can also allow
access to memory reserves.

Mel Gorman

unread,

Dec 11, 2013, 4:30:02 AM12/11/13

to

Which __GFP_NOFAIL on its own does not guarantee if they just smack into
that barrier and cannot do anything. It changes the timing, not fixes
the problem.

> > There are enough bad users of __GFP_NOFAIL that I really question how
> > good an idea it is to allow emergency reserves to be used when they are
> > potentially leaked to other !__GFP_NOFAIL users via the slab allocator
> > shortly afterwards.
> >
>
> You could make the same argument for GFP_ATOMIC which can also allow
> access to memory reserves.

The critical difference being that GFP_ATOMIC callers typically can handle
NULL being returned to them. GFP_ATOMIC storms may starve !GFP_ATOMIC
requests but it does not cause the same types of problems that
__GFP_NOFAIL using reserves would.

--
Mel Gorman
SUSE Labs

Dave Chinner

unread,

Dec 11, 2013, 8:20:02 PM12/11/13

to

On Tue, Dec 10, 2013 at 03:03:39PM -0800, David Rientjes wrote:

> On Tue, 10 Dec 2013, Mel Gorman wrote:
>
> > > If direct reclaim has failed to free memory, __GFP_NOFAIL allocations
> > > can potentially loop forever in the page allocator. In this case, it's
> > > better to give them the ability to access below watermarks so that they
> > > may allocate similar to the same privilege given to GFP_ATOMIC
> > > allocations.
> > >
> > > We're careful to ensure this is only done after direct reclaim has had
> > > the chance to free memory, however.
> > >
> > > Signed-off-by: David Rientjes <rien...@google.com>
> >
> > The main problem with doing something like this is that it just smacks
> > into the adjusted watermark if there are a number of __GFP_NOFAIL. Who
> > was the user of __GFP_NOFAIL that was fixed by this patch?
> >
>
> Nobody, it comes out of a memcg discussion where __GFP_NOFAIL were
> recently given the ability to bypass charges to the root memcg when the
> memcg has hit its limit since we disallow the oom killer to kill a process
> (for the same reason that the vast majority of __GFP_NOFAIL users, those
> that do GFP_NOFS | __GFP_NOFAIL, disallow the oom killer in the page
> allocator).
>
> Without some other thread freeing memory, these allocations simply loop
> forever.

So what is kswapd doing in this situation?

> Since there are comments in both gfp.h and page_alloc.c that say no new
> users will be added, it seems legitimate to ensure that the allocation
> will at least have a chance of succeeding, but not the point of depleting
> memory reserves entirely.

As it said before, the filesystem will then simply keep allocating
memory until it hits the next limit, and then you're back in the
same situation. Moving the limit at which it fails does not solve
the problem at all.

Cheers,

Dave.
--
Dave Chinner
da...@fromorbit.com