Re: WAPBL fix for deallocation exhaustion + slow file removal

Taylor R Campbell

unread,

Oct 1, 2016, 10:48:43 AM10/1/16

to Jaromír Doleček, tech...@netbsd.org

Date: Sat, 1 Oct 2016 16:19:33 +0200
From: =?UTF-8?B?SmFyb23DrXIgRG9sZcSNZWs=?= <jaromir...@gmail.com>

attached patch contains a fix to WAPBL deallocation structure
exhaustion and panic (kern/47146), and avoids need to do slow partial
truncates in loop, fixing kern/49175.

[...]

I plan to commit this sometime tomorrow.

Thanks for taking a shot at this! But I think it needs a little more
time for review -- certainly I can't digest it in the 24 hours you're
giving.

From a quick glance at the patch, I see one bug immediately in
vfs_wapbl.c that must have been introduced in a recent change:
pool_get(PR_WAITOK) is forbidden while holding a mutex, but
wapbl_register_deallocation does just that.

What happens if wapbl_register_deallocation fails in the middle of
ffs_truncate, and then the power goes out before the caller retries
the truncation? It looks like we may end up committing a transaction
to disk that partially zeros a file -- and it is not immediately clear
to me who will be responsible for freeing the remaining blocks when
the system comes back up.

I don't understand why the possible failure in the fragment extension
code should not be worrying. `Only 7 times during a file's lifetime'
says to me that if I check out and delete the NetBSD CVS repository
once, there are about two million times this can happen. If we can't
restart the transaction safely, we need to find another way to do it.

Can you say anything about alternatives you might have considered that
don't entail adding undo/restart logic to every user of UFS_TRUNCATE
and why they were unfit?

Jaromír Doleček

unread,

Oct 1, 2016, 12:40:46 PM10/1/16

to Taylor R Campbell, Tech-kern

> Thanks for taking a shot at this! But I think it needs a little more
> time for review -- certainly I can't digest it in the 24 hours you're
> giving.

Sure.

> From a quick glance at the patch, I see one bug immediately in
> vfs_wapbl.c that must have been introduced in a recent change:
> pool_get(PR_WAITOK) is forbidden while holding a mutex, but
> wapbl_register_deallocation does just that.

I'll recheck this, thank you. I run it under rump only, I think I had
it compiled LOCKDEBUG, so should trigger the same assertions as
LOCKDEBUG kernel. But apparently not.

> What happens if wapbl_register_deallocation fails in the middle of
> ffs_truncate, and then the power goes out before the caller retries
> the truncation? It looks like we may end up committing a transaction
> to disk that partially zeros a file -- and it is not immediately clear
> to me who will be responsible for freeing the remaining blocks when
> the system comes back up.

The indirect disk blocks go via the journal also, so if power fails
while the journal is not yet committed, it would work just the same as
usually, i.e. no change would be visible on filesystem. If truncate
fails once, then this partial truncate would be committed, and power
fails before the truncate is completely finished, then the log replay
would just replay just the first part, creating the hole.

> I don't understand why the possible failure in the fragment extension
> code should not be worrying. `Only 7 times during a file's lifetime'
> says to me that if I check out and delete the NetBSD CVS repository
> once, there are about two million times this can happen. If we can't
> restart the transaction safely, we need to find another way to do it.

The fragment deallocation is only registered when a file is extended
from size less then full block, and only when there is no free block
available just next to the fragment. This can't happen more then once
per data write operation, hence once per each
wapbl_begin()/wapbl_end() block.

The two million file creates are all in different vnode calls, each
call checking the limits in wapbl_start() and flushing if necessary.
It would be extremely difficult to trigger transaction overflow there.

The deletions won't cause any problem, since on that path we use
!force path, so can't overflow transaction.

> Can you say anything about alternatives you might have considered that
> don't entail adding undo/restart logic to every user of UFS_TRUNCATE
> and why they were unfit?

Right now every caller of UFS_TRUNCATE() does it implicitely, via
ufs_truncate(). That function right now tries to avoid big transaction
by truncating on indirect block boundary with wapbl, but that fails to
account for bigger directories and of course is slow.

There is alternative patch in kern/49175, which merely changes
ufs_truncate() to check number of real blocks and adjust truncate size
to fit the transaction within that limit. That patch however
exclusively locks the transaction in order to avoid some parallell
operations to exhaust the dealloc limit, and I don't like how it moves
logic about allocated blocks away from ffs_alloc/ffs_inode.

I considered another alternative with dealloc pre-registration,
avoiding the exclusive lock also. But that would still require partial
truncate support similar to my patch in order to avoid stall when
there are several parallel requests, so basically it doesn't bring any
extra value. Pre-allocation also requires some accouting hence extra
code, so it looked just simplier to go without.

Jaromir

Taylor R Campbell

unread,

Oct 1, 2016, 1:00:21 PM10/1/16

to Jaromír Doleček, Tech-kern

Date: Sat, 1 Oct 2016 18:40:31 +0200
From: Jaromír Dole ek <jaromir...@gmail.com>

> Thanks for taking a shot at this! But I think it needs a little more
> time for review -- certainly I can't digest it in the 24 hours you're
> giving.

Sure.

OK, thanks! I'll try to find some time to review this more carefully
in the next few days. For now, one comment:

> From a quick glance at the patch, I see one bug immediately in
> vfs_wapbl.c that must have been introduced in a recent change:
> pool_get(PR_WAITOK) is forbidden while holding a mutex, but
> wapbl_register_deallocation does just that.

I'll recheck this, thank you. I run it under rump only, I think I had
it compiled LOCKDEBUG, so should trigger the same assertions as
LOCKDEBUG kernel. But apparently not.

Since wl->wl_mtx is an adaptive lock, not a spin lock, nothing
automatically detects this. But in general, with minor exceptions,
one is not supposed to sleep while holding an adaptive lock.
(Exception example: waiting for a cheap cross-call to complete in
pserialize_perform(9).)

It's also suboptimal that we sleep while holding rwlocks for vnode
locks, since rw_enter is uninterruptable, so if a wait inside the
kernel hangs with a vnode lock held, anyone else trying to examine
that vnode will hang uninterruptably too. But it will take some
effort to undo that state of affairs by teaching all callers of
vn_lock to accept failure.

The same goes for the the long-term file system transaction lock
wl->wl_rwlock. However, suboptimality of sleeping with those
long-term locks aside, certainly one should not sleep while holding
the short-term wl->wl_mtx.

David Holland

unread,

Oct 1, 2016, 1:54:44 PM10/1/16

to Taylor R Campbell, Jarom?r Dole?ek, Tech-kern

On Sat, Oct 01, 2016 at 05:00:10PM +0000, Taylor R Campbell wrote:
> It's also suboptimal that we sleep while holding rwlocks for vnode
> locks, since rw_enter is uninterruptable, so if a wait inside the
> kernel hangs with a vnode lock held, anyone else trying to examine
> that vnode will hang uninterruptably too. But it will take some
> effort to undo that state of affairs by teaching all callers of
> vn_lock to accept failure.

..even if someone can persuade the developer consensus to accept that
change, which has failed in the past.

(more on the rest when I have some time, I hope)

--
David A. Holland
dhol...@netbsd.org

Taylor R Campbell

unread,

Oct 6, 2016, 8:24:48 PM10/6/16

to Jaromír Doleček, tech...@netbsd.org

Date: Thu, 6 Oct 2016 22:36:48 +0200
From: Jaromír Doleček <jaromir...@gmail.com>

I've incorporated the mutex fix, here is the final patch relative to
trunk. I'd like to commit this sometime next week.

OK, thanks, I'll try to find time to review in the next couple days!