Thank you for explaining the situation. I understand what's going on.
In the end, going back to my question, this is caused by file system corruption.
Because the inode bitmap is corrupted, an inode with an inode number
that should exist as a ".nilfs" file was reassigned by nilfs_mkdir for
"file0", causing an inode duplication during execution.
And this causes an underflow of i_nlink in rmdir operations.
After considering the issue, I came to the conclusion that although
there is an approach that strengthens checks for bitmap
inconsistencies and inode type inconsistencies to detect errors in
advance, these approaches are difficult to fundamentally solve.
This is caused by a corrupted inode bitmap, but checking is difficult
in the first place when there are multiple directories with the same
inode number in the name space.
Therefore, the appropriate solution is to either check for i_nlink
underflow during rmdir, or to detect i_nlink == 0 in the call path of
nilfs_lookup() --> nilfs_iget(), and I think the latter approach is
better, as you have chosen.
> >
> > When I mounted the mount_0 image as read-only, the filesystem looked
> > normal without such inode duplication.
> >
> > At least, nilfs_read_inode_common(), which reads inodes from block
> > devices, is implemented to return an error with -ESTALE if i_nlink ==
> > 0. So it seems that nilfs_iget() picked up this inode with i_nlilnk
> > == 0 because it hit an inode being deleted in the inode cache. Why is
> > that happening?
> Are you talking about the following call trace?
> If so, then because the value of inode->i_state is I_DIRTY (set in nilfs_mkdir)
> it will not enter __nilfs_read_inode().
>
> nilfs_iget()->
> __nilfs_read_inode()->
> nilfs_read_inode_common()
Yes, in this case __nifls_read_inode() is not called. As mentioned
above, the conclusion is that i_nlink abnormalities can occur in other
FS corruption patterns such as inode bitmap corruption and directory
inconsistency.
> >
> > Also, why do you put the i_size check as an AND condition?
> i_size will set to 0 in nilfs_rmdir(), so check it too.
> > i_size is independent of i_nlink and the inode lifecycles. If i_size
> > is also broken, this check will not work properly.
> > If something is not working and you have included it as a workaround,
> > I would like to know about it.
To fix the problem, please modify the patch as follows:
(1) In nilfs_iget(), if the inode without I_NEW obtained by
nilfs_iget_locked() has i_nlink == 0, call iput() and return
ERR_PTR(-ESTALE). Do not call make_bad_inode(). Also do not check for
i_size == 0 (it is not a good idea to check this to identify the case
of rmdir).
(2) In nilfs_lookup(), if the return value of nilfs_iget() is
ERR_PTR(-ESTALE), output a message with nilfs_error() indicating that
file system corruption has been detected , and returns ERR_PTR(-EIO).
Please refer to ext2_lookup() for the concrete implementation.
The reason for not calling make_bad_inode() is to prevent interference
from the side when i_nlink is set to 0 and nilfs_evict() is trying to
delete the inode. The VFS layer guarantees that inode acquisition and
evict are mutually exclusive, but it is possible to grab it before
evict(), and the result of interference in that case is unpredictable.
Instead, as mentioned above, I think it is safer to call nilfs_error()
at the nilfs_lookup() level, where namespace operations are involved.
Also, could you please explain in the commit log that inode
duplication has occurred due to file system corruption (in this case,
inode bitmap corruption), and that when inode duplication occurs due
to file system inconsistencies like this, a link count underflow can
occur during an rmdir operation, so that a link count check is
necessary at runtime?
Thank you in advance.
Ryusuke Konishi