BUG ALERT: Dirty inode was destroyed!

83 views
Skip to first unread message

Nicolas Deschildre

unread,
Aug 21, 2020, 9:03:53 AM8/21/20
to s3ql
Hi everybody!

I'm willing to debug this bug, could you please give me the theorical ways this bug could happen, so that I can search more efficiently for the reason of the bug?
--> SQL 3.3.2 (I know, cannot upgrade right now)
--> Happened on 6 of ~200 S3QL instances I manage
--> Openstack OVH backend
--> fsck.s3ql does not fix the issue, will happen again at some random time later

2020-08-12 09:54:14.716 5090:fuse-worker-9 root.excepthook: Uncaught top-level exception:
Traceback (most recent call last):
  File "/data/s3ql/src/s3ql/inode_cache.py", line 92, in __del__
    raise RuntimeError('BUG ALERT: Dirty inode was destroyed!')
RuntimeError: BUG ALERT: Dirty inode was destroyed!
2020-08-12 09:54:15.006 5090:MainThread s3ql.mount.unmount: Unmounting file system...
2020-08-12 09:54:15.141 5090:MainThread root.excepthook: Uncaught top-level exception:
Traceback (most recent call last):
  File "/data/s3ql/bin/mount.s3ql", line 21, in <module>
    s3ql.mount.main(sys.argv[1:])
  File "/data/s3ql/src/s3ql/mount.py", line 214, in main
    raise RuntimeError('Received signal %d, terminating' % (ret,))

Thanks a lot!
Nicolas Deschildre

Daniel Jagszent

unread,
Aug 21, 2020, 1:04:36 PM8/21/20
to s3ql
Hello Nicolas,

> --> fsck.s3ql does not fix the issue, will happen again at some random
> time later
S3QL somehow manages to delete/garbage collects an _Inode object that is
dirty (i.e. has an attribute modification that is not persisted in the
S3QL database)
This bug is probably not related to the SQLite database or the backend.
So a fsck.s3ql does not help.

Can it be reproduced? I suspect it happens, when you do many file
attribute modifications (chmod, chown, touch) simultaneously. You might
want to test this with a throw-away local backend without
compression/encryption - since this bug is unlikely related to the used
backend – so you can skip all the network traffic/costs of a remote backend.

> --> SQL 3.3.2 (I know, cannot upgrade right now)
You might want to first upgrade before diving into debugging this issue.
Since newer versions of S3QL use pyfuse3 instead of llfuse and also use
a newer trio versions it might very well be that your bug has vanished
in newer S3QL versions.


Nicolas Deschildre

unread,
Sep 9, 2020, 3:24:52 AM9/9/20
to s3ql
Hello!

Sorry, it seems a previous answer did not go through.
Thank you for your answer.

> --> SQL 3.3.2 (I know, cannot upgrade right now)
> You might want to first upgrade before diving into debugging this issue.
> Since newer versions of S3QL use pyfuse3 instead of llfuse and also use
> a newer trio versions it might very well be that your bug has vanished
> in newer S3QL versions.

Ok I did the upgrade to SQL 3.5.0 last week.
--> The bug is still present (had one occurence in a week since the upgrade)
--> BUT the filesystem no longer crash, which is good.

S3QL somehow manages to delete/garbage collects an _Inode object that is
dirty (i.e. has an attribute modification that is not persisted in the
S3QL database)

So, if I understand correctly, since it is a pending modification on a now deleted inode, this is not really a problem, right? Said otherwise, the filesystem is not corrupted?
If the filesystem is not corrupted, and it now no longer crashes, I can live with that :-)
 
Thank you,
Nicolas Deschildre

Daniel Jagszent

unread,
Sep 9, 2020, 11:50:19 AM9/9/20
to s3ql
Hello Nicolas,
 
S3QL somehow manages to delete/garbage collects an _Inode object that is
dirty (i.e. has an attribute modification that is not persisted in the
S3QL database)

So, if I understand correctly, since it is a pending modification on a now deleted inode, this is not really a problem, right? Said otherwise, the filesystem is not corrupted? [...]

No, the inode/file does not have to be deleted. There is a pending metadata modification (access time, modification time, file size, file mode) that should have been persisted (written in the Sqlite-Database) but it did not made it in the database.
File data is not corrupted, but some metadata of your files might not be in the correct state.

Nicolas Deschildre

unread,
Sep 11, 2020, 2:50:24 AM9/11/20
to s3ql
Hello,

Thanks, Ok, I now understand the code better.
InodeCache holds 100 Inodes as cache in a ring. When a new inode not in cache is requested, an inode in the cache is flushed, and the new inode is stored in the cache instead.
The bug : Race condition : 2 inodes are requested from InodeCache at the same time. Thread 1 requesting inode 1 flush and remove inode 2 from cache. Thread 2 got inode 2 before it was removed from cache by Thread 1, but makes changes after it was removed and flushed by Thread 1. Thead 2 ends, there are no longer references to inode 2, python garbage collect it, and this trigger the bug.

I see 2 possibles solutions :
1/ https://github.com/s3ql/s3ql/pull/196 : The _Inode class keeps a reference to the InodeCache. On __del__, if we encouter the above bug, we flush ourselves. The problem (and I'm not familiar enough with python) : I guess garbage collection could happen at shutdown, when the InodeCache SQL connection is no longer valid. Do you see a way to make this approach work?
2/ (Not coded yet) : The InodeCache is a ring, first in first out. What if we store access time on InodeCache.__getitem__, and the inode to be removed is the most old accessed one? This solution should reduce a lot (but not eliminate) the race condition. What do you think?

Finally : I tried to reproduce locally, with a unencrypted, uncompressed local backend, with mass parralel attribute changes, but I was not able to reproduce the bug.

Thanks,
Nicolas Deschildre

Nicolas Deschildre

unread,
Sep 17, 2020, 12:26:33 PM9/17/20
to s3ql
Hello,

Ok I found it!! (Or at least, one of them).
And a sure way to reproduce the crash.
All this time I was looking at a async/await bug... but it was finally another kind of concurrency issue :-)

The bug :
If you move a file from a folder to another, and the inode cache happens to be in a very specific condition (file to be copied uncached, destination folder in cache about to be evicted on next inode cache miss, origin folder in cache about to be evicted next after).
--> fs.py:654 Reduce a lot the occurence of this bug by ensuring the origin and destination folder are in inode cache
--> fs.py:658 self._lookup() load the inode of the file to be copied, evicting the destination folder inode.
--> fs.py:719 self.inodes[id_p_old] Load the origin folder which was still in cache
--> fs.py:720 self.inodes[id_p_new] Load the destination folder, evicts the origin folder inode.
--> fs.py:721/723 Changes are made on the origin folder inode, which is no longer handled by the inode cache. We have the bug.


Thanks for the pointers!

Nicolas Deschildre

Nikolaus Rath

unread,
Sep 17, 2020, 3:07:32 PM9/17/20
to s3...@googlegroups.com
Great work, thank you for tracking this down Nicolas!

Best,
-Nikolaus

--
GPG Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F

             »Time flies like an arrow, fruit flies like a Banana.«

--
You received this message because you are subscribed to the Google Groups "s3ql" group.
To unsubscribe from this group and stop receiving emails from it, send an email to s3ql+uns...@googlegroups.com.

Nicolas Deschildre

unread,
Nov 13, 2020, 5:29:49 AM11/13/20
to s3ql
Hello,

A quick follow-up on this : since I put the patch live, I didn't get a single new BUG ALERT. Looks like that was the only one I was getting.

Cheers,
Nicolas
Reply all
Reply to author
Forward
0 new messages