s3ql (b2 backend) crashes after writing a specific amount of data

29 views

Skip to first unread message

Ivan Shapovalov

unread,

Aug 16, 2020, 1:40:16 AM8/16/20

to s3ql

Greetings again.

(I had no time to follow up on my idea of partial object caching; in
the meantime, I've been using s3ql with a ridiculously low object size
as a workaround. But this is about another file system.)

Today I've been rsyncing my NAS to the cloud, when s3ql suddenly exited
in the middle of a transfer:

==== 8< ====
Aug 16 08:07:15 stratofortress.nexus.i.intelfx.name systemd[1]: Starting s3ql-mnt-b2-files.service...
Aug 16 08:07:18 stratofortress.nexus.i.intelfx.name mount.s3ql[40268]: Using cached metadata.
Aug 16 08:07:18 stratofortress.nexus.i.intelfx.name mount.s3ql[40268]: mount.s3ql[40268:MainThread] s3ql.mount.get_metadata: Using cached metadata.
Aug 16 08:07:18 stratofortress.nexus.i.intelfx.name mount.s3ql[40268]: Loaded 0 entries from cache
Aug 16 08:07:18 stratofortress.nexus.i.intelfx.name mount.s3ql[40268]: mount.s3ql[40268:MainThread] s3ql.block_cache.__init__: Loaded 0 entries from cache
Aug 16 08:07:18 stratofortress.nexus.i.intelfx.name mount.s3ql[40268]: Mounting b2://intelfx-name-stratofortress-files at /mnt/b2/files...
Aug 16 08:07:18 stratofortress.nexus.i.intelfx.name mount.s3ql[40268]: mount.s3ql[40268:MainThread] s3ql.mount.main_async: Mounting b2://intelfx-name-stratofortress-files at /mnt/b2/files...
Aug 16 08:07:19 stratofortress.nexus.i.intelfx.name systemd[1]: Started s3ql-mnt-b2-files.service.
Aug 16 08:07:19 stratofortress.nexus.i.intelfx.name mount.s3ql[40268]: /usr/lib/python3.8/site-packages/_pyfuse3.py:30: TrioDeprecationWarning: trio.hazmat is deprecated since Trio 0.15.0; use trio.lowlevel instead (https://github.com/python-trio/trio/issues/476)
Aug 16 08:07:19 stratofortress.nexus.i.intelfx.name mount.s3ql[40268]: await fn(*args, **kwargs)
Aug 16 08:07:19 stratofortress.nexus.i.intelfx.name mount.s3ql[40268]: mount.s3ql[40268:MainThread] py.warnings._showwarnmsg: /usr/lib/python3.8/site-packages/_pyfuse3.py:30: TrioDeprecationWarning: trio.hazmat is deprecated since Trio 0.15.0; use trio.lowlevel instead (https://github.com/python-trio/trio/issues/476)
await fn(*args, **kwargs)
Aug 16 08:09:16 stratofortress.nexus.i.intelfx.name systemd[1]: s3ql-mnt-b2-files.service: Succeeded.
Aug 16 08:09:16 stratofortress.nexus.i.intelfx.name systemd[1]: s3ql-mnt-b2-files.service: Consumed 1min 32.062s CPU time.
==== 8< ====

On rsync side, it looked like this:

==== 8< ====
19.70G 1% 214.51MB/s 0:01:34 (xfr#19, to-chk=410/1602)
>f..t...... Backups/Hosts/able/borg/data/39/19782
rsync: [receiver] close failed on "/mnt/b2/files/Backups/Hosts/able/borg/data/39/.19781.C3Bufb": Transport endpoint is not connected (107)
rsync: [receiver] stat "/mnt/b2/files/Backups/Hosts/able/borg/data/39/.19781.C3Bufb" failed: Transport endpoint is not connected (107)
rsync: [receiver] rename "/mnt/b2/files/Backups/Hosts/able/borg/data/39/.19781.C3Bufb" -> "Backups/Hosts/able/borg/data/39/19781": Transport endpoint is not connected (107)
rsync error: error in file IO (code 11) at receiver.c(868) [receiver=v3.2.3]
rsync: [sender] write error: Broken pipe (32)
==== 8< ====

s3ql command line:

==== 8< ====
$ systemctl cat s3ql-mnt-b2-files.service
# /etc/systemd/system/s3ql-mnt-b2-files.service
[Unit]
Wants=network-online.target
After=network.target network-online.target

[Service]
Type=notify
ExecStart=/usr/bin/mount.s3ql b2://intelfx-name-stratofortress-files /mnt/b2/files -o systemd,log=syslog,authfile=/etc/s3ql/authinfo2,cachedir=/var/tmp/s3ql,cachesize=104857600,max-cache-entries=524088,allow-other,compress=none,threads=16,backend-options=disable-versions,debug
ExecStop=/usr/bin/umount /mnt/b2/files
LimitNOFILE=524288
MemoryHigh=1G
==== 8< ====

Here are the last few lines of the debug log:

==== 8< ====
Aug 16 08:28:27 stratofortress.nexus.i.intelfx.name mount.s3ql[43764]: mount.s3ql[43764:MainThread] s3ql.fs.getxattr: started with 4113, b'security.capability'
Aug 16 08:28:27 stratofortress.nexus.i.intelfx.name mount.s3ql[43764]: mount.s3ql[43764:MainThread] s3ql.block_cache._get_entry: started with 4113, 0
Aug 16 08:28:27 stratofortress.nexus.i.intelfx.name mount.s3ql[43764]: mount.s3ql[43764:MainThread] s3ql.block_cache._get_entry: started with 4113, 0
Aug 16 08:28:27 stratofortress.nexus.i.intelfx.name mount.s3ql[43764]: mount.s3ql[43764:MainThread] s3ql.block_cache._get_entry: started with 4113, 0
Aug 16 08:28:27 stratofortress.nexus.i.intelfx.name mount.s3ql[43764]: mount.s3ql[43764:MainThread] s3ql.block_cache._get_entry: started with 4113, 0
Aug 16 08:28:27 stratofortress.nexus.i.intelfx.name mount.s3ql[43764]: mount.s3ql[43764:MainThread] s3ql.block_cache._get_entry: started with 4113, 0
Aug 16 08:28:27 stratofortress.nexus.i.intelfx.name mount.s3ql[43764]: mount.s3ql[43764:MainThread] s3ql.block_cache.with_event_loop: upload of 3439 failed
NoneType: None
Aug 16 08:28:27 stratofortress.nexus.i.intelfx.name mount.s3ql[43764]: mount.s3ql[43764:MainThread] s3ql.block_cache._get_entry: started with 4113, 0
Aug 16 08:28:27 stratofortress.nexus.i.intelfx.name mount.s3ql[43764]: mount.s3ql[43764:MainThread] s3ql.block_cache._get_entry: started with 4113, 0
Aug 16 08:28:27 stratofortress.nexus.i.intelfx.name mount.s3ql[43764]: mount.s3ql[43764:MainThread] s3ql.fs.getxattr: started with 4113, b'security.capability'
Aug 16 08:28:27 stratofortress.nexus.i.intelfx.name mount.s3ql[43764]: mount.s3ql[43764:Thread-10] s3ql.mount.exchook: recording exception No pending requests
Aug 16 08:28:27 stratofortress.nexus.i.intelfx.name umount[44705]: umount: /mnt/b2/files: target is busy.
Aug 16 08:28:27 stratofortress.nexus.i.intelfx.name systemd[1]: s3ql-mnt-b2-files.service: Control process exited, code=exited, status=32/n/a
Aug 16 08:28:27 stratofortress.nexus.i.intelfx.name systemd[1]: s3ql-mnt-b2-files.service: Failed with result 'exit-code'.
Aug 16 08:28:27 stratofortress.nexus.i.intelfx.name systemd[1]: s3ql-mnt-b2-files.service: Consumed 1min 27.740s CPU time, read 2.3M from disk, written 70.6M to disk.
==== 8< ====

By some unknown reason, the first line of this excerpt
(s3ql.fs.getxattr) is being spammed all over the log -- I'm not sure
what makes several hundred getxattrs per second.

Anyway, the actual log (17M):
https://intelfx.name/files/persistent/2020-08-16%20s3ql%20crash/s3ql-mnt-b2-files.log

I can reliably reproduce this crash by running the same rsync command,
which fails after writing a specific amount of data.
This also happened with another filesystem (the borg one), although I
have not tried reproducing it there as I need to repair the borg
repository first.

Any ideas what could go wrong or how to debug further?

Thanks,
--
Ivan Shapovalov / intelfx /