crash report: s3ql 5.4 plus metadata patch on debian trixe

10 views
Skip to first unread message

Joseph Maher

unread,
Sep 13, 2025, 1:32:47 PM (7 days ago) Sep 13
to s3...@googlegroups.com

Just thought I'd report the following crash using 5.4 with the additional
patch for setting metadata-backup-interval=0. The error message is long,
so I have posted at the end pf this email. I don't know what it means, it
could be a network problem, but I didn't notice any other connection
problems with this machine.

I used stdeb to build a debian package for distribution across different
machines, you can see the result here:

https://www.maher.org.uk/~joseph/deb/

Running fsck.s3ql --force on this filesystem takes about 4 days, is there
a way to avoid doing this?

Thanks for all your works on s3ql!

Joseph


2025-09-13 11:11:18.400 1450804:Thread-3 (_upload_loop)
py.warnings._showwarnmsg:
/usr/lib/python3/dist-packages/s3ql/backends/gs.py:503: RuntimeWarning:
coroutine 'HTTPConnection.co_readall' was never awaited
exc = _parse_error_response(resp, self.conn.co_readall())

2025-09-13 11:11:18.426 1450804:Thread-3 (_upload_loop)
s3ql.mount.exchook: Unhandled exception in thread, terminating
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/s3ql/mount.py", line 93, in
run_with_except_hook
run_old(*args, **kw)
~~~~~~~^^^^^^^^^^^^^
File "/usr/lib/python3.13/threading.py", line 994, in run
self._target(*self._args, **self._kwargs)
~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/s3ql/block_cache.py", line 325, in
_upload_loop
self._do_upload(*tmp)
~~~~~~~~~~~~~~~^^^^^^
File "/usr/lib/python3/dist-packages/s3ql/block_cache.py", line 380, in
_do_upload
obj_size = backend.write_fh('s3ql_data_%d' % obj_id, el)
File "/usr/lib/python3/dist-packages/s3ql/backends/comprenc.py", line
279, in write_fh
return self.backend.write_fh(key, fh, meta_raw, len_=len_)
~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/s3ql/backends/gs.py", line 425, in
write_fh
return self._write_fh(key, fh, off, len_, metadata or {})
~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/s3ql/backends/common.py", line 114,
in wrapped
return method(*a, **kw)
File "/usr/lib/python3/dist-packages/s3ql/backends/gs.py", line 461, in
_write_fh
resp = self._do_request(
'POST',
...<3 lines>...
body=BodyFollowing(body_size),
)
File "/usr/lib/python3/dist-packages/s3ql/backends/gs.py", line 606, in
_do_request
resp = self.conn.read_response()
File "/usr/lib/python3/dist-packages/s3ql/http.py", line 769, in
read_response
return eval_coroutine(self.co_read_response())
File "/usr/lib/python3/dist-packages/s3ql/http.py", line 1429, in
eval_coroutine
next(it)
~~~~^^^^
File "/usr/lib/python3/dist-packages/s3ql/http.py", line 787, in
co_read_response
raise StateError('Previous response not read completely')
s3ql.http.StateError: Previous response not read completely
2025-09-13 11:11:18.441 1450804:Thread-3 (_upload_loop)
__main__.excepthook: Uncaught top-level exception:
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/s3ql/mount.py", line 93, in
run_with_except_hook
run_old(*args, **kw)
~~~~~~~^^^^^^^^^^^^^
File "/usr/lib/python3.13/threading.py", line 994, in run
self._target(*self._args, **self._kwargs)
~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/s3ql/block_cache.py", line 325, in
_upload_loop
self._do_upload(*tmp)
~~~~~~~~~~~~~~~^^^^^^
File "/usr/lib/python3/dist-packages/s3ql/block_cache.py", line 380, in
_do_upload
obj_size = backend.write_fh('s3ql_data_%d' % obj_id, el)
File "/usr/lib/python3/dist-packages/s3ql/backends/comprenc.py", line
279, in write_fh
return self.backend.write_fh(key, fh, meta_raw, len_=len_)
~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/s3ql/backends/gs.py", line 425, in
write_fh
return self._write_fh(key, fh, off, len_, metadata or {})
~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/s3ql/backends/common.py", line 114,
in wrapped
return method(*a, **kw)
File "/usr/lib/python3/dist-packages/s3ql/backends/gs.py", line 461, in
_write_fh
resp = self._do_request(
'POST',
...<3 lines>...
body=BodyFollowing(body_size),
)
File "/usr/lib/python3/dist-packages/s3ql/backends/gs.py", line 606, in
_do_request
resp = self.conn.read_response()
File "/usr/lib/python3/dist-packages/s3ql/http.py", line 769, in
read_response
return eval_coroutine(self.co_read_response())
File "/usr/lib/python3/dist-packages/s3ql/http.py", line 1429, in
eval_coroutine
next(it)
~~~~^^^^
File "/usr/lib/python3/dist-packages/s3ql/http.py", line 787, in
co_read_response
raise StateError('Previous response not read completely')
s3ql.http.StateError: Previous response not read completely
2025-09-13 11:11:18.639 1450804:MainThread s3ql.mount.unmount: Unmounting
file system...
2025-09-13 11:11:34.981 1450804:MainThread __main__.excepthook: Uncaught
top-level exception:
+ Exception Group Traceback (most recent call last):
| File "/usr/bin/mount.s3ql", line 33, in <module>
| sys.exit(load_entry_point('s3ql==5.4.0', 'console_scripts',
'mount.s3ql')())
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^
| File "/usr/lib/python3/dist-packages/s3ql/mount.py", line 170, in
main
| trio.run(main_async, options, stdout_log_handler)
| ~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| File "/usr/lib/python3/dist-packages/trio/_core/_run.py", line 2423,
in run
| raise runner.main_task_outcome.error
| File "/usr/lib/python3/dist-packages/s3ql/mount.py", line 238, in
main_async
| async with trio.open_nursery() as nursery, AsyncExitStack() as cm:
| ~~~~~~~~~~~~~~~~~^^
| File "/usr/lib/python3/dist-packages/trio/_core/_run.py", line 1058,
in __aexit__
| raise combined_error_from_nursery
| ExceptionGroup: Exceptions from Trio nursery (1 sub-exception)
+-+---------------- 1 ----------------
| Traceback (most recent call last):
| File "/usr/lib/python3/dist-packages/s3ql/mount.py", line 336, in
main_async
| raise exc_inst.with_traceback(exc_tb)
| File "/usr/lib/python3/dist-packages/s3ql/mount.py", line 93, in
run_with_except_hook
| run_old(*args, **kw)
| ~~~~~~~^^^^^^^^^^^^^
| File "/usr/lib/python3.13/threading.py", line 994, in run
| self._target(*self._args, **self._kwargs)
| ~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| File "/usr/lib/python3/dist-packages/s3ql/block_cache.py", line
325, in _upload_loop
| self._do_upload(*tmp)
| ~~~~~~~~~~~~~~~^^^^^^
| File "/usr/lib/python3/dist-packages/s3ql/block_cache.py", line
380, in _do_upload
| obj_size = backend.write_fh('s3ql_data_%d' % obj_id, el)
| File "/usr/lib/python3/dist-packages/s3ql/backends/comprenc.py",
line 279, in write_fh
| return self.backend.write_fh(key, fh, meta_raw, len_=len_)
| ~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| File "/usr/lib/python3/dist-packages/s3ql/backends/gs.py", line
425, in write_fh
| return self._write_fh(key, fh, off, len_, metadata or {})
| ~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| File "/usr/lib/python3/dist-packages/s3ql/backends/common.py",
line 114, in wrapped
| return method(*a, **kw)
| File "/usr/lib/python3/dist-packages/s3ql/backends/gs.py", line
461, in _write_fh
| resp = self._do_request(
| 'POST',
| ...<3 lines>...
| body=BodyFollowing(body_size),
| )
| File "/usr/lib/python3/dist-packages/s3ql/backends/gs.py", line
606, in _do_request
| resp = self.conn.read_response()
| File "/usr/lib/python3/dist-packages/s3ql/http.py", line 769, in
read_response
| return eval_coroutine(self.co_read_response())
| File "/usr/lib/python3/dist-packages/s3ql/http.py", line 1429, in
eval_coroutine
| next(it)
| ~~~~^^^^
| File "/usr/lib/python3/dist-packages/s3ql/http.py", line 787, in
co_read_response
| raise StateError('Previous response not read completely')
| s3ql.http.StateError: Previous response not read completely
+------------------------------------

Nikolaus Rath

unread,
Sep 14, 2025, 1:37:12 PM (6 days ago) Sep 14
to noreply-spamdigest via s3ql
On Sat, 13 Sep 2025, at 18:32, 'Joseph Maher' via s3ql wrote:
> Just thought I'd report the following crash using 5.4 with the additional
> patch for setting metadata-backup-interval=0. The error message is long,
> so I have posted at the end pf this email. I don't know what it means, it
> could be a network problem, but I didn't notice any other connection
> problems with this machine.
[...]
> | File "/usr/lib/python3/dist-packages/s3ql/backends/gs.py", line
> 461, in _write_fh
> | resp = self._do_request(
> | 'POST',
> | ...<3 lines>...
> | body=BodyFollowing(body_size),
> | )
> | File "/usr/lib/python3/dist-packages/s3ql/backends/gs.py", line
> 606, in _do_request
> | resp = self.conn.read_response()
> | File "/usr/lib/python3/dist-packages/s3ql/http.py", line 769, in
> read_response
> | return eval_coroutine(self.co_read_response())
> | File "/usr/lib/python3/dist-packages/s3ql/http.py", line 1429, in
> eval_coroutine
> | next(it)
> | ~~~~^^^^
> | File "/usr/lib/python3/dist-packages/s3ql/http.py", line 787, in
> co_read_response
> | raise StateError('Previous response not read completely')
> | s3ql.http.StateError: Previous response not read completely

Thanks for reporting this. This is a bug in S3QL. Unfortunately it will be very hard to track down, since the actual error happened *before* the exception was raised (the exception just detects that something must have gone wrong before), and because it's apparently rather rare (this is the first such report in many years).

It'd be great if you could file it at https://github.com/s3ql/s3ql/issues so we can keep track of it, but it's unlikely to be fixed until we can find a way to reproduce this.

> Running fsck.s3ql --force on this filesystem takes about 4 days, is there
> a way to avoid doing this?

Which steps take the longest? Depending on that, --fast may or may not help.


Best,
-Nikolaus

jos...@maher.org.uk

unread,
Sep 14, 2025, 9:36:28 PM (6 days ago) Sep 14
to s3ql

I'm currently mounting the file system with a fairly large cache size, so it tends to have a fairly large number of dirty blocks (about 1 million).  I will try mounting with a much smaller cache size in future and see if that helps...

Joseph

Nikolaus Rath

unread,
Sep 15, 2025, 4:05:32 AM (6 days ago) Sep 15
to noreply-spamdigest via s3ql


On Mon, 15 Sep 2025, at 02:36, 'jos...@maher.org.uk' via s3ql wrote:
> Running fsck.s3ql --force on this filesystem takes about 4 days, is there
> a way to avoid doing this?

Which steps take the longest? Depending on that, --fast may or may not help.


I'm currently mounting the file system with a fairly large cache size, so it tends to have a fairly large number of dirty blocks (about 1 million).  I will try mounting with a much smaller cache size in future and see if that helps...

That's an interesting datapoint, but doesn't answer the question. If you'd like advice on this, you need to tell us which fsck steps take the longest.

Best,
-Nikolaus

jos...@maher.org.uk

unread,
Sep 15, 2025, 10:53:47 AM (5 days ago) Sep 15
to s3ql
The current fsck seems to be taking about the same time as the previous one, the last one was using s3ql 5.3 with the --fast option, and the longest steps were:

Checking DB integrity,  approx 1 day

Uploading metadata, approx 3 days

I've appended an edited version of the fsck log at the end of this email, removing some hopefully unimportant info...

Here's the about from s3qlstat /mnt/s3ql form about a month ago, in case that's of any use:

Directory entries:    940256836
Inodes:               940256838
Data objects:         13035637
Total data size:      147 TB
After de-duplication: 2.61 TB (1.76% of total)
After compression:    1.23 TB (0.83% of total, 47.15% of de-duplicated)
Database size:        109 GiB (uncompressed)
Cache size:           89.8 MiB, 90 entries
Cache size (dirty):   0 bytes, 0 entries
Queued object removals: 0

Thanks!

Joseph

2025-09-06 11:42:12.567 1347769:MainThread s3ql.fsck.main: Starting fast fsck of gs://bucketname.../
2025-09-06 11:42:12.845 1347769:MainThread s3ql.fsck.main: Using cached metadata.
2025-09-06 11:42:12.852 1347769:MainThread s3ql.database.sqlite3_log: sqlite3: recovered 3 frames from WAL file /mnt/backup/c>
2025-09-06 11:42:12.860 1347769:MainThread s3ql.fsck.main: File system was not unmounted cleanly
2025-09-06 11:42:12.860 1347769:MainThread s3ql.fsck.main: Checking DB integrity...
2025-09-07 10:00:56.583 1347769:MainThread s3ql.fsck.check: Creating temporary extra indices...
2025-09-07 11:49:44.169 1347769:MainThread s3ql.fsck.check_lof: Checking lost+found...
2025-09-07 11:49:45.015 1347769:MainThread s3ql.fsck.check_cache: Checking for dirty cache objects...
2025-09-07 11:49:45.061 1347769:MainThread s3ql.fsck.check_names_refcount: Checking names (refcounts)...
2025-09-07 11:51:42.137 1347769:MainThread s3ql.fsck.check_contents_name: Checking contents (names)...
2025-09-07 11:57:56.590 1347769:MainThread s3ql.fsck.check_contents_inode: Checking contents (inodes)...
2025-09-07 12:19:18.962 1347769:MainThread s3ql.fsck.check_contents_parent_inode: Checking contents (parent inodes)...
2025-09-07 12:26:18.833 1347769:MainThread s3ql.fsck.check_objects_hash: Checking objects.hash...
2025-09-07 12:26:18.898 1347769:MainThread s3ql.fsck.check_objects_refcount: Checking objects (reference counts)...
2025-09-07 12:27:50.174 1347769:MainThread s3ql.fsck.check_objects_id: Checking objects (backend)...
2025-09-07 12:29:07.418 1347769:MainThread s3ql.fsck.check_objects_id: Processed 177475 objects so far
2025-09-07 13:38:40.142 1347769:MainThread s3ql.fsck.check_objects_id: Processed 13003855 objects so far
2025-09-07 13:42:02.701 1347769:MainThread s3ql.fsck.check_objects_phys_size: Checking objects (physical sizes)...
2025-09-07 13:42:03.605 1347769:MainThread s3ql.fsck.check_inode_blocks_obj_id: Checking block-object mapping...
2025-09-07 13:46:06.383 1347769:MainThread s3ql.fsck.check_inode_blocks_inode: Checking inode_blocks.inode...
2025-09-07 13:52:18.184 1347769:MainThread s3ql.fsck.check_inodes_refcount: Checking inodes (refcounts)...
2025-09-07 14:14:28.235 1347769:MainThread s3ql.fsck.check_inodes_size: Checking inodes (sizes)...
2025-09-07 19:48:56.131 1347769:MainThread s3ql.fsck.check_ext_attributes_name: Checking extended attributes (names)...
2025-09-07 19:48:56.161 1347769:MainThread s3ql.fsck.check_ext_attributes_inode: Checking extended attributes (inodes)...
2025-09-07 19:48:56.161 1347769:MainThread s3ql.fsck.check_symlinks_inode: Checking symlinks (inodes)...
2025-09-07 19:53:41.031 1347769:MainThread s3ql.fsck.check_loops: Checking directory reachability...
2025-09-07 21:29:46.610 1347769:MainThread s3ql.fsck.check_unix: Checking unix conventions...
2025-09-07 22:56:11.105 1347769:MainThread s3ql.fsck.check_foreign_keys: Checking referential integrity...
2025-09-07 23:22:00.714 1347769:MainThread s3ql.fsck.check: Dropping temporary indices...
2025-09-08 05:28:28.875 1347769:MainThread s3ql.database.upload_metadata: Uploading metadata...
2025-09-08 05:34:43.290 1347769:MainThread s3ql.database.upload_metadata: Uploaded 2459 out of ~1785367 dirty blocks (0%)
2025-09-11 05:45:36.418 1347769:MainThread s3ql.database.upload_metadata: Uploaded 1785367 out of ~1785367 dirty blocks (100%)
2025-09-11 05:45:36.518 1347769:MainThread s3ql.database.upload_metadata: Calculating metadata checksum...
2025-09-11 06:17:06.301 1347769:MainThread s3ql.fsck.main: Completed fsck of gs://bucketname.../


 

Nikolaus Rath

unread,
Sep 15, 2025, 12:49:16 PM (5 days ago) Sep 15
to noreply-spamdigest via s3ql

On Mon, 15 Sep 2025, at 15:53, 'jos...@maher.org.uk' via s3ql wrote:
On Monday, September 15, 2025 at 4:05:32 AM UTC-4 Nikolaus Rath wrote:
On Mon, 15 Sep 2025, at 02:36, 'jos...@maher.org.uk' via s3ql wrote:
> Running fsck.s3ql --force on this filesystem takes about 4 days, is there
> a way to avoid doing this?

Which steps take the longest? Depending on that, --fast may or may not help.

The current fsck seems to be taking about the same time as the previous one, the last one was using s3ql 5.3 with the --fast option, and the longest steps were:

Checking DB integrity,  approx 1 day

Nothing that can be done here, unfortunately. This is just SQLite doing its thing. Only thing that might help is a faster CPU or putting the DB on a faster disk.

Uploading metadata, approx 3 days

That's only after an unclean unmount, right? It should be much faster if you run fsck.s3ql --force on a clean filesystem?

The problem is that fsck.s3ql needs to re-upload the entire database (109 GB in your case), because the information about which parts were modified was lost when mount.s3ql crashed.

Is the upload time (3 days for 109 GB, maybe 50 GB after compression) consistent with your bandwidth?

If bandwidth is the limiting factor, then in theory this could be sped up by calculating a checksum of each DB block, comparing it with the checksum of the block stored in remote storage, and only re-uploading if the checksum changed. That'd require someone to write the code, of course :-).

Best,
-Nikolaus

jos...@maher.org.uk

unread,
Sep 15, 2025, 8:29:58 PM (5 days ago) Sep 15
to s3ql
Bandwidth does not seem to be the limiting factor, I'm getting roughly 100KB/sec upload to google storage, but I get 5MB/sec or more copying data to/from other machines.   In fact, nothing on the machine seems maxed out, top says cpu usage less than 20% and memory usage less than 2%, iotop does not see that much disk usage, so not sure where the bottleneck is...

Joseph

Nikolaus Rath

unread,
Sep 16, 2025, 3:53:22 AM (5 days ago) Sep 16
to noreply-spamdigest via s3ql


On Tue, 16 Sep 2025, at 01:29, 'jos...@maher.org.uk' via s3ql wrote:
On Monday, September 15, 2025 at 12:49:16 PM UTC-4 Nikolaus Rath wrote:

On Mon, 15 Sep 2025, at 15:53, 'jos...@maher.org.uk' via s3ql wrote:
On Monday, September 15, 2025 at 4:05:32 AM UTC-4 Nikolaus Rath wrote:
On Mon, 15 Sep 2025, at 02:36, 'jos...@maher.org.uk' via s3ql wrote:
> Running fsck.s3ql --force on this filesystem takes about 4 days, is there
> a way to avoid doing this?

Which steps take the longest? Depending on that, --fast may or may not help.

The current fsck seems to be taking about the same time as the previous one, the last one was using s3ql 5.3 with the --fast option, and the longest steps were:

Checking DB integrity,  approx 1 day


Nothing that can be done here, unfortunately. This is just SQLite doing its thing. Only thing that might help is a faster CPU or putting the DB on a faster disk.

Uploading metadata, approx 3 days

That's only after an unclean unmount, right? It should be much faster if you run fsck.s3ql --force on a clean filesystem?

The problem is that fsck.s3ql needs to re-upload the entire database (109 GB in your case), because the information about which parts were modified was lost when mount.s3ql crashed.

Is the upload time (3 days for 109 GB, maybe 50 GB after compression) consistent with your bandwidth?

If bandwidth is the limiting factor, then in theory this could be sped up by calculating a checksum of each DB block, comparing it with the checksum of the block stored in remote storage, and only re-uploading if the checksum changed. That'd require someone to write the code, of course :-).


Bandwidth does not seem to be the limiting factor, I'm getting roughly 100KB/sec upload to google storage, but I get 5MB/sec or more copying data to/from other machines.   In fact, nothing on the machine seems maxed out, top says cpu usage less than 20% and memory usage less than 2%, iotop does not see that much disk usage, so not sure where the bottleneck is...

If you truly uploaded at 100 KB/sec, then it would take you about (50*1024^3) / (100*1024) / (60*60) = 145 hours to upload 50 GB. So either compression is a lot better than I'd expect, or you must be uploading more quickly than that.

What's your metadata block size (specified on mkfs.s3ql time)? I suspect it's relatively small, and since metadata is uploaded sequentially, you're limited by network latency rather than bandwidth.  This will be addressed by https://github.com/s3ql/s3ql/issues/310, which I plan to work on if ever I find some free time...

Best,
-Nikolaus


jos...@maher.org.uk

unread,
Sep 16, 2025, 12:36:53 PM (4 days ago) Sep 16
to s3ql
Not sure what's happening with bandwidth, maybe there are spikes of usage I'm missing...

I don't know what the metadata block size is I'm afraid :(  It looks like I made this filesystem in Sep 2021 using s3ql 3.7.3 and bash history just says mkfs.s3ql  gs://bucketname/...

Joseph

Nikolaus Rath

unread,
Sep 16, 2025, 3:21:24 PM (4 days ago) Sep 16
to noreply-spamdigest via s3ql


On Tue, 16 Sep 2025, at 17:36, 'jos...@maher.org.uk' via s3ql wrote:
I don't know what the metadata block size is I'm afraid :(  It looks like I made this filesystem in Sep 2021 using s3ql 3.7.3 and bash history just says mkfs.s3ql  gs://bucketname/...

It's probably the default  from "s3qladm upgrade", 64 kB. 

Best,
-Nikolaus


jos...@maher.org.uk

unread,
Sep 19, 2025, 8:38:56 AM (yesterday) Sep 19
to s3ql
I think I got this on again - this time I ran mount.s3ql with the --debug option, logs are available here:


Joseph
Reply all
Reply to author
Forward
0 new messages