S3QL crashes when uploading large files with inconclusive logs (yet again)

48 views
Skip to first unread message

Ivan Shapovalov

unread,
Dec 27, 2020, 6:36:15 AM12/27/20
to s3ql
Hey,

I'm hitting a very similar problem compared to the last report on this
mailing list. I'm using rsync to upload large files onto a S3QL
filesystem using B2 backend, no compression and no encryption. In the
middle of upload, S3QL terminates with no conclusive error messages in
the log. The problem is reproducible every time, on the same very large
(>500GB) file.

After the crash, rsync either happily continues to copy the file into
nothing, or terminates with "Transport endpoint not connected". This
happens randomly and I could not work out the logic that influences
each outcome.

Versions I use are:

-- 8< --
$ pacman -Qi s3ql-git | grep Version
Version : 3.5.1.15.g81868632-1

$ pacman -Qi python-pyfuse3 | grep Version
Version : 3.1.1-1

$ pacman -Qi python-trio | grep Version
Version : 0.17.0-3

$ pacman -Qi python3 | grep Version
Version : 3.9.1-1
-- 8< --

Full mount.s3ql invocation is:

-- 8< --
/usr/bin/mount.s3ql b2://<mybucket> /mnt/b2/files -o systemd,log=none,authfile=/etc/s3ql/authinfo2,cachedir=/var/tmp/s3ql,debug,allow-other,compress=none,cachesize=10485760,threads=8,keep-cache,backend-options=disable-versions
-- 8< --

Full s3ql log (captured stdout/stderr) is available here: 
https://intelfx.name/files/persistent/2020-12-27%20s3ql/s3ql.2.log.zst
https://intelfx.name/files/persistent/2020-12-27%20s3ql/s3ql.3.log.zst
(caution -- each log is ~150 MB unpacked, which is a separate concern
of mine but irrelevant to this problem.)

Manual excerpt from s3ql log relevant to the block that failed:
-- 8< --
mount.s3ql[117923]: 2020-12-27 13:02:03.135 117923 DEBUG MainThread s3ql.block_cache.upload_if_dirty: started with <Dirty CacheEntry, inode=11545, blockno=109>
mount.s3ql[117923]: 2020-12-27 13:02:03.135 117923 DEBUG MainThread s3ql.block_cache.upload_if_dirty: uploading <Dirty CacheEntry, inode=11545, blockno=109>..
mount.s3ql[117923]: 2020-12-27 13:02:03.980 117923 DEBUG MainThread s3ql.block_cache.upload_if_dirty: created new object 8516
mount.s3ql[117923]: 2020-12-27 13:02:03.980 117923 DEBUG MainThread s3ql.block_cache.upload_if_dirty: created new block 7638
mount.s3ql[117923]: 2020-12-27 13:02:03.980 117923 DEBUG MainThread s3ql.block_cache.upload_if_dirty: adding to upload queue
mount.s3ql[117923]: 2020-12-27 13:02:03.981 117923 DEBUG MainThread s3ql.block_cache.upload_if_dirty: no old block
mount.s3ql[117923]: 2020-12-27 13:02:03.981 117923 DEBUG Thread-5 s3ql.block_cache._upload_loop: got work
mount.s3ql[117923]: 2020-12-27 13:02:03.981 117923 DEBUG Thread-5 s3ql.backends.b2.b2_backend.open_write: started with s3ql_data_8516
mount.s3ql[117923]: 2020-12-27 13:02:03.981 117923 DEBUG Thread-5 s3ql.backends.b2.object_w.close: started with s3ql_data_8516
mount.s3ql[117923]: 2020-12-27 13:02:03.981 117923 DEBUG Thread-5 s3ql.backends.b2.b2_backend._do_request: started with POST /b2api/v2/b2_get_upload_url
mount.s3ql[117923]: 2020-12-27 13:02:03.982 117923 DEBUG Thread-5 s3ql.backends.b2.b2_backend._do_request: REQUEST: api003.backblazeb2.com POST /b2api/v2/b2_get_upload_url
mount.s3ql[117923]: 2020-12-27 13:02:03.982 117923 DEBUG Thread-5 s3ql.backends.common.wrapped: Average retry rate: 0.05 Hz
mount.s3ql[117923]: 2020-12-27 13:02:03.982 117923 DEBUG Thread-5 s3ql.backends.common.wrapped: Encountered ConnectionClosed (server closed connection), retrying B2Backend._request_upload_url_info (attempt 1)...
mount.s3ql[117923]: 2020-12-27 13:02:04.009 117923 DEBUG Thread-5 s3ql.backends.b2.b2_backend._do_request: started with POST /b2api/v2/b2_get_upload_url
mount.s3ql[117923]: 2020-12-27 13:02:04.009 117923 DEBUG Thread-5 s3ql.backends.b2.b2_backend._do_request: REQUEST: api003.backblazeb2.com POST /b2api/v2/b2_get_upload_url
mount.s3ql[117923]: 2020-12-27 13:02:04.205 117923 DEBUG Thread-5 s3ql.backends.b2.b2_backend._do_request: RESPONSE: POST 200 275
mount.s3ql[117923]: 2020-12-27 13:02:04.205 117923 DEBUG Thread-5 s3ql.backends.b2.b2_backend._do_request: started with POST /b2api/v2/b2_upload_file/e77141e88db425f876370514/c003_v0312000_t0001
mount.s3ql[117923]: 2020-12-27 13:02:04.206 117923 DEBUG Thread-5 s3ql.backends.b2.b2_backend._do_request: REQUEST: pod-031-2000-09.backblaze.com POST /b2api/v2/b2_upload_file/e77141e88db425f876370514/c003_v0312000_t0001
mount.s3ql[117923]: 2020-12-27 13:02:04.411 117923 DEBUG MainThread s3ql.block_cache.with_event_loop: upload of 8516 failed
mount.s3ql[117923]: NoneType: None
mount.s3ql[117923]: 2020-12-27 13:02:04.416 117923 DEBUG Thread-5 s3ql.mount.exchook: recording exception No pending requests
umount[135718]: umount: /mnt/b2/files: target is busy.
-- 8< --

Strangely, systemd tells me that mount.s3ql does not just exit
normally, but is terminated by SIGTERM:

-- 8< --
$ systemctl status s3...@mnt-b2-files.service
● s3...@mnt-b2-files.service - s3ql file system at /mnt/b2/files
Loaded: loaded (/etc/systemd/system/s3ql@.service; enabled; vendor preset: disabled)
Active: inactive (dead) since Sun 2020-12-27 13:02:05 MSK; 1h 27min ago
Process: 117923 ExecStart=/usr/bin/mount.s3ql ${What} /mnt/b2/files -o systemd,log=none,authfile=/etc/s3ql/authinfo2,cachedir=/var/tmp/s3ql,${Options} (code=killed, signal=TERM)
Process: 135718 ExecStop=/usr/bin/umount /mnt/b2/files (code=exited, status=32)
Main PID: 117923 (code=killed, signal=TERM)
IP: 40.5M in, 8.5G out
CPU: 6min 11.868s
-- 8< --

I'd gladly go debug it myself, but I don't know where to start with
this async stuff. Any pointers? How do I read this log? What is this
strange "NoneType: None"?

Thanks,
--
Ivan Shapovalov / intelfx /
signature.asc

Nikolaus Rath

unread,
Dec 27, 2020, 6:44:40 AM12/27/20
to s3...@googlegroups.com
On Dec 27 2020, Ivan Shapovalov <int...@intelfx.name> wrote:
> Full mount.s3ql invocation is:
>
> -- 8< --
> /usr/bin/mount.s3ql b2://<mybucket> /mnt/b2/files -o systemd,log=none,authfile=/etc/s3ql/authinfo2,cachedir=/var/tmp/s3ql,debug,allow-other,compress=none,cachesize=10485760,threads=8,keep-cache,backend-options=disable-versions
> -- 8< --

Please also run this under gdb:

$ gdb python3
> run /usr/bin/mount.s3ql b2://<mybucket> /mnt/b2/files -o systemd,log=none,authfile=/etc/s3ql/authinfo2,cachedir=/var/tmp/s3ql,debug,allow-other,compress=none,cachesize=10485760,threads=8,keep-cache,backend-options=disable-versions


When the process terminates, run:

> thread apply all bt


>
> Full s3ql log (captured stdout/stderr) is available here: 
> https://intelfx.name/files/persistent/2020-12-27%20s3ql/s3ql.2.log.zst
> https://intelfx.name/files/persistent/2020-12-27%20s3ql/s3ql.3.log.zst
> (caution -- each log is ~150 MB unpacked, which is a separate concern
> of mine but irrelevant to this problem.)
>
> Manual excerpt from s3ql log relevant to the block that failed:
> -- 8< --
> mount.s3ql[117923]: 2020-12-27 13:02:03.135 117923 DEBUG MainThread
> s3ql.block_cache.upload_if_dirty: started with <Dirty CacheEntry,
> inode=11545, blockno=109>

Debug logging won't help with this problem, you can safely disable that.

> Strangely, systemd tells me that mount.s3ql does not just exit
> normally, but is terminated by SIGTERM:
>
> -- 8< --
> $ systemctl status s3...@mnt-b2-files.service
> ● s3...@mnt-b2-files.service - s3ql file system at /mnt/b2/files
> Loaded: loaded (/etc/systemd/system/s3ql@.service; enabled; vendor preset: disabled)
> Active: inactive (dead) since Sun 2020-12-27 13:02:05 MSK; 1h 27min ago
> Process: 117923 ExecStart=/usr/bin/mount.s3ql ${What} /mnt/b2/files -o systemd,log=none,authfile=/etc/s3ql/authinfo2,cachedir=/var/tmp/s3ql,${Options} (code=killed, signal=TERM)
> Process: 135718 ExecStop=/usr/bin/umount /mnt/b2/files (code=exited, status=32)
> Main PID: 117923 (code=killed, signal=TERM)
> IP: 40.5M in, 8.5G out
> CPU: 6min 11.868s
> -- 8< --

What does your kernel log say at this time (dmesg)?

Could it be that you're running out of memory, and the OOM killer is
killing mount.s3ql to free up memory?


The TERM signal does not make sense to me, to this a non-fatal signal
that should result in S3QL gracefully exiting.


Could you try what happens when you manually send SIGTERM to a running
mount.s3ql process? Does it terminate properly with full logging until
the end?

> mount.s3ql[117923]: NoneType: None
>
> I'd gladly go debug it myself, but I don't know where to start with
> this async stuff. Any pointers? How do I read this log? What is this
> strange "NoneType: None"?

This doesn't make any sense to me either.

Normally I would suggest to add e.g. line numbers and file names to
https://github.com/s3ql/s3ql/blob/master/src/s3ql/logging.py#L86 to see
where this message is generated, but looking at it it seems that it
should already inculde the function name.. So whatever logs this is
bypassing the regular logging code path.


So, in summary:

- Run standalone under gdb (and not as a systemd service)
- Check kernel logs
- Check memory usage
- Try to send SIGTERM to a non-problematic mount



Best,
-Nikolaus

--
GPG Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F

»Time flies like an arrow, fruit flies like a Banana.«

Ivan Shapovalov

unread,
Dec 27, 2020, 4:17:50 PM12/27/20
to Nikolaus Rath, s3...@googlegroups.com
On 2020-12-27 at 11:44 +0000, Nikolaus Rath wrote:
> <snip>

Thank you for the reply, Nikolaus.

> What does your kernel log say at this time (dmesg)?
>
> Could it be that you're running out of memory, and the OOM killer is
> killing mount.s3ql to free up memory?

Kernel log is silent. It's definitely not an OOM (an OOM would have
SIGKILLed s3ql anyway).

> The TERM signal does not make sense to me, to this a non-fatal signal
> that should result in S3QL gracefully exiting.
>
>
> Could you try what happens when you manually send SIGTERM to a
> running
> mount.s3ql process? Does it terminate properly with full logging
> until
> the end?

Nope. It dies immediately. Which is sort of expected, because I
actually see no SIGTERM handler in s3ql.

And, on that matter, I see where it comes from. See below.

> So, in summary:
>
> - Run standalone under gdb (and not as a systemd service)
> - Check kernel logs
> - Check memory usage
> - Try to send SIGTERM to a non-problematic mount

OK, so I did not yet try to run s3ql under gdb, but I think I
(partially) know what happens.

Running mount.s3ql in a plain shell session:

-- 8< --
mount.s3ql b2://<mybucket> /mnt/b2/files -o fg,log=none,authfile=/etc/s3ql/authinfo2,cachedir=/var/tmp/s3ql,debug,allow-other,compress=none,cachesize=10485760,threads=8,keep-cache,backend-options=disable-versions
-- 8< --

Produces this log:

-- 8< --
2020-12-27 19:04:33.819 211867 DEBUG Thread-1 s3ql.backends.b2.b2_backend._do_request: RESPONSE: POST 400 97
2020-12-27 19:04:33.820 211867 DEBUG MainThread s3ql.block_cache.with_event_loop: upload of 8652 failed
NoneType: None
2020-12-27 19:04:33.827 211867 DEBUG Thread-1 s3ql.mount.exchook: recording exception 400 : bad_request - Checksum did not match data received
zsh: terminated mount.s3ql b2://<mybucket> /mnt/b2/files -o
-- 8< --

Leaving out the question of why journald eats the last line, the
situation is pretty clear. The backend (B2Backend._do_request) raises
an exception (B2Error) which is not considered a "temporary failure".
It bubbles all the way through ObjectW.close(),
AbstractBackend.perform_write(), BlockCache._do_upload(),
BlockCache._upload_loop() and is never caught.

Finally, exchook() from mount.py:setup_exchook() gets called and sends
SIGTERM to the mount process (mount.py:687).

Does that sound plausible?

I have just patched up error handling in the B2 backend to consider the
checksum mismatch a transient failure (testing now). But I take it the
whole SIGTERM thing is also unexpected?
signature.asc

Ivan Shapovalov

unread,
Dec 27, 2020, 5:32:44 PM12/27/20
to Nikolaus Rath, s3...@googlegroups.com
On 2020-12-28 at 00:17 +0300, Ivan Shapovalov wrote:

> <...>
> I have just patched up error handling in the B2 backend to consider
> the
> checksum mismatch a transient failure (testing now).
> <...>

OK, now it gets interesting.

-- 8< --
2020-12-28 00:39:58.695 246057 DEBUG MainThread s3ql.block_cache.upload_if_dirty: created new object 8704
2020-12-28 00:39:58.695 246057 DEBUG MainThread s3ql.block_cache.upload_if_dirty: created new block 7803
2020-12-28 00:39:58.695 246057 DEBUG MainThread s3ql.block_cache.upload_if_dirty: adding to upload queue
2020-12-28 00:39:58.696 246057 DEBUG MainThread s3ql.block_cache.upload_if_dirty: no old block
2020-12-28 00:39:58.696 246057 DEBUG Thread-5 s3ql.block_cache._upload_loop: got work
2020-12-28 00:39:58.696 246057 DEBUG Thread-5 s3ql.backends.b2.b2_backend.open_write: started with s3ql_data_8704
2020-12-28 00:39:58.697 246057 DEBUG Thread-5 s3ql.backends.b2.object_w.close: started with s3ql_data_8704
2020-12-28 00:39:58.697 246057 DEBUG Thread-5 s3ql.backends.b2.b2_backend._do_request: started with POST /b2api/v2/b2_get_upload_url
2020-12-28 00:39:58.697 246057 DEBUG Thread-5 s3ql.backends.b2.b2_backend._do_request: REQUEST: api003.backblazeb2.com POST /b2api/v2/b2_get_upload_url
2020-12-28 00:39:58.697 246057 DEBUG Thread-5 s3ql.backends.common.wrapped: Average retry rate: 0.23 Hz
2020-12-28 00:39:58.697 246057 DEBUG Thread-5 s3ql.backends.common.wrapped: Encountered ConnectionClosed (server closed connection), retrying B2Backend._request_upload_url_info (attempt 1)...
2020-12-28 00:39:58.724 246057 DEBUG Thread-5 s3ql.backends.b2.b2_backend._do_request: started with POST /b2api/v2/b2_get_upload_url
2020-12-28 00:39:58.724 246057 DEBUG Thread-5 s3ql.backends.b2.b2_backend._do_request: REQUEST: api003.backblazeb2.com POST /b2api/v2/b2_get_upload_url
2020-12-28 00:39:58.942 246057 DEBUG Thread-5 s3ql.backends.b2.b2_backend._do_request: RESPONSE: POST 200 275
2020-12-28 00:39:58.942 246057 DEBUG Thread-5 s3ql.backends.b2.b2_backend._do_request: started with POST /b2api/v2/b2_upload_file/e77141e88db425f876370514/c003_v0312004_t0021
2020-12-28 00:39:58.942 246057 DEBUG Thread-5 s3ql.backends.b2.b2_backend._do_request: REQUEST: pod-031-2004-17.backblaze.com POST /b2api/v2/b2_upload_file/e77141e88db425f876370514/c003_v0312004_t0021
2020-12-28 00:39:59.131 246057 DEBUG MainThread s3ql.block_cache.with_event_loop: upload of 8704 failed
NoneType: None
2020-12-28 00:39:59.136 246057 DEBUG Thread-5 s3ql.mount.exchook: recording exception No pending requests
-- 8< --

This exception comes from dugong.
Any ideas?

Also, I wonder if that "NoneType: None" is coming from `exc_info=True`
in the log.debug() statement in in with_event_loop() inside
BlockCache._do_upload() at block_cache.py:343. Guess the
trio.from_thread.run() does not magically carry sys.exc_info over
between threads?
signature.asc

Nikolaus Rath

unread,
Dec 28, 2020, 8:40:19 AM12/28/20
to s3...@googlegroups.com
Hi Ivan,

Please do not Cc me, I am reading the list.


On Dec 28 2020, Ivan Shapovalov <int...@intelfx.name> wrote:
> Finally, exchook() from mount.py:setup_exchook() gets called and sends
> SIGTERM to the mount process (mount.py:687).
>
> Does that sound plausible?

Ah, I completely forgot about this, good find!

When S3QL was using llfuse rather than pyfuse3, sending SIGTERM to
itself was the way to signal to libfuse to exit the main event loop. The
signal handler is installed by llfuse, not S3QL, so that's not obvious
from the code.

I suspect that this part needed a change when switching from llfuse to
pyfuse3 but was forgotten. I will try to take a look at it.

Nikolaus Rath

unread,
Dec 28, 2020, 8:41:49 AM12/28/20
to s3...@googlegroups.com
On Dec 28 2020, Ivan Shapovalov <int...@intelfx.name> wrote:
> 2020-12-27 19:04:33.819 211867 DEBUG Thread-1 s3ql.backends.b2.b2_backend._do_request: RESPONSE: POST 400 97
> 2020-12-27 19:04:33.820 211867 DEBUG MainThread s3ql.block_cache.with_event_loop: upload of 8652 failed
> NoneType: None
> 2020-12-27 19:04:33.827 211867 DEBUG Thread-1 s3ql.mount.exchook: recording exception 400
> : bad_request - Checksum did not match data received
> zsh: terminated mount.s3ql b2://<mybucket> /mnt/b2/files -o
> -- 8< --
>
> Leaving out the question of why journald eats the last line, the
> situation is pretty clear. The backend (B2Backend._do_request) raises
> an exception (B2Error) which is not considered a "temporary failure".
>
> I have just patched up error handling in the B2 backend to consider the
> checksum mismatch a transient failure (testing now).

Is B2 not using SSL for its data connection? That should make sure that
there are no checksum errors....

Nikolaus Rath

unread,
Dec 28, 2020, 10:48:52 AM12/28/20
to s3...@googlegroups.com
On Dec 28 2020, Nikolaus Rath <Niko...@rath.org> wrote:
> Hi Ivan,
>
> Please do not Cc me, I am reading the list.
>
>
> On Dec 28 2020, Ivan Shapovalov <int...@intelfx.name> wrote:
>> Finally, exchook() from mount.py:setup_exchook() gets called and sends
>> SIGTERM to the mount process (mount.py:687).
>>
>> Does that sound plausible?
>
> Ah, I completely forgot about this, good find!
>
> When S3QL was using llfuse rather than pyfuse3, sending SIGTERM to
> itself was the way to signal to libfuse to exit the main event loop. The
> signal handler is installed by llfuse, not S3QL, so that's not obvious
> from the code.
>
> I suspect that this part needed a change when switching from llfuse to
> pyfuse3 but was forgotten. I will try to take a look at it.

Ok, turns out we needed some extra building blocks in pyfuse3 for
this. I have just committed
https://github.com/libfuse/pyfuse3/commit/3b9c7dfd7f68b1dbac17b325deeb7d66a66a2b05
which adds a new pyfuse3.terminate() function that S3QL should use
instead of sending SIGTERM to itself.

As a first order solution this replacement should make things
better. However, we should keep track that we terminated due to an
unhandled exception and re-raise this after the main loop has
terminated. I filed https://github.com/s3ql/s3ql/issues/227 to track
this.

Paul Tirk

unread,
Dec 28, 2020, 11:31:39 AM12/28/20
to s3...@googlegroups.com


Am Mo, 28. Dez, 2020 um 1:41 P. M. schrieb Nikolaus Rath
<Niko...@rath.org>:
> On Dec 28 2020, Ivan Shapovalov <int...@intelfx.name> wrote:
>> 2020-12-27 19:04:33.819 211867 DEBUG Thread-1
>> s3ql.backends.b2.b2_backend._do_request: RESPONSE: POST 400 97
>> 2020-12-27 19:04:33.820 211867 DEBUG MainThread
>> s3ql.block_cache.with_event_loop: upload of 8652 failed
>> NoneType: None
>> 2020-12-27 19:04:33.827 211867 DEBUG Thread-1 s3ql.mount.exchook:
>> recording exception 400
>> : bad_request - Checksum did not match data received
>> zsh: terminated mount.s3ql b2://<mybucket> /mnt/b2/files -o
>> -- 8< --
>>
>> Leaving out the question of why journald eats the last line, the
>> situation is pretty clear. The backend (B2Backend._do_request)
>> raises
>> an exception (B2Error) which is not considered a "temporary
>> failure".
>>
>> I have just patched up error handling in the B2 backend to consider
>> the
>> checksum mismatch a transient failure (testing now).
>
> Is B2 not using SSL for its data connection? That should make sure
> that
> there are no checksum errors....
>

I think this error refers to B2 not receiving the correct data. When an
object is uploaded, the checksum is provided through a header, B2 then
checks the received data if it has the same checksum as the one
provided.

Since this is a case which should not happen (and never happened to me)
it was not clear how to handle it. But probably it would be best to
also retry it for some time in case it is really just temporary..

Regards, Paul


Nikolaus Rath

unread,
Dec 28, 2020, 11:42:55 AM12/28/20
to s3...@googlegroups.com
I think if this is not a result of the network connection flipping some
bits, then it is either a bug in S3QL (perhaps the data is being
modified while uploaded), or a faulty machine on the local side, or a
faulty machine on the receiving side.

I think in all these cases retrying is likely to succeed and will thus
result in the problem not being surfaced. I would, personally, prefer to
know about such issues so that I can replace RAM/disks/CPU, switch
storage providers, or find the S3QL bug *before* something more serious
happens and results in un-fixable corruption.

Ivan Shapovalov

unread,
Dec 28, 2020, 12:01:06 PM12/28/20
to s3...@googlegroups.com
On 2020-12-28 at 13:41 +0000, Nikolaus Rath wrote:
> On Dec 28 2020, Ivan Shapovalov <int...@intelfx.name> wrote:
> > 2020-12-27 19:04:33.819 211867 DEBUG    Thread-1
> > s3ql.backends.b2.b2_backend._do_request: RESPONSE: POST 400  97
> > 2020-12-27 19:04:33.820 211867 DEBUG    MainThread
> > s3ql.block_cache.with_event_loop: upload of 8652 failed
> > NoneType: None
> > 2020-12-27 19:04:33.827 211867 DEBUG Thread-1 s3ql.mount.exchook:
> > recording exception 400
> > : bad_request - Checksum did not match data received
> > zsh: terminated  mount.s3ql b2://<mybucket> /mnt/b2/files -o
> > -- 8< --
> >
> > Leaving out the question of why journald eats the last line, the
> > situation is pretty clear. The backend (B2Backend._do_request)
> > raises
> > an exception (B2Error) which is not considered a "temporary
> > failure".
> >
> > I have just patched up error handling in the B2 backend to consider
> > the
> > checksum mismatch a transient failure (testing now).
>
> Is B2 not using SSL for its data connection? That should make sure
> that
> there are no checksum errors....

Indeed it does. I have added some proper exception logging and found
the actual problem, which is — unsurprisingly — combination of user
error, unclear system requirements and broken logging.

The B2 backend creates a temporary file for each object that is being
uploaded. My s3ql instance has object size = 1 GiB, and with threads=8,
that means at most 8 GiB worth of temporary files at once. Thing is,
temporary files are created in /tmp, which is a tmpfs and has a size
limit.

-- 8< --
2020-12-28 16:39:12.924 340652 ERROR Thread-3 s3ql.mount.exchook: Unhandled exception in thread, terminating
Traceback (most recent call last):
File "/usr/lib/python3.9/site-packages/s3ql/backends/common.py", line 279, in perform_write
return fn(fh)
File "/usr/lib/python3.9/site-packages/s3ql/block_cache.py", line 334, in do_write
fh.write(buf)
File "/usr/lib/python3.9/site-packages/s3ql/backends/b2/object_w.py", line 36, in write
self.fh.write(buf)
OSError: [Errno 28] No space left on device

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/lib/python3.9/site-packages/s3ql/mount.py", line 58, in run_with_except_hook
run_old(*args, **kw)
File "/usr/lib/python3.9/threading.py", line 892, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib/python3.9/site-packages/s3ql/block_cache.py", line 319, in _upload_loop
self._do_upload(*tmp)
File "/usr/lib/python3.9/site-packages/s3ql/block_cache.py", line 376, in _do_upload
obj_size = backend.perform_write(do_write, 's3ql_data_%d'
File "/usr/lib/python3.9/site-packages/s3ql/backends/common.py", line 108, in wrapped
return method(*a, **kw)
File "/usr/lib/python3.9/site-packages/s3ql/backends/common.py", line 279, in perform_write
return fn(fh)
File "/usr/lib/python3.9/site-packages/s3ql/backends/b2/object_w.py", line 79, in __exit__
self.close()
File "/usr/lib/python3.9/site-packages/s3ql/backends/common.py", line 108, in wrapped
return method(*a, **kw)
File "/usr/lib/python3.9/site-packages/s3ql/backends/b2/object_w.py", line 64, in close
response = self.backend._do_upload_request(self.headers, self.fh)
File "/usr/lib/python3.9/site-packages/s3ql/backends/b2/b2_backend.py", line 291, in _do_upload_request
response, response_body = self._do_request(upload_url_info['connection'], 'POST', upload_url_info['path'], headers, body)
File "/usr/lib/python3.9/site-packages/s3ql/backends/b2/b2_backend.py", line 235, in _do_request
response = connection.read_response()
File "/usr/lib/python3.9/site-packages/dugong/__init__.py", line 790, in read_response
return eval_coroutine(self.co_read_response(), self.timeout)
File "/usr/lib/python3.9/site-packages/dugong/__init__.py", line 1531, in eval_coroutine
if not next(crt).poll(timeout=timeout):
File "/usr/lib/python3.9/site-packages/dugong/__init__.py", line 803, in co_read_response
raise StateError('No pending requests')
dugong.StateError: No pending requests
zsh: terminated mount.s3ql b2://<mybucket> /mnt/b2/files -o
-- 8< --

I guess this was causing the 400 as well, because we were sending the
temporary file including the last partial write, but the locally
computed hash did not include it.

There is still a problem of getting the dugong.StateError instead of a
400. I don't know why it started giving me this, even after rolling
back all the tentative patches. Though I'm sure it will be something
trivial.
signature.asc

Nikolaus Rath

unread,
Dec 29, 2020, 4:39:46 AM12/29/20
to s3...@googlegroups.com
Great, thanks for your help!

Could you tell what exactly you changed to make the exception
information appear?
I consider this a bug in the B2 backend (and other backends may have the
same problem). If the backend returns an exception from write(), this
should not result in a second exception from close(). Either write()
should update the checksum to reflect the partial data that was written
(thus eliminating the checksum error on upload), or perhaps it should
set a flag that this object should not be uploaded at all on close.

https://github.com/s3ql/s3ql/issues/228


> File "/usr/lib/python3.9/site-packages/s3ql/backends/common.py", line 108, in wrapped
> return method(*a, **kw)
> File "/usr/lib/python3.9/site-packages/s3ql/backends/b2/object_w.py", line 64, in close
> response = self.backend._do_upload_request(self.headers, self.fh)
> File "/usr/lib/python3.9/site-packages/s3ql/backends/b2/b2_backend.py", line 291, in _do_upload_request
> response, response_body = self._do_request(upload_url_info['connection'], 'POST', upload_url_info['path'], headers, body)
> File "/usr/lib/python3.9/site-packages/s3ql/backends/b2/b2_backend.py", line 235, in _do_request
> response = connection.read_response()
> File "/usr/lib/python3.9/site-packages/dugong/__init__.py", line 790, in read_response
> return eval_coroutine(self.co_read_response(), self.timeout)
> File "/usr/lib/python3.9/site-packages/dugong/__init__.py", line 1531, in eval_coroutine
> if not next(crt).poll(timeout=timeout):
> File "/usr/lib/python3.9/site-packages/dugong/__init__.py", line 803, in co_read_response
> raise StateError('No pending requests')
> dugong.StateError: No pending requests
> zsh: terminated mount.s3ql b2://<mybucket> /mnt/b2/files -o
> -- 8< --
>
> I guess this was causing the 400 as well, because we were sending the
> temporary file including the last partial write, but the locally
> computed hash did not include it.

Exactly.

> There is still a problem of getting the dugong.StateError instead of a
> 400. I don't know why it started giving me this, even after rolling
> back all the tentative patches. Though I'm sure it will be something
> trivial.

My guess is that dugong is expecting more data in the request (thus
the request is not considered complete and there is no response that
could be read). However, looking at the code, I do not immediately see
why this would be the case.

Ivan Shapovalov

unread,
Dec 29, 2020, 4:59:26 AM12/29/20
to s3...@googlegroups.com
On 2020-12-29 at 09:39 +0000, Nikolaus Rath wrote:
> On Dec 28 2020, Ivan Shapovalov <int...@intelfx.name> wrote:
> > On 2020-12-28 at 13:41 +0000, Nikolaus Rath wrote:
> > > <snip>
> >
> > Indeed it does. I have added some proper exception logging and
> > found
> > the actual problem, which is — unsurprisingly — combination of user
> > error, unclear system requirements and broken logging.
>
> Great, thanks for your help!
>
> Could you tell what exactly you changed to make the exception
> information appear?

Sure, I'll send a PR fixing exception logging shortly.

>
> <snip>
>
> I consider this a bug in the B2 backend (and other backends may have
> the
> same problem). If the backend returns an exception from write(), this
> should not result in a second exception from close(). Either write()
> should update the checksum to reflect the partial data that was
> written
> (thus eliminating the checksum error on upload), or perhaps it should
> set a flag that this object should not be uploaded at all on close.
>
> https://github.com/s3ql/s3ql/issues/228
>

As far as I remember from my brief digging, all backends are
implemented similarly in this regard. Any exception in ObjectW.write()
will bubble up through do_write() in BlockCache._do_upload() to
AbstractBackend.perform_write(), which holds the ObjectW as a context
manager, and exit it, thus calling ObjectW.close().

> > <snip>
> >
> > I guess this was causing the 400 as well, because we were sending
> > the
> > temporary file including the last partial write, but the locally
> > computed hash did not include it.
>
> Exactly.
>
> > There is still a problem of getting the dugong.StateError instead
> > of a
> > 400. I don't know why it started giving me this, even after rolling
> > back all the tentative patches. Though I'm sure it will be
> > something
> > trivial.
>
> My guess is that dugong is expecting more data in the request (thus
> the request is not considered complete and there is no response that
> could be read). However, looking at the code, I do not immediately
> see
> why this would be the case.

Yeah, me neither.

> Best,
> -Nikolaus

Thanks,
signature.asc
Reply all
Reply to author
Forward
0 new messages