On 2020-12-28 at 13:41 +0000, Nikolaus Rath wrote:
> On Dec 28 2020, Ivan Shapovalov <
int...@intelfx.name> wrote:
> > 2020-12-27 19:04:33.819 211867 DEBUG Thread-1
> > s3ql.backends.b2.b2_backend._do_request: RESPONSE: POST 400 97
> > 2020-12-27 19:04:33.820 211867 DEBUG MainThread
> > s3ql.block_cache.with_event_loop: upload of 8652 failed
> > NoneType: None
> > 2020-12-27 19:04:33.827 211867 DEBUG Thread-1 s3ql.mount.exchook:
> > recording exception 400
> > : bad_request - Checksum did not match data received
> > zsh: terminated mount.s3ql b2://<mybucket> /mnt/b2/files -o
> > -- 8< --
> >
> > Leaving out the question of why journald eats the last line, the
> > situation is pretty clear. The backend (B2Backend._do_request)
> > raises
> > an exception (B2Error) which is not considered a "temporary
> > failure".
> >
> > I have just patched up error handling in the B2 backend to consider
> > the
> > checksum mismatch a transient failure (testing now).
>
> Is B2 not using SSL for its data connection? That should make sure
> that
> there are no checksum errors....
Indeed it does. I have added some proper exception logging and found
the actual problem, which is — unsurprisingly — combination of user
error, unclear system requirements and broken logging.
The B2 backend creates a temporary file for each object that is being
uploaded. My s3ql instance has object size = 1 GiB, and with threads=8,
that means at most 8 GiB worth of temporary files at once. Thing is,
temporary files are created in /tmp, which is a tmpfs and has a size
limit.
-- 8< --
2020-12-28 16:39:12.924 340652 ERROR Thread-3 s3ql.mount.exchook: Unhandled exception in thread, terminating
Traceback (most recent call last):
File "/usr/lib/python3.9/site-packages/s3ql/backends/common.py", line 279, in perform_write
return fn(fh)
File "/usr/lib/python3.9/site-packages/s3ql/block_cache.py", line 334, in do_write
fh.write(buf)
File "/usr/lib/python3.9/site-packages/s3ql/backends/b2/object_w.py", line 36, in write
self.fh.write(buf)
OSError: [Errno 28] No space left on device
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.9/site-packages/s3ql/mount.py", line 58, in run_with_except_hook
run_old(*args, **kw)
File "/usr/lib/python3.9/threading.py", line 892, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib/python3.9/site-packages/s3ql/block_cache.py", line 319, in _upload_loop
self._do_upload(*tmp)
File "/usr/lib/python3.9/site-packages/s3ql/block_cache.py", line 376, in _do_upload
obj_size = backend.perform_write(do_write, 's3ql_data_%d'
File "/usr/lib/python3.9/site-packages/s3ql/backends/common.py", line 108, in wrapped
return method(*a, **kw)
File "/usr/lib/python3.9/site-packages/s3ql/backends/common.py", line 279, in perform_write
return fn(fh)
File "/usr/lib/python3.9/site-packages/s3ql/backends/b2/object_w.py", line 79, in __exit__
self.close()
File "/usr/lib/python3.9/site-packages/s3ql/backends/common.py", line 108, in wrapped
return method(*a, **kw)
File "/usr/lib/python3.9/site-packages/s3ql/backends/b2/object_w.py", line 64, in close
response = self.backend._do_upload_request(self.headers, self.fh)
File "/usr/lib/python3.9/site-packages/s3ql/backends/b2/b2_backend.py", line 291, in _do_upload_request
response, response_body = self._do_request(upload_url_info['connection'], 'POST', upload_url_info['path'], headers, body)
File "/usr/lib/python3.9/site-packages/s3ql/backends/b2/b2_backend.py", line 235, in _do_request
response = connection.read_response()
File "/usr/lib/python3.9/site-packages/dugong/__init__.py", line 790, in read_response
return eval_coroutine(self.co_read_response(), self.timeout)
File "/usr/lib/python3.9/site-packages/dugong/__init__.py", line 1531, in eval_coroutine
if not next(crt).poll(timeout=timeout):
File "/usr/lib/python3.9/site-packages/dugong/__init__.py", line 803, in co_read_response
raise StateError('No pending requests')
dugong.StateError: No pending requests
zsh: terminated mount.s3ql b2://<mybucket> /mnt/b2/files -o
-- 8< --
I guess this was causing the 400 as well, because we were sending the
temporary file including the last partial write, but the locally
computed hash did not include it.
There is still a problem of getting the dugong.StateError instead of a
400. I don't know why it started giving me this, even after rolling
back all the tentative patches. Though I'm sure it will be something
trivial.