Serious issues with S3QL and Amazon S3 storage backend - "Transport Endpoint is not Connected"

485 views
Skip to first unread message

Brandon Orwell

unread,
Jun 17, 2016, 12:55:05 PM6/17/16
to s3...@googlegroups.com
I've been using S3QL for a few days now, and whenever I am copying over large amounts of data the mount point seems to 'lock up' for a period of time (as well as anything else trying to access it), and then I start getting 'transport endpoint not connected' errors. I "umount" the mount point, run fsck on it. and then continue archiving until the problem happens again. I have been using cp --update --archive but I'm not quite sure this is working to ensure that I end up with a complete replicated filesystem from my copy source to my copy destination - and MD5ing takes forever since it has to transfer all the data (not to mention, it's costly!). So, as you all can imagine, this transport endpoint disconnection thing is causing many headaches. Right now as I type this message, fsck is running and 'committing blocks' (correcting errors?) to the S3 backend.

Does anyone know what would cause these problems? The machine is connected via a dedicated 100mbit pipe and has low latency to the S3 servers. The problem only seems to occur when I'm copying large amounts of data. I have tried disabling threads, thinking perhaps the asynchronous writes were causing fault, but this did not help. The back-end is a regular old S3 bucket in the Oregon location.

The errors from 'cp' looks like:

... successful copies ...
cp: cannot create directory ‘/mnt/s3-df/Backup/Movies/Old’: Transport endpoint is not connected
... etc etc ...

This issue is essentially making S3QL unusable as its so unstable that I cannot confidently mount a partition and let it be. Just now, prior to my currently running fsck, I had only mounted the S3QL filesystem just 45 seconds prior and ran a 'cp' of a 17GB directory to the S3QL fs when it then locked up and I started receiving the Transport Endpoint Not Connected errors.

Any and all help and ideas are appreciated!

Thanks

Nikolaus Rath

unread,
Jun 17, 2016, 5:10:55 PM6/17/16
to s3...@googlegroups.com
On Jun 17 2016, Brandon Orwell <x...@codeslum.org> wrote:
> I've been using S3QL for a few days now, and whenever I am copying over
> large amounts of data the mount point seems to 'lock up' for a period of
> time (as well as anything else trying to access it), and then I start
> getting 'transport endpoint not connected' errors. I "umount" the mount
> point, run fsck on it. and then continue archiving until the problem
> happens again.

I've heard this kind of story before, but it still amazes me. What train
of thought led you to this procedure, instead of reporting the problem?

> Does anyone know what would cause these problems?

Where did you look for the answer?
https://bitbucket.org/nikratio/s3ql/wiki/FAQ#!what-does-the-transport-endpoint-not-connected-error-mean
says:

,----
| What does the "Transport endpoint not connected" error mean?
|
| It means that the file system has crashed. Please check mount.log for a
| more useful error message and report a bug if appropriate. If you can't
| find any errors in mount.log, the mount process may have
| "segfaulted". To confirm this, look for a corresponding message in the
| dmesg output. If the mount process segfaulted, please try to obtain a C
| backtrace (see Providing Debugging Info) of the crash and file a bug
| report.
|
| To make the mountpoint available again (i.e., unmount the crashed file system), use the fusermount -u command.
|
| Before reporting a bug, please make sure that you're not just using the
| most recent S3QL version, but also the most-recent version of the most
| important dependencies (python-llfuse, python-apsw, python-lzma).
`----


Best,
-Nikolaus

--
GPG encrypted emails preferred. Key id: 0xD113FCAC3C4E599F
Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F

»Time flies like an arrow, fruit flies like a Banana.«

Brandon

unread,
Jun 19, 2016, 7:47:35 AM6/19/16
to s3ql
Hello,

Per the error - I realized that I am using an older version of S3QL that is distributed in Ubuntu and was installed with its aptitude package manager. I have installed and upgraded to the latest release version (2.18), and will be using it and seeing if I have the same issues. I had just removed the directory structure and re-copied it, with threading turned on per default, and it successfully copied everything (for the first time!) without giving me the transport error. So, we'll see what happens.

Also another question - what is the benefit to changing the maximum file size option during mkfs from the default 10MB? In one case I have particularly larger files on average than 10MB - would I see some kind of benefit by changing this to the average file size? Does this option exist to keep the number of files stored on the fs lower, so as to increase some kind of read speed for the filesystem or something? For instance if I have a lot of large files, in excess of 1GB of each on average, and that is pretty much the only files a particular S3QL filesystem is going to store, should I set that size to 100MB to keep the number of data files low in order to get a performance increase of some kind?


Thank you!

Brandon

Brandon

unread,
Jun 19, 2016, 7:50:54 AM6/19/16
to s3ql
Actually, ignore the max object size question - I read your response to my other question about billing, and see about the (file size)/(object size) GET/PUT ratio. I will adjust accordingly for the size of files on average I store to keep object counts lower.

Thanks, and very good software you've created here, it's basically answered my needs for encrypted cloud backup and real-time file storage/access.

Brandon

Brandon

unread,
Jun 19, 2016, 6:32:11 PM6/19/16
to s3ql
Got a crash again - this time, after upgrading to 2.18, it was stable for quite a while. I had three simultaneous rsync's running, backing up in excess of 5TB of data over two buckets and it got to around 400GB or so before it crashed this time around. I've attached the relevant mount.log portion (the exception from Python at time of crash) below. I unmounted, ran an fsck, remounted, and everything is running smooth again for the time being. I figure you can look at the crash log and let me know what you think.

Thanks
Brandon

-- SNIP SNIP --


2016-06-19 10:33:15.849 15572:Thread-5 s3ql.backends.common.wrapped: Encountered ConnectionClosed (found closed when trying to write), retrying ObjectW.close (attempt 3)...
2016-06-19 13:24:37.420 15572:Thread-8 s3ql.backends.common.wrapped: Encountered ConnectionClosed (found closed when trying to write), retrying ObjectW.close (attempt 3)...
2016-06-19 14:29:32.596 15572:Thread-10 s3ql.backends.common.wrapped: Encountered ConnectionClosed (found closed when trying to write), retrying ObjectW.close (attempt 3)...
2016-06-19 16:00:04.635 15572:Thread-12 root.excepthook: Uncaught top-level exception:
Traceback (most recent call last):
  File "/usr/local/lib/python3.4/dist-packages/s3ql-2.18-py3.4-linux-x86_64.egg/s3ql/mount.py", line 64, in run_with_except_hook
    run_old(*args, **kw)
  File "/usr/lib/python3.4/threading.py", line 868, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.4/dist-packages/s3ql-2.18-py3.4-linux-x86_64.egg/s3ql/block_cache.py", line 405, in _upload_loop
    self._do_upload(*tmp)
  File "/usr/local/lib/python3.4/dist-packages/s3ql-2.18-py3.4-linux-x86_64.egg/s3ql/block_cache.py", line 432, in _do_upload
    % obj_id).get_obj_size()
  File "/usr/local/lib/python3.4/dist-packages/s3ql-2.18-py3.4-linux-x86_64.egg/s3ql/backends/common.py", line 108, in wrapped
    return method(*a, **kw)
  File "/usr/local/lib/python3.4/dist-packages/s3ql-2.18-py3.4-linux-x86_64.egg/s3ql/backends/common.py", line 340, in perform_write
    return fn(fh)
  File "/usr/local/lib/python3.4/dist-packages/s3ql-2.18-py3.4-linux-x86_64.egg/s3ql/backends/comprenc.py", line 346, in __exit__
    self.close()
  File "/usr/local/lib/python3.4/dist-packages/s3ql-2.18-py3.4-linux-x86_64.egg/s3ql/backends/comprenc.py", line 340, in close
    self.fh.close()
  File "/usr/local/lib/python3.4/dist-packages/s3ql-2.18-py3.4-linux-x86_64.egg/s3ql/backends/comprenc.py", line 505, in close
    self.fh.close()
  File "/usr/local/lib/python3.4/dist-packages/s3ql-2.18-py3.4-linux-x86_64.egg/s3ql/backends/common.py", line 108, in wrapped
    return method(*a, **kw)
  File "/usr/local/lib/python3.4/dist-packages/s3ql-2.18-py3.4-linux-x86_64.egg/s3ql/backends/s3c.py", line 906, in close
    headers=self.headers, body=self.fh)
  File "/usr/local/lib/python3.4/dist-packages/s3ql-2.18-py3.4-linux-x86_64.egg/s3ql/backends/s3c.py", line 460, in _do_request
    query_string=query_string, body=body)
  File "/usr/local/lib/python3.4/dist-packages/s3ql-2.18-py3.4-linux-x86_64.egg/s3ql/backends/s3c.py", line 695, in _send_request
    headers=headers, body=BodyFollowing(body_len))
  File "/usr/local/lib/python3.4/dist-packages/dugong/__init__.py", line 508, in send_request
    self.timeout)
  File "/usr/local/lib/python3.4/dist-packages/dugong/__init__.py", line 1396, in eval_coroutine
    if not next(crt).poll(timeout=timeout):
  File "/usr/local/lib/python3.4/dist-packages/dugong/__init__.py", line 603, in co_send_request
    yield from self._co_send(buf)
  File "/usr/local/lib/python3.4/dist-packages/dugong/__init__.py", line 619, in _co_send
    len_ = self._sock.send(buf)
  File "/usr/lib/python3.4/ssl.py", line 678, in send
    v = self._sslobj.write(data)
ssl.SSLError: [SSL: BAD_LENGTH] bad length (_ssl.c:1638)
2016-06-19 16:08:22.636 15572:MainThread s3ql.mount.unmount: Unmounting file system...
Enter code here...




-- snip snip --



On Friday, June 17, 2016 at 4:10:55 PM UTC-5, Nikolaus Rath wrote:

Nikolaus Rath

unread,
Jun 19, 2016, 7:52:08 PM6/19/16
to s3...@googlegroups.com
On Jun 19 2016, Brandon <x...@codeslum.org> wrote:
> Actually, ignore the max object size question - I read your response to my
> other question about billing, and see about the (file size)/(object size)
> GET/PUT ratio. I will adjust accordingly for the size of files on average I
> store to keep object counts lower.

That most likely does not make sense. Have you actually done the math?
If you intend to use the file system for archival purposes (i.e., write
once, keep around for long, and read again at some indefinite point in
the future), then the one-time cost for the PUT requests will be
completely irrelevant compared to the runnig cost for storage.

Nikolaus Rath

unread,
Jun 19, 2016, 7:53:56 PM6/19/16
to s3...@googlegroups.com
On Jun 19 2016, Brandon <x...@codeslum.org> wrote:
> Got a crash again - this time, after upgrading to 2.18, it was stable
> for
[...]

Thanks for reporting! Unfortunately this is a known problem, see
https://bitbucket.org/nikratio/s3ql/issues/87/retry-on-ssl-bad_write_retry.

Only known reliable workaround is to not use SSL.

Brandon Orwell

unread,
Jun 19, 2016, 8:01:16 PM6/19/16
to s3...@googlegroups.com

Right, but in this case, what's the difference? If I set it to a billion gigabyte Max object size, would it matter in anyway aside from keeping the number of objects to a minimum? Does having a. Lower size help ensure there are no issues in the case of an interrupted transfer? And it would matter because if we're talking 45TB, which is what is being uploaded, and we had to recover from a catastrophic failure, the amount of GETs to recover and amount of PUTs to store in any case wouldbe high if the object size is low. So the math here is its very expensive if there are that many objects.

So again - is there any performance hit or any kind of logistical issue with the object size? If I set it to 1GB in the case of our larger archive files, would there be some. Kind of issue?

Thanks for the fast response,

Brandon

--
You received this message because you are subscribed to a topic in the Google Groups "s3ql" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/s3ql/8XlDBUn8poo/unsubscribe.
To unsubscribe from this group and all its topics, send an email to s3ql+uns...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Nikolaus Rath

unread,
Jun 19, 2016, 8:37:29 PM6/19/16
to s3...@googlegroups.com
On Jun 19 2016, Brandon Orwell <x...@codeslum.org> wrote:
> Right, but in this case, what's the difference?

You should not deviate from the default without good reason. So if you
think it doesn't make a difference, that's an argument for going with
the default.

> Does having a. Lower size help ensure there
> are no issues in the case of an interrupted transfer?

Yes, among other things.

> And it would matter because if we're talking 45TB, which is what is
> being uploaded, and we had to recover from a catastrophic failure, the
> amount of GETs to recover and amount of PUTs to store in any case
> wouldbe high if the object size is low. So the math here is its very
> expensive if there are that many objects.

I highly recommend that you actually *do* the math, instead of relying
on your intuition of what the result will be.

Suppose you use standard storage. Transferring 45 TB costs you about
$0.090/GB * 45 TB = $ 4147. With the default block size, you need
4718592 GET requests, costing you $ 1.887 ($ 0.004 for 10,000
requests). In other words, in the best case you will save less than $ 2
out of $ 4149.

Brandon Orwell

unread,
Jun 20, 2016, 12:35:32 AM6/20/16
to s3...@googlegroups.com
So - what is the among other things in reference to increasing default block size? I have one set to 30MB. What problems could this cause? Please elaborate if you will.

Thanks again,

Brandon

Brandon Orwell

unread,
Jun 21, 2016, 8:35:54 PM6/21/16
to s3...@googlegroups.com

Just so everyone knows, since disabling SSL via backend option "no-ssl" on Amazon S3, I haven't had a single incidence of the transport endpoint not connected error killing my mount. I have had three simultaneous s3 buckets mounted and transferred well over 8TB now without the endpoint disconnecting. I got the idea after seeing an SSL error in the mount.log after the last error and decided to disable it. It has been fine since. I don't know if this is a coincidence or if there's an issue with SSL, but I just wanted to let everyone know that for me disabling ssl did the trick. It's been nice and stable since.

Thanks for all the help!

Brandon

Nikolaus Rath

unread,
Jun 22, 2016, 10:49:07 AM6/22/16
to s3...@googlegroups.com
Hi Brandon,

A: Because it confuses the reader.
Q: Why?
A: No.
Q: Should I write my response above the quoted reply?

..so please quote properly, as I'm doing in the rest of this mail:

On Jun 21 2016, Brandon Orwell <x...@codeslum.org> wrote:
> Just so everyone knows, since disabling SSL via backend option "no-ssl" on
> Amazon S3, I haven't had a single incidence of the transport endpoint not
> connected error killing my mount. I have had three simultaneous s3 buckets
> mounted and transferred well over 8TB now without the endpoint
> disconnecting. I got the idea after seeing an SSL error in the mount.log
> after the last error and decided to disable it.

I take that to mean you didn't read my answer to your report? In
http://article.gmane.org/gmane.comp.file-systems.s3ql/1469 I actually
told you about this very workaround.

> It has been fine since. I
> don't know if this is a coincidence or if there's an issue with SSL,

No, that's not a coincidence. The problem is with Python's OpenSSL
integration, so disabling SSL ensures that this problem will not
happen.

Brandon Orwell

unread,
Jun 22, 2016, 11:38:14 AM6/22/16
to s3...@googlegroups.com


On Jun 22, 2016 9:49 AM, "Nikolaus Rath" <Niko...@rath.org> wrote:
>
> Hi Brandon,
>
> A: Because it confuses the reader.
> Q: Why?
> A: No.
> Q: Should I write my response above the quoted reply?
>
> ..so please quote properly, as I'm doing in the rest of this mail:
>
> On Jun 21 2016, Brandon Orwell <x...@codeslum.org> wrote:
> > Just so everyone knows, since disabling SSL via backend option "no-ssl" on
> > Amazon S3, I haven't had a single incidence of the transport endpoint not
> > connected error killing my mount. I have had three simultaneous s3 buckets
> > mounted and transferred well over 8TB now without the endpoint
> > disconnecting. I got the idea after seeing an SSL error in the mount.log
> > after the last error and decided to disable it.
>
> I take that to mean you didn't read my answer to your report? In
> http://article.gmane.org/gmane.comp.file-systems.s3ql/1469 I actually
> told you about this very workaround.

No, I didn't see it. Google groups is screwy. So is the mail editor.

>
> > It has been fine since. I
> > don't know if this is a coincidence or if there's an issue with SSL,
>
> No, that's not a coincidence. The problem is with Python's OpenSSL
> integration, so disabling SSL ensures that this problem  will not
> happen.
>
>
> Best,
> -Nikolaus
> --
> GPG encrypted emails preferred. Key id: 0xD113FCAC3C4E599F
> Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F
>
>              »Time flies like an arrow, fruit flies like a Banana.«
>

> --
> You received this message because you are subscribed to a topic in the Google Groups "s3ql" group.
> To unsubscribe from this topic, visit https://groups.google.com/d/topic/s3ql/8XlDBUn8poo/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to s3ql+uns...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

It it know if this happens only in particular versions of OpenSSL or is the problem. in Python?

Nikolaus Rath

unread,
Jun 22, 2016, 12:18:43 PM6/22/16
to s3...@googlegroups.com
On Jun 22 2016, Brandon Orwell <x...@codeslum.org> wrote:
>> No, that's not a coincidence. The problem is with Python's OpenSSL
>> integration, so disabling SSL ensures that this problem will not
>> happen.
>>
>
> It it know if this happens only in particular versions of OpenSSL or is the
> problem. in Python?

I'm having a bit of trouble parsing that, but the problem is in Python,
and it is possible that it only manifests in combination with specific
OpenSSL versions.
Reply all
Reply to author
Forward
0 new messages