GCS 410 error during basebakup

5 views
Skip to first unread message

Yun Guo

unread,
Jun 21, 2019, 9:43:43 AM6/21/19
to wa...@googlegroups.com, Andreas Brandl, Anthony Sandoval

Hi,

We are using wal-e v1.1 to backup GCS. The total backup is around 3.2T .
We noticed the wal-e processed failed HTTP/410 sporadically and below is the log.
Jun 21 02:30:57  wal_e.worker.upload INFO     MSG: beginning volume compression#012        DETAIL: Building volume 1142.#012        STRUCTURED: time=2019-06-21T02:30:57.666929-00 pid=37373
Jun 21 02:30:58  wal_e.worker.upload INFO     MSG: beginning volume compression#012        DETAIL: Building volume 1143.#012        STRUCTURED: time=2019-06-21T02:30:58.958880-00 pid=37373
Jun 21 02:31:13  wal_e.worker.upload INFO     MSG: beginning volume compression#012        DETAIL: Building volume 1144.#012        STRUCTURED: time=2019-06-21T02:31:13.820819-00 pid=37373
Jun 21 02:31:14  wal_e.operator.backup WARNING  MSG: blocking on sending WAL segments#012        DETAIL: The backup was not completed successfully, but we have to wait anyway.  See README: TODO about pg_cancel_backup#012        STRUCTURED: time=2019-06-21T02:31:14.716392-00 pid=37373
Jun 21 02:31:17  wal_e.main   CRITICAL MSG: An unprocessed exception has avoided all error handling#012        DETAIL: Traceback (most recent call last):#012          File "/opt/wal-e/lib/python3.5/site-packages/google/cloud/storage/blob.py", line 1041, in upload_from_file#012            size, num_retries, predefined_acl)#012          File "/opt/wal-e/lib/python3.5/site-packages/google/cloud/storage/blob.py", line 957, in _do_upload#012            num_retries, predefined_acl)#012          File "/opt/wal-e/lib/python3.5/site-packages/google/cloud/storage/blob.py", line 904, in _do_resumable_upload#012            response = upload.transmit_next_chunk(transport)#012          File "/opt/wal-e/lib/python3.5/site-packages/google/resumable_media/requests/upload.py", line 396, in transmit_next_chunk#012            self._process_response(result, len(payload))#012          File "/opt/wal-e/lib/python3.5/site-packages/google/resumable_media/_upload.py", line 574, in _process_response#012            self._get_status_code, callback=self._make_invalid)#012          File "/opt/wal-e/lib/python3.5/site-packages/google/resumable_media/_helpers.py", line 93, in require_status_code#012            status_code, u'Expected one of', *status_codes)#012        google.resumable_media.common.InvalidResponse: ('Request failed with status code', 410, 'Expected one of', <HTTPStatus.OK: 200>, 308)#012        #012        During handling of the above exception, another exception occurred:#012        #012        Traceback (most recent call last):#012          File "/opt/wal-e/lib/python3.5/site-packages/wal_e/retries.py", line 87, in shim#012            return f(*args, **kwargs)#012          File "/opt/wal-e/lib/python3.5/site-packages/wal_e/worker/upload.py", line 140, in put_file_helper#012            return self.blobstore.uri_put_file(self.creds, url, tf)#012          File "/opt/wal-e/lib/python3.5/site-packages/wal_e/blobstore/gs/utils.py", line 38, in uri_put_file#012            blob.upload_from_file(fp, size=size, content_type=content_type)#012          File "/opt/wal-e/lib/python3.5/site-packages/google/cloud/storage/blob.py", line 1044, in upload_from_file#012            _raise_from_invalid_response(exc)#012          File "/opt/wal-e/lib/python3.5/site-packages/google/cloud/storage/blob.py", line 1914, in _raise_from_invalid_response#012            response.status_code, message, response=response)#012        google.api_core.exceptions.GoogleAPICallError: 410 PUT https://www.googleapis.com/upload/storage/v1/b/gitlab-gprd-postgres-backup/o?uploadType=resumable&upload_id=AEnB2UrKU4zHPqzF4fGPeEvhoxJ-2qeIK5xY9SI8O1NIhtOaDn1GC7Q_D4XQVFFXvMVVzuhCLJvUmzTkkKui6M8mpb3BedH15g: ('Request failed with status code', 410, 'Expected one of', <HTTPStatus.OK: 200>, 308)#012        #012        During handling of the above exception, another exception occurred:#012        #012        Traceback (most recent call last):#012          File "/opt/wal-e/lib/python3.5/site-packages/wal_e/cmd.py", line 652, in main#012            pool_size=args.pool_size)#012          File "/opt/wal-e/lib/python3.5/site-packages/wal_e/operator/backup.py", line 197, in database_backup#012            **kwargs)#012          File "/opt/wal-e/lib/python3.5/site-packages/wal_e/operator/backup.py", line 500, in _upload_pg_cluster_dir#012            pool.put(tpart)#012          File "/opt/wal-e/lib/python3.5/site-packages/wal_e/worker/upload_pool.py", line 108, in put#012            self._wait()#012          File "/opt/wal-e/lib/python3.5/site-packages/wal_e/worker/upload_pool.py", line 65, in _wait#012            raise val#012          File "src/gevent/greenlet.py", line 716, in gevent._greenlet.Greenlet.run#012          File "/opt/wal-e/lib/python3.5/site-packages/wal_e/worker/upload.py", line 145, in __call__#012            k = put_file_helper()#012          File "/opt/wal-e/lib/python3.5/site-packages/wal_e/retries.py", line 101, in shim#012            exc_processor_cxt=exc_processor_cxt)#012          File "/opt/wal-e/lib/python3.5/site-packages/wal_e/retries.py", line 139, in retry_with_count_internal#012            side_effect_func(exc_tup, exc_processor_cxt)#012          File "/opt/wal-e/lib/python3.5/site-packages/wal_e/worker/upload.py", line 135, in log_volume_failures_on_error#012            raise typ(value).with_traceback(tb)#012          File "/opt/wal-e/lib/python3.5/site-packages/wal_e/retries.py", line 87, in shim#012            return f(*args, **kwargs)#012          File "/opt/wal-e/lib/python3.5/site-packages/wal_e/worker/upload.py", line 140, in put_file_helper#012            return self.blobstore.uri_put_file(self.creds, url, tf)#012          File "/opt/wal-e/lib/python3.5/site-packages/wal_e/blobstore/gs/utils.py", line 38, in uri_put_file#012            blob.upload_from_file(fp, size=size, content_type=content_type)#012          File "/opt/wal-e/lib/python3.5/site-packages/google/cloud/storage/blob.py", line 1044, in upload_from_file#012            _raise_from_invalid_response(exc)#012          File "/opt/wal-e/lib/python3.5/site-packages/google/cloud/storage/blob.py", line 1914, in _raise_from_invalid_response#012            response.status_code, message, response=response)#012        google.api_core.exceptions.GoogleAPICallError: None 410 PUT https://www.googleapis.com/upload/storage/v1/b/gitlab-gprd-postgres-backup/o?uploadType=resumable&upload_id=AEnB2UrKU4zHPqzF4fGPeEvhoxJ-2qeIK5xY9SI8O1NIhtOaDn1GC7Q_D4XQVFFXvMVVzuhCLJvUmzTkkKui6M8mpb3BedH15g: ('Request failed with status code', 410, 'Expected one of', <HTTPStatus.OK: 200>, 308)#012        #012        STRUCTURED: time=2019-06-21T02:31:17.960909-00 pid=37373


Any idea what we can do to fix it?

Thanks


--

Yun GuoSenior Database Engineer | GitLab


Samuel Kohonen

unread,
Jun 21, 2019, 11:27:33 AM6/21/19
to yg...@gitlab.com, wal-e, Andreas Brandl, Anthony Sandoval
Hey,

Seems like for some reason the resumable upload session that the google python library uses for large files disappeared. No idea how common or why that would happen, but unfortunately the google library doesn't seem to retry those errors themselves anymore now that we stopped using the deprecated num_retries parameter directly.

Are you open to hacking your wal-e installation a bit to see if just checking for the GoogleAPICallError (and maybe specifically 410) and retrying would fix this? We can think about more cleaner solutions afterwards. Checking for the exception somewhere in this if-branch (https://github.com/wal-e/wal-e/blob/master/wal_e/worker/upload.py#L119) and making sure it doesn't get to the else block and raised should force wal-e to retry the upload. Is this something you could try adding to your local installation and see if it fixes the situation for you?

Cheers,
Samuel

--
You received this message because you are subscribed to the Google Groups "wal-e" group.
To unsubscribe from this group and stop receiving emails from it, send an email to wal-e+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/wal-e/CAJsFAOz8wwdwGRgV9u4CFnrdQ0QKYMcArpVLOQ%3D%3DvVynT5q-Pw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Yun Guo

unread,
Jun 21, 2019, 2:29:57 PM6/21/19
to Samuel Kohonen, wal-e, Andreas Brandl, Anthony Sandoval
Thanks Samuel! 
Can you help me review if below function can be used to check for the 410 exception?

try:
    import google.cloud.exceptions
except ImportError:
    gcs = None

def is_gcs_response_error(typ, value):
    if gcs is None:
        return False

    if not issubclass(typ, google.api_core.exceptions.GoogleAPICallError):
        return False

    if value.code == 410:
        return True

    return False

Samuel Kohonen

unread,
Jun 21, 2019, 2:56:44 PM6/21/19
to yg...@gitlab.com, wal-e, Andreas Brandl, Anthony Sandoval
Looks OK on a quick look. I tried to verify that it is the exception that bubbles up at the point in code I marked in my first email. If I didn't completely misread the traceback that should be correct exception to catch.

Hope you can test this in a test cluster before going for production. I guess the worst case for this change is that WAL uploads would start failing if there is a typo or something somewhere..

Yun Guo

unread,
Jun 21, 2019, 3:13:37 PM6/21/19
to Samuel Kohonen, wal-e, Andreas Brandl, Anthony Sandoval
Thanks!  I'll try to test it. but it only happens sporadically, so not sure if I'll be able to verify it in test cluster.
Reply all
Reply to author
Forward
0 new messages