Invalid Backup causes Downtime

40 views
Skip to first unread message

Samir Faci

unread,
Mar 23, 2023, 12:08:43 PM3/23/23
to Postgres Operator
Hello all, 

    I just wanted to report an issue I ran into.  I have one cluster running version 5.2.0 of the operator.  I'm running on GKE infrastructure.

I configured it with two backups repo1 is a local backup while repo2 is a GCS remote backup.

```sh
cronjob.batch/dashboard-v5-2-0-repo1-diff 0 1 0-6 False 0 14h 2d1h
cronjob.batch/dashboard-v5-2-0-repo1-full 0 1 0 False 0 2d1h
cronjob.batch/dashboard-v5-2-0-repo2-diff 0 3 0-6 False 0 12h 2d1h
cronjob.batch/dashboard-v5-2-0-repo2-full 0 3 0 False 0 2d1h
```

I had an issue with the deployment where the wrong service key was deployed, and the backup failed.  That in itself isn't an issue.  If it doesn't have the right credentials I would expect it to fail, but the backup jobs seem to bring down the entire cluster, and database connectivity no longer works.


I attached a copy of my manifest I used for the initial deployment for reference.

Event Logs look like this:

```json

{
  "insertId": "gdvkpxfo9ayla",
  "jsonPayload": {
    "kind": "Event",
    "eventTime": null,
    "apiVersion": "v1",
    "message": "Readiness probe failed: HTTP probe failed with statuscode: 503",
    "type": "Warning",
    "source": {
      "host": "gke-prod2-dashboard-default-pool-3c07ebdf-tw9p",
      "component": "kubelet"
    },
    "metadata": {
      "creationTimestamp": "2023-03-23T12:50:25Z",
      "resourceVersion": "3368",
      "namespace": "postgres-operator",
      "name": "dashboard-v5-2-0-00-jdgd-0.174f0d506cf55d67",
      "managedFields": [
        {
          "time": "2023-03-23T12:50:25Z",
          "operation": "Update",
          "fieldsType": "FieldsV1",
          "manager": "kubelet",
          "apiVersion": "v1",
          "fieldsV1": {
            "f:count": {},
            "f:involvedObject": {},
            "f:reason": {},
            "f:source": {
              "f:host": {},
              "f:component": {}
            },
            "f:message": {},
            "f:type": {},
            "f:lastTimestamp": {},
            "f:firstTimestamp": {}
          }
        }
      ],
      "uid": "28889db4-2523-43ea-9151-e54d9e398ad2"
    },
    "reportingInstance": "",
    "involvedObject": {
      "apiVersion": "v1",
      "name": "dashboard-v5-2-0-00-jdgd-0",
      "namespace": "postgres-operator",
      "fieldPath": "spec.containers{database}",
      "uid": "c7fce83f-b0a8-4a78-aa41-cc8d74d18334",
      "resourceVersion": "1209718",
      "kind": "Pod"
    },
    "lastTimestamp": "2023-03-23T12:52:05Z",
    "reason": "Unhealthy",
    "reportingComponent": ""
  },
  "resource": {
    "type": "k8s_pod",
    "labels": {
      "namespace_name": "postgres-operator",
      "cluster_name": "prod2-dashboard",
      "project_id": "esnet-sd-dev",
      "location": "us-central1-c",
      "pod_name": "dashboard-v5-2-0-00-jdgd-0"
    }
  },
  "timestamp": "2023-03-23T12:52:05Z",
  "severity": "WARNING",
  "logName": "projects/esnet-sd-dev/logs/events",
  "receiveTimestamp": "2023-03-23T12:52:10.131326261Z"
}

(combined from similar events): command terminated with exit code 39: ERROR: [039]: HTTP request failed with 403 (Forbidden): *** Path/Query ***: GET /storage/v1/b/prod-dashboards/o/pgbackrest%2Fpostgres-operator%2Fdashboard-v5-2-0-gcs%2Fprod2%2Farchive%2Fdb%2Farchive.info?fields=size%2Cupdated *** Request Headers ***: authorization: <redacted> content-length: 0 host: storage.googleapis.com *** Response Headers ***: cache-control: no-cache, no-store, max-age=0, must-revalidate content-length: 598 content-type: application/json; charset=UTF-8 date: Thu, 23 Mar 2023 13:39:06 GMT expires: Mon, 01 Jan 1990 00:00:00 GMT pragma: no-cache server: UploadServer vary: Origin, X-Origin x-guploader-uploadid: ADPycdtw4LkL_Sg2cr4K0H2fQBnbkXeHHz8_EO8JbyxANE8GsE32RIE2S3E9GBHUW_t-YfxNQ468McouKXmDP9OQJ7BjRA *** Response Content ***: { "error": { "code": 403, "message": "staging-bac...@esnet-sd-dev.iam.gserviceaccount.com does not have storage.objects.get access to the Google Cloud Storage object. Permission 'storage.objects.get' denied on resource (or it may not exist).", "errors": [ { "message": "staging-bac...@esnet-sd-dev.iam.gserviceaccount.com does not have storage.objects.get access to the Google Cloud Storage object. Permission 'storage.objects.get' denied on resource (or it may not exist).", "domain": "global", "reason": "forbidden" } ] } }
```


I'm fairly sure I resolved the issue now that the correct service account has been added, but the Report or Bug that I'm filling is that a failure on a backup job should not bring down a production database no matter what the issues were.  Or am I wrong to make that assumption?


--

Samir Faci



postgres.yaml

drew.s...@crunchydata.com

unread,
Mar 29, 2023, 8:29:32 PM3/29/23
to Postgres Operator, Samir Faci
Hello Samir,

Thank you for bringing this to our attention. Would you mind sharing some more information to help us get to the bottom of this? Would you please send me the operator logs as well as the postgres logs?

PgBackRest not only creates backups but is also involved in archiving WAL logs to the repos. So, if the repo was unavailable for a period of time, this could affect archiving which could be a factor in the issue you saw. If this is the case, we should see something in the logs, so we look forward to taking a look that those.

Thanks!

Regards,
Drew

(combined from similar events): command terminated with exit code 39: ERROR: [039]: HTTP request failed with 403 (Forbidden): *** Path/Query ***: GET /storage/v1/b/prod-dashboards/o/pgbackrest%2Fpostgres-operator%2Fdashboard-v5-2-0-gcs%2Fprod2%2Farchive%2Fdb%2Farchive.info?fields=size%2Cupdated *** Request Headers ***: authorization: <redacted> content-length: 0 host: storage.googleapis.com *** Response Headers ***: cache-control: no-cache, no-store, max-age=0, must-revalidate content-length: 598 content-type: application/json; charset=UTF-8 date: Thu, 23 Mar 2023 13:39:06 GMT expires: Mon, 01 Jan 1990 00:00:00 GMT pragma: no-cache server: UploadServer vary: Origin, X-Origin x-guploader-uploadid: ADPycdtw4LkL_Sg2cr4K0H2fQBnbkXeHHz8_EO8JbyxANE8GsE32RIE2S3E9GBHUW_t-YfxNQ468McouKXmDP9OQJ7BjRA *** Response Content ***: { "error": { "code": 403, "message": "staging-backup-service@esnet-sd-dev.iam.gserviceaccount.com does not have storage.objects.get access to the Google Cloud Storage object. Permission 'storage.objects.get' denied on resource (or it may not exist).", "errors": [ { "message": "staging-backup-service@esnet-sd-dev.iam.gserviceaccount.com does not have storage.objects.get access to the Google Cloud Storage object. Permission 'storage.objects.get' denied on resource (or it may not exist).", "domain": "global", "reason": "forbidden" } ] } }
```

Samir Faci

unread,
Mar 31, 2023, 1:47:58 PM3/31/23
to drew.s...@crunchydata.com, Postgres Operator
Sure. I won't be working on that till next week or later but I can try to recreate that environment and try to force that behavior again. 



(combined from similar events): command terminated with exit code 39: ERROR: [039]: HTTP request failed with 403 (Forbidden): *** Path/Query ***: GET /storage/v1/b/prod-dashboards/o/pgbackrest%2Fpostgres-operator%2Fdashboard-v5-2-0-gcs%2Fprod2%2Farchive%2Fdb%2Farchive.info?fields=size%2Cupdated *** Request Headers ***: authorization: <redacted> content-length: 0 host: storage.googleapis.com *** Response Headers ***: cache-control: no-cache, no-store, max-age=0, must-revalidate content-length: 598 content-type: application/json; charset=UTF-8 date: Thu, 23 Mar 2023 13:39:06 GMT expires: Mon, 01 Jan 1990 00:00:00 GMT pragma: no-cache server: UploadServer vary: Origin, X-Origin x-guploader-uploadid: ADPycdtw4LkL_Sg2cr4K0H2fQBnbkXeHHz8_EO8JbyxANE8GsE32RIE2S3E9GBHUW_t-YfxNQ468McouKXmDP9OQJ7BjRA *** Response Content ***: { "error": { "code": 403, "message": "staging-bac...@esnet-sd-dev.iam.gserviceaccount.com does not have storage.objects.get access to the Google Cloud Storage object. Permission 'storage.objects.get' denied on resource (or it may not exist).", "errors": [ { "message": "staging-bac...@esnet-sd-dev.iam.gserviceaccount.com does not have storage.objects.get access to the Google Cloud Storage object. Permission 'storage.objects.get' denied on resource (or it may not exist).", "domain": "global", "reason": "forbidden" } ] } }
```

Reply all
Reply to author
Forward
0 new messages