Primary Pod Errors

28 views
Skip to first unread message

Samir Faci

unread,
May 13, 2021, 11:53:07 AM5/13/21
to Postgres Operator
This is happening to a new cluster I created recently.  I had an issue yesterday where the database went down and the primary was down so I manually failed over.

essentially running: 
pgo failover dasbharod-prod-20-100 --target= gke-dashboard-beta-default-pool-dd4da635-3p3l

That seems to have fixed things and when I do a test it shows all green, but when I try to take a backup I get this error.

pgo backup --backup-opts="--type=full"  dashboard-prod-20-100
Error: primary pod is not in Ready state
I presume when we're saying 'primary pod' it's talking about the primary instance of the database.  

logs in: service/dashboard-prod-20-100 all look good.

2021-05-13 15:41:02,987 INFO: no action.  i am the leader with the lock
2021-05-13 15:41:12,980 INFO: Lock owner: dashboard-prod-20-100-tnki-74cb96ddc8-5blgx; I am dashboard-prod-20-100-tnki-74cb96ddc8-5blgx
2021-05-13 15:41:13,028 INFO: no action.  i am the leader with the lock
2021-05-13 15:41:22,980 INFO: Lock owner: dashboard-prod-20-100-tnki-74cb96ddc8-5blgx; I am dashboard-prod-20-100-tnki-74cb96ddc8-5blgx
2021-05-13 15:41:22,991 INFO: no action.  i am the leader with the lock

Replica Service also seems fine.

2021-05-13 15:43:43,030 INFO: no action.  i am a secondary and i am following a leader
2021-05-13 15:43:52,987 INFO: Lock owner: dashboard-prod-20-100-tnki-74cb96ddc8-5blgx; I am dashboard-prod-20-100-5694d79955-gsvs4
2021-05-13 15:43:52,987 INFO: does not have lock
2021-05-13 15:43:53,011 INFO: no action.  i am a secondary and i am following a leader
API Server gives me this error not sure if it is related:

2021/05/11 04:26:07 http: panic serving 127.0.0.1:60952: runtime error: index out of range [46] with length 46
goroutine 1003161 [running]:
net/http.(*conn).serve.func1(0xc0006830e0)
    /usr/lib/golang/src/net/http/server.go:1801 +0x147
panic(0x165e1a0, 0xc0007927e0)
    /usr/lib/golang/src/runtime/panic.go:975 +0x47a
github.com/crunchydata/postgres-operator/internal/apiserver/backupoptions.parseBackupOpts(0xc0004937d0, 0x2e, 0x1973d60, 0xc00008aaa0, 0xc000470000)
    /opt/cdev/src/github.com/crunchydata/postgres-operator/internal/apiserver/backupoptions/backupoptionsutil.go:115 +0x665
github.com/crunchydata/postgres-operator/internal/apiserver/backupoptions.convertBackupOptsToStruct(0xc0004937d0, 0x2e, 0x14e8b80, 0xc0005f5200, 0x0, 0x0, 0x7, 0xc000492510, 0x27, 0x1975180, ...)
    /opt/cdev/src/github.com/crunchydata/postgres-operator/internal/apiserver/backupoptions/backupoptionsutil.go:62 +0x50
github.com/crunchydata/postgres-operator/internal/apiserver/backupoptions.ValidateBackupOpts(0xc0004937d0, 0x2e, 0x14e8b80, 0xc0005f5200, 0x1e, 0x0)
    /opt/cdev/src/github.com/crunchydata/postgres-operator/internal/apiserver/backupoptions/backupoptionsutil.go:49 +0x10a
github.com/crunchydata/postgres-operator/internal/apiserver/clusterservice.validateDataSourceParms(0xc0005f5200, 0x0, 0x0)
    /opt/cdev/src/github.com/crunchydata/postgres-operator/internal/apiserver/clusterservice/clusterimpl.go:2371 +0x38e
github.com/crunchydata/postgres-operator/internal/apiserver/clusterservice.CreateCluster(0xc0005f5200, 0xc0004af000, 0x3, 0xc000792d60, 0x5, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
    /opt/cdev/src/github.com/crunchydata/postgres-operator/internal/apiserver/clusterservice/clusterimpl.go:586 +0x47a
github.com/crunchydata/postgres-operator/internal/apiserver/clusterservice.CreateClusterHandler(0x19a68a0, 0xc0000fa70
Operator looks fine, just repeats '2021/05/13 00:41:28 INF   82 (localhost:4150) connecting to nsqd'. Event looks fine as well, same for scheduler.

Executing the backup command with no options also gives me the same error.  (ie. pog backup dashboard-prod-20-100).

Any thoughts on what is going on here?







Jonathan S. Katz

unread,
May 13, 2021, 6:04:17 PM5/13/21
to Samir Faci, Postgres Operator
Hi Samir,

Thanks for submitting this.

Before getting into the issue, overall I do appreciate the reproduction steps you provided, however it missed a few elements that are helpful in troubleshooting, i.e. which version of PGO is this, what platform you are running on etc. In the future suggest following the bug reporting template -- anything that is referring to a Go panic is definitely a bug:


The rest of my comments are inline:

On Thu, May 13, 2021 at 11:53 AM Samir Faci <sa...@es.net> wrote:
This is happening to a new cluster I created recently.  I had an issue yesterday where the database went down and the primary was down so I manually failed over.

essentially running: 
pgo failover dasbharod-prod-20-100 --target= gke-dashboard-beta-default-pool-dd4da635-3p3l

That seems to have fixed things and when I do a test it shows all green, but when I try to take a backup I get this error.

pgo backup --backup-opts="--type=full"  dashboard-prod-20-100
Error: primary pod is not in Ready state
I presume when we're saying 'primary pod' it's talking about the primary instance of the database.  

The issue is in the above command, but the reference you provided did not actually trigger the crash. I am presuming some elements may have been redacted, but one of the giveaways was the index length in the error below.

The good news is the issue can be reproduced with the following:

pgo backup --backup-opts="--type=full " dashboard-prod-20-100
 
logs in: service/dashboard-prod-20-100 all look good.

2021-05-13 15:41:02,987 INFO: no action.  i am the leader with the lock
2021-05-13 15:41:12,980 INFO: Lock owner: dashboard-prod-20-100-tnki-74cb96ddc8-5blgx; I am dashboard-prod-20-100-tnki-74cb96ddc8-5blgx
2021-05-13 15:41:13,028 INFO: no action.  i am the leader with the lock
2021-05-13 15:41:22,980 INFO: Lock owner: dashboard-prod-20-100-tnki-74cb96ddc8-5blgx; I am dashboard-prod-20-100-tnki-74cb96ddc8-5blgx
2021-05-13 15:41:22,991 INFO: no action.  i am the leader with the lock

Replica Service also seems fine.

2021-05-13 15:43:43,030 INFO: no action.  i am a secondary and i am following a leader
2021-05-13 15:43:52,987 INFO: Lock owner: dashboard-prod-20-100-tnki-74cb96ddc8-5blgx; I am dashboard-prod-20-100-5694d79955-gsvs4
2021-05-13 15:43:52,987 INFO: does not have lock
2021-05-13 15:43:53,011 INFO: no action.  i am a secondary and i am following a leader
API Server gives me this error not sure if it is related:
 
The below log trace was very helpful, thank you.
2021/05/11 04:26:07 http: panic serving 127.0.0.1:60952: runtime error: index out of range [46] with length 46
goroutine 1003161 [running]:
net/http.(*conn).serve.func1(0xc0006830e0)
    /usr/lib/golang/src/net/http/server.go:1801 +0x147
panic(0x165e1a0, 0xc0007927e0)
    /usr/lib/golang/src/runtime/panic.go:975 +0x47a
github.com/crunchydata/postgres-operator/internal/apiserver/backupoptions.parseBackupOpts(0xc0004937d0, 0x2e, 0x1973d60, 0xc00008aaa0, 0xc000470000)
    /opt/cdev/src/github.com/crunchydata/postgres-operator/internal/apiserver/backupoptions/backupoptionsutil.go:115 +0x665
github.com/crunchydata/postgres-operator/internal/apiserver/backupoptions.convertBackupOptsToStruct(0xc0004937d0, 0x2e, 0x14e8b80, 0xc0005f5200, 0x0, 0x0, 0x7, 0xc000492510, 0x27, 0x1975180, ...)
    /opt/cdev/src/github.com/crunchydata/postgres-operator/internal/apiserver/backupoptions/backupoptionsutil.go:62 +0x50
github.com/crunchydata/postgres-operator/internal/apiserver/backupoptions.ValidateBackupOpts(0xc0004937d0, 0x2e, 0x14e8b80, 0xc0005f5200, 0x1e, 0x0)
    /opt/cdev/src/github.com/crunchydata/postgres-operator/internal/apiserver/backupoptions/backupoptionsutil.go:49 +0x10a
github.com/crunchydata/postgres-operator/internal/apiserver/clusterservice.validateDataSourceParms(0xc0005f5200, 0x0, 0x0)
    /opt/cdev/src/github.com/crunchydata/postgres-operator/internal/apiserver/clusterservice/clusterimpl.go:2371 +0x38e
github.com/crunchydata/postgres-operator/internal/apiserver/clusterservice.CreateCluster(0xc0005f5200, 0xc0004af000, 0x3, 0xc000792d60, 0x5, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
    /opt/cdev/src/github.com/crunchydata/postgres-operator/internal/apiserver/clusterservice/clusterimpl.go:586 +0x47a
github.com/crunchydata/postgres-operator/internal/apiserver/clusterservice.CreateClusterHandler(0x19a68a0, 0xc0000fa70
Operator looks fine, just repeats '2021/05/13 00:41:28 INF   82 (localhost:4150) connecting to nsqd'. Event looks fine as well, same for scheduler.

Executing the backup command with no options also gives me the same error.  (ie. pog backup dashboard-prod-20-100).

Any thoughts on what is going on here?

I have a bug fix patch available that will handle the issue:


This will be applied in the upcoming release.

In the interim, I would suggest ensuring there is no whitespace in the "--backup-opts" values.

If you continue to receive the "primary pod not in Ready state" that could mean that the primary is unhealthy, or not all of the attributes that handle the primary detection have been set (primary role label, pod phase, etc.)

Thanks for reporting!

Jonathan 

Jonathan S. Katz

unread,
May 13, 2021, 6:10:34 PM5/13/21
to Samir Faci, Postgres Operator
Hi Samir,

Also, as we do credit bug reporters, if you have a GitHub handle I can credit, I would be happy to include it in the release notes.

Thanks,

Jonathan

Jonathan S. Katz
VP Platform Engineering

Crunchy Data
Enterprise PostgreSQL 


Samir Faci

unread,
May 13, 2021, 9:02:08 PM5/13/21
to Jonathan S. Katz, Postgres Operator
Hello Jonathan, 

   Oh, I'm happy to use github if that's the preferred method.  I wasn't sure if it was a bug or some flag I missed in the initial setup.  


On Thu, May 13, 2021 at 3:10 PM Jonathan S. Katz <jonath...@crunchydata.com> wrote:
Hi Samir,

Also, as we do credit bug reporters, if you have a GitHub handle I can credit, I would be happy to include it in the release notes.

GH: safaci2000
 

Thanks,

Jonathan

Jonathan S. Katz
VP Platform Engineering

Crunchy Data
Enterprise PostgreSQL 



On Thu, May 13, 2021 at 6:04 PM Jonathan S. Katz <jonath...@crunchydata.com> wrote:
Hi Samir,

Thanks for submitting this.

Before getting into the issue, overall I do appreciate the reproduction steps you provided, however it missed a few elements that are helpful in troubleshooting, i.e. which version of PGO is this, what platform you are running on etc. In the future suggest following the bug reporting template -- anything that is referring to a Go panic is definitely a bug:


PGO version: 4.6.0
Platform: GCP, Kubernetes version: 1.18.17-gke.100
Well, as silly as that is, that's easy to fix to ensure there's no extra spaces.  Thank you for the fix and for catching that!



If you continue to receive the "primary pod not in Ready state" that could mean that the primary is unhealthy, or not all of the attributes that handle the primary detection have been set (primary role label, pod phase, etc.)

Thanks for reporting!

Jonathan 

--
You received this message because you are subscribed to the Google Groups "Postgres Operator" group.
To unsubscribe from this group and stop receiving emails from it, send an email to postgres-opera...@crunchydata.com.

Jonathan S. Katz

unread,
May 13, 2021, 9:02:59 PM5/13/21
to Samir Faci, Postgres Operator
Hi Samir,

The mailing list is fine and generally preferred -- I was mostly referencing to the template for trying to compile everything :-)

Thanks again,

Jonathan

Jonathan S. Katz
VP Platform Engineering

Crunchy Data
Enterprise PostgreSQL 


Jonathan S. Katz

unread,
May 17, 2021, 4:29:38 PM5/17/21
to Samir Faci, Postgres Operator
This mailing list is public and archived.

Thanks,

Jonathan


On Mon, May 17, 2021 at 3:52 PM Samir Faci <sa...@es.net> wrote:
Hello Jonathan/CrunchyData Folks,

Sort of a random question.  Is the mailing list archived and public? I try to post my solution when I do find one assuming others would find a solution that worked for me so others can find it in the future but if it's not persisted, it can likely just live in my internal documentation flow and not need to fill other people's inbox.

--
Thank you,

Samir Faci

Samir Faci

unread,
May 17, 2021, 4:29:57 PM5/17/21
to Jonathan S. Katz, Postgres Operator
Hello Jonathan/CrunchyData Folks,

Sort of a random question.  Is the mailing list archived and public? I try to post my solution when I do find one assuming others would find a solution that worked for me so others can find it in the future but if it's not persisted, it can likely just live in my internal documentation flow and not need to fill other people's inbox.

--
Thank you,

Samir Faci


Reply all
Reply to author
Forward
0 new messages