authentication fails (again) - cluster unstable

Just Marvin

unread,

Mar 29, 2019, 5:01:33 AM3/29/19

to OpenShift 4 Developer Preview

Hi,

This is the second time my install has failed after two days, in the same way: passwords no longer work. The "oc login" command produces a "error: EOF" message, and "oc whoami" says that I'm not authenticated. No configuration changes were done to the cluster itself since it was installed. Here is some verbose output from the login command:

[friar@oc6344180105 ocp4]$ oc login -u kubeadmin -p <kubeadmin password> --v=10

I0328 22:08:59.483578 28404 loader.go:359] Config loaded from file /home/friar/ocp4/auth/kubeconfig

I0328 22:08:59.484939 28404 loader.go:359] Config loaded from file /home/friar/ocp4/auth/kubeconfig

I0328 22:08:59.485298 28404 round_trippers.go:386] curl -k -v -XHEAD 'https://api.gatt.random.domain.name:6443/'

I0328 22:08:59.753744 28404 round_trippers.go:405] HEAD https://api.gatt.random.domain.name:6443/ 403 Forbidden in 268 milliseconds

I0328 22:08:59.753772 28404 round_trippers.go:411] Response Headers:

I0328 22:08:59.753777 28404 round_trippers.go:414] Audit-Id: ac56d8fe-c782-4fb9-946f-78f96a594298

I0328 22:08:59.753782 28404 round_trippers.go:414] Cache-Control: no-store

I0328 22:08:59.753786 28404 round_trippers.go:414] Content-Type: application/json

I0328 22:08:59.753790 28404 round_trippers.go:414] X-Content-Type-Options: nosniff

I0328 22:08:59.753794 28404 round_trippers.go:414] Content-Length: 186

I0328 22:08:59.753798 28404 round_trippers.go:414] Date: Fri, 29 Mar 2019 08:49:10 GMT

I0328 22:08:59.753856 28404 round_trippers.go:386] curl -k -v -XGET -H "X-Csrf-Token: 1" 'https://api.gatt.random.domain.name:6443/.well-known/oauth-authorization-server'

I0328 22:08:59.805970 28404 round_trippers.go:405] GET https://api.gatt.random.domain.name:6443/.well-known/oauth-authorization-server 200 OK in 52 milliseconds

I0328 22:08:59.805985 28404 round_trippers.go:411] Response Headers:

I0328 22:08:59.805990 28404 round_trippers.go:414] Cache-Control: no-store

I0328 22:08:59.805994 28404 round_trippers.go:414] Content-Type: application/json

I0328 22:08:59.805998 28404 round_trippers.go:414] Content-Length: 762

I0328 22:08:59.806002 28404 round_trippers.go:414] Date: Fri, 29 Mar 2019 08:49:10 GMT

I0328 22:08:59.806006 28404 round_trippers.go:414] Audit-Id: 64162c52-2ee8-444c-9220-1be36704ba52

I0328 22:08:59.806633 28404 round_trippers.go:386] curl -k -v -XHEAD 'https://openshift-authentication-openshift-authentication.apps.gatt.random.domain.name'

I0328 22:08:59.916533 28404 round_trippers.go:405] HEAD https://openshift-authentication-openshift-authentication.apps.gatt.random.domain.name in 109 milliseconds

I0328 22:08:59.916554 28404 round_trippers.go:411] Response Headers:

I0328 22:08:59.916563 28404 request_token.go:440] falling back to kubeconfig CA due to possible IO error: EOF

I0328 22:08:59.916636 28404 round_trippers.go:386] curl -k -v -XGET -H "X-Csrf-Token: 1" 'https://openshift-authentication-openshift-authentication.apps.gatt.random.domain.name/oauth/authorize?client_id=openshift-challenging-client&code_challenge=wWrXDvRHNpmma8YfWOBpSaMlzI_NMbp9vtA7NE7R7s0&code_challenge_method=S256&redirect_uri=https%3A%2F%2Fopenshift-authentication-openshift-authentication.apps.gatt.random.domain.name%2Foauth%2Ftoken%2Fimplicit&response_type=code'

I0328 22:09:00.021652 28404 round_trippers.go:405] GET https://openshift-authentication-openshift-authentication.apps.gatt.random.domain.name/oauth/authorize?client_id=openshift-challenging-client&code_challenge=wWrXDvRHNpmma8YfWOBpSaMlzI_NMbp9vtA7NE7R7s0&code_challenge_method=S256&redirect_uri=https%3A%2F%2Fopenshift-authentication-openshift-authentication.apps.gatt.random.domain.name%2Foauth%2Ftoken%2Fimplicit&response_type=code in 104 milliseconds

I0328 22:09:00.021730 28404 round_trippers.go:411] Response Headers:

F0328 22:09:00.021788 28404 helpers.go:119] error: EOF

And here is the install-config used for the install:

apiVersion: v1beta4

baseDomain: openshifttest.boomerangplatform.net

sshKey: ssh-rsa <key value>

compute:

- name: worker

platform:

aws:

type: c5.xlarge

rootVolume:

size: 50

type: gp2

replicas: 2

controlPlane:

name: master

platform:

aws:

type: m4.xlarge

rootVolume:

size: 80

type: gp2

replicas: 2

metadata:

creationTimestamp: null

name: gatt

networking:

clusterNetwork:

- cidr: 10.128.0.0/14

hostPrefix: 23

machineCIDR: 10.0.0.0/16

networkType: OpenShiftSDN

serviceNetwork:

- 172.30.0.0/16

platform:

aws:

region: us-west-1

pullSecret: <pull secret>

Is there way to get the cluster back on its feet, or am I forced to reinstall?

Regards,

Marvin

Just Marvin

unread,

Mar 29, 2019, 9:49:07 AM3/29/19

to OpenShift 4 Developer Preview

Hi,

When I ssh into the master, I see this:

---

[systemd]

Failed Units: 8

rdisc.service

tcsd.service

sssd-autofs.socket

sssd-nss.socket

sssd-pac.socket

sssd-pam-priv.socket

sssd-ssh.socket

sssd-sudo.socket

[core@ip-10-0-129-170 ~]$

What does that mean and how would I fix it?

Regards,

Marvin

W. Trevor King

unread,

Mar 29, 2019, 9:52:59 AM3/29/19

to Just Marvin, OpenShift 4 Developer Preview

On Fri, Mar 29, 2019, 06:49 Just Marvin wrote:

When I ssh into the master, I see this:

---
[systemd]
Failed Units: 8
rdisc.service
tcsd.service
sssd-autofs.socket
sssd-nss.socket
sssd-pac.socket
sssd-pam-priv.socket
sssd-ssh.socket
sssd-sudo.socket

Tracked in https://bugzilla.redhat.com/show_bug.cgi?id=1692667

Cheers,

Trevor

Just Marvin

unread,

Mar 29, 2019, 10:03:30 AM3/29/19

to W. Trevor King, OpenShift 4 Developer Preview

Trevor,

The bug you pointed to has an emphasis on solving the problem of not being able to ssh in. I was able to ssh in. Is there cause to be concerned even when there is no “error” message being reported (as per the bug)?

Regards,

Marvin

Just Marvin

unread,

Mar 31, 2019, 12:41:20 PM3/31/19

to OpenShift 4 Developer Preview

Hi,

Quoting bug report ( https://bugzilla.redhat.com/show_bug.cgi?id=1693951 ) referenced by Trevor in the "start / stop" related thread:

<quote>But if you want to shut down nodes, you'll certainly want to wait after the initial install, for a whole day or however long it takes for the first in-cluster rotations to go through, to get certs with longer validity times before shutting down nodes. The auth/master teams may also have some advice for monitoring those rotations, even if it's just "grep the kube-apiserver-operator logs". Maybe there are Kubernetes Events you can watch for? I dunno.

Alternatively, you can just let the certs expire, and when the cluster comes back up, use SSH (which we don't expire/rotate) to go through and rebuild the x.509 chains.</quote>

I'm at my wits end here. I let the cluster up for atleast 36 hours, and when I tried logging in at that point, got an http 400 error response back. This was different from the "EOF" string I usually got when it had previously died. I thought that perhaps this was a different symptom and decided to bounce the nodes. But when the nodes came up, I got the familiar EOF error on login. I have asked before, and I'll repeat again: is there a way to get the cluster back to normal?

Otherwise I'm faced with another reinstall. And wondering why I'm bothering with a product that stops working after about a day.

Regards,

Marvin

On Friday, March 29, 2019 at 5:01:33 AM UTC-4, Just Marvin wrote:

W. Trevor King

unread,

Mar 31, 2019, 1:14:41 PM3/31/19

to Just Marvin, OpenShift 4 Developer Preview

On Sun, Mar 31, 2019, 09:41 Just Marvin wrote:

I'm at my wits end here. I let the cluster up for at least 36 hours, and when I tried logging in at that point, got an http 400 error response back.

Just to be clear, this cluster was running that whole time, without having nodes stopped or started? We currently have good CI coverage for the first hour post install. There's CI coverage for upgrades [1] which run a bit longer, but we still have some kinks to work out there. Some of those are upgrade-related, but some of them are from the volume of clusters we run straining our test infrastructure; quieter accounts should have higher success rates. We're in the process of building soak-tests to ensure clusters remain healthy for longer (initially a week or so). But if we had all the bugs fixed, we'd have had a general release ;). Until then, the more information we get into (new?) bug reports about symptoms, reproducers, error logs and such, the easier it is for us to fix the issue you've bumped into. Or wait until we've built out the CI to hit those bugs ourselves.

Cheers,

Trevor

[1]: https://openshift-release.svc.ci.openshift.org/

Just Marvin

unread,

Mar 31, 2019, 2:41:22 PM3/31/19

to OpenShift 4 Developer Preview

Trevor,

Thats correct. Cluster and cluster settings was left intact from the point if install for atleast 36 hours. I was able to get useful work done for about 12 - 14 of those hours. I will note that one significant change was to set up htpasswd auth as per the instructions (which was only possible because I had sufficient shell history and config yamls cached away - see my gripes in another thread about broken docs for more context). Anyway, as soon as I created one user per those instructions, I discovered that I could no longer login to kubeadmin via the console. So I had to define the kubeadmin as a user as well in the htpasswd secret.

I appreciate the fact that this is a developer preview, but not being able to survive beyond 36 hours puts this problem in a special class. Not sure when exactly it broke since I was asleep / away for the last 8 of those 36.

In the meantime, if there is a mechanism to reset the certs to a working state, that can get me up and running. In fact, that may be the only option for me since I have to demo the cluster, and it may take me 8 hours to get setup for the demo and atleast 24 hours for the demo to be scheduled.

If there are logs that need to be gathered so that diagnosis can be done by the redhat folks, please let me know. Otherwise, I plan to wipe the cluster tomorrow morning and try once more.

Regards,

Marvin

W. Trevor King

unread,

Mar 31, 2019, 3:35:46 PM3/31/19

to Just Marvin, OpenShift 4 Developer Preview

On Sun, Mar 31, 2019, 11:41 Just Marvin wrote:

In the meantime, if there is a mechanism to reset the certs to a working state, that can get me up and running.

Not yet, that's [1].

> If there are logs that need to be gathered so that diagnosis can be done by the redhat folks, please let me know. Otherwise, I plan to wipe the cluster tomorrow morning and try once more.

This isn't my space, so take these suggestions with a grain of salt, but my first question would be whether the breakage was killing containers or not. Checking container age (probably via SSH and crictl, with the Kubernetes API down) will tell you that. Then getting logs from the time of the breakage (maybe hard for you on this cluster, so long after the fact, but for next time) from containers that just died, or, if they didn't die, the kubelets, openshift-kube-apiserver pods, openshift-kube-controller pods, openshift-apiserver pods, and openshift-controller-manager pods. Maybe also check kubelet cert expirations following [2]. Again, I don't really know what you'll be looking for, so you'll need to sniff out anything that seems suspicious. And also again, we expect soaking CI soon, in which case the CI logs will presumably capture tge issue you're hitting without you needing to do any legwork.

Cheers,

Trevor

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1694079

[2]: https://bugzilla.redhat.com/show_bug.cgi?id=1693951#c0

Just Marvin

unread,

Apr 2, 2019, 6:31:29 AM4/2/19

to OpenShift 4 Developer Preview

Trevor,

FYI - destroy appears to have challenges as well:

INFO get tagged resources: InvalidSignatureException: Signature expired: 20190401T080546Z is now earlier than 20190402T102426Z (20190402T102926Z - 5 min.)

status code: 400, request id: 30c519d8-5532-11e9-9fbb-b1dd33773a42

INFO get tagged resources: InvalidSignatureException: Signature expired: 20190401T080546Z is now earlier than 20190402T102426Z (20190402T102926Z - 5 min.)

status code: 400, request id: 30e943a9-5532-11e9-9fbb-b1dd33773a42

INFO get tagged resources: InvalidSignatureException: Signature expired: 20190401T080547Z is now earlier than 20190402T102427Z (20190402T102927Z - 5 min.)

status code: 400, request id: 30fddd3d-5532-11e9-b8d3-a39c4cd9b6c1

INFO get tagged resources: InvalidSignatureException: Signature expired: 20190401T080547Z is now earlier than 20190402T102427Z (20190402T102927Z - 5 min.)

status code: 400, request id: 310f4263-5532-11e9-b8d3-a39c4cd9b6c1

INFO SignatureDoesNotMatch: Signature expired: 20190401T080547Z is now earlier than 20190402T101427Z (20190402T102927Z - 15 min.)

status code: 403, request id: 3114244f-5532-11e9-a3aa-f34bdbe8f45a

INFO SignatureDoesNotMatch: Signature expired: 20190401T080547Z is now earlier than 20190402T101427Z (20190402T102927Z - 15 min.)

status code: 403, request id: 3118df57-5532-11e9-a3aa-f34bdbe8f45a

Regards,

Marvin

Just Marvin

unread,

Apr 2, 2019, 6:54:19 AM4/2/19

to OpenShift 4 Developer Preview

Nevermind - my KVM's date / time was waaaay out of synch.

W. Trevor King

unread,

Apr 2, 2019, 8:39:24 AM4/2/19

to Just Marvin, OpenShift 4 Developer Preview

On Tue, Apr 2, 2019, 03:31 Just Marvin wrote:

INFO get tagged resources: InvalidSignatureException: Signature expired: 20190401T080546Z is now earlier than 20190402T102426Z (20190402T102926Z - 5 min.)

This means you and Amazon disagree on the current time. You can fix your clock, and maybe consider running an NTP daemon ;).

Cheers,

Trevor

Just Marvin

unread,

Apr 3, 2019, 3:23:07 PM4/3/19

to OpenShift 4 Developer Preview

Trevor,

Here we go again.....a little more than 24 hours after install, I see:

[friar@oc6344180105 app]$ oc get is,bc,dc,svc,route,pvc --export -l app=teamdb -o json

error: the server doesn't have a resource type "is"

[friar@oc6344180105 app]$ oc whoami

error: You must be logged in to the server (Unauthorized)

[friar@oc6344180105 app]$ oc login -u system:admin

Error from server (InternalError): Internal error occurred: unexpected response: 400

[friar@oc6344180105 app]$

Is build 16 expected to be any better in behavior?

Regards,

Marvin

Edward Callahan

unread,

Apr 3, 2019, 3:28:27 PM4/3/19

to OpenShift 4 Developer Preview

We are seeing this too. We build a cluster and start testing with it. The next morning it is unresponsive. `EOF` response from `oc login` and connection refused from browser.

Just Marvin

unread,

Apr 3, 2019, 3:58:04 PM4/3/19

to OpenShift 4 Developer Preview

Hi,

Funnily enough, the workloads that I deployed on the cluster are still working. These are using secrets, so some level of the API is functional. But everything "oc" is dead.

Regards,

Marvin

--
You received this message because you are subscribed to the Google Groups "OpenShift 4 Developer Preview" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openshift-4-dev-p...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openshift-4-dev-preview/9eab8626-f9fe-452a-91c3-c79a584c1dad%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Corey Daley

unread,

Apr 3, 2019, 4:24:58 PM4/3/19

to Just Marvin, OpenShift 4 Developer Preview

I have had a similar issue and had to re-export my KUBECONFIG that was created during the install, and was then able to access the cluster again.

I did not try with the password.

To view this discussion on the web visit https://groups.google.com/d/msgid/openshift-4-dev-preview/CAAy1pe2eUSm0SJ20spt8LLPJotd7bKjocAcDYAyoeNqA0d5U8Q%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

--

Corey Daley

Senior Software Engineer, OpenShift Developer Experience

Red Hat

100 East Davie Street

Raleigh, NC 27601

cda...@redhat.com T: (919)-754-4623 M: (270)-996-3065

TRIED. TESTED. TRUSTED.

@redhatnews Red Hat Red Hat

Just Marvin

unread,

May 1, 2019, 8:42:41 PM5/1/19

to OpenShift 4 Developer Preview

Yesssss!! This problem is finally fixed. The cluster still goes bonkers if shut down within 24 hours after install. But if kept running for atleast 24 hours, it does seem to survive intact.

Regards,

Marvin

On Wednesday, April 3, 2019 at 3:58:04 PM UTC-4, Just Marvin wrote:

Hi,

Funnily enough, the workloads that I deployed on the cluster are still working. These are using secrets, so some level of the API is functional. But everything "oc" is dead.

Regards,
Marvin

On Wed, Apr 3, 2019 at 3:28 PM Edward Callahan <ed.ca...@lightbend.com> wrote:

We are seeing this too. We build a cluster and start testing with it. The next morning it is unresponsive. `EOF` response from `oc login` and connection refused from browser.

On Wednesday, April 3, 2019 at 12:23:07 PM UTC-7, Just Marvin wrote:
Trevor,

Here we go again.....a little more than 24 hours after install, I see:

[friar@oc6344180105 app]$ oc get is,bc,dc,svc,route,pvc --export -l app=teamdb -o json
error: the server doesn't have a resource type "is"
[friar@oc6344180105 app]$ oc whoami
error: You must be logged in to the server (Unauthorized)
[friar@oc6344180105 app]$ oc login -u system:admin
Error from server (InternalError): Internal error occurred: unexpected response: 400
[friar@oc6344180105 app]$

Is build 16 expected to be any better in behavior?

Regards,
Marvin

--
You received this message because you are subscribed to the Google Groups "OpenShift 4 Developer Preview" group.

To unsubscribe from this group and stop receiving emails from it, send an email to openshift-4-dev-preview+unsub...@googlegroups.com.

Reply all

Reply to author

Forward