authentication fails (again) - cluster unstable

1,366 views
Skip to first unread message

Just Marvin

unread,
Mar 29, 2019, 5:01:33 AM3/29/19
to OpenShift 4 Developer Preview
Hi,

    This is the second time my install has failed after two days, in the same way: passwords no longer work. The "oc login" command produces a "error: EOF" message, and "oc whoami" says that I'm not authenticated. No configuration changes were done to the cluster itself since it was installed. Here is some verbose output from the login command:

[friar@oc6344180105 ocp4]$ oc login -u kubeadmin -p <kubeadmin password> --v=10
I0328 22:08:59.483578   28404 loader.go:359] Config loaded from file /home/friar/ocp4/auth/kubeconfig
I0328 22:08:59.484939   28404 loader.go:359] Config loaded from file /home/friar/ocp4/auth/kubeconfig
I0328 22:08:59.485298   28404 round_trippers.go:386] curl -k -v -XHEAD  'https://api.gatt.random.domain.name:6443/'
I0328 22:08:59.753744   28404 round_trippers.go:405] HEAD https://api.gatt.random.domain.name:6443/ 403 Forbidden in 268 milliseconds
I0328 22:08:59.753772   28404 round_trippers.go:411] Response Headers:
I0328 22:08:59.753777   28404 round_trippers.go:414]     Audit-Id: ac56d8fe-c782-4fb9-946f-78f96a594298
I0328 22:08:59.753782   28404 round_trippers.go:414]     Cache-Control: no-store
I0328 22:08:59.753786   28404 round_trippers.go:414]     Content-Type: application/json
I0328 22:08:59.753790   28404 round_trippers.go:414]     X-Content-Type-Options: nosniff
I0328 22:08:59.753794   28404 round_trippers.go:414]     Content-Length: 186
I0328 22:08:59.753798   28404 round_trippers.go:414]     Date: Fri, 29 Mar 2019 08:49:10 GMT
I0328 22:08:59.753856   28404 round_trippers.go:386] curl -k -v -XGET  -H "X-Csrf-Token: 1" 'https://api.gatt.random.domain.name:6443/.well-known/oauth-authorization-server'
I0328 22:08:59.805970   28404 round_trippers.go:405] GET https://api.gatt.random.domain.name:6443/.well-known/oauth-authorization-server 200 OK in 52 milliseconds
I0328 22:08:59.805985   28404 round_trippers.go:411] Response Headers:
I0328 22:08:59.805990   28404 round_trippers.go:414]     Cache-Control: no-store
I0328 22:08:59.805994   28404 round_trippers.go:414]     Content-Type: application/json
I0328 22:08:59.805998   28404 round_trippers.go:414]     Content-Length: 762
I0328 22:08:59.806002   28404 round_trippers.go:414]     Date: Fri, 29 Mar 2019 08:49:10 GMT
I0328 22:08:59.806006   28404 round_trippers.go:414]     Audit-Id: 64162c52-2ee8-444c-9220-1be36704ba52
I0328 22:08:59.806633   28404 round_trippers.go:386] curl -k -v -XHEAD  'https://openshift-authentication-openshift-authentication.apps.gatt.random.domain.name'
I0328 22:08:59.916533   28404 round_trippers.go:405] HEAD https://openshift-authentication-openshift-authentication.apps.gatt.random.domain.name  in 109 milliseconds
I0328 22:08:59.916554   28404 round_trippers.go:411] Response Headers:
I0328 22:08:59.916563   28404 request_token.go:440] falling back to kubeconfig CA due to possible IO error: EOF
I0328 22:09:00.021730   28404 round_trippers.go:411] Response Headers:
F0328 22:09:00.021788   28404 helpers.go:119] error: EOF

    And here is the install-config used for the install:

apiVersion: v1beta4
sshKey: ssh-rsa <key value>
compute:
- name: worker
  platform:
    aws:
      type: c5.xlarge
      rootVolume:
        size: 50
        type: gp2
  replicas: 2
controlPlane:
  name: master
  platform:
    aws:
      type: m4.xlarge
      rootVolume:
        size: 80
        type: gp2
  replicas: 2
metadata:
  creationTimestamp: null
  name: gatt
networking:
  clusterNetwork:
  - cidr: 10.128.0.0/14
    hostPrefix: 23
  machineCIDR: 10.0.0.0/16
  networkType: OpenShiftSDN
  serviceNetwork:
platform:
  aws:
    region: us-west-1
pullSecret: <pull secret>

     Is there way to get the cluster back on its feet, or am I forced to reinstall?

Regards,
Marvin

Just Marvin

unread,
Mar 29, 2019, 9:49:07 AM3/29/19
to OpenShift 4 Developer Preview
Hi,

    When I ssh into the master, I see this:

---
[systemd]
Failed Units: 8
  rdisc.service
  tcsd.service
  sssd-autofs.socket
  sssd-nss.socket
  sssd-pac.socket
  sssd-pam-priv.socket
  sssd-ssh.socket
  sssd-sudo.socket
[core@ip-10-0-129-170 ~]$ 

    What does that mean and how would I fix it?

Regards,
Marvin

W. Trevor King

unread,
Mar 29, 2019, 9:52:59 AM3/29/19
to Just Marvin, OpenShift 4 Developer Preview
On Fri, Mar 29, 2019, 06:49 Just Marvin wrote:
    When I ssh into the master, I see this:

---
[systemd]
Failed Units: 8
  rdisc.service
  tcsd.service
  sssd-autofs.socket
  sssd-nss.socket
  sssd-pac.socket
  sssd-pam-priv.socket
  sssd-ssh.socket
  sssd-sudo.socket

Just Marvin

unread,
Mar 29, 2019, 10:03:30 AM3/29/19
to W. Trevor King, OpenShift 4 Developer Preview
Trevor,

    The bug you pointed to has an emphasis on solving the problem of not being able to ssh in. I was able to ssh in. Is there cause to be concerned even when there is no “error” message being reported (as per the bug)?

Regards,
Marvin

Just Marvin

unread,
Mar 31, 2019, 12:41:20 PM3/31/19
to OpenShift 4 Developer Preview
Hi,

    Quoting bug report ( https://bugzilla.redhat.com/show_bug.cgi?id=1693951 ) referenced by Trevor in the "start / stop" related thread:

<quote>But if you want to shut down nodes, you'll certainly want to wait after the initial install, for a whole day or however long it takes for the first in-cluster rotations to go through, to get certs with longer validity times before shutting down nodes. The auth/master teams may also have some advice for monitoring those rotations, even if it's just "grep the kube-apiserver-operator logs". Maybe there are Kubernetes Events you can watch for? I dunno.

Alternatively, you can just let the certs expire, and when the cluster comes back up, use SSH (which we don't expire/rotate) to go through and rebuild the x.509 chains.</quote>

    I'm at my wits end here. I let the cluster up for atleast 36 hours, and when I tried logging in at that point, got an http 400 error response back. This was different from the "EOF" string I usually got when it had previously died. I thought that perhaps this was a different symptom and decided to bounce the nodes. But when the nodes came up, I got the familiar EOF error on login. I have asked before, and I'll repeat again: is there a way to get the cluster back to normal?

    Otherwise I'm faced with another reinstall. And wondering why I'm bothering with a product that stops working after about a day.

Regards,
Marvin

On Friday, March 29, 2019 at 5:01:33 AM UTC-4, Just Marvin wrote:

W. Trevor King

unread,
Mar 31, 2019, 1:14:41 PM3/31/19
to Just Marvin, OpenShift 4 Developer Preview
On Sun, Mar 31, 2019, 09:41 Just Marvin wrote:
    I'm at my wits end here. I let the cluster up for at least 36 hours, and when I tried logging in at that point, got an http 400 error response back.

Just to be clear, this cluster was running that whole time, without having nodes stopped or started?  We currently have good CI coverage for the first hour post install.  There's CI coverage for upgrades [1] which run a bit longer, but we still have some kinks to work out there.  Some of those are upgrade-related, but some of them are from the volume of clusters we run straining our test infrastructure; quieter accounts should have higher success rates.  We're in the process of building soak-tests to ensure clusters remain healthy for longer (initially a week or so).  But if we had all the bugs fixed, we'd have had a general release ;).  Until then, the more information we get into (new?) bug reports about symptoms, reproducers, error logs and such, the easier it is for us to fix the issue you've bumped into.  Or wait until we've built out the CI to hit those bugs ourselves.

Cheers,
Trevor

Just Marvin

unread,
Mar 31, 2019, 2:41:22 PM3/31/19
to OpenShift 4 Developer Preview
Trevor,

    Thats correct. Cluster and cluster settings was left intact from the point if install for atleast 36 hours. I was able to get useful work done for about 12 - 14 of those hours. I will note that one significant change was to set up htpasswd auth as per the instructions (which was only possible because I had sufficient shell history and config yamls cached away - see my gripes in another thread about broken docs for more context). Anyway, as soon as I created one user per those instructions, I discovered that I could no longer login to kubeadmin via the console. So I had to define the kubeadmin as a user as well in the htpasswd secret.

    I appreciate the fact that this is a developer preview, but not being able to survive beyond 36 hours puts this problem in a special class. Not sure when exactly it broke since I was asleep / away for the last 8 of those 36.

    In the meantime, if there is a mechanism to reset the certs to a working state, that can get me up and running. In fact, that may be the only option for me since I have to demo the cluster, and it may take me 8 hours to get setup for the demo and atleast 24 hours for the demo to be scheduled. 

    If there are logs that need to be gathered so that diagnosis can be done by the redhat folks, please let me know. Otherwise, I plan to wipe the cluster tomorrow morning and try once more.

Regards,
Marvin

W. Trevor King

unread,
Mar 31, 2019, 3:35:46 PM3/31/19
to Just Marvin, OpenShift 4 Developer Preview
On Sun, Mar 31, 2019, 11:41 Just Marvin wrote:
    In the meantime, if there is a mechanism to reset the certs to a working state, that can get me up and running.

Not yet, that's [1].

> If there are logs that need to be gathered so that diagnosis can be done by the redhat folks, please let me know. Otherwise, I plan to wipe the cluster tomorrow morning and try once more.

This isn't my space, so take these suggestions with a grain of salt, but my first question would be whether the breakage was killing containers or not.  Checking container age (probably via SSH and crictl, with the Kubernetes API down) will tell you that.  Then getting logs from the time of the breakage (maybe hard for you on this cluster, so long after the fact, but for next time) from containers that just died, or, if they didn't die, the kubelets, openshift-kube-apiserver pods, openshift-kube-controller pods, openshift-apiserver pods, and openshift-controller-manager pods.  Maybe also check kubelet cert expirations following [2].  Again, I don't really know what you'll be looking for, so you'll need to sniff out anything that seems suspicious.  And also again, we expect soaking CI soon, in which case the CI logs will presumably capture tge issue you're hitting without you needing to do any legwork.

Cheers,
Trevor

Just Marvin

unread,
Apr 2, 2019, 6:31:29 AM4/2/19
to OpenShift 4 Developer Preview
Trevor,

    FYI - destroy appears to have challenges as well:

INFO get tagged resources: InvalidSignatureException: Signature expired: 20190401T080546Z is now earlier than 20190402T102426Z (20190402T102926Z - 5 min.)
status code: 400, request id: 30c519d8-5532-11e9-9fbb-b1dd33773a42 
INFO get tagged resources: InvalidSignatureException: Signature expired: 20190401T080546Z is now earlier than 20190402T102426Z (20190402T102926Z - 5 min.)
status code: 400, request id: 30e943a9-5532-11e9-9fbb-b1dd33773a42 
INFO get tagged resources: InvalidSignatureException: Signature expired: 20190401T080547Z is now earlier than 20190402T102427Z (20190402T102927Z - 5 min.)
status code: 400, request id: 30fddd3d-5532-11e9-b8d3-a39c4cd9b6c1 
INFO get tagged resources: InvalidSignatureException: Signature expired: 20190401T080547Z is now earlier than 20190402T102427Z (20190402T102927Z - 5 min.)
status code: 400, request id: 310f4263-5532-11e9-b8d3-a39c4cd9b6c1 
INFO SignatureDoesNotMatch: Signature expired: 20190401T080547Z is now earlier than 20190402T101427Z (20190402T102927Z - 15 min.)
status code: 403, request id: 3114244f-5532-11e9-a3aa-f34bdbe8f45a 
INFO SignatureDoesNotMatch: Signature expired: 20190401T080547Z is now earlier than 20190402T101427Z (20190402T102927Z - 15 min.)
status code: 403, request id: 3118df57-5532-11e9-a3aa-f34bdbe8f45a 

Regards,
Marvin

Just Marvin

unread,
Apr 2, 2019, 6:54:19 AM4/2/19
to OpenShift 4 Developer Preview
Nevermind - my KVM's date / time was waaaay out of synch.

W. Trevor King

unread,
Apr 2, 2019, 8:39:24 AM4/2/19
to Just Marvin, OpenShift 4 Developer Preview
On Tue, Apr 2, 2019, 03:31 Just Marvin wrote:
INFO get tagged resources: InvalidSignatureException: Signature expired: 20190401T080546Z is now earlier than 20190402T102426Z (20190402T102926Z - 5 min.)

This means you and Amazon disagree on the current time.  You can fix your clock, and maybe consider running an NTP daemon ;).

Cheers,
Trevor

Just Marvin

unread,
Apr 3, 2019, 3:23:07 PM4/3/19
to OpenShift 4 Developer Preview
Trevor,

    Here we go again.....a little more than 24 hours after install, I see:

[friar@oc6344180105 app]$ oc get is,bc,dc,svc,route,pvc --export -l app=teamdb -o json
error: the server doesn't have a resource type "is"
[friar@oc6344180105 app]$ oc whoami
error: You must be logged in to the server (Unauthorized)
[friar@oc6344180105 app]$ oc login -u system:admin
Error from server (InternalError): Internal error occurred: unexpected response: 400
[friar@oc6344180105 app]$ 

    Is build 16 expected to be any better in behavior?

Regards,
Marvin

Edward Callahan

unread,
Apr 3, 2019, 3:28:27 PM4/3/19
to OpenShift 4 Developer Preview

We are seeing this too. We build a cluster and start testing with it. The next morning it is unresponsive. `EOF` response from `oc login` and connection refused from browser.

Just Marvin

unread,
Apr 3, 2019, 3:58:04 PM4/3/19
to OpenShift 4 Developer Preview
Hi,

    Funnily enough, the workloads that I deployed on the cluster are still working. These are using secrets, so some level of the API is functional. But everything "oc" is dead.

Regards,
Marvin

--
You received this message because you are subscribed to the Google Groups "OpenShift 4 Developer Preview" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openshift-4-dev-p...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openshift-4-dev-preview/9eab8626-f9fe-452a-91c3-c79a584c1dad%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Corey Daley

unread,
Apr 3, 2019, 4:24:58 PM4/3/19
to Just Marvin, OpenShift 4 Developer Preview
I have had a similar issue and had to re-export my KUBECONFIG that was created during the install, and was then able to access the cluster again.
I did not try with the password.


For more options, visit https://groups.google.com/d/optout.


--

Corey Daley

Senior Software Engineer, OpenShift Developer Experience

Red Hat

100 East Davie Street

Raleigh, NC 27601

cda...@redhat.com    T: (919)-754-4623    M: (270)-996-3065    

Just Marvin

unread,
May 1, 2019, 8:42:41 PM5/1/19
to OpenShift 4 Developer Preview
Yesssss!! This problem is finally fixed. The cluster still goes bonkers if shut down within 24 hours after install. But if kept running for atleast 24 hours, it does seem to survive intact.

Regards,
Marvin


On Wednesday, April 3, 2019 at 3:58:04 PM UTC-4, Just Marvin wrote:
Hi,

    Funnily enough, the workloads that I deployed on the cluster are still working. These are using secrets, so some level of the API is functional. But everything "oc" is dead.

Regards,
Marvin

On Wed, Apr 3, 2019 at 3:28 PM Edward Callahan <ed.ca...@lightbend.com> wrote:

We are seeing this too. We build a cluster and start testing with it. The next morning it is unresponsive. `EOF` response from `oc login` and connection refused from browser.

On Wednesday, April 3, 2019 at 12:23:07 PM UTC-7, Just Marvin wrote:
Trevor,

    Here we go again.....a little more than 24 hours after install, I see:

[friar@oc6344180105 app]$ oc get is,bc,dc,svc,route,pvc --export -l app=teamdb -o json
error: the server doesn't have a resource type "is"
[friar@oc6344180105 app]$ oc whoami
error: You must be logged in to the server (Unauthorized)
[friar@oc6344180105 app]$ oc login -u system:admin
Error from server (InternalError): Internal error occurred: unexpected response: 400
[friar@oc6344180105 app]$ 

    Is build 16 expected to be any better in behavior?

Regards,
Marvin

--
You received this message because you are subscribed to the Google Groups "OpenShift 4 Developer Preview" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openshift-4-dev-preview+unsub...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages