Install of OKD 4.9 on vSphere doesn't complete cluster creation

999 views

Skip to first unread message

Gianluca Cecchi

unread,

Mar 9, 2022, 12:29:30 PM3/9/22

to okd-wg

Hello,

I'm experimenting installing OKD 4.9 using IPI on vSphere 7.0.2 with 3 ESXi hosts.

The command

./openshift-install create cluster --dir myocp_install --log-level=info

creates the expected objects, such as the Fedora CoreOS template, then bootstrap VM, then the 3 masters (each one on a different ESXi) and arrives at a point where the bootstrap node has been destroyed as expected, but it seems cluster creation doesn't complete and no worker node VM has been still created.

In output I get

[INFO] Obtaining RHCOS image file from 'https://builds.coreos.fedoraproject.org/prod/streams/stable/builds/34.20210626.3.1/x86_64/fedora-coreos-34.20210626.3.1-vmware.x86_64.ova?sha256='
[INFO] The file was found in cache: /home/g.cecchi/.cache/openshift-installer/image_cache/fedora-coreos-34.20210626.3.1-vmware.x86_64.ova. Reusing...
[INFO] Creating infrastructure resources...
[INFO] Waiting up to 20m0s for the Kubernetes API at https://api.myocp.localdomain.local:6443...
[INFO] API v1.22.1-1839+b93fd35dd03051-dirty up
[INFO] Waiting up to 30m0s for bootstrapping to complete...
[INFO] Destroying the bootstrap resources...
[INFO] Waiting up to 40m0s for the cluster at https://api.myocp.localdomain.local:6443 to initialize...

E0309 01:42:24.532792 3630 reflector.go:138] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to watch *v1.ClusterVersion: failed to list *v1.ClusterVersion: Get "https://api.myocp.localdomain.local:6443/apis/config.openshift.io/v1/clusterversions?fieldSelector=metadata.name%3Dversion&limit=500&resourceVersion=0": dial tcp 192.168.161.111:6443: connect: connection refused
E0309 01:42:25.371765 3630 reflector.go:138] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to watch *v1.ClusterVersion: failed to list *v1.ClusterVersion: Get "https://api.myocp.localdomain.local:6443/apis/config.openshift.io/v1/clusterversions?fieldSelector=metadata.name%3Dversion&limit=500&resourceVersion=0": dial tcp 192.168.161.111:6443: connect: connection refused
I0309 01:42:40.851815 3630 trace.go:205] Trace[569575566]: "Reflector ListAndWatch" name:k8s.io/client-go/tools/watch/informerwatcher.go:146 (09-Mar-2022 01:42:28.487) (total time: 12364ms):
Trace[569575566]: ---"Objects listed" 12364ms (01:42:00.851)
Trace[569575566]: [12.364463121s] [12.364463121s] END
[ERROR] Cluster operator authentication Degraded is True with IngressStateEndpoints_MissingSubsets::OAuthClientsController_SyncError::OAuthServerDeployment_PreconditionNotFulfilled::OAuthServerRouteEndpointAccessibleController_SyncError::OAuthServerServiceEndpointAccessibleController_SyncError::OAuthServerServiceEndpointsEndpointAccessibleController_SyncError::WellKnownReadyController_SyncError: IngressStateEndpointsDegraded: No subsets found for the endpoints of oauth-server
[ERROR] OAuthClientsControllerDegraded: no ingress for host oauth-openshift.apps.myocp.localdomain.local in route oauth-openshift in namespace openshift-authentication
[ERROR] OAuthServerDeploymentDegraded: waiting for the oauth-openshift route to contain an admitted ingress: no admitted ingress for route oauth-openshift in namespace openshift-authentication
[ERROR] OAuthServerDeploymentDegraded:
[ERROR] OAuthServerRouteEndpointAccessibleControllerDegraded: route "openshift-authentication/oauth-openshift": status does not have a host address
[ERROR] OAuthServerServiceEndpointAccessibleControllerDegraded: Get "https://172.30.124.149:443/healthz": dial tcp 172.30.124.149:443: connect: connection refused
...

[ERROR] updating grafana: waiting for Grafana Route to become ready failed: waiting for route openshift-monitoring/grafana: no status available
[ERROR] updating kube-state-metrics: reconciling kube-state-metrics Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/kube-state-metrics: got 1 unavailable replicas
[ERROR] updating openshift-state-metrics: reconciling openshift-state-metrics Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/openshift-state-metrics: got 1 unavailable replicas
[ERROR] updating prometheus-adapter: reconciling PrometheusAdapter Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/prometheus-adapter: got 2 unavailable replicas
[INFO] Cluster operator network ManagementStateDegraded is False with :
[INFO] Cluster operator network Progressing is True with Deploying: Deployment "openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready
[ERROR] Cluster initialization failed because one or more operators are not functioning properly.

...

in .openshift_install.log it seems in a few minutes 716 out of 745 done, but then nothing else

time="2022-03-09T01:44:48+01:00" level=debug msg="Still waiting for the cluster
to initialize: Working towards 4.9.0-0.okd-2022-02-12-140851: 716 of 745 done (9
6% complete)"
time="2022-03-09T01:45:03+01:00" level=debug msg="Still waiting for the cluster
to initialize: Some cluster operators are still updating: authentication, consol
e, image-registry, ingress, kube-apiserver, monitoring"
time="2022-03-09T01:45:44+01:00" level=debug msg="Still waiting for the cluster
to initialize: Some cluster operators are still updating: authentication, consol
e, image-registry, ingress, kube-apiserver, monitoring"
time="2022-03-09T01:45:48+01:00" level=debug msg="Still waiting for the cluster
to initialize: Working towards 4.9.0-0.okd-2022-02-12-140851: 719 of 745 done (9
6% complete)"
time="2022-03-09T01:46:03+01:00" level=debug msg="Still waiting for the cluster
to initialize: Working towards 4.9.0-0.okd-2022-02-12-140851: 722 of 745 done (9
6% complete)"
time="2022-03-09T01:48:18+01:00" level=debug msg="Still waiting for the cluster
to initialize: Some cluster operators are still updating: authentication, consol
e, ingress, monitoring"

time="2022-03-09T02:22:19+01:00" level=error msg="Cluster operator authentication Degraded is True with IngressStateEndpoints_MissingSubsets::OAuthClientsController_SyncError::OAuthServerDeployment_PreconditionNotFulfilled::OAuthServerRouteEndpointAccessibleController_SyncError::OAuthServerServiceEndpointAccessibleController_SyncError::OAuthServerServiceEndpointsEndpointAccessibleController_SyncError::WellKnownReadyController_SyncError: IngressStateEndpointsDegraded: No subsets found for the endpoints of oauth-server\nOAuthClientsControllerDegraded: no ingress for host oauth-openshift.apps.myocp.localdomain.local in route oauth-openshift in namespace openshift-authentication\nOAuthServerDeploymentDegraded: waiting for the oauth-openshift route to contain an admitted ingress: no admitted ingress for route oauth-openshift in namespace openshift-authentication\nOAuthServerDeploymentDegraded: \nOAuthServerRouteEndpointAccessibleControllerDegraded: route \"openshift-authentication/oauth-openshift\": status does not have a host address\nOAuthServerServiceEndpointAccessibleControllerDegraded: Get \"https://172.30.124.149:443/healthz\": dial tcp 172.30.124.149:443: connect: connection refused\nOAuthServerServiceEndpointsEndpointAccessibleControllerDegraded: oauth service endpoints are not ready\nWellKnownReadyControllerDegraded: failed to get oauth metadata from openshift-config-managed/oauth-openshift ConfigMap: configmap \"oauth-openshift\" not found (check authentication operator, it is supposed to create this)"

One doubt: do the master nodes have to reach vCenter Server and ESXi hosts with their names?

I can ssh to the master nodes as the core user with public key and run any debug/verify command in case

Thanks in advance,

Gianluca

Gianluca Cecchi

unread,

Mar 10, 2022, 6:02:04 AM3/10/22

to okd-wg

On Wed, Mar 9, 2022 at 6:29 PM Gianluca Cecchi <gianluc...@gmail.com> wrote:

Hello,
I'm experimenting installing OKD 4.9 using IPI on vSphere 7.0.2 with 3 ESXi hosts.
The command
./openshift-install create cluster --dir myocp_install --log-level=info

creates the expected objects, such as the Fedora CoreOS template, then bootstrap VM, then the 3 masters (each one on a different ESXi) and arrives at a point where the bootstrap node has been destroyed as expected, but it seems cluster creation doesn't complete and no worker node VM has been still created.

[snip]

One doubt: do the master nodes have to reach vCenter Server and ESXi hosts with their names?

Ok, indeed it was that. The environment is inside GCVE and I configured vcsa and ESXi hosts entries only inside /etc/hosts of the workstation from which I was executing the install. This workstation acts also as DHCP and DNS server for the okd nodes.

I have now put in its bind configuration as forwarders the ip of the private cloud dns servers provided by gcve dashboard and the master nodes are able to resolve the vSphere names and cluster deployment completes in 38 minutes with success.

It would be nice to know how I could have debugged this in any log file of masters....