IPI on vmware :: Problems with Operator authentication & console

This comment was posted to reddit on Apr 21, 2021 at 9:58 pm and was deleted within 4 minutes.

IPI on vmware :: Problems with Operator authentication & console

I have been having an absolute nightmare for 2 months with Openshift 4.7 running on bare metal on a VM Cluster.

I was getting a lot of TIME_WAIT timeouts on ports 443, 6443

I was constantly getting time out and failures on both the bootstrap and the install phase.

I have literately rebuilt the entire cluster from scratch about 20-30 times.

However I first tried ESXi 7.0b as the hypervisor but recently I got so fed up I rebuilt the entire environment using proxmox ... and I still get exactly the same problem.

Its seems you just have to keep restarting the bootstrap and the install phases repeatedly until you get lucky and get success. When I check the packet captures I see the HAProxy FIN the sessions (originating from the source) so I don't actually see packet loss as such ! but I need to make sure I identify the correct sessions I need to trap

Anybody have any ideas why the problem coud be the same in ESXi and Proxmox ?

Install fails with below

^{INFO Cluster operator network ManagementStateDegraded is False with :}

^{ERROR Cluster initialization failed because one or more operators are not functioning properly.}

^{ERROR The cluster should be accessible for troubleshooting as detailed in the documentation linked below,}

^ERROR ^{https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-installations.html}

^{ERROR The 'wait-for install-complete' subcommand can then be used to continue the installation}

^{FATAL failed to initialize the cluster: Working towards 4.7.6: 621 of 668 done (92% complete})

the log files barfs with

^{time="2021-04-21T22:38:08+01:00" level=error msg="Cluster operator authentication Degraded is True with WellKnownReadyController\}SyncError: WellKnownReadyControllerDegraded: kube-apiserver oauth endpoint) ^{https://192.168.100.203:6443/.well-known/oauth-authorization-server} ^{is not yet served and authentication operator keeps waiting (check kube-apiserver operator, and check that instances roll out successfully, which can take several minutes per instance}")

^{time="2021-04-21T22:38:08+01:00" level=info msg="Cluster operator authentication Progressing is True with OAuthServerDeployment\}ReplicasNotReady::OAuthVersionDeployment_ReplicasNotReady: OAuthVersionDeploymentProgressing: Waiting for all OAuth server replicas to be ready (1 not ready, container is not ready in oauth-openshift-7c9cdb4c96-zblcr pod)\nOAuthServerDeploymentProgressing: Waiting for all OAuth server replicas to be ready (1 not ready, container is not ready in oauth-openshift-7c9cdb4c96-zblcr pod)")

^{time="2021-04-21T22:38:08+01:00" level=info msg="Cluster operator authentication Available is False with WellKnown\}NotReady: WellKnownAvailable: The well-known endpoint is not yet available: kube-apiserver oauth endpoint) ^{https://192.168.100.203:6443/.well-known/oauth-authorization-server} ^{is not yet served and authentication operator keeps waiting (check kube-apiserver operator, and check that instances roll out successfully, which can take several minutes per instance}")

^{time="2021-04-21T22:38:08+01:00" level=info msg="Cluster operator baremetal Disabled is True with UnsupportedPlatform: Nothing to do on this Platform"}

^{time="2021-04-21T22:38:08+01:00" level=info msg="Cluster operator console Progressing is True with SyncLoopRefresh\}InProgress: SyncLoopRefreshProgressing: Working toward version 4.7.6")

^{time="2021-04-21T22:38:08+01:00" level=info msg="Cluster operator console Available is False with Deployment\}FailedUpdate: DeploymentAvailable: 2 replicas ready at version 4.7.6")

^{time="2021-04-21T22:38:08+01:00" level=info msg="Cluster operator csi-snapshot-controller Progressing is True with \}Deploying: Progressing: Waiting for Deployment to deploy csi-snapshot-controller pods")

^{time="2021-04-21T22:38:08+01:00" level=info msg="Cluster operator csi-snapshot-controller Available is False with \}Deploying: Available: Waiting for Deployment to deploy csi-snapshot-controller pods")

^{time="2021-04-21T22:38:08+01:00" level=error msg="Cluster operator image-registry Degraded is True with ImagePrunerJobFailed::Removed: Degraded: The registry is removed\}nImagePrunerDegraded: Job has reached the specified backoff limit")

^{time="2021-04-21T22:38:08+01:00" level=error msg="Cluster operator ingress Degraded is True with IngressControllersDegraded: Some ingresscontrollers are degraded: ingresscontroller \}"default\" is degraded: DegradedConditions: One or more other status conditions indicate a degraded state: PodsScheduled=False (PodsNotScheduled: Some pods are not scheduled: Pod \"router-default-6cf4c4788-gw5bb\" cannot be scheduled: 0/4 nodes are available: 1 node(s) didn't have free ports for the requested pod ports, 3 node(s) had taint {)^{node-role.kubernetes.io/master:} ^{}, that the pod didn't tolerate. Make sure you have sufficient worker nodes.}")

/r/openshift Thread

IPI on vmware :: Problems with Operator authentication & console

More Random Comments