I have been having an absolute nightmare for 2 months with Openshift 4.7 running on bare metal on a VM Cluster.
I was getting a lot of TIME_WAIT timeouts on ports 443, 6443
I was constantly getting time out and failures on both the bootstrap and the install phase.
I have literately rebuilt the entire cluster from scratch about 20-30 times.
However I first tried ESXi 7.0b as the hypervisor but recently I got so fed up I rebuilt the entire environment using proxmox ... and I still get exactly the same problem.
Its seems you just have to keep restarting the bootstrap and the install phases repeatedly until you get lucky and get success. When I check the packet captures I see the HAProxy FIN the sessions (originating from the source) so I don't actually see packet loss as such ! but I need to make sure I identify the correct sessions I need to trap
Anybody have any ideas why the problem coud be the same in ESXi and Proxmox ?
Install fails with below
INFO Cluster operator network ManagementStateDegraded is False with :
ERROR Cluster initialization failed because one or more operators are not functioning properly.
ERROR The cluster should be accessible for troubleshooting as detailed in the documentation linked below,
ERROR The 'wait-for install-complete' subcommand can then be used to continue the installation
FATAL failed to initialize the cluster: Working towards 4.7.6: 621 of 668 done (92% complete)
the log files barfs with
time="2021-04-21T22:38:08+01:00" level=error msg="Cluster operator authentication Degraded is True with WellKnownReadyController\SyncError: WellKnownReadyControllerDegraded: kube-apiserver oauth endpoint) https://192.168.100.203:6443/.well-known/oauth-authorization-server is not yet served and authentication operator keeps waiting (check kube-apiserver operator, and check that instances roll out successfully, which can take several minutes per instance")
time="2021-04-21T22:38:08+01:00" level=info msg="Cluster operator authentication Progressing is True with OAuthServerDeployment\ReplicasNotReady::OAuthVersionDeployment_ReplicasNotReady: OAuthVersionDeploymentProgressing: Waiting for all OAuth server replicas to be ready (1 not ready, container is not ready in oauth-openshift-7c9cdb4c96-zblcr pod)\nOAuthServerDeploymentProgressing: Waiting for all OAuth server replicas to be ready (1 not ready, container is not ready in oauth-openshift-7c9cdb4c96-zblcr pod)")
time="2021-04-21T22:38:08+01:00" level=info msg="Cluster operator authentication Available is False with WellKnown\NotReady: WellKnownAvailable: The well-known endpoint is not yet available: kube-apiserver oauth endpoint) https://192.168.100.203:6443/.well-known/oauth-authorization-server is not yet served and authentication operator keeps waiting (check kube-apiserver operator, and check that instances roll out successfully, which can take several minutes per instance")
time="2021-04-21T22:38:08+01:00" level=info msg="Cluster operator baremetal Disabled is True with UnsupportedPlatform: Nothing to do on this Platform"
time="2021-04-21T22:38:08+01:00" level=info msg="Cluster operator console Progressing is True with SyncLoopRefresh\InProgress: SyncLoopRefreshProgressing: Working toward version 4.7.6")
time="2021-04-21T22:38:08+01:00" level=info msg="Cluster operator console Available is False with Deployment\FailedUpdate: DeploymentAvailable: 2 replicas ready at version 4.7.6")
time="2021-04-21T22:38:08+01:00" level=info msg="Cluster operator csi-snapshot-controller Progressing is True with \Deploying: Progressing: Waiting for Deployment to deploy csi-snapshot-controller pods")
time="2021-04-21T22:38:08+01:00" level=info msg="Cluster operator csi-snapshot-controller Available is False with \Deploying: Available: Waiting for Deployment to deploy csi-snapshot-controller pods")
time="2021-04-21T22:38:08+01:00" level=error msg="Cluster operator image-registry Degraded is True with ImagePrunerJobFailed::Removed: Degraded: The registry is removed\nImagePrunerDegraded: Job has reached the specified backoff limit")
time="2021-04-21T22:38:08+01:00" level=error msg="Cluster operator ingress Degraded is True with IngressControllersDegraded: Some ingresscontrollers are degraded: ingresscontroller \"default\" is degraded: DegradedConditions: One or more other status conditions indicate a degraded state: PodsScheduled=False (PodsNotScheduled: Some pods are not scheduled: Pod \"router-default-6cf4c4788-gw5bb\" cannot be scheduled: 0/4 nodes are available: 1 node(s) didn't have free ports for the requested pod ports, 3 node(s) had taint {)node-role.kubernetes.io/master: }, that the pod didn't tolerate. Make sure you have sufficient worker nodes.")