IPI on vmware :: Problems with Operator authentication & console

I have been having an absolute nightmare for 2 months with Openshift 4.7 running on bare metal on a VM Cluster.

I was getting a lot of TIME_WAIT timeouts on ports 443, 6443

I was constantly getting time out and failures on both the bootstrap and the install phase.

I have literately rebuilt the entire cluster from scratch about 20-30 times.

However I first tried ESXi 7.0b as the hypervisor but recently I got so fed up I rebuilt the entire environment using proxmox ... and I still get exactly the same problem.

Its seems you just have to keep restarting the bootstrap and the install phases repeatedly until you get lucky and get success. When I check the packet captures I see the HAProxy FIN the sessions (originating from the source) so I don't actually see packet loss as such ! but I need to make sure I identify the correct sessions I need to trap

Anybody have any ideas why the problem coud be the same in ESXi and Proxmox ?

Install fails with below

INFO Cluster operator network ManagementStateDegraded is False with :

ERROR Cluster initialization failed because one or more operators are not functioning properly.

ERROR The cluster should be accessible for troubleshooting as detailed in the documentation linked below,

ERROR https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-installations.html

ERROR The 'wait-for install-complete' subcommand can then be used to continue the installation

FATAL failed to initialize the cluster: Working towards 4.7.6: 621 of 668 done (92% complete)

the log files barfs with

time="2021-04-21T22:38:08+01:00" level=error msg="Cluster operator authentication Degraded is True with WellKnownReadyController\SyncError: WellKnownReadyControllerDegraded: kube-apiserver oauth endpoint) https://192.168.100.203:6443/.well-known/oauth-authorization-server is not yet served and authentication operator keeps waiting (check kube-apiserver operator, and check that instances roll out successfully, which can take several minutes per instance")

time="2021-04-21T22:38:08+01:00" level=info msg="Cluster operator authentication Progressing is True with OAuthServerDeployment\ReplicasNotReady::OAuthVersionDeployment_ReplicasNotReady: OAuthVersionDeploymentProgressing: Waiting for all OAuth server replicas to be ready (1 not ready, container is not ready in oauth-openshift-7c9cdb4c96-zblcr pod)\nOAuthServerDeploymentProgressing: Waiting for all OAuth server replicas to be ready (1 not ready, container is not ready in oauth-openshift-7c9cdb4c96-zblcr pod)")

time="2021-04-21T22:38:08+01:00" level=info msg="Cluster operator authentication Available is False with WellKnown\NotReady: WellKnownAvailable: The well-known endpoint is not yet available: kube-apiserver oauth endpoint) https://192.168.100.203:6443/.well-known/oauth-authorization-server is not yet served and authentication operator keeps waiting (check kube-apiserver operator, and check that instances roll out successfully, which can take several minutes per instance")

time="2021-04-21T22:38:08+01:00" level=info msg="Cluster operator baremetal Disabled is True with UnsupportedPlatform: Nothing to do on this Platform"

time="2021-04-21T22:38:08+01:00" level=info msg="Cluster operator console Progressing is True with SyncLoopRefresh\InProgress: SyncLoopRefreshProgressing: Working toward version 4.7.6")

time="2021-04-21T22:38:08+01:00" level=info msg="Cluster operator console Available is False with Deployment\FailedUpdate: DeploymentAvailable: 2 replicas ready at version 4.7.6")

time="2021-04-21T22:38:08+01:00" level=info msg="Cluster operator csi-snapshot-controller Progressing is True with \Deploying: Progressing: Waiting for Deployment to deploy csi-snapshot-controller pods")

time="2021-04-21T22:38:08+01:00" level=info msg="Cluster operator csi-snapshot-controller Available is False with \Deploying: Available: Waiting for Deployment to deploy csi-snapshot-controller pods")

time="2021-04-21T22:38:08+01:00" level=error msg="Cluster operator image-registry Degraded is True with ImagePrunerJobFailed::Removed: Degraded: The registry is removed\nImagePrunerDegraded: Job has reached the specified backoff limit")

time="2021-04-21T22:38:08+01:00" level=error msg="Cluster operator ingress Degraded is True with IngressControllersDegraded: Some ingresscontrollers are degraded: ingresscontroller \"default\" is degraded: DegradedConditions: One or more other status conditions indicate a degraded state: PodsScheduled=False (PodsNotScheduled: Some pods are not scheduled: Pod \"router-default-6cf4c4788-gw5bb\" cannot be scheduled: 0/4 nodes are available: 1 node(s) didn't have free ports for the requested pod ports, 3 node(s) had taint {)node-role.kubernetes.io/master: }, that the pod didn't tolerate. Make sure you have sufficient worker nodes.")

/r/openshift Thread