PS4 NA servers have been down almost 12 hrs WTF

Let me be more clear - when you're running a service which is what ESO is, it is beholden to have certain SLAs (service level agreements) for customers. These could be firm and punishable via a penalty or in this case, more of a customer promise (ie. you buy our product and we are promising to deliver the service in good faith). Here on this forum, people talk about the NA or EU server - it is obviously a cluster of many nodes geographically distributed and load balanced with geo IP load balancing (though I always thought it was interesting you could elect to go to the EU "server"). So when I say the NA servers are down, I'm referring to the service. If that caused confusion, sorry, but it's very obvious there is a widespread issue that is very easily detectable either programmatically with monitoring and alarming, or via the number of incoming customer tickets in the queue which should be monitored and alarmed if it goes above normal thresholds.

Building in scalability and reliability into your service isn't easy for sure, but for a game of this size and scope it's necessary. In fact I'm sure they have a number of failsafes but I suspect one of them is NOT in the DNS/networking or BGP routing layer. Hopefully they're using more than one DNS service because if that was attacked for example, then they would have a very bad single point of failure on their hands.

As an example, when Amazon had several issues in a couple of their datacenters in an AZ on us-east-1 a few years ago it exposed that a lot of very large services such as reddit had not properly added infrastructure redundancy and didn't follow best practices utilizing multi-az and multi-region deployments. And it took them a very long time to restore service.

While this is properly a bad route somewhere, it's affect a large number of customers and it shows that ESO's monitoring is not very robust because they should have detected this either with monitors on the edge OR alarming at the relative lower number of incoming connection requests during the event.

Given that this happens to folks who are doing wayshrines and delves, I suspect that they have a single point of failure where they may be dialing back home to a single cluster to fetch or store data and that is failing callbacks.

tl;dr I want to be spending the big bucks on ESO right now but won't because their customer support took hours to acknowledge an incident.

/r/elderscrollsonline Thread Parent