Avatar (Fabio Alessandro Locati|Fale)'s blog

Can you trust a cloud provider for HA?

May 23, 2022

We have seen a massive increase in the “real world” dependency on digital services in the last few years. This process will probably continue in the future, and we are not ready for it. In the same few years, we have seen a lot of cases where digital services went offline or got hacked. In a society that relies more and more on digital services, we can not afford such services not to be available or secure. Although security is essential, I want to focus on availability for now.

I often see companies creating services that rely on a single cloud provider. Every time this happens, I start questioning them on their paradigm to ensure that their service is highly available and can meet the required SLAs. They often mention that their provider has given them SLAs so that they can provide similar SLAs to their users. This way of reasoning is, more often than not, fallacious.

SLAs math

The reason for the fallacy relies on what I call “SLAs math”, or what is studied by the probability branch of mathematics. Let’s take an example to explain what this means. For example, let’s take a simple service that needs an infrastructure composed of a public-facing connection, one application server node (stateless), one internal connection, and one database server (stateful). In this elementary example, I will have 5 SLAs to consider:

Let’s say the cloud provider guarantees the following monthly SLAs (taken from a famous cloud provider but very similar to the values of many other providers):

To calculate the overall infrastructure SLA, we would need to know the correlation between the various SLAs, but I’ve not seen any provider state it. So, let’s calculate the two extreme correlation options: +1 and -1.

We assume a complete correlation for +1, so it will suffice to take the lowest SLA. In our example, it will therefore be 99.5%.

For the correlation of -1, we assume there will never be a fault than affects multiple services. For this, the calculation is slightly more complex. We will need to multiply the various probability, so we will have: 0.999*0.995*0.999*0.995*0.995=0.9831. So, we will have 98.31% of SLA for the whole infrastructure.

Considering the best-case and the worst-case scenario that is still complying with the SLAs, we could have between 3h 39m 8s (99.5%) and 12h 20m 42s (98.31%) unguaranteed uptime every month. Also, the correlation is probably much closer to 0 than to the two extremes, so we could expect the actual SLA for the solution to be an average of the two, to something like ~99% (~8h unguaranteed uptime).

Nowadays, services require infrastructures with way more than five components. If you try the same exercise for a real-world service, you can quickly get a value for the worst-case scenario well below 90% or three days of unguaranteed uptime per month.

In addition to this, you have to remember that if the infrastructure your service is running on has a cumulative SLA of, for example, 95%, you can not guarantee the same SLA to your customers since you are not accounting for possible problems in your part of the stack. Hence the SLA you’ll be able to guarantee to your customers should always be lower than your architecture SLAs.

Increase infrastructure SLA

You might be wondering how’s possible that many services have higher SLAs than what this way of thinking is bringing. Simply enough, they make sensible use of redundancy. If the correct architecture is chosen, the SLA of the infrastructure can be better than the SLAs of the single items that compose it.

Let’s pick as an example two servers placed in two different Availability Zones of the same Region. The provider might guarantee you that, in this case, you might get a 99.99% SLA of at least one of the two machines being alive. If your service is designed to be able to work with only one of the two machines alive, then the probability of uptime will compound favorably in your case.

Single provider risks

You might be tricked into thinking that if you have three instances in three different Availability Zones, you can get something like 99.999% of SLA, but this is not the case.

In the example with no redundancies that we have seen, the more correlated the failures were, the better the overall SLA was. Here the opposite is true.

The issue with using multiple resources for redundancy purposes from the same cloud provider is that those will not be completely uncorrelated. Therefore it is way more probable that when one has an issue, the other ones will have issues than not.

The correlation between them might be unclear at times, but it becomes obvious when an issue arises. An example of this is the OVH Cloud incident that happened on March 10th, 2021. Due to a fire in the data center SBG2, all SBG data centers went down:

As you can imagine, if someone bought servers in multiple SBG data centers to serve a Highly Available service, they ended up with the service down for five days (best case scenario), nine days, or with the data wholly lost (worst case scenario). Even if some information has not been released to the public, we can safely say that some design flaws made the incident worse than it could have been and that the customers were not aware of those design flaws before the accident. Even worst, multiple OpenStack Availability Zones (os-sbg1, os-sbg2, os-sbg3, os-sbg4) were in the same physical data center (SBG2); therefore, a cluster spanning across those AZs got annihilated in the accident.

Aside from the fact that the cloud providers themselves are the ones that define what Availability Zones within the Region are and that the cloud providers are the ones that suggest using them for Highly Available deployments, this might still not be enough. You might infer from this that you should leverage multiple Regions to ensure that the servers are physically far away. This approach is undoubtedly good hygiene if the latencies are acceptable for your use case, but it might not be enough.

Multiple regions of the same cloud provider are still somehow related and might have a downtime correlation higher than 0. As an example, we can look at the AWS outage on December 6th, 2021. Due to an outage in AWS us-east-1 (North Virginia) region, AWS services went down globally for around four hours. This outage happened because us-east-1 is AWS’s first and main region, and many global services depend on it.

The solution is to source the resources needed for your service from multiple cloud providers, which is the definition of multi-cloud deployment.

The multi-cloud fallacy

As we have seen, sourcing resources from multiple cloud providers is critical for a real highly available setup, but it is not a guarantee of success.

People often compose the infrastructure by using a single provider per component, relying on the “best” provider for that component. As an example, it would be common to have an infrastructure such as:

This infrastructure is asking to go down since it’s sufficient that one of the four clouds has an issue to make the whole service go down.

The proper way to leverage the presence of multiple providers to create a Highly Available environment is to create replicas of all components in the infrastructure across all providers so that even if a provider goes completely down, the service can still survive.

Conclusions

When designing a service that needs to be Highly Available, it is key to understand what can go wrong at every level and compose multiple service providers to maximize the uptime of the service, usually by deploying the same components of the infrastructure across multiple providers.