Avatar (Fabio Alessandro Locati|Fale)'s blog

Hyperscalers are not serious about Service Level Agreements (SLA)

November 30, 2024

I often talk with people about Service Level Agreements (SLAs) in public cloud contexts, and I discover that their idea of what those SLAs are is often distorted.

I believe SLAs need to be approached with a healthy dose of skepticism. In reality, they often provide little meaningful recourse when things go awry. There are two big issues, in my opinion, with the SLA provided by many companies, including the hyperscalers:

The first aspect is critical but often hard to measure and prove since the presence of other people’s networks in between makes it much harder to pinpoint the cause of a problem in a demonstrable way. The second aspect is much easier to understand and to quantify.

Let’s say that you have a service that consumes 4 CPUs (x86_64) and 16 GB of RAM and that, if it goes down, will have a real-world impact on your business. Let’s quantify that impact with the following table:

Continuous downtimeDamage
1 hour$ 1000
1 day$ 40000

Let’s say that you want to evaluate running this application on both AWS and Google Cloud; these would be the two options to run it:

Cloud providerInstanceMonthly cost
AWSm7i.xlarge$ 147
Google Cloudn4-standard-4$ 144

Although we will use those specific prices for the examples, the prices are very similar, so this will make little difference. Also, I took AWS and Google Cloud as examples, but you will arrive at very similar results if you do the same calculations for the other public clouds.

AWS SLA

AWS publishes the SLAs for all their services in a very convenient way. An interesting aspect of all published SLAs is that they all give you Service Credits back if they break the SLA, in different amounts based on how much they broke the SLA.

Service Credits are calculated in slightly different ways for the various services, but they can be considered a percentage of the monthly bill (excluding one-time payments and upfront payments) for the service that did not meet the SLA.

If we pick the single EC2 instances as an example, we can see that AWS guarantees a 99.5% SLA and that - in the case they miss it - will give their users the following percentages of service credits:

Monthly Uptime PercentagePercentage
99.0% - 99.5%10%
95.0% - 99.0%30%
< 95.0%100%

Google Cloud SLA

Google Cloud publishes its SLA on different pages based on the service, but the one for Compute is this one. Google Cloud Compute is the equivalent of AWS EC2, so it makes sense to compare them. First of all, it is interesting to see that in the GCP world, Single Compute Instances have an SLA only if they are on the Premium Network Tier. If they are on the Standard Network Tier, the SLA covers only Load Balancers and Multi-Zone Compute Instances.

Also, the way they calculate the credits is similar to AWS, and they have a table similar to AWS on the penalty, more specifically:

Monthly Uptime PercentagePercentage
95.0% - 99.95%10%
90.0% - 95.0%25%
< 90.0%100%

Damages calculation

Since we now have the damages we will incur in case of downtime, the costs, and the services SLAs, it will be possible to calculate a few instances. We will consider four cases: a month with one hour of downtime, a month with two hours of downtime, a month with one day of downtime, and a month with two days of downtime.

DowntimeDamagesUptimeAWS CreditsGoogle Cloud Credits
1x1 hour$ 100099.86%$ 0.00$ 14.40
2x1 hour$ 200099.72%$ 0.00$ 14.40
1x1 day$ 4000096.68%$ 44.10$ 14.40
2x1 day$ 8000093.37%$ 147.00$ 36.00

As the table clearly shows, in most cases, the Service Credits that AWS and Google Cloud give you back are two to three orders of magnitude smaller than the damages you are receiving.

Conclusions

Although it is important that the hyperscaler provide SLAs, it is also important to remember that the credits you’ll get if those SLAs are not met are so limited that they can not be seriously considered a safeguard of any kind. For those reasons, it is critical to design applications that will survive the loss of some of the machines running them so that the potential damages discussed above will not materialize, even if one of the cloud instances where they are running goes down.