Hyperscalers are not serious about Service Level Agreements (SLA)

 November 30, 2024

I often talk with people about Service Level Agreements (SLAs) in public cloud contexts, and I discover that their idea of what those SLAs are is often distorted.

I believe SLAs need to be approached with a healthy dose of skepticism. In reality, they often provide little meaningful recourse when things go awry. There are two big issues, in my opinion, with the SLA provided by many companies, including the hyperscalers:

SLAs usually do not address the Service Quality. An SLA might promise uptime, but it rarely addresses the service quality. Poor performance, intermittent slowdowns, and subpar user experiences often are not considered downtimes. Focusing solely on uptime metrics can obscure broader service quality problems that are equally damaging.
SLAs are easy to promise on paper. Many SLAs promise “four nines” of availability (99.99%), seemingly offering a rock-solid uptime guarantee. But the catch is that, even if those targets are missed, the consequences for the provider are often minor. They may offer partial refunds, credits, or minor penalties, but these rarely compensate for the real-world impact of a service outage or failure. The true cost of downtime (lost revenue, damaged reputation, and disrupted operations) falls squarely on you.

The first aspect is critical but often hard to measure and prove since the presence of other people’s networks in between makes it much harder to pinpoint the cause of a problem in a demonstrable way. The second aspect is much easier to understand and to quantify.

Let’s say that you have a service that consumes 4 CPUs (x86_64) and 16 GB of RAM and that, if it goes down, will have a real-world impact on your business. Let’s quantify that impact with the following table:

Continuous downtime	Damage
1 hour	$ 1000
1 day	$ 40000

Let’s say that you want to evaluate running this application on both AWS and Google Cloud; these would be the two options to run it:

Cloud provider	Instance	Monthly cost
AWS	m7i.xlarge	$ 147
Google Cloud	n4-standard-4	$ 144

Although we will use those specific prices for the examples, the prices are very similar, so this will make little difference. Also, I took AWS and Google Cloud as examples, but you will arrive at very similar results if you do the same calculations for the other public clouds.

AWS SLA

AWS publishes the SLAs for all their services in a very convenient way. An interesting aspect of all published SLAs is that they all give you Service Credits back if they break the SLA, in different amounts based on how much they broke the SLA.

Service Credits are calculated in slightly different ways for the various services, but they can be considered a percentage of the monthly bill (excluding one-time payments and upfront payments) for the service that did not meet the SLA.

If we pick the single EC2 instances as an example, we can see that AWS guarantees a 99.5% SLA and that - in the case they miss it - will give their users the following percentages of service credits:

Monthly Uptime Percentage	Percentage
99.0% - 99.5%	10%
95.0% - 99.0%	30%
< 95.0%	100%

Google Cloud SLA

Google Cloud publishes its SLA on different pages based on the service, but the one for Compute is this one. Google Cloud Compute is the equivalent of AWS EC2, so it makes sense to compare them. First of all, it is interesting to see that in the GCP world, Single Compute Instances have an SLA only if they are on the Premium Network Tier. If they are on the Standard Network Tier, the SLA covers only Load Balancers and Multi-Zone Compute Instances.

Also, the way they calculate the credits is similar to AWS, and they have a table similar to AWS on the penalty, more specifically:

Monthly Uptime Percentage	Percentage
95.0% - 99.95%	10%
90.0% - 95.0%	25%
< 90.0%	100%

Damages calculation

Since we now have the damages we will incur in case of downtime, the costs, and the services SLAs, it will be possible to calculate a few instances. We will consider four cases: a month with one hour of downtime, a month with two hours of downtime, a month with one day of downtime, and a month with two days of downtime.

Downtime	Damages	Uptime	AWS Credits	Google Cloud Credits
1x1 hour	$ 1000	99.86%	$ 0.00	$ 14.40
2x1 hour	$ 2000	99.72%	$ 0.00	$ 14.40
1x1 day	$ 40000	96.68%	$ 44.10	$ 14.40
2x1 day	$ 80000	93.37%	$ 147.00	$ 36.00

As the table clearly shows, in most cases, the Service Credits that AWS and Google Cloud give you back are two to three orders of magnitude smaller than the damages you are receiving.

Conclusions

Although it is important that the hyperscaler provide SLAs, it is also important to remember that the credits you’ll get if those SLAs are not met are so limited that they can not be seriously considered a safeguard of any kind. For those reasons, it is critical to design applications that will survive the loss of some of the machines running them so that the potential damages discussed above will not materialize, even if one of the cloud instances where they are running goes down.

 Cloud, Security, Web

 Amazon, AWS, GCP, Google, Public Cloud, SLA

Say something

Comments

Federico on 2024-12-01 21:15:19 UTC said:

I usually consider the real SLA to be the one for which a 100% reimbursement is promised; the other levels are just suggestions of a hope. When you have a service hosted in Azure, relying on multiple Azure products each with their own SLA, if you multiply those SLAs you often discover that the total SLA is well under 90 % (0,95^5 is 0,77).

Fabio Alessandro Locati on 2024-12-02 07:01:06 UTC said:

I totally agree that an application’s “real” uptime needs to account for a more complex scenario in which many services need to be up to ensure that the application is up. As you pointed out, the right operation in those cases is multiplying the various SLAs (at a reimbursement level you deem appropriate).

On the other hand, though, it is also possible to make some systems redundant and available in multiple availability zones or regions so that the application’s uptime is higher than that of the individual components.