
Hyperscalers are not serious about Service Level Agreements (SLA)
November 30, 2024
I often talk with people about Service Level Agreements (SLAs) in public cloud contexts, and I discover that their idea of what those SLAs are is often distorted.
I believe SLAs need to be approached with a healthy dose of skepticism. In reality, they often provide little meaningful recourse when things go awry. There are two big issues, in my opinion, with the SLA provided by many companies, including the hyperscalers:
SLAs usually do not address the Service Quality. An SLA might promise uptime, but it rarely addresses the service quality. Poor performance, intermittent slowdowns, and subpar user experiences often are not considered downtimes. Focusing solely on uptime metrics can obscure broader service quality problems that are equally damaging.
SLAs are easy to promise on paper. Many SLAs promise “four nines” of availability (99.99%), seemingly offering a rock-solid uptime guarantee. But the catch is that, even if those targets are missed, the consequences for the provider are often minor. They may offer partial refunds, credits, or minor penalties, but these rarely compensate for the real-world impact of a service outage or failure. The true cost of downtime (lost revenue, damaged reputation, and disrupted operations) falls squarely on you.
The first aspect is critical but often hard to measure and prove since the presence of other people’s networks in between makes it much harder to pinpoint the cause of a problem in a demonstrable way. The second aspect is much easier to understand and to quantify.
Let’s say that you have a service that consumes 4 CPUs (x86_64) and 16 GB of RAM and that, if it goes down, will have a real-world impact on your business. Let’s quantify that impact with the following table:
Continuous downtime | Damage |
---|---|
1 hour | $ 1000 |
1 day | $ 40000 |
Let’s say that you want to evaluate running this application on both AWS and Google Cloud; these would be the two options to run it:
Cloud provider | Instance | Monthly cost |
---|---|---|
AWS | m7i.xlarge | $ 147 |
Google Cloud | n4-standard-4 | $ 144 |
Although we will use those specific prices for the examples, the prices are very similar, so this will make little difference. Also, I took AWS and Google Cloud as examples, but you will arrive at very similar results if you do the same calculations for the other public clouds.
AWS SLA
AWS publishes the SLAs for all their services in a very convenient way. An interesting aspect of all published SLAs is that they all give you Service Credits back if they break the SLA, in different amounts based on how much they broke the SLA.
Service Credits are calculated in slightly different ways for the various services, but they can be considered a percentage of the monthly bill (excluding one-time payments and upfront payments) for the service that did not meet the SLA.
If we pick the single EC2 instances as an example, we can see that AWS guarantees a 99.5% SLA and that - in the case they miss it - will give their users the following percentages of service credits:
Monthly Uptime Percentage | Percentage |
---|---|
99.0% - 99.5% | 10% |
95.0% - 99.0% | 30% |
< 95.0% | 100% |
Google Cloud SLA
Google Cloud publishes its SLA on different pages based on the service, but the one for Compute is this one. Google Cloud Compute is the equivalent of AWS EC2, so it makes sense to compare them. First of all, it is interesting to see that in the GCP world, Single Compute Instances have an SLA only if they are on the Premium Network Tier. If they are on the Standard Network Tier, the SLA covers only Load Balancers and Multi-Zone Compute Instances.
Also, the way they calculate the credits is similar to AWS, and they have a table similar to AWS on the penalty, more specifically:
Monthly Uptime Percentage | Percentage |
---|---|
95.0% - 99.95% | 10% |
90.0% - 95.0% | 25% |
< 90.0% | 100% |
Damages calculation
Since we now have the damages we will incur in case of downtime, the costs, and the services SLAs, it will be possible to calculate a few instances. We will consider four cases: a month with one hour of downtime, a month with two hours of downtime, a month with one day of downtime, and a month with two days of downtime.
Downtime | Damages | Uptime | AWS Credits | Google Cloud Credits |
---|---|---|---|---|
1x1 hour | $ 1000 | 99.86% | $ 0.00 | $ 14.40 |
2x1 hour | $ 2000 | 99.72% | $ 0.00 | $ 14.40 |
1x1 day | $ 40000 | 96.68% | $ 44.10 | $ 14.40 |
2x1 day | $ 80000 | 93.37% | $ 147.00 | $ 36.00 |
As the table clearly shows, in most cases, the Service Credits that AWS and Google Cloud give you back are two to three orders of magnitude smaller than the damages you are receiving.
Conclusions
Although it is important that the hyperscaler provide SLAs, it is also important to remember that the credits you’ll get if those SLAs are not met are so limited that they can not be seriously considered a safeguard of any kind. For those reasons, it is critical to design applications that will survive the loss of some of the machines running them so that the potential damages discussed above will not materialize, even if one of the cloud instances where they are running goes down.