
Resource scarcity in Public Clouds
December 10, 2019
In the last couple of weeks, I’ve seen problems allocating resources in the Google Cloud Platform in the Frankfurt region. The problem seemed to have occurred due to the high requests Google customers made to sustain their businesses during the Black Friday, Thanksgiving, Cyber Monday period. Making some searches on Google, I’ve found out that this is not the first time it occurs, and this is not only a GCP problem since AWS and Azure had similar incidents.
This issue, made me think about the assumption of infinite resources the Public Cloud providers promote. Obviously, there is no such thing as infinite resources, no matter the size of the Public Cloud provider of choice. What the Public Cloud provider has to do is to estimate the requests and provisions a little bit more resources than the estimated ones. If the provider will estimate too high resources, they will lose money, if they estimate too low resources, their customers will not be able to use the resources they want. To make the situation even harder for the Public Cloud provider, the estimations will be done having no or very little clue about the loads that run in their data-centers.
In the biggest part of the year, this is not a problem, since every business has it’s own high and low periods, and we could assume that in a very big data-center like the big Public Cloud providers ones, this is going to level itself naturally. This, though, has a chain of problems.
The first problem is linked to the fact that all the major Public Cloud providers (at least today, maybe this will change in the future), allow (and force) you to choose a specific region to run your workloads. This means that businesses will try to collocate their loads as close as possible to their offices or their customers, which means that a high percentage of the workloads will have similar nature or peak periods. For instance, a region in Frankfurt will have a high percentage of resources used by services that are mainly used by people living in Germany and nearby countries. So, in periods of higher user interaction with Internet services, such as Christmas, the data-center load will increase dramatically.
In addition to this, there are some “special” moments, such as Thanksgiving and the nearby days that, by now, have become a widespread event even beyond the countries where they used to be celebrated. Probably, in the data-centers in areas where those festivities are celebrated (or at least where the capitalistic part of the celebration is celebrated), the load reaches the annual peak, due to the e-commerce websites.
To make the situation even worst, many Cloud customers are rewriting and improving their applications, making them more cloud-native. Now, you’ll wonder how cloud-native applications can make things worse? The reason is very simple: the cloud-native applications scale. This means that during the off-peak season the applications will drastically reduce their footprint, creating the false feeling of resource abundancy.
This situation creates some problems, in my opinion.
First of all, since it’s very hard for the Public Cloud provider to estimate the load - and in the future, it will be even harder - we will have to live with frequent resource exhaustion in public clouds, which will make a single-cloud single-region application fragile. This will be true, not even considering the economic aspect of the problem. There will be situations where it will not be economically convenient for the Cloud Provider to provision enough resources to manage the peaks since the additional provisioning cost would not be repaid during the short periods those resources will be used.
On top of all those problems, no Public Cloud provider is giving real insights on the amount of resources they are placing in the various regions and Availability Zones, as well as the current and/or historical load they had in the various locations. I do understand that those data are considered business critical, but the Public Clouds are becoming more and more critical for our society and should probably start to consider themselves more like utilities than companies operating in a free market.
So, what should be done to improve the situation? First of all, the Public Cloud users should not assume that their provider has unlimited resources, and therefore design their load to be able to scale in different regions, clouds (multi-cloud), or datacenters (hybrid-cloud). Secondly, the Public Cloud providers should be more transparent about their capacity and resource availability. Lastly, the Public Cloud providers should begin considering themselves more like utility companies and be sure to always have enough resources for all their customers’ needs.