Why do Kubernetes Control Planes have an odd number of members?

 June 29, 2023

The single most frequent question I get asked about Kubernetes is regarding the number of Control Plane nodes. Sometimes it is out of curiosity for the “unusual number”; other times, it is plainly confrontational since the person would prefer a different number, which usually is 2.

The first thing to understand is that there are a couple of reasons to choose a certain number of Kubernetes Control Plane nodes over another, and those are:

Performances
Resiliency

The performance case is the easiest to address: increase the number of nodes and/or their size if you think your cluster will need more resources. This means that we will need to look into resiliency to find the reasons for those odd numbers. More specifically, the Kubernetes Control Plane is composed of many components, one of which is etcd. Etcd is a distributed Key-Value store that keeps the state of the whole Kubernetes. Since etcd is the pickiest component of the Kubernetes Control Plane in terms of the number of nodes, the Kubernetes Control Plane tends to have the same number of etcd nodes, though this is not a strict requirement, since the Control Plane could have some nodes where etcd is not running.

Etcd uses the raft algorithm to ensure consensus within the cluster. Raft ensures consensus via a leader election. To be elected as leader, a node needs to ensure the votes of more than half of the cluster’s nodes. To ensure etcd is readable and writeable and, therefore, the expected Kubernetes behaviors, a leader needs to be eligible. Therefore more than half of the etcd must be online and able to communicate.

With this information, we can quickly derive the number of nodes we can lose before losing capabilities for etcd clusters up to 5 nodes:

etcd with 1 node: min r/w = 1, max lose = 0
etcd with 2 nodes: min r/w = 2, max lose = 0
etcd with 3 nodes: min r/w = 2, max lose = 1
etcd with 4 nodes: min r/w = 3, max lose = 1
etcd with 5 nodes: min r/w = 3, max lose = 2

We can therefore calculate which is the probability of each cluster being in a fully functioning state. To do so, we shall assume some information:

all etcd nodes are fully independent of each other
all etcd have the same probability of falling over: let’s say 1% to make the math simple

We can use binomial probability to calculate the expected uptime for the various cluster sizes:

etcd with 1 node: 99.0000%
etcd with 2 nodes: 98.0100%
etcd with 3 nodes: 99.9702%
etcd with 4 nodes: 99.9408%
etcd with 5 nodes: 99.9990%

Looking at the expected uptime, we can notice that it tends to increase by increasing the number of nodes. Though, the cases where the number of nodes even has a lower expected uptime of the configuration with one less node. This can also be logically deducted since we can lose the same number of nodes, but we have one more node that can fail.

It is important to notice that those numbers are here just to explain this concept and are not representative of the expected uptime of any real-world etcd since the two assumptions need to be changed for every situation. I would like to put particular emphasis on the first point: “all etcd nodes are fully independent of each other”. This is critical because that is a very abstract situation, which is very hard to find in real life.

This assumption often becomes a contentious point when people have two data centers, and they want to put two etcd nodes in one and one in the other. This approach creates a fairly strong correlation between the two etcd that are co-located, heavily impacting the expected uptime.

Another typical example of something that breaks independency is the Software Defined Network (SDN). If your etcd nodes communicate via an SDN, and the SDN goes down, all your etcd will be isolated, therefore unable to run elections properly. Therefore, the SDN expected uptime will directly impact your etcd cluster’s expected uptime.

I hope that this simple math and examples can help you evaluate the expected uptime of your etcds and Kubernetes clusters and eventually design clusters that match your uptime expectations.

 Kubernetes, OpenShift

 Cloud Native