Avoid two groups of servers making independent decisions by requiring majority for taking every decision.
In a distributed system, whenever a server takes any action, it needs to ensure that in the event of a crash the results of the actions are available to the clients. This can be achieved by replicating the result to other servers in the cluster. But that leads to the question: How many other servers need to confirm the replication before the original server can be confident that the update is fully recognized? If the original server waits for too many replications, then it will respond slowly—reducing liveness. But if it doesn't have enough replications, then the update could be lost—a failure of safety. It's critical to balance between the overall system performance and system integrity.
A cluster agrees that it's received an update when a majority of the nodes in the cluster have acknowledged the update. We call this number a quorum. So if we have a cluster of five nodes, we need a quorum of 3. For a cluster of n nodes, the quorum is n / 2 + 1.
The need for a quorum indicates how many failures can be tolerated—which is the size of the cluster minus the quorum. A cluster of five nodes can tolerate two of them failing. In general, if we want to tolerate f failures we need a cluster size of 2f + 1.
for more details go to Chapter 08 of the online ebook at oreilly.com
This pattern is part of Patterns of Distributed Systems
23 November 2023