This pattern is part of Patterns of Distributed Systems

Paxos

Use two consensus building phases to reach safe consensus even when nodes disconnect

05 January 2022

Problem

When multiple nodes share state, they often need to agree between themselves on a particular value. With Leader and Followers, the leader decides and passes its value to the followers. But if there is no leader, then the nodes need to determine a value themselves. (Even with a leader-follower, they may need to do this to elect a leader.)

A leader can ensure replicas safely acquire an update by using Two Phase Commit, but without a leader we can have competing nodes attempt to gather a Quorum. This process is further complicated because any node may fail or disconnect. A node may achieve quorum on a value, but disconnect before it is able to communicate this value to the entire cluster.

Solution

The Paxos algorithm was developed by Leslie Lamport, published in his 1998 paper The Part-Time Parliament. Paxos works in three phases to make sure multiple nodes agree on the same value in spite of partial network or node failures. The first two phases act to build consensus around a value, the last phase then communicates that consensus to the remaining replicas.

  • Prepare phase: establish the latest Generation Clock and gather any already accepted values.
  • Accept phase: propose a value for this generation for replicas to accept.
  • Commit Phase: let all the replicas know that a value has been chosen.

In the first phase (called prepare phase), the node proposing a value (called a proposer) contacts all the nodes in the cluster (called acceptors) and asks them if they will promise to consider its value. Once a quorum of acceptors return such a promise, the proposer moves onto the second phase. In the second phase (called the accept phase) the proposer sends out a proposed value, if a quorum [1] of nodes accepts this value then the value is chosen. In the final phase (called the commit phase), the proposer can then commit the chosen value to all the nodes in the cluster.

Flow of the Protocol

Paxos is a difficult protocol to understand. We'll start by showing an example of a typical flow of the protocol, and then dig into some of the details of how it works. We intend this explanation to provide an intuitive sense of how the protocol works, but not as a comprehensive description to base an implementation upon.

Here's a very brief summary of the protocol.

Proposer

Acceptor

Obtains the next generation number from a Generation Clock. Sends a prepare request with this generation number to all acceptors.

If the generation number of the prepare request is later than its promised generation variable, it updates its promise generation with this later value and returns a promise response. If it has already accepted a proposal it returns this proposal.

When it receives promises from quorum of acceptors, it looks to see if any of these responses contain accepted values. If so it changes its own proposed value to that of the returned proposal with the highest generation number. Sends accept requests to all acceptors with its generation number and proposed value.

If the generation number of the accept request is later than its promised generation variable it stores the proposal as its accepted proposal and responds that it has accepted the request.

When it receives a successful response from a quorum of acceptors, it records the value as chosen and sends commit messages to all nodes.

Those are basic rules for paxos, but it's very hard to understand how they combine for an effective behavior. So here's an example to show how this works.

Consider a cluster of five nodes: Athens, Byzantium, Cyrene, Delphi, and Ephesus. A client contacts the Athens node, requesting to set the name to "alice". The Athens node now needs to initiate a Paxos interaction to see if all the nodes will agree to this change. Athens is called the proposer, in that Athens will propose to all the other nodes that the name of the cluster become "alice". All the nodes in the cluster (including Athens) are "acceptors", meaning they are capable of accepting proposals.

At the same time that Athens is proposing "alice", the node Ephesus gets a request to set the name to "elanor". This makes Ephesus also be a proposer.

In the prepare phase the proposers begin by sending some prepare requests, which all include a generation number. Since Paxos is intended to avoid single points of failure, we don't take this from a single generation clock. Instead each node maintains its own generation clock where it combines a generation number with a node ID. The node ID is used to break ties, so [2,a] > [1,e] > [1,a]. Each acceptor records the latest promise it's seen so far.

NodeAthensByzantiumCyreneDelphiEphesus
promised generation1,a1,a01,e1,e
accepted valuenonenonenonenonenone

Since they haven't seen any requests before this, they all return a promise to the calling proposer. We call the returned value a "promise" because it indicates that the acceptor promises to not consider any messages with an earlier generation clock than the promised one.

Athens sends its prepare message to Cyrene. When it receives a promise in return, this means it has now got promises from from three of the five nodes, which represents a Quorum. Athens now shifts from sending prepare messages to sending accept messages.

It is possible that Athens fails to receive a promise from a majority of the cluster nodes. In that case Athens retries the prepare request by incrementing the generation clock.

NodeAthensByzantiumCyreneDelphiEphesus
promised generation1,a1,a1,a1,e1,e
accepted valuenonenonenonenonenone

Athens now starts sending accept messages, containing the generation and the proposed value. Athens and Byzantium accept the proposal.

NodeAthensByzantiumCyreneDelphiEphesus
promised generation1,a1,a1,a1,e1,e
accepted valuealicealicenonenonenone

Ephesus now sends a prepare message to Cyrene. Cyrene had sent a promise to Athens, but Ephesus's request has a higher generation, so it takes precedence. Cyrene sends back a promise to Ephesus.

Cyrene now gets an accept request from Athens but rejects it as the generation number is behind its promise to Ephesus.

NodeAthensByzantiumCyreneDelphiEphesus
promised generation1,a1,a1,e1,e1,e
accepted valuealicealicenonenonenone

Ephesus has now got a quorum from its prepare messages, so can move on to sending accepts. It sends accepts to itself and to Delphi but then crashes before it can send any more accepts.

NodeAthensByzantiumCyreneDelphiEphesus
promised generation1,a1,a1,e1,e1,e
accepted valuealicealicenoneelanorelanor

Meanwhile, Athens has to deal with the rejection of its accept request from Cyrene. This indicates that its quorum is no longer promised to it and thus its proposal will fail. This will always happen to a proposer who loses its initial quorum like this; for another proposer to achieve quorum at least one member of the first proposer's quorum will defect.

In a situation with a simple two phase commit, we would then expect Ephesus to just go on and get its value chosen, but such a scheme would now be in trouble since Ephesus has crashed. If it had a lock on a quorum of acceptors, its crash would deadlock the whole proposal process. Paxos, however, expects this kind of thing to happen, so Athens will make another try, this time with a higher generation.

It sends prepare messages again, but this time with a higher generation number. As with the first round, it gets back a trio of promises, but with an important difference. Athens already accepted "alice" earlier, and Delphi had accepted "elanor". Both of these acceptors return a promise, but also the value that they already accepted, together with the generation number of that accepted proposal. When they return that value, they update their promised generation to [2,a] to reflect the promise they made to Athens.

NodeAthensByzantiumCyreneDelphiEphesus
promised generation2,a1,a2,a2,a1,e
accepted valuealicealicenoneelanorelanor

Athens, with a quorum, must now move onto the accept phase, but it must propose the already-accepted value with the highest generation, which is "elanor", who was accepted by Delphi with a generation of [1,e], which is greater than Athens's acceptance of "alice" with [1,a].

Athens starts to send out accept requests, but now with "elanor" and its current generation. Athens sends an accept request to itself, which is accepted. This is a crucial acceptance because now there are three nodes accepting "elanor", which is a quorum for "elanor", therefore we can consider "elanor" to be the chosen value.

NodeAthensByzantiumCyreneDelphiEphesus
promised generation2,a1,a2,a2,a1,e
accepted valueelanoralicenoneelanorelanor

But although "elanor" is now the chosen value, nobody is yet aware of it. Within the accept stage Athens only knows itself having "elanor" as the value, which isn't a quorum and Ephesus is offline. All Athens needs to do is have a couple more accept requests accepted and it will be able to commit. But now Athens crashes.

At this point Athens and Ephesus have now crashed. But the cluster still has a quorum of nodes operating, so they should be able to keep working, and indeed by following the protocol they can discover that "elanor" is the chosen value.

Cyrene gets a request to set the name to "carol", so it becomes a proposer. It's seen generation [2,a] so it kicks off a prepare phase with generation [3,c]. While it wishes to propose "carol" as the name, for the moment it's just issuing prepare requests.

Cyrene sends prepare messages to the remaining nodes in the cluster. As with Athens's earlier prepare phase, Cyrene gets accepted values back, so "carol" never gets proposed as a value. As before, Delphi's "elanor" is later than Byzantium's "alice", so Cyrene starts an accept phase with "elanor" and [3,c].

NodeAthensByzantiumCyreneDelphiEphesus
promised generation2,a3,c3,c3,c1,e
accepted valueelanoralicenoneelanorelanor

While I could continue to crash and wake up nodes, it's clear now that "elanor" will win out. As long as a quorum of nodes are up, at least one of them will have "elanor" as its value, and any node attempting a prepare will have to contact one node that's accepted "elanor" in order to get a quorum for its prepare phase. So we'll finish with Cyrene sending out commits.

At some point Athens and Ephesus will come back online and they will discover what the quorum has chosen.

Requests don't need to be rejected

In the example above, we saw acceptors rejecting requests with an aged generation. But the protocol does not require an explicit rejection like this. As formulated, an acceptor may just ignore an out-of-date request. If this is the case, then the protocol will still converge on a single consensus value. This is an important feature of the protocol because, as this is a distributed system, connections can be lost at any time, so it's beneficial to not be dependent on rejections to ensure the safety of the protocol. (Safety here meaning that the protocol will choose only one value, and once chosen, it won't be overwritten.)

Sending rejections, however, is still useful as it improves performance. The quicker proposers find out they are old, the sooner they can start another round with a higher generation.

Competing proposers may fail to choose

One way this protocol can go wrong is if two (or more) proposers get into a cycle.

  • alice is accepted by athens and byzantium
  • elanor is prepared by all nodes, preventing alice from gaining quorum
  • elanor is accepted by delphi and ephesus
  • alice is prepared by all nodes, preventing elanor from gaining quorum.
  • alice is accepted by athens and byzantium
  • ... and so on, a situation called a livelock

The FLP Impossibility Result shows that even a single faulty node can stop a cluster from ever choosing a value.

We can reduce the chances of this livelock happening by ensuring that whenever a proposer needs to choose a new generation, it must wait a random period of time. This randomness makes it likely that one proposer will be able to get a quorum accepted before the other sends a prepare request to the full quorum.

But we can never ensure that livelock can't happen. This is a fundamental trade-off: we can either ensure safety or liveness, but not both. Paxos ensures safety first.

An example key-value store

The Paxos protocol explained here, builds consensus on a single value (often called as single-decree Paxos). Most practical implementations used in mainstream products like Cosmos DB or Spanner use a modification of paxos called multi-paxos which is implemented as a Replicated Log.

But a simple key-value store can be built using basic Paxos. [cassandra] uses basic Paxos in a similar way to implement it's light-weight transactions.

The key-value store maintains Paxos instance per key.

class PaxosPerKeyStore…

  int serverId;
  public PaxosPerKeyStore(int serverId) {
      this.serverId = serverId;
  }

  Map<String, Acceptor> key2Acceptors = new HashMap<String, Acceptor>();
  List<PaxosPerKeyStore> peers;

The Acceptor stores the promisedGeneration, acceptedGeneration and acceptedValue.

class Acceptor…

  public class Acceptor {
      MonotonicId promisedGeneration = MonotonicId.empty();
  
      Optional<MonotonicId> acceptedGeneration = Optional.empty();
      Optional<Command> acceptedValue = Optional.empty();
  
      Optional<Command> committedValue = Optional.empty();
      Optional<MonotonicId> committedGeneration = Optional.empty();
  
      public AcceptorState state = AcceptorState.NEW;
      private BiConsumer<Acceptor, Command> kvStore;

When the key and value is put in the kv store, it runs the Paxos protocol.

class PaxosPerKeyStore…

  int maxKnownPaxosRoundId = 1;
  int maxAttempts = 4;
  public void put(String key, String defaultProposal) {
      int attempts = 0;
      while(attempts <= maxAttempts) {
          attempts++;
          MonotonicId requestId = new MonotonicId(maxKnownPaxosRoundId++, serverId);
          SetValueCommand setValueCommand = new SetValueCommand(key, defaultProposal);

          if (runPaxos(key, requestId, setValueCommand)) {
              return;
          }

          Uninterruptibles.sleepUninterruptibly(ThreadLocalRandom.current().nextInt(100), MILLISECONDS);
          logger.warn("Experienced Paxos contention. Attempting with higher generation");
      }
      throw new WriteTimeoutException(attempts);
  }

  private boolean runPaxos(String key, MonotonicId generation, Command initialValue) {
      List<Acceptor> allAcceptors = getAcceptorInstancesFor(key);
      List<PrepareResponse> prepareResponses = sendPrepare(generation, allAcceptors);
      if (isQuorumPrepared(prepareResponses)) {
          Command proposedValue = getValue(prepareResponses, initialValue);
          if (sendAccept(generation, proposedValue, allAcceptors)) {
              sendCommit(generation, proposedValue, allAcceptors);
          }
          if (proposedValue == initialValue) {
              return true;
          }
      }
      return false;
  }

  public Command getValue(List<PrepareResponse> prepareResponses, Command initialValue) {
      PrepareResponse mostRecentAcceptedValue = getMostRecentAcceptedValue(prepareResponses);
      Command proposedValue
              = mostRecentAcceptedValue.acceptedValue.isEmpty() ?
              initialValue : mostRecentAcceptedValue.acceptedValue.get();
      return proposedValue;
  }

  private PrepareResponse getMostRecentAcceptedValue(List<PrepareResponse> prepareResponses) {
      return prepareResponses.stream().max(Comparator.comparing(r -> r.acceptedGeneration.orElse(MonotonicId.empty()))).get();
  }

class Acceptor…

  public PrepareResponse prepare(MonotonicId generation) {

      if (promisedGeneration.isAfter(generation)) {
          return new PrepareResponse(false, acceptedValue, acceptedGeneration, committedGeneration, committedValue);
      }
      promisedGeneration = generation;
      state = AcceptorState.PROMISED;
      return new PrepareResponse(true, acceptedValue, acceptedGeneration, committedGeneration, committedValue);

  }

class Acceptor…

  public boolean accept(MonotonicId generation, Command value) {
      if (generation.equals(promisedGeneration) || generation.isAfter(promisedGeneration)) {
          this.promisedGeneration = generation;
          this.acceptedGeneration = Optional.of(generation);
          this.acceptedValue = Optional.of(value);
          return true;
      }
      state = AcceptorState.ACCEPTED;
      return false;
  }

The value is stored in the kvstore only when it can be successfully committed.

class Acceptor…

  public void commit(MonotonicId generation, Command value) {
      committedGeneration = Optional.of(generation);
      committedValue = Optional.of(value);
      state = AcceptorState.COMMITTED;
      kvStore.accept(this, value);
  }

class PaxosPerKeyStore…

  private void accept(Acceptor acceptor, Command command) {
      if (command instanceof SetValueCommand) {
          SetValueCommand setValueCommand = (SetValueCommand) command;
          kv.put(setValueCommand.getKey(), setValueCommand.getValue());
      }
      acceptor.resetPaxosState();
  }

The paxos state needs to be persisted. It can be easily done by using a Write-Ahead Log.

Handling multiple values.

It is important to note that Paxos is specified and proven to work on single value. So handling multiple values with the single value Paxos protocol needs to be done outside of the protocol specification. One alternative is to reset the state, and store committed values separately to make sure they are not lost.

class Acceptor…

  public void resetPaxosState() {
      //This implementation has issues if committed values are not stored
      //and handled separately in the prepare phase.
      //See Cassandra implementation for details.
      //https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/SystemKeyspace.java#L1232
      promisedGeneration = MonotonicId.empty();
      acceptedGeneration = Optional.empty();
      acceptedValue = Optional.empty();
  }

There is an alternative, as suggested in [gryadka], which slightly modifies the basic Paxos to allow setting multiple values. This need for executing steps beyond the basic algorithm is the reason that in practice Replicated Log is preferred.

Reading the values

Paxos relies on the prepare phase to detect any uncommitted values. So if basic Paxos is used to implement a key-value store as shown above, the read operation also needs to run the full Paxos algorithm.

class PaxosPerKeyStore…

  public String get(String key) {
      int attempts = 0;
      while(attempts <= maxAttempts) {
          attempts++;
          MonotonicId requestId = new MonotonicId(maxKnownPaxosRoundId++, serverId);
          Command getValueCommand = new NoOpCommand(key);
          if (runPaxos(key, requestId, getValueCommand)) {
              return kv.get(key);
          }

          Uninterruptibles.sleepUninterruptibly(ThreadLocalRandom.current().nextInt(100), MILLISECONDS);
          logger.warn("Experienced Paxos contention. Attempting with higher generation");

      }
      throw new WriteTimeoutException(attempts);
  }

Examples

[cassandra] uses Paxos to implement light-weight transaction.

All the consensus algorithms like Raft use basic concepts similar to the basic Paxos. The use of Two Phase Commit, Quorum and Generation Clock is used in a similar manner.

Notes

1: Flexible Paxos

The original description of Paxos requires majority Quorum in both the prepare and the accept phases. Some recent work by Heidi Howard and others show that the main requirement of Paxos is to have overlap in the quorums of the prepare and the accept phase. As long as this requirement is fulfilled, it does not require a majority Quorum in both the phases.

Significant Revisions

05 January 2022: