Rao, Shekita and Tata (2011). Using Paxos to Build a Scalable, Consistent, and Highly Available Datastore [Spinnaker]

Mon Feb 08 2021


Overview of the paper

Partitioning and load balancing

Two-way replication doesn't work. Three-way replication works, but more complicated failures become possible. The Paxos familiy of protocols is widely considered to be the only proven solution when there are 3 or more replicas. Paxos solves the general problem of reaching consensus on the state of 2F+1 replicas while tolerating up to F failures.

Idea: three nodes replicate keys of a particular

A cohort of a key range [i,j][i, j] is the set of nodes that contains its key range. So the cohort of the key range [0,199][0,199] is the set A,B,C{A, B, C}.

Each node in Spinnaker contains several components. All components are thread safe.

There's a shared write-ahead log (shared amongst), each log record is uniquely identified with an LSN. Each cohort on a node uses its own logical LSNs.

The commit queue is a main-memory data structure that is used to track pending writes. Writes are committed only after receiving a sufficient number of acks from a cohort.

Committed writes are placed in a memtable, which is sorted and flushed to an immutable disk structure called an SSTable. (Sorted String table)

What is a write-ahead log?

Write-ahead log: before you perform an action, you save it to disk first

MemTable has the most recent state, we want to restore the state inside the memtable.

How does SSTable work?

If it's immutable, how is there a

What is Zookeeper?

Zookeeper is a "fault toleratnt, distributed coordination service".

The only messages exchanged between Zookeeper and a Spinnaker node are heartbeats.

Why don't we use Zookeeper for the entirety of Spinnaker?

What is the difference between strong consistency, eventual consistency, and timeline consistency?

Strong consistency vs eventual consistency

Strong consistency guarantees that all replicas available identical to applications.


Dynamo uses eventual consistency to provide high availability and partition tolerance. Dynamo is AP, so applications may see multiple versions of the same data.

Spinnaker is CA.

Spinnaker uses the following:

  • key based range partitioning,
  • 3 way replication,
  • transactional get-put API with the option to choose either strong or time-line consistency on reads.

Not robust to network partition. It assumes that you will run it on a single datacenter where network partitions are rare.

"A data partition will be available for reads and writes as long as a majority of the nodes containing its replicas are alive."

Bigtable is CP apparently.

Optimistic concurrency control

OCC: the idea that data races are unlikely. You allow Assumption: low data contention.

To increment a counter cc,

c = get(key, "c", consistent = true)
ret = conditionalPut(key, "c", c.value+1, c.version)

and this would fail if the c.version is not equal the later version.

Two-phase commit

A type of atomic protocol:

  1. Commit-request phase (or voting phaes)
  2. If all vote yes, then commit.

Two phase commit is overkill because if one node fails, we get an abort: not fault-tolerant.

2PC blocks when the coordinator fails.

Replication protocol

Two phases (in Paxos): leader election, then quorum.

When a client submits a write W.

when a leader gets an act from at least one follower, (so it has a majority 2/3), it commits W and writes to MemTable.

Periodically the leader sends an async commit meesage asking them to apply all pending writes up to a certain LSN to their memtable, effectively committing those writes.

Key guarantee: Anything written to the MemTable is guaranteed not to be lost.

Strong consistency vs timeline consistency

Strongly consistent reads are always routed to the cohort's leader, so they are guaranteed to see the latest value for W. However, timeline consistent reads are routed to any node in the cohort, so they may see a stale value for W.

Why is Spinnaker not partition-tolerant?

Where is the leader info stored?

How do we ensure that all committed writes are saved?

Leader election

Done on per-cohort basis, shared data in Zookeeper client

Two guarantees:

  1. Will give a result
  2. No committed writes are lost

How do we know when the leader is down?

Zookeeper keep-alive by heartbeat. "Ephemeral znode" of /r/leader will be deleted

How does the leader election in Zookeeper differ from the leader election in Raft?

A cohort will not lose committed data even if 2 out of its 3 nodes permanently fail.

What is partition tolerance really about?

Is a majority of nodes failing an availability thing or a network partition thing?

What is availability?

Every request received by a non-failing node in the system must result in a response.

Which part of

Is Zookeeper a CA system? Why don't you use Zookeeper as the data store?

What are the guarantees of Zookeeper?