分布式系统学习

开场白

水了那么久的博客，这次来点稍微硬核的吧。

来读一下Leslie Lamport 的论文《Paxos Made Simple》。

全程坚持读下来，我能保证读者明白什么是Paxos。

开干！

(～￣▽￣)～

Paxos算法注解

参阅 Leslie Lamport 的论文《Paxos Made Simple》。

三类角色

proposers
acceptors
learners

算法要求(requirements)

An acceptor can accept a proposal numbered n if it has not responded to a prepare request having a number greater than n.
For any v and n, if a proposal with value v and number n is issued, then there is a set S consisting of a majority of acceptors such that either (a) no acceptor in S has accepted any proposal numbered less than n, or (b) v is the value of the highest-numbered proposal among all proposals numbered less than n accepted by the acceptors in S.
(optimization) An acceptor needs to remember only the highest numbered proposal that it has ever accepted and the number of the highest numbered prepare request to which it has responded, while the proposer can always abandon a proposal and forget all about it——as long as it never tries to issue another proposal with the same number.
(optimization) If an acceptor ignores a prepare or accept request because it has already received a prepare request with a higher number, then it should probably inform the proposer, who should then abandon its proposal.

Acceptor可以接受number为n的proposal，如果它之前没有回复任何number大于n的prepare request。

对于任意的v和n，如果有一个value为v，number为n的proposal被提出了，那么就存在一个由majority个acceptor组成的集合S。且S满足以下条件之一：(a) S中没有一个acceptor接受过number小于n的proposal；(b) v是S中的acceptor接受过的、所有满足“number小于n”的proposal中，number最大的proposal的value。

(优化) Acceptor只需要记住它接受过的、number最大的proposal以及它回复过的、number最大的prepare request的number。而proposer总能随时放弃一个proposal，只要它不再发送另一个具有相同number的proposal。

(优化) 如果一个proposal因为已经接收^[*]了(此处为received而非accepted)更高number的proposal而忽略处于prepare request或accept request阶段的proposal时，那么它应该通知对应的proposer放弃该proposal。

满足要求前需要满足的假设(assumptions)

Different proposals have different numbers.
The assumption of non-Byzantine failures.

不同proposal要能够有不同的序列号。

不存在拜占庭式故障。

算法两阶段表示

Phase 1.

(a) A proposer selects a proposal number n and sends a prepare request with number n to a majority of acceptors.

(b) If an acceptor receives a prepare request with number n greater than that of any prepare request to which it has already responded, then it responds to the request with a promise not to accept any more proposals numbered less than n and with the highest-numbered proposal (if any) that it has accepted.

(a) Proposer选择一个number为n的proposal并发送包含其信息的prepare request给acceptors的majority集合。

(b) 如果一个acceptor接收了一个number为n的prepare request，且n大于以往它回复的perpare request的number，那么它将承诺不再接受^[*]任何number小于n的proposal，并且回复其已经接受过的(如果有的话)最大number的proposal。

Phase 2.

(a) If the proposer receives a response to its prepare requests (numbered n) from a majority of acceptors, then it sends an accept request to each of those acceptors for a proposal numbered n with a value v, where v is the value of the highest-numbered proposal among the responses, or is any value if the responses reported no proposals.

(b) If an acceptor receives an accept request for a proposal numbered n, it accepts the proposal unless it has already responded to a prepare request having a number greater than n.

(a) 如果proposer接收了来自majority对其number为n的proposal的回应，那么它接下来将给包含在majority里的acceptors发送一个value为v，number为n的accept request，其中v要么是其在perpare request阶段收到的回复中number最大proposal里的value，要么是由proposer自己设定的值(在回复中没有proposal的情况下)。

(b) 当acceptor接收了一个number为n的accept request时，如果它没有对number大于n的prepare request进行过回复，那么就接受该accept request。

Learners要做的

The acceptors could respond with their acceptances to some set of distinguished learners, each of which can then inform all the learners when a value has been chosen. Using a larger set of distinguished learners provides greater reliability at the cost of greater communication complexity.

每个acceptor将它接受的proposal发送给其设定好的一个distinguished learners的集合，这个集合将会在任何一个value被选中时通知所有的learner。

Progress取得进展

It’s easy to construct a scenario in which two proposers each keep issuing a sequence of proposals with increasing numbers, none of which are ever chosen. Proposer p completes phase 1 for a proposal number n1. Another proposer q then completes phase 1 for a proposal number n2 > n1. Proposer p’s phase 2 accept requests for a proposal numbered n1 are ignored because the acceptors have all promised not to accept any new proposal numbered less than n2. So, proposer p then begins and completes phase 1 for a new proposal number n3 > n2, causing the second phase 2 accept requests of proposer q to be ignored. And so on.

To guarantee progress, a distinguished proposer must be selected as the only one to try issuing proposals.

If enough of the system (proposer, acceptors, and communication net work) is working properly, liveness can therefore be achieved by electing a single distinguished proposer.

如果两个或以上的proposer轮流发送number比对方更大的提案，有可能导致最终它们都未被选中。

为了保证流程的顺利进行，必须选出一个distinguished proposer，作为唯一的proposal发送者。

在各组件正常工作的情况下，通过选出单一的distinguished proposer就能保持系统的活力。

Implementation实现(针对存储)

An acceptor records its intended response in stable storage before actually sending the response.

Different proposers choose their numbers from disjoint sets of numbers, so two different proposers never issue a proposal with the same number. Each proposer remembers (in stable storage) the highest-numbered proposal it has tried to issue, and begins phase 1 with a higher proposal number than any it has already used.

Acceptor会在真正发送回复前将其记录在stable storage中。

不同的proposer会在各自不相交的集合中选择number来保证不会出现带有相同number的proposal被两个proposer同时提出。每个proposer会在stable storage里记住它提出的number最大的proposal，并只可能提出number更大的proposal开始phase 1。

State Machine状态机(Paxos算法的实现)

An implementation that uses a single central server fails if that server fails. We therefore instead use a collection of servers, each one independently implementing the state machine. Because the state machine is deterministic, all the servers will produce the same sequences of states and outputs if they all execute the same sequence of commands. A client issuing a command can then use the output generated for it by any server.

一种基于单一中心server的(分布式系统)实现会在该server发生故障时整体故障。

我们因此使用一个server的集合进行代替，其中每个server由一个state machine独立实现。

由于state machine是deterministic(确定性的)，所有server会在执行完相同序列的命令后产生相同的状态和输出。此时client可以在提交某个command后使用其中任意一个server产生的输出。

To guarantee that all servers execute the same sequence of state machine commands, we implement a sequence of separate instances of the Paxos consensus algorithm, the value chosen by the i^th instance being the i^th state machine command in the sequence. Each server plays all the roles (proposer, acceptor, and learner) in each instance of the algorithm. For now, I assume that the set of servers is fixed, so all instances of the consensus algorithm use the same sets of agents.

为了保证所有的server运行相同的命令序列，我们基于Paxos算法创建一个序列的单独实例，其中每个实例的value呈现为一个状态机命令。

在处理任一个实例时，每个server同时扮演所有角色(proposer, acceptor, and learner)。

现在假设这个server的集合是固定的，也即所有实例在同一套agents下运行。

In normal operation, a single server is elected to be the leader, which acts as the distinguished proposer (the only one that tries to issue proposals) in all instances of the consensus algorithm. Clients send commands to the leader, who decides where in the sequence each command should appear. If the leader decides that a certain client command should be the 135^th command, it tries to have that command chosen as the value of the 135^th instance of the consensus algorithm.

通常情况下，只有单一server会被选为leader，并充当distinguished proposer。

Client发送命令给leader，leader决定这个命令应该放在序列的哪个位置执行。

例如，如果leader决定将这个命令在序列里排第135，那么它将成为第135个实例的value。

Essentially Optimal算法优越性

Since failure of the leader and election of a new one should be rare events, the effective cost of executing a state machine command——that is, of achieving consensus on the command/value——is the cost of executing only phase 2 of the consensus algorithm. It can be shown that phase 2 of the Paxos consensus algorithm has the minimum possible cost of any algorithm for reaching agreement in the presence of faults. Hence, the Paxos algorithm is essentially optimal.

由于leader的故障和选举都是小概率事件，一致性算法的主要开销在于phase 2的执行。

Paxos一致性算法在所有“能在出现故障情况下达到一致的实现”中有最小的可能开销。

因此，Paxos算法是非常优越的。

初学者容易踩坑的点

Proposers会发出两种request：prepare request和accept request，分别对应准备和接受两个阶段。

算法的两阶段表示是针对一个proposal而言的，即一个proposal从提出到最终被接受需要经过prepare request和accept request两个阶段。

关于算法要求(requirements)里的第二点，其实可以理解为最新proposal可以向前兼容旧的proposal，保留已有的value。(这个要求保证了一致性。)

Value应该是一个抽象，可以实现为某个具体的变量(不单指变量的值，还应该包含这个变量到目前为止经历的完整的生命周期)，也可以是某个状态机命令，还可以表示为某个文件的记录。在上文的State Machine状态机部分中就是以状态机命令形式出现的。

关于算法要求(requirements)里的第三点，注意acceptor要记住的分别是一个proposal和一个number，其中proposal是已经经过accept request被接受了的，它包含以往所有的value的集合；而number则对应在prepare request阶段答应的proposal，它用于判断是否回复其它处于prepare request的proposal以及判断是否接受其它处于accept request的proposal。

算法里acceptor的“接受”(或“接收”)行为可以认为是prepare request和accept request两个阶段共用的(比如算法要求(requirements)里的第四点以及算法两阶段表示的 Phase 1. 的 (b) ，已使用上标“^[*]”进行标注)，也就是说它选择接受处于prepare request或accept request两个阶段的proposal所带有的number必然同时大于其当前保存的proposal的number以及其当前保存的number。(这样的结果在原文里是使用数学归纳法进行证明的。)

Leader的存在保证的是progress(取得进展)，也就是说即使是对于场上不存在leader的情况、或者由于选举过程产生冲突从而导致多个leader同时存在的情况，Paxos算法的一致性和安全性依然能得到保留。

在State Machine状态机部分里，要明确client输入的command与运行实例的value的关系。每一个已输入的command会对应一个经过leader决定成为实例的value，而反之则不然。例如，当一个新的leader被选举出来后，若其已知了第1-135、138-140的实例的value，其只能执行前1-135的value对应的命令而不能运行第138-140的部分，因为第136、137个实例的value还未知(这就是gap)。此时，我们就通过发送两个“noop”空指令(作为实例的value)来填充这个gap，从而使得第1-140的实例都能正常运行。

在phase 1阶段，leader作为distinguished proposer发送proposal是可以“并行”的，即同时发送多个的实例，这些实例使用同一个number。这样做的好处在于，acceptor在第一次收到这些消息时，由于没有需要返回的proposal，因此回复的消息只包含一个“OK”，很大程度提升了效率。

论文算法修改

Leslie Lamport本人承认在该论文中有一个错误，这个错误错误包含在：

A proposer issues a proposal by sending, to some set of acceptors, a request that the proposal be accepted. (This need not be the same set of acceptors that responded to the initial requests.) Let’s call this an accept request.

括号里的内容是有额外条件的，需要做一个算法优化：

在acceptor接受一个accept request时，将其记录的number更新为该accept request携带的number。

这个优化也与上标“^[*]”标注部分相对应，这样能使得两个阶段不需要选择完全一致的acceptor集合进行发送，这也就是Leslie Lamport本人在《The Part-Time Parliament》中描述的算法。