Message Requests¶
In BFT protocols you can lose no more than a certain number of messages. In Plenum it may happen in the following valid cases:
A node received more number of messages than maximum size of a ZMQ queue.
A node discarded messages because it was unable to process it.
A node disconnected for a short time.
In plenum there are classes MessageReq
for message requests and MessageRep
for replies. MessageReqService
manages receiving, sending and processing request and reply messages.
###Three-phase commit messages
If a node lost a lot of messages for three-phase commit, ordering can stop for this node until a catch-up is triggered by a checkpoint. The problem is successfully solved via message request mechanism. A node requests lost messages from other nodes in follow cases:
All PrePrepares need to be applied sequentially, without any gaps. If a node receives a PrePrepare where previous PrePrepare(s) are not received yet or lost, it asks PrePrepares, Prepares, Commits from the node last pp_seq_no to received PrePrepare (pp_seq_no - 1). If the number of messages to be requested is more than CHK_FREQ, that is the size of Checkpoint, then nothing is requested since the Node is falling behind and is going to start a catch-up in any case because of a quorum of stashed stable checkpoints.
A node receives a Prepare for which it doesn’t have the corresponding PrePrepare -> ask PrePrepares for pp_seq_no from received Prepare
A node receives a PrePrepare for a not finalized request (that is request for which the Node doesn’t have a quorum of Propagates) -> ask Propagates for not finalize requests from PrePrepare
For system security, there are the following message request rules:
Lost Pre-prepares are requested only from the primary node
Lost Commits, Prepares and Propagates are requested from all nodes
###Catchup messages
Lost messages may delay catchup for a long time. The node message requests mechanism is used in follow cases:
A node requests a LedgerStatus in a start of catchup process to compare with self ledger.
With the first message request for a LedgerStatus the node is scheduling a new message request in config.LedgerStatusTimeout for case the answer to the first request is not received.
A node requests a ConsistencyProof with median rate of ledger size on other nodes.
With the first message request for a ConsistencyProof the node is scheduling a new message request in config.ConsistencyProofTimeout for case the answer to the first request is not received.
If the node has the quorum(n - f - 1) of LedgerStatuses or the quorum(f + 1) of ConsistencyProofs, scheduled request will be canceled.
###ViewChange messages
Lost NewView message will prevent view change from finishing successfully. A node requests NewView messages from all nodes in NEW_VIEW_TIMEOUT after a view change started.
If an answer is received from expected primary node uses it and finishes the view change.
Otherwise it uses a quorum (f+1) of NewView responses from other nodes, finishes the view change and starts catchup.
Lost ViewChange message can prevent view change from finishing successfully. A node requests missing ViewChange messages from all nodes after receiving a NewView message.
If an answer is received from an owner of requested ViewChange the node uses it and finishes the view change.
Otherwise it uses a quorum (f+1) of ViewChange responses from other nodes and finishes the view change.