1 High Availability
Dev Random edited this page 2021-01-16 11:45:37 -08:00

Initial thoughts about high availability (original author @devrandom).

A highly available installation should have replication at these layers:

  • Signer - there should be a quorum of signers that have to agree for channel state advancement
  • Channel Monitor - there should be at least one Monitor replica reacting to on-chain events
  • Network handler (everything other than Signer/Monitor)

The network handler should probably be in a active+hot-standby configuration, but maybe we can do active+active or partition by channel.

The reason for the Signer to require a quorum (majority) is consistency. If a tx is revoked by a subset, you don't want it to be signed for broadcast by a different subset. The requirement for majority results in a final consensus on each state change.

Signer state changes include: funding, revocation, signing of a holder commitment tx, signing of a counterparty commitment tx, cooperative close, detection of a counterparty broadcast of a commitment tx and completion of a sweep.

The signer layer should have a TEMPFAIL result, which would defer the operation and retry later. For example, there may not be a quorum for one channel due to replica failures, but another channel can go ahead. This requires some additional code to implement the deferral. This failure is different from "internal errors", such as missing key material, hardware error, wrong state, etc., which should alert an operator to a production problem with the replica.

Since the blockchain enforces consistency on reactions to on-chain events (a UTXO can only be spent once), the Monitors don't need to coordinate, but they could to reduce load.

TODO: once this page matures, it should be folded into LDK docs