This commit moves calls to link.DisableAdds to outside link.OnCommitOnce
call backs. This is done so that if a user requests shutdown, then we
can immediately block any new outgoing HTLCs. It's only the sending of
the Shutdown message that needs to wait until after any pending
CommitSig message is sent.
In this commit, the `ChannelUpdateHandler`'s `EnableAdds` and
`DisableAdds` methods are adjusted to return booleans instead of errors.
This is done becuase currently, any error returned by these methods is
treated by just logging the error since today all it means is that the
proposed update has already been done. And so all we do today is log the
error. But in future, if these methods are updated to return actual
errors that need to be handled, then we might forget to handle them
correctly at the various call sights. So we instead change the signature
of the function to just return a boolean. In future, if we do need to
return any error, we will have to go inspect every call sight in any
case to fix compliation & then we can be sure we are handling the errors
correctly.
In this commit, we make all calls to disconnect after a ping/pong
violation is detected in the `PingManager` async. We do this to avoid
circular waiting that may occur if the disconnect call back ends up
waiting on the peer goroutine to be torn down. If this happens, then the
peer goroutine will be blocked on the ping manager fully tearing down,
which is blocked on the peer disconnect succeeding.
This is a similar class of issue we've delt with recently as pertains to
the peer and the server: sync all back execution must not lead to
a circular waiting loop.
Fixes#8379
We don't need to try a link shutdown when the chan closer is fetched
since, by this commit, the only callsite manages the shutdown semantics.
After removing the call to tryLinkShutdown, we no longer need the
function at all.
In order to handle shutdown requests when there are still HTLCs on
the link, we have to manage the shutdown process via the link's
lifecycle hooks. This means we can't use the simple `tryLinkShutdown`
anymore and instead queue a `Shutdown` message at the next opportunity
to do so -- when we send our next `CommitSig`
This commit moves over the last two methods, `RegisterChannel` and
`BackupState` from the `Client` to the `Manager` interface. With this
change, we no longer need to pass around the individual clients around
and now only need to pass the manager around.
To do this change, all the goroutines that handle channel closes,
closable sessions needed to be moved to the Manager and so a large part
of this commit is just moving this code from the TowerClient to the
Manager.
This is to avoid a potential race on WriteMessage and Flush internals.
Because there is no locking on WriteMessage and Flush, if we allow
writeMessage calls in Start after the writeHandler has started,
the writeMessage calls may call WriteMessage/Flush at the same time
that writeMessage calls from the writeHandler does. Since there is
no locking, internals like b.nextHeaderSend can race and cause
panics.
Without this the following could happen:
* InboundPeerConnected is called while we already have an inbound
connection with the peer. This calls removePeer which calls Disconnect.
* If the peer is starting up in Start, it may be sending messages
synchronously via SendMessage(true, ...). This eventually calls the
writeMessage function which will exit if disconnect is set to 1.
* Since Disconnect was called, disconnect will be 1 and writeMessage
will exit, causing writeHandler to exit.
* If there is more than 1 message being sent, later messages will
queue in queueHandler but be unable to get into sendQueue as the
writeHandler goroutine has exited.
* The synchronous sends will be waiting on the errChan indefinitely
and startReady will never get closed meaning Disconnect will never
proceed.
The end result is that the server's mutex will be held until shutdown.
Avoid this by using writeMessage to bypass the writeHandler goroutine.
Prior to this commit, taproot channels had a bug:
- If a disconnect happened before peer.AddNewChannel was called,
then the subsequent reconnect would call peer.AddNewChannel and
attempt the ChannelReestablish dance.
- peer.AddNewChannel would call NewLightningChannel with
populated nonce ChannelOpts. This in turn would call
InitRemoteMusigNonces which would create a new musig pair session
and set the channel's pendingVerificationNonce to nil.
- During the reestablish dance, ProcessChanSyncMsg would be called.
This would also call InitRemoteMusigNonces, except it would fail
since pendingVerificationNonce was set to nil in the previous
invocation.
To fix this, we add a new functional option to signal to the init logic
that it doesn't need to call InitRemoteMusigNonces in in
ProcessChanSyncMsg.
This commit refactors some of the bookkeeping around the ping logic
inside of the Brontide. If the pong response is noncompliant with
the spec or if it times out, we disconnect from the peer.
This change makes the generation of the ping payload a no-arg
closure parameter, relieving the pingHandler of having to
directly monitor the chain state. This makes use of the
BestBlockView that was introduced in earlier commits.
This PR is a follow up, to a [follow
up](https://github.com/lightningnetwork/lnd/pull/7938) of an [initial
concurrency issue](https://github.com/lightningnetwork/lnd/pull/7856)
fixed in the peer goroutine.
In #7938, we noticed that the introduction of `p.startReady` can cause
`Disconnect` to block. This happens as `Disconnect` cannot be called
until `p.startReady` has been closed. `Disconnect` is also called from
`InboundPeerConnected` (the case of concurrent peers, so we need to
remove one of the connections) while the main server mutex is held. If
`p.Start` blocks for any reason, then this leads to the deadlock as: we
can't disconnect until we've finished starting, and we can't finish
starting as we need the disconnect caller to exit as it has the mutex.
In this commit, we now make the call to `prunePersistentPeerConnection`
async. The call to `prunePersistentPeerConnection` eventually wants to
grab the server mutex, which triggers the circular waiting scenario
above.
The main learning here is that no calls to the main server mutex path
can block from `p.Start`. This is more or less a stop gap to resolve the
issue initially introduced in v0.16.4. Assuming we want to move forward
with this fix, we should reexamine `p.startReady` all together, and also
revisit attempt to refactor this section of the code to eliminate the
mega mutex in the server in favor of a dedicated event loop.
* lnwallet: fix log output msg
The log message is off by one.
* htlcswitch: fail channel when revoking it fails.
When the revocation of a channel state fails after receiving a new
CommitmentSigned msg we have to fail the channel otherwise we
continue with an unclean state.
* docs: update release-docs
* htlcswitch: tear down connection if revocation processing fails
If we couldn't revoke due to a DB error, then we want to also tear down
the connection, as we don't want the other party to continue to send
updates. That may lead to de-sync'd state an eventual force close.
Otherwise, the database might be able to recover come the next
reconnection attempt.
* kvdb: use sql.LevelSerializable for all backends
In this commit, we modify the default isolation level to be
`sql.LevelSerializable. This is the strictness isolation type for
postgres. For sqlite, there's only ever a single writer, so this doesn't
apply directly.
* kvdb/sqlbase: add randomized exponential backoff for serialization failures
In this commit, we add randomized exponential backoff for serialization
failures. For postgres, we''ll his this any time a transaction set fails
to be linearized. For sqlite, we'll his this if we have many writers
trying to grab the write lock at time same time, manifesting as a
`SQLITE_BUSY` error code.
As is, we'll retry up to 10 times, waiting a minimum of 50 miliseconds
between each attempt, up to 5 seconds without any delay at all. For
sqlite, this is also bounded by the busy timeout set, which applies on
top of this retry logic (block for busy timeout seconds, then apply this
back off logic).
* docs/release-notes: add entry for sqlite/postgres tx retry
---------
Co-authored-by: ziggie <ziggie1984@protonmail.com>
In this commit, we attempt to fix circular waiting scenario introduced
inadvertently when [fixing a race condition
scenario](https://github.com/lightningnetwork/lnd/pull/7856). With that
PR, we added a new channel that would block `Disconnect`, and
`WaitForDisconnect` to ensure that only until the `Start` method has
finished would those calls be allowed to succeed.
The issue is that if the server is trying to disconnect a peer due to a
concurrent connection, but `Start` is blocked on `maybeSendNodeAnn`,
which then wants to grab the main server mutex, then `Start` can never
exit, which causes `startReady` to never be closed, which then causes
the server to be blocked.
This PR attempts to fix the issue by calling `maybeSendNodeAnn` in a
goroutine, so it can grab the server mutex and not block the `Start`
method.
Fixes https://github.com/lightningnetwork/lnd/issues/7924
Fixes https://github.com/lightningnetwork/lnd/issues/7928
Fixes https://github.com/lightningnetwork/lnd/issues/7866
In this commit, we add support for the new musig2 channel funding flow.
This flow is identical to the existing flow, but not both sides need to
exchange local nonces up front, and then signatures sent are now partial
signatures instead of regular signatures.
The funding manager also gains some new state of the local nonces it
needs to generate in order to send the funding locked message, and also
process the funding locked message from the remote party.
In order to allow the funding manger to generate the nonces that need to
be applied to each channel, then AddNewChannel method has been modified
to accept a set of options that the peer will then use to bind the
nonces to a new channel.
In this commit, we update the co-op close flow to support the new musig2
keyspend flow. We'll use some new functional options to allow a caller
to pass in an active musig2 session. If this is present, then we'll use
that to complete the musig2 flow by signing with a partial signature,
and then ultimately combining the signatures at the end.
In this commit, we modify the starting logic to note attempt to add a
tower client for taproot channels. Instead, we'll just log that this
isn't available yet.
This commit now sends messages to `chanStream` for both pending and
active channels. If the message is sent to a pending channel, it will be
queued in `chanStream`. Once the channel link becomes active, the early
messages will be processed.
This commit adds a new interface method, `RemovePendingChannel`, to be
used when the funding flow is failed after calling `AddPendingChannel`
such that the Brontide has the most up-to-date view of the active
channels.
This commit adds a new channel `newPendingChannel` and its dedicated
handler `handleNewPendingChannel` to keep track of pending open
channels. This should not affect the original handling of new active
channels, except `addedChannels` is now updated in
`handleNewPendingChannel` such that this new pending channel won't be
reestablished in link.
In this commit, we make sure that all the `wg.Add(1)` calls succeed
before we attempt to wait on the shutdown of all the goroutines. Under
rare scheduling scenarios, if both `Start` and `Disconnect` are called
concurrently, then this internal race error can be hit, causing the
panic to occur.
Fixes https://github.com/lightningnetwork/lnd/issues/7853
As new pending channels will be tracked in the following commits, the
`newChannels` is now renamed to `newActiveChannel` to explicitly refer
to active, non-pending channels.