Commit Graph

141 Commits

Author SHA1 Message Date
Matt Morehouse
f0ae5b2456
peer: add test for startup race on writeMessage
The test reliably detects
https://github.com/lightningnetwork/lnd/issues/8184.
2023-11-16 17:33:44 -06:00
Matt Morehouse
08fff28504
peer: enable mockMessageConn to detect data races
We use unsynchronized counters to trigger a report under the race
detector if multiple reads or writes happen concurrently.
2023-11-16 17:32:19 -06:00
Matt Morehouse
c4e0daa274
peer: add missing Close() method to mockMessageConn 2023-11-16 17:31:18 -06:00
Eugene Siegel
7556402ea4
peer: send reestablish, shutdown messages before starting writeHandler
This is to avoid a potential race on WriteMessage and Flush internals.
Because there is no locking on WriteMessage and Flush, if we allow
writeMessage calls in Start after the writeHandler has started,
the writeMessage calls may call WriteMessage/Flush at the same time
that writeMessage calls from the writeHandler does. Since there is
no locking, internals like b.nextHeaderSend can race and cause
panics.
2023-11-16 12:07:10 -05:00
Elle
ed179e3e7d
Merge pull request #8180 from carlaKC/8128-brontideflake
peer/test: fix race in TestHandleNewPendingChannel
2023-11-16 10:10:24 +02:00
Carla Kirk-Cohen
a8a86b290c
peer/test: setup clean peer for each TestHandleNewPendingChannel case
Running test cases in parallel with shared state means that they
can intermingle if they're run at the same time and set up incorrect
values when we assert on the number of channels we have. Moving the
peer setup into the parallel test run fixes this because no single
test case can interfere with the other.
2023-11-14 15:45:19 -05:00
Elle
6122452509
Merge pull request #8117 from ellemouton/removeDBFromChannelDBSchemas
channeldb+refactor: remove `kvdb.Backend` from channel DB schemas
2023-11-09 07:47:11 +02:00
Eugene Siegel
955a345153
peer: send messages in Start via writeMessage to bypass writeHandler
Without this the following could happen:

* InboundPeerConnected is called while we already have an inbound
connection with the peer. This calls removePeer which calls Disconnect.
* If the peer is starting up in Start, it may be sending messages
synchronously via SendMessage(true, ...). This eventually calls the
writeMessage function which will exit if disconnect is set to 1.
* Since Disconnect was called, disconnect will be 1 and writeMessage
will exit, causing writeHandler to exit.
* If there is more than 1 message being sent, later messages will
queue in queueHandler but be unable to get into sendQueue as the
writeHandler goroutine has exited.
* The synchronous sends will be waiting on the errChan indefinitely
and startReady will never get closed meaning Disconnect will never
proceed.

The end result is that the server's mutex will be held until shutdown.

Avoid this by using writeMessage to bypass the writeHandler goroutine.
2023-11-08 16:36:02 -05:00
Elle Mouton
84cdcd6847
multi: move DB schemas to channeldb/models
This commit moves the ChannelEdgePolicy, ChannelEdgeInfo,
ChanelAuthProof and CachedEdgePolicy structs to the `channeldb/models`
package.
2023-11-08 14:50:35 +02:00
Eugene Siegel
dc42b160a0
multi: skip InitRemoteMusigNonces if we've already called it
Prior to this commit, taproot channels had a bug:

- If a disconnect happened before peer.AddNewChannel was called,
  then the subsequent reconnect would call peer.AddNewChannel and
  attempt the ChannelReestablish dance.

- peer.AddNewChannel would call NewLightningChannel with
  populated nonce ChannelOpts. This in turn would call
  InitRemoteMusigNonces which would create a new musig pair session
  and set the channel's pendingVerificationNonce to nil.

- During the reestablish dance, ProcessChanSyncMsg would be called.
  This would also call InitRemoteMusigNonces, except it would fail
  since pendingVerificationNonce was set to nil in the previous
  invocation.

To fix this, we add a new functional option to signal to the init logic
that it doesn't need to call InitRemoteMusigNonces in   in
ProcessChanSyncMsg.
2023-10-31 10:10:35 -07:00
Keagan McClelland
012cc6af8c peer: eliminate unnecessary log spam from received pong msgs 2023-10-19 09:26:50 -07:00
Keagan McClelland
99226e37ef peer: Add machinery to track the state and validity of remote pongs
This commit refactors some of the bookkeeping around the ping logic
inside of the Brontide. If the pong response is noncompliant with
the spec or if it times out, we disconnect from the peer.
2023-10-19 09:26:45 -07:00
Keagan McClelland
1eab7826d2 peer: abstract out ping payload generation from the pingHandler
This change makes the generation of the ping payload a no-arg
closure parameter, relieving the pingHandler of having to
directly monitor the chain state. This makes use of the
BestBlockView that was introduced in earlier commits.
2023-10-19 09:22:07 -07:00
Keagan McClelland
ac265812fc peer+lnd: make BestBlockView available to Brontide 2023-10-19 09:22:07 -07:00
Oliver Gugger
82838e4a62
Merge pull request #8090 from ziggie1984/add-additional-log-chan-closure
Add more information when a co-op close is failing.
2023-10-17 10:27:21 +00:00
ziggie
e4dd911778
multi: clarify co-op closure failures. 2023-10-13 20:35:08 +02:00
yyforyongyu
839b6271e5
peer: fix unit test flake 2023-10-13 13:50:07 +08:00
Olaoluwa Osuntokun
eae9dd07f0
peer: launch persistent peer pruning in background goroutine
This PR is a follow up, to a [follow
up](https://github.com/lightningnetwork/lnd/pull/7938) of an [initial
concurrency issue](https://github.com/lightningnetwork/lnd/pull/7856)
fixed in the peer goroutine.

In #7938, we noticed that the introduction of `p.startReady` can cause
`Disconnect` to block. This happens as `Disconnect` cannot be called
until `p.startReady` has been closed. `Disconnect` is also called from
`InboundPeerConnected` (the case of concurrent peers, so we need to
remove one of the connections) while the main server mutex is held. If
`p.Start` blocks for any reason, then this leads to the deadlock as: we
can't disconnect until we've finished starting, and we can't finish
starting as we need the disconnect caller to exit as it has the mutex.

In this commit, we now make the call to `prunePersistentPeerConnection`
async. The call to `prunePersistentPeerConnection` eventually wants to
grab the server mutex, which triggers the circular waiting scenario
above.

The main learning here is that no calls to the main server mutex path
can block from `p.Start`. This is more or less a stop gap to resolve the
issue initially introduced in v0.16.4. Assuming we want to move forward
with this fix, we should reexamine `p.startReady` all together, and also
revisit attempt to refactor this section of the code to eliminate the
mega mutex in the server in favor of a dedicated event loop.
2023-09-27 21:24:50 -05:00
Olaoluwa Osuntokun
01c64712a3
multi: ensure link is always torn down due to db failures, add exponential back off for sql-kvdb failures (#7927)
* lnwallet: fix log output msg

The log message is off by one.

* htlcswitch: fail channel when revoking it fails.

When the revocation of a channel state fails after receiving a new
CommitmentSigned msg we have to fail the channel otherwise we
continue with an unclean state.

* docs: update release-docs

* htlcswitch: tear down connection if revocation processing fails

If we couldn't revoke due to a DB error, then we want to also tear down
the connection, as we don't want the other party to continue to send
updates. That may lead to de-sync'd state an eventual force close.
Otherwise, the database might be able to recover come the next
reconnection attempt.

* kvdb: use sql.LevelSerializable for all backends

In this commit, we modify the default isolation level to be
`sql.LevelSerializable. This is the strictness isolation type for
postgres. For sqlite, there's only ever a single writer, so this doesn't
apply directly.

* kvdb/sqlbase: add randomized exponential backoff for serialization failures

In this commit, we add randomized exponential backoff for serialization
failures. For postgres, we''ll his this any time a transaction set fails
to be linearized. For sqlite, we'll his this if we have many writers
trying to grab the write lock at time same time, manifesting as a
`SQLITE_BUSY` error code.

As is, we'll retry up to 10 times, waiting a minimum of 50 miliseconds
between each attempt, up to 5 seconds without any delay at all. For
sqlite, this is also bounded by the busy timeout set, which applies on
top of this retry logic (block for busy timeout seconds, then apply this
back off logic).

* docs/release-notes: add entry for sqlite/postgres tx retry

---------

Co-authored-by: ziggie <ziggie1984@protonmail.com>
2023-08-30 16:48:00 -07:00
Olaoluwa Osuntokun
eb0d8af645
peer: eliminate circular waiting by calling maybeSendNodeAnn async (#7938)
In this commit, we attempt to fix circular waiting scenario introduced
inadvertently when [fixing a race condition
scenario](https://github.com/lightningnetwork/lnd/pull/7856). With that
PR, we added a new channel that would block `Disconnect`, and
`WaitForDisconnect` to ensure that only until the `Start` method has
finished would those calls be allowed to succeed.

The issue is that if the server is trying to disconnect a peer due to a
concurrent connection, but `Start` is blocked on `maybeSendNodeAnn`,
which then wants to grab the main server mutex, then `Start` can never
exit, which causes `startReady` to never be closed, which then causes
the server to be blocked.

This PR attempts to fix the issue by calling `maybeSendNodeAnn` in a
goroutine, so it can grab the server mutex and not block the `Start`
method.

Fixes https://github.com/lightningnetwork/lnd/issues/7924

Fixes https://github.com/lightningnetwork/lnd/issues/7928

Fixes https://github.com/lightningnetwork/lnd/issues/7866
2023-08-30 15:47:45 -07:00
Olaoluwa Osuntokun
bed9455d60
lnwallet: fix taproot co-op close nonce bug
The local nonce needs to be the one that's finalized to simulate us
receiving the remote nonce, then generating the local nonce in a JIT
style.
2023-08-22 16:33:50 -07:00
Olaoluwa Osuntokun
15978a8691
funding+peer: add support for new musig2 channel funding flow
In this commit, we add support for the new musig2 channel funding flow.
This flow is identical to the existing flow, but not both sides need to
exchange local nonces up front, and then signatures sent are now partial
signatures instead of regular signatures.

The funding manager also gains some new state of the local nonces it
needs to generate in order to send the funding locked message, and also
process the funding locked message from the remote party.

In order to allow the funding manger to generate the nonces that need to
be applied to each channel, then AddNewChannel method has been modified
to accept a set of options that the peer will then use to bind the
nonces to a new channel.
2023-08-22 16:32:07 -07:00
Olaoluwa Osuntokun
aaba144804
multi: fix linter warnings 2023-08-22 16:32:00 -07:00
Olaoluwa Osuntokun
3879138018
lnwallet: update internal co-op close flow to support musig2 keyspend
In this commit, we update the co-op close flow to support the new musig2
keyspend flow. We'll use some new functional options to allow a caller
to pass in an active musig2 session. If this is present, then we'll use
that to complete the musig2 flow by signing with a partial signature,
and then ultimately combining the signatures at the end.
2023-08-22 16:31:38 -07:00
Olaoluwa Osuntokun
a5d8f69516
peer: soft-disable towers with taproot channels
In this commit, we modify the starting logic to note attempt to add a
tower client for taproot channels. Instead, we'll just log that this
isn't available yet.
2023-08-22 16:31:13 -07:00
Olaoluwa Osuntokun
9a65806c09
input+wallet: extract musig2 session management into new module
In this commit, we extract the musig2 session management into a new
module. This allows us to re-use the session logic elsewhere in unit
tests so we don't need to instantiate the entire wallet.
2023-08-22 16:30:39 -07:00
Oliver Gugger
fb36806724
peer: load initial forwarding policy from channel DB 2023-08-22 06:22:34 +08:00
Oliver Gugger
d5c504c8de
multi: use fwding policy from models pkg 2023-08-22 06:22:33 +08:00
yyforyongyu
f9d4212ecc
peer: send msgs to chanStream for both active and pending channels
This commit now sends messages to `chanStream` for both pending and
active channels. If the message is sent to a pending channel, it will be
queued in `chanStream`. Once the channel link becomes active, the early
messages will be processed.
2023-08-09 01:29:19 +08:00
yyforyongyu
6b41289538
multi: patch unit tests for handling pending channels 2023-08-09 01:29:19 +08:00
yyforyongyu
d28242c664
peer: return an error from updateNextRevocation and patch unit tests
This commit makes the `updateNextRevocation` to return an error and
further feed it through the request's error chan so it'd be handled by
the caller.
2023-08-09 01:29:18 +08:00
yyforyongyu
927572583b
multi: remove pending channel from Brontide when funding flow failed
This commit adds a new interface method, `RemovePendingChannel`, to be
used when the funding flow is failed after calling `AddPendingChannel`
such that the Brontide has the most up-to-date view of the active
channels.
2023-08-09 01:29:18 +08:00
yyforyongyu
3eb7f54a6d
peer: add method handleLinkUpdateMsg to handle channel update msgs 2023-08-09 00:17:23 +08:00
yyforyongyu
f39c568c94
peer: track pending channels in Brontide
This commit adds a new channel `newPendingChannel` and its dedicated
handler `handleNewPendingChannel` to keep track of pending open
channels. This should not affect the original handling of new active
channels, except `addedChannels` is now updated in
`handleNewPendingChannel` such that this new pending channel won't be
reestablished in link.
2023-08-09 00:17:22 +08:00
Olaoluwa Osuntokun
325e844a96
peer: ensure the peer has been started before Disconnect can be called
In this commit, we make sure that all the `wg.Add(1)` calls succeed
before we attempt to wait on the shutdown of all the goroutines. Under
rare scheduling scenarios, if both `Start` and `Disconnect` are called
concurrently, then this internal race error can be hit, causing the
panic to occur.

Fixes https://github.com/lightningnetwork/lnd/issues/7853
2023-07-27 10:59:39 -07:00
Olaoluwa Osuntokun
798821db17
peer: remove unused queueQuit channel
In this commit, we remove an unused `queueQuit` channel.
2023-07-27 10:59:29 -07:00
Matt Morehouse
e2f2b07149
peer: combine handleWarning and handleError
The two methods were nearly identical, so we can simplify by combining
them.
2023-07-19 12:40:16 -05:00
yyforyongyu
195e951e58
peer: rename newChannels to newActiveChannel
As new pending channels will be tracked in the following commits, the
`newChannels` is now renamed to `newActiveChannel` to explicitly refer
to active, non-pending channels.
2023-06-27 20:19:37 +08:00
yyforyongyu
22d2819489
peer: add handleNewChannelMsg method to handle new channel request 2023-06-27 20:19:37 +08:00
yyforyongyu
60627f676f
peer: stop considering pending channels as active
This commit changes the method `isActiveChannel` such that the pending
channels are not consider as active.
2023-06-27 20:19:37 +08:00
yyforyongyu
ccb86a9d1d
peer: update and use SyncMap in Brontide
This commit applies methods `ForEach` and `Len` to `Brontide`.
2023-06-27 20:19:37 +08:00
yyforyongyu
bacb49ddba
peer+server: provide more verbose logging 2023-06-19 14:04:24 +08:00
Olaoluwa Osuntokun
4d633f04e3
htlcswitch: add new LinkFailureDisconnect action
In this commit, we add a new LinkFailureDisconnect action that'll be
used if we detect that the remote party hasn't sent a revoke and ack
when it actually should.

Before this commit, we would log our action, tear down the link, but
then not actually force a connection recycle, as we assumed that if the
TCP connection was actually stale, then the read/write timeout would
expire.

In practice this doesn't always seem to be the case, so we make a strong
action here to actually force a disconnection in hopes that either side
will reconnect and keep the good times rollin' 🕺.
2023-05-23 12:25:11 -07:00
Olaoluwa Osuntokun
cb5fc71659
htlcswitch: add new LinkFailureAction enum
In this commit, we add a new LinkFailureAction enum to take over the old
force close bool. Force closing isn't the only thing we might want to do
when we decide to fail the link, so this is a prep refactoring for an
upcoming change.
2023-05-23 12:24:55 -07:00
Carla Kirk-Cohen
3f9f0ea5d1
multi/refactor: separate methods for get and set node announcement
In preparation for a more complex function signature for set node
announcement, separate get and set so that readonly callers don't need
to handle the extra arguments.
2023-05-04 10:35:42 -04:00
Oliver Gugger
5fbaed6b06
peer: add TODO about initial forwarding policy
The right way to solve the problem of the link not being up to date with
custom user set forwarding policies once the channel is announced would
be to pass in those custom values when the link is created initially.
This requires a bit more of a refactor and is not addressed in this bug
fix.
2023-04-18 11:47:03 +02:00
Oliver Gugger
ec5b95c9a9
Merge pull request #7517 from yyforyongyu/fix-funding-locked
Replace `FundingLocked` with `ChannelReady`
2023-03-30 17:22:17 +02:00
yyforyongyu
2b8e9a0d36
multi: add more trace logs regarding link activate flow 2023-03-21 10:18:49 +07:00
yyforyongyu
f8a8326141
multi: replace FundingLocked and funding_locked strings
This commit is created by running the following commands,
```shell
find . -name "*.go" -exec sed -i '' 's/\"FundingLocked/\"ChannelReady/g' {} \;
find . -name "*.go" -exec sed -i '' 's/FundingLocked\"/ChannelReady\"/g' {} \;
find . -name "*.go" -exec sed -i '' 's/\ funding_locked/\ channel_ready/g' {} \;
```
2023-03-17 18:21:59 +08:00
yyforyongyu
539cae1999
multi: rename fundingLockedMsg to channelReadyMsg
This commit is created by running,
```shell
gofmt -d -w -r 'fundingLockedMsg -> channelReadyMsg' .
```
2023-03-17 18:21:59 +08:00