Got complaints about us hanging up on some nodes because they don't respond
to pings in a timely manner (e.g. ACINQ?), but that turned out to be something
else.
Nonetheless, we've had reports in the past of LND badly prioritizing gossip
traffic, and thus important messages can get queued behind gossip dumps!
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Changed: connectd: give busy peers more time to respond to pings.
This was fixed in 1c495ca5a8 ("connectd:
fix accidental handling of old reconnections.") and then reverted by
the rework in "connectd: avoid use-after-free upon multiple
reconnections by a peer".
The latter made the race much less likely, since we cleaned up the
reconnecting struct once the connection was hung up by the remote
node, but it's still theoretically possible.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
`peer_reconnected` was freeing a `struct peer_reconnected` instance
while a pointer to that instance was registered to be passed as an
argument to the `retry_peer_connected` callback function. This caused a
use-after-free crash when `retry_peer_connected` attempted to reparent
the instance to the temporary context.
Instead, never have `peer_reconnected` free a `struct peer_reconnected`
instance, and only ever allow such an instance to be freed after the
`retry_peer_connected` callback has finished with it. To ensure that the
instance is freed even if the connection is closed before the callback
can be invoked, parent the instance to the connection rather than to the
daemon.
Absent the need to free `struct peer_reconnected` instances outside of
the `retry_peer_connected` callback, there is no use for the
`reconnected` hashtable, so remove it as well.
See: https://github.com/ElementsProject/lightning/issues/5282#issuecomment-1141454255Fixes: #5282Fixes: #5284
Changelog-Fixed: connectd no longer crashes when peers reconnect.
The crash below from @zerofeerouting left me confused. The invalid
value in fmt_wireaddr_internal is a telltale sign of use-after-free.
This backtrace shows us destroying the conn *twice*: what's happening?
Well, tal carefully protects against destroying twice: it's not that
unusual to free something in a destructor which has already been freed.
So this indicates that there are *two* io_conn hanging off one
struct connecting, which isn't supposed to happen! We deliberately
call try_connect_one_addr() initially, then inside the io_conn destructor.
But due to races in connectd vs lightningd connection state, we added
a fix which allows a connect command to sit around while the peer is
cleaning up (6cc9f37cab) and get fired
off when it's done.
But what if, in the chaos, we are already connecting again? Now we'll
end up with *two* connections.
Fortunately, we have a `conn` pointer inside struct connecting, which
(with a bit of additional care) we can ensure is only non-NULL while
we're actually trying to connect. This lets us check that before
firing off a new connection attempt in peer_conn_closed.
```
lightning_connectd: FATAL SIGNAL 6 (version v0.11.2rc2-2-g8f7e939)
0x5614a4915ae8 send_backtrace
common/daemon.c:33
0x5614a4915b72 crashdump
common/daemon.c:46
0x7ffa14fcd72f ???
???:0
0x7ffa14dc87bb ???
???:0
0x7ffa14db3534 ???
???:0
0x5614a491fc71 fmt_wireaddr_internal
common/wireaddr.c:255
0x5614a491fc7a fmt_wireaddr_internal_
common/wireaddr.c:257
0x5614a491ea6b type_to_string_
common/type_to_string.c:32
0x5614a490beaa destroy_io_conn
connectd/connectd.c:754
0x5614a494a2f1 destroy_conn
ccan/ccan/io/poll.c:246
0x5614a494a313 destroy_conn_close_fd
ccan/ccan/io/poll.c:252
0x5614a4953804 notify
ccan/ccan/tal/tal.c:240
0x5614a49538d6 del_tree
ccan/ccan/tal/tal.c:402
0x5614a4953928 del_tree
ccan/ccan/tal/tal.c:412
0x5614a4953e07 tal_free
ccan/ccan/tal/tal.c:486
0x5614a4908b7a try_connect_one_addr
connectd/connectd.c:870
0x5614a490bef1 destroy_io_conn
connectd/connectd.c:759
0x5614a494a2f1 destroy_conn
ccan/ccan/io/poll.c:246
0x5614a494a313 destroy_conn_close_fd
ccan/ccan/io/poll.c:252
0x5614a4953804 notify
ccan/ccan/tal/tal.c:240
0x5614a49538d6 del_tree
ccan/ccan/tal/tal.c:402
0x5614a4953e07 tal_free
ccan/ccan/tal/tal.c:486
0x5614a4948f08 io_close
ccan/ccan/io/io.c:450
0x5614a4948f59 do_plan
ccan/ccan/io/io.c:401
0x5614a4948fe1 io_ready
ccan/ccan/io/io.c:417
0x5614a494a8e6 io_loop
ccan/ccan/io/poll.c:453
0x5614a490c12f main
connectd/connectd.c:2164
0x7ffa14db509a ???
???:0
0x5614a4904e99 ???
???:0
0xffffffffffffffff ???
???:0
```
Fixes: #5339
Changelog-Fixed: connectd: occasional crash when we reconnect to a peer quickly.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
It doesn't get the right errno, and it says "create" not "bind".
```
2022-05-20T03:04:46.498Z DEBUG connectd: Failed to create 2 socket: Success
2022-05-20T03:04:46.500Z DEBUG connectd: REPLY WIRE_CONNECTD_INIT_REPLY with 0 fds
2022-05-20T03:04:46.501Z DEBUG connectd: connectd_init_done
2022-05-20T03:04:46.503Z **BROKEN** connectd: Failed to bind socket for 127.0.0.1:37871: Address already in use
```
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
I have a test which reproduces this, too, and it's been seen in the
wild. It seems we can add a subd as we're closing, which causes
this assert to trigger.
Fixes: #5254
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We had multiple reports of channels being unilaterally closed because
it seemed like the peer was sending old revocation numbers.
Turns out, it was actually old reestablish messages! When we have a
reconnection, we would put the new connection aside, and tell lightningd
to close the current connection: when it did, we would restart
processing of the initial reconnection.
However, we could end up with *multiple* "reconnecting" connections,
while waiting for an existing connection to close. Though the
connections were long gone, there could still be messages queued
(particularly the channel_reestablish message, which comes early on).
Eventually, a normal reconnection would cause us to process one of
these reconnecting connections, and channeld would see the (perhaps
very old!) messages, and get confused.
(I have a test which triggers this, but it also hangs the connect
command, due to other issues we will fix in the next release...)
Fixes: #5240
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This seems to prevent broad propagation, due to LND not allowing it. See
https://github.com/lightningnetwork/lnd/issues/6432
We still announce it if you disable deprecated-apis, so tests still work,
and hopefully we can enable it in future.
Fixes: #5196
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-EXPERIMENTAL: Protocol: disabled websocket announcement due to LND propagation issues
I was seeing a strange crash:
Connectd gave bad CONNECT_PEER_CONNECTED message
The message is indeed mangled, around the remote_addr!
A quick review of the code revealed that we were not making a copy
when it was a reconnect, and so the remote_addr pointer was pointing
to memory which was freed.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We do this (send warnings) in almost all cases anyway, so mainly this
is a textual update, but there are some changes:
1. Send ERROR not WARNING if they send a malformed commitment secret.
2. Send WARNING not ERROR if they get the shutdown_scriptpubkey wrong (vs upfront)
3. Send WARNING not ERROR if they send a bad shutdown_scriptpubkey (e.g. p2pkh in future)
4. Rename some vars 'err' to 'warn' to make it clear we send a warning.
This means test_option_upfront_shutdown_script can be made reliable, too,
and it now warns and doesn't automatically close channel.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Gossipd didn't actually suppress all gossip, resulting in a flake!
Doing it in connectd now makes much more sense.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Either because lightningd tells us it wants to talk, or because the peer
says something about a channel.
We also introduce a behavior change: we disconnect after a failed open.
We might want to modify this later, but we it's a side-effect of openingd
not holding onto idle connections.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We would return success from connect even though the peer was closing;
this is technically correct but fairly undesirable. Better is to pass
every connect attempt to connectd, and have it block if the peer is
exiting (and retry), otherwise tell us it's already connected.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
The message from lightningd simply acknowleges that we are allowed to
discard the peer (because no subdaemons are talking to it anymore).
This difference becomes more stark once connectd holds on to idle
peers.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
For example, if you do:
```
./lightningd/lightningd --network=regtest --experimental-websocket-port=19846
```
Then you're trying to reuse the normal port as the websocket port, but this
only fails at *listen* time, when we activate connectd. Catch this too.
Fixes incorrect fatal() message, too.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
By accessing `addr` after the loop, it's possible that it's one which
failed, in complex scenarios.
Also gives us a chance to warn if they specify a websocket but don't
actually end up advertizing it (you *must* advertize a normal addr as
well).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We always added to both arrays, might as well just keep one.
We make mayfail an explicit flag, rather than relying on the presence
of errstr, which is never NULL now.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This lets us give a better error message if listen fails, and also
moved the callback closer to where it's needed.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Gossipd now simply gets told by channeld when peers arrive or leave.
(it only needs to know for the seeker).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This is neater than what we had before, and slightly more general.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Changed: JSON_RPC: `sendcustommsg` now works with any connected peer, even when shutting down a channel.
We don't need to log msgs from subds, but we do our own, and we weren't.
1. Rename queue_peer_msg to inject_peer_msg for clarity, make it do logging
2. In the one place where we're relaying, call msg_queue() directly.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We want to stream gossip through this, but currently connectd treats the
fd as synchronous. While we work on getting rid of that, it's easiest to
have two fds.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We used to shut down peers atomically, but now we flush the
connections there's a delay. If we are asked to connect in that time,
we ignore it, as we are already connected, but that's wrong: we need
to remember that we were told to connect and reconnect.
This should solve a few weird test failures where "connect" would hang
indefinitely.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This is critical in the common case where peer sends an error and
hangs up: we almost never get to relay the error to the subd in time.
This also applies in the other direction: we need to flush the queue
to the peer when the subd closes. Note we only free the actual peer
struct when lightningd reaps us with connectd_peer_disconnected().
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We would lose packets sometimes due to this previously, but it
doesn't happen over localhost so our tests didn't notice. However,
now we have connectd being sole thing talking to peers, we can do
a more elegant shutdown, which should fix closing.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Fixed: Protocol: Always flush sockets to increase chance that final message get to peer (esp. error packets).
msg_queue was originally designed for inter-daemon comms, and so it has
a special mechanism to mark that we're trying to send an fd. Unfortunately,
a peer could also send such a message, confusing us!
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
dev_blackhole_fd was a hack, and doesn't work well now we are async
(it worked for sync comms in per-peer daemons, but now we could sneak
through a read before we get to the next write).
So, make explicit flags and use them. This is much easier now we
have all peer comms in one place.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
It's weird to have connectd ask gossipd, when lightningd can just do it
and hand all the addresses together.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We now let gossipd do it.
This also means there's nothing left in 'struct per_peer_state' to
send across the wire (the fds are sent separately), so that gets
removed from wire messages too.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>