Commit Graph

90 Commits

Author SHA1 Message Date
Rusty Russell
002dc60b33 Gossip: BOLT catch, remove initial_routing_sync.
Everyone sends a gossip_timestamp_filter message these days to start gossip.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-06-19 15:54:24 +09:30
Rusty Russell
06cf5ac841 Doc: update bolts to assume gossip_queries under the new meaning.
Everyone understands gossip_queries now, but peers leave it unset to indicate
they have nothing useful to say.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-06-19 15:54:24 +09:30
Rusty Russell
5d061c4cf4 global: remove tags from BOLT quotes now dual-funding is in master
A few of them had minor wording changes, too.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-05-09 16:14:23 -05:00
Rusty Russell
d3dbcf03fa channeld: close an unimportant connection when fds get low.
We use a crude heuristic: if we were trying to contact them, it's a
"deliberate" connection, and should be preserved.

Changelog-Changed: connectd: prioritize peers with channels (and log!) if we run low on file descriptors.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-05-09 01:23:46 -05:00
Rusty Russell
3bfe622413 connectd: log when we fail to receive an fd from lightningd.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-05-09 01:23:46 -05:00
Rusty Russell
e0e879c003 common: remove type_to_string files altogther.
This means including <common/utils.h> where it was indirectly included.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-03-20 13:51:48 +10:30
Rusty Russell
07cd4a809b gossipd: remove spam handling.
We weakened this progressively over time, and gossip v1.5 makes spam
impossible by protocol, so we can wait until then.

Removing this code simplifies things a great deal!

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Removed: Protocol: we no longer ratelimit gossip messages by channel, making our code far simpler.
2024-02-04 09:24:44 +10:30
Rusty Russell
db6f0da3b3 connectd: separate routine to inject message without closing connection.
We will want this to send private channel_updates direct to peer.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-01-31 14:47:33 +10:30
niftynei
fa8458c00a dualfund: add test to make sure that tx-sigs sent before commitment
results in an error.
2023-11-02 19:32:05 +10:30
Rusty Russell
043c4ae5eb pytest: fix flake in test_even_sendcustommsg
Make sure plugin has got message to connectd before sending!

```
    def test_even_sendcustommsg(node_factory):
        l1, l2 = node_factory.get_nodes(2, opts={'log-level': 'io',
                                                 'allow_warning': True})
        l1.connect(l2)
    
        # Even-numbered message
        msg = hex(43690)[2:] + ('ff' * 30) + 'bb'
    
        # l2 will hang up when it gets this.
        l1.rpc.sendcustommsg(l2.info['id'], msg)
        l2.daemon.wait_for_log(r'\[IN\] {}'.format(msg))
        l1.daemon.wait_for_log('Invalid unknown even msg')
        wait_for(lambda: l1.rpc.listpeers(l2.info['id'])['peers'] == [])
    
        # Now with a plugin which allows it
        l1.connect(l2)
        l2.rpc.plugin_start(os.path.join(os.getcwd(), "tests/plugins/allow_even_msgs.py"))
    
        l1.rpc.sendcustommsg(l2.info['id'], msg)
        l2.daemon.wait_for_log(r'\[IN\] {}'.format(msg))
>       l2.daemon.wait_for_log(r'allow_even_msgs.*Got message 43690')

tests/test_misc.py:3623: 
...
>                   raise TimeoutError('Unable to find "{}" in logs.'.format(exs))
E                   TimeoutError: Unable to find "[re.compile('allow_even_msgs.*Got message 43690')]" in logs.

contrib/pyln-testing/pyln/testing/utils.py:327: TimeoutError
```

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2023-10-24 09:51:43 +02:00
Rusty Russell
ad7dcf381e lightningd: tell connectd about the custom messages.
We re-send whenever a plugin which allows them starts/finishes.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2023-10-24 11:50:57 +10:30
Rusty Russell
4216affe90 connectd: fix forwarding after tx_abort.
If we get a WIRE_TX_ABORT then another message, we send the other message to the same
subd (even though the tx abort causes it to shutdown).  This means we effectively
lose the next message, and timeout (see below from CI, reproduced locally).

So, have connectd ignore the subd after it forwards the WIRE_TX_ABORT.  The next
message will, correctly, cause a fresh subdaemon to be spawned.

```
    @unittest.skipIf(TEST_NETWORK != 'regtest', 'elementsd doesnt yet support PSBT features we need')
    @pytest.mark.openchannel('v2')
    def test_v2_rbf_multi(node_factory, bitcoind, chainparams):
        l1, l2 = node_factory.get_nodes(2,
                                        opts={'may_reconnect': True,
                                              'dev-no-reconnect': None,
                                              'allow_warning': True})
    
        l1.rpc.connect(l2.info['id'], 'localhost', l2.port)
        amount = 2**24
        chan_amount = 100000
        bitcoind.rpc.sendtoaddress(l1.rpc.newaddr()['bech32'], amount / 10**8 + 0.01)
        bitcoind.generate_block(1)
        # Wait for it to arrive.
        wait_for(lambda: len(l1.rpc.listfunds()['outputs']) > 0)
    
        res = l1.rpc.fundchannel(l2.info['id'], chan_amount)
        chan_id = res['channel_id']
        vins = bitcoind.rpc.decoderawtransaction(res['tx'])['vin']
        assert(only_one(vins))
        prev_utxos = ["{}:{}".format(vins[0]['txid'], vins[0]['vout'])]
    
        # Check that we're waiting for lockin
        l1.daemon.wait_for_log(' to DUALOPEND_AWAITING_LOCKIN')
    
        # Attempt to do abort, should fail since we've
        # already gotten an inflight
        with pytest.raises(RpcError):
            l1.rpc.openchannel_abort(chan_id)
    
        rate = int(find_next_feerate(l1, l2)[:-5])
        # We 4x the feerate to beat the min-relay fee
        next_feerate = '{}perkw'.format(rate * 4)
    
        # Initiate an RBF
        startweight = 42 + 172  # base weight, funding output
        initpsbt = l1.rpc.utxopsbt(chan_amount, next_feerate, startweight,
                                   prev_utxos, reservedok=True,
                                   min_witness_weight=110,
                                   excess_as_change=True)
    
        # Do the bump
        bump = l1.rpc.openchannel_bump(chan_id, chan_amount,
                                       initpsbt['psbt'],
                                       funding_feerate=next_feerate)
    
        # Abort this open attempt! We will re-try
        aborted = l1.rpc.openchannel_abort(chan_id)
        assert not aborted['channel_canceled']
        # We no longer disconnect on aborts, because magic!
        assert only_one(l1.rpc.listpeers()['peers'])['connected']
    
        # Do the bump, again, same feerate
>       bump = l1.rpc.openchannel_bump(chan_id, chan_amount,
                                       initpsbt['psbt'],
                                       funding_feerate=next_feerate)

tests/test_opening.py:668: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
contrib/pyln-client/pyln/client/lightning.py:1206: in openchannel_bump
    return self.call("openchannel_bump", payload)
contrib/pyln-testing/pyln/testing/utils.py:718: in call
    res = LightningRpc.call(self, method, payload, cmdprefix, filter)
contrib/pyln-client/pyln/client/lightning.py:398: in call
    resp, buf = self._readobj(sock, buf)
contrib/pyln-client/pyln/client/lightning.py:315: in _readobj
    b = sock.recv(max(1024, len(buff)))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <pyln.client.lightning.UnixSocket object at 0x7f34675aae80>
length = 1024

    def recv(self, length: int) -> bytes:
        if self.sock is None:
            raise socket.error("not connected")
    
>       return self.sock.recv(length)
E       Failed: Timeout >1200.0s
```
2023-10-23 21:57:51 +10:30
Rusty Russell
1d61edfe0c connectd: use shutdown() not close() on TCP sockets for dev-disconnect.
close() is allowed to lose data, and I saw this in CI:

```
2023-10-22T05:12:36.6576005Z ____________________________ test_permfail_htlc_out ____________________________
2023-10-22T05:12:36.6608511Z [gw2] linux -- Python 3.8.18 /home/runner/.cache/pypoetry/virtualenvs/cln-meta-project-AqJ9wMix-py3.8/bin/python
2023-10-22T05:12:36.6611663Z 
2023-10-22T05:12:36.6614768Z node_factory = <pyln.testing.utils.NodeFactory object at 0x7f381039a5e0>
2023-10-22T05:12:36.6623694Z bitcoind = <pyln.testing.utils.BitcoinD object at 0x7f38103c0400>
2023-10-22T05:12:36.6627092Z executor = <concurrent.futures.thread.ThreadPoolExecutor object at 0x7f38103c0ee0>
2023-10-22T05:12:36.6627701Z 
2023-10-22T05:12:36.6628051Z     def test_permfail_htlc_out(node_factory, bitcoind, executor):
2023-10-22T05:12:36.6631192Z         # Test case where we fail with unsettled outgoing HTLC.
2023-10-22T05:12:36.6634154Z         disconnects = ['+WIRE_REVOKE_AND_ACK', 'permfail']
2023-10-22T05:12:36.6635106Z         l1 = node_factory.get_node(options={'dev-no-reconnect': None})
2023-10-22T05:12:36.6637321Z         # Feerates identical so we don't get gratuitous commit to update them
2023-10-22T05:12:36.6642691Z         l2 = node_factory.get_node(disconnect=disconnects,
2023-10-22T05:12:36.6644734Z                                    feerates=(7500, 7500, 7500, 7500))
2023-10-22T05:12:36.6647205Z     
2023-10-22T05:12:36.6649671Z         l1.rpc.connect(l2.info['id'], 'localhost', l2.port)
2023-10-22T05:12:36.6650460Z         l2.daemon.wait_for_log('Handed peer, entering loop')
2023-10-22T05:12:36.6654865Z         l2.fundchannel(l1, 10**6)
2023-10-22T05:12:36.6655305Z     
2023-10-22T05:12:36.6657810Z         # This will fail at l2's end.
2023-10-22T05:12:36.6660554Z         t = executor.submit(l2.pay, l1, 200000000)
2023-10-22T05:12:36.6662947Z     
2023-10-22T05:12:36.6665147Z         l2.daemon.wait_for_log('dev_disconnect permfail')
2023-10-22T05:12:36.6668530Z         l2.wait_for_channel_onchain(l1.info['id'])
2023-10-22T05:12:36.6671588Z         bitcoind.generate_block(1)
2023-10-22T05:12:36.6674510Z >       l1.daemon.wait_for_log('Their unilateral tx, old commit point')
2023-10-22T05:12:36.6675001Z 
2023-10-22T05:12:36.6675212Z tests/test_closing.py:3027: 
...
2023-10-22T05:12:36.8784390Z lightningd-2 2023-10-22T04:41:04.448Z DEBUG   0266e4598d1d3c415f572a8488830b60f7e744ed9235eb0b1ba93283b315c03518-connectd: dev_disconnect: +WIRE_REVOKE_AND_ACK (WIRE_REVOKE_AND_ACK)
2023-10-22T05:12:36.8786260Z lightningd-2 2023-10-22T04:41:04.452Z INFO    0266e4598d1d3c415f572a8488830b60f7e744ed9235eb0b1ba93283b315c03518-channeld-chan#1: Peer connection lost
2023-10-22T05:12:36.8788076Z lightningd-2 2023-10-22T04:41:04.453Z DEBUG   0266e4598d1d3c415f572a8488830b60f7e744ed9235eb0b1ba93283b315c03518-channeld-chan#1: Status closed, but not exited. Killing
2023-10-22T05:12:36.8789915Z lightningd-1 2023-10-22T04:41:04.454Z INFO    022d223620a359a47ff7f7ac447c85c46c923da53389221a0054c11c1e3ca31d59-channeld-chan#1: Peer connection lost
```

Note that l1 doesn't receive WIRE_REVOKE_AND_ACK!

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2023-10-23 15:48:50 +10:30
Rusty Russell
cf2ebed567 lightningd: close channel ourselves, if we receive an error.
Previously, we would forward the message to a subd, but now we have
the case where the subd is gone, but we're still connected.  If the
peer anything but a reestablish in that state, we drop the connection.

Instead, an error should always make us fail the channel.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2023-10-23 15:48:50 +10:30
Rusty Russell
08f0a54fdc connectd: don't disconnect automatically on sending warning or error.
This changes some tests which expected us to disconnect.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2023-10-23 15:48:50 +10:30
Rusty Russell
798cf27cb4 connectd: give subds a chance to drain when lightningd says to disconnect.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2023-10-23 15:48:50 +10:30
Rusty Russell
0ff91e65dc connectd: remove #if DEVELOPER
We still refuse to run dev commands if lightningd sends it to us
despite us not being in developer mode, but that's mainly paranoia.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2023-09-21 20:08:24 +09:30
Rusty Russell
df80a2c056 doc: update to BOLT 3747ba83022cd385093df2696ed342f1e41e31b3 "Remove requirements to disconnect on warnings"
Now we don't do that anymore (at least, for sending) we can update bolt quotes to match.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2023-09-20 13:56:46 +09:30
Rusty Russell
468d3fd387 connectd: also don't disconnect on "all-channel" warnings.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2023-09-20 13:56:46 +09:30
Dusty Daemon
4628e3ace8 channeld: Code to implement splicing
Update the lightningd <-> channeld interface with lots of new commands to needed to facilitate spicing.

Implement the channeld splicing protocol leveraging the interactivetx protocol.

Implement lightningd’s channel_control to support channeld in its splicing efforts.

Changelog-Added: Added the features to enable splicing & resizing of active channels.
2023-07-31 21:00:22 +09:30
Rusty Russell
6c23349c72 channeld: allow stfu based on peer features, not EXPERIMENTAL_FEATURES.
Changelog-EXPERIMENTAL: Config: `--experimental-quiesce` enables queiescence, for testing.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2023-05-23 09:34:08 +09:30
Alex Myers
bec8586dce connectd: remove handling of push only gossip
This is now handled by gossipd on initial connection to peer.
2023-04-13 08:48:50 -07:00
Rusty Russell
6a446a94c6 connectd: implement timestamp-as-trinary.
This implements the proposal to simply use timestamp as "all", "none"
or "stream".  There's also a rough spec draft which I will post soon.

This *also* removes the last place where we would sometimes sweep the
entire gossip_store looking for their given timestamps.

We could also get rid of the actual timestamp filtering logic in
gossip_store_next if we want to, as it's now basically unused.

Changelog-Changed: Protocol: Simplify gossip_timestamp_filter handling to "all", "none" or "recent" instead of exact timestamp.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2023-04-13 08:48:50 -07:00
Rusty Russell
00f75d6ee1 connectd: no longer stream our own generated gossip, let gossipd do it.
This removes the sweep logic as soon as they connect.  This should save
connectd a significant number of CPU cycles and make @whitslack finally
stop hitting me.

Changelog-Changed: `connectd` no longer sweeps gossip_store file when peer connects, saving CPU for large nodes.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2023-04-13 08:48:50 -07:00
Rusty Russell
ed58c24bc7 connectd: log broken if TCP_CORK fails.
But not if we're a developer using dev_disconnect, which substitutes the fd.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2023-04-10 09:41:56 +09:30
Rusty Russell
295557ac50 connectd: don't try to set TCP_CORK on websocket pipe.
Most of this is piping the flag through so we know it's a websocket!

Reported-by: @ShahanaFarooqui
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2023-04-10 09:41:56 +09:30
adi2011
709ff01fd2 connectd: make exception for peer storage msgs. 2023-02-08 08:37:59 -06:00
adi2011
5f481aaf96 wire: Add patch file for peer storage bkp
Add msg type peer_storage and your_peer_storage
2023-02-08 08:37:59 -06:00
Rusty Russell
2209d0149f connectd: add new start_shutdown message.
We stop listening, and also refuse to send "connectd_peer_spoke" to create
new subdaemons.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2023-02-05 20:40:47 +01:00
Michael Schmoock
ad249607d6 dual-fund: update extracted CSVs to latest bolt draft
Changelog-None
2023-02-04 15:31:16 +10:30
Rusty Russell
153b7bf192 common/gossip_store: move subdaemon-only routines to connectd.
connectd is the only one who uses these routines now.  The
rest can be linked into a plugin.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2023-01-30 15:15:41 -06:00
Rusty Russell
81e57dce52 connectd: ensure htables are always tal objects.
We want to change the htable allocator to use tal, which will need
this.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2023-01-12 11:44:10 +10:30
Rusty Russell
15d0a8bec8 connectd: don't spam logs when we're under load.
This happens a lot with my node with rc2, so drop it to debug.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2022-11-30 19:31:38 +01:00
Rusty Russell
41ef85318d onionmessages: remove obsolete onion message parsing.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2022-09-29 16:10:57 +09:30
Rusty Russell
1b30ea4b82 doc: update BOLTs to bc86304b4b0af5fd5ce9d24f74e2ebbceb7e2730
This contains the zeroconf stuff, with funding_locked renamed to
channel_ready.  I change that everywhere, and try to fix up the
comments.

Also the `alias` field is called `short_channel_id`.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Changed: Protocol: `funding_locked` is now called `channel_ready` as per latest BOLTs.
2022-09-12 09:34:52 +09:30
Rusty Russell
a3c4908f4a lightningd: don't explicitly tell connectd to disconnect, have it do it on sending error/warning.
Connectd already does this when we *receive* an error or warning, but
now do it on send.  This causes some slight behavior change: we don't
disconnect when we close a channel, for example (our behaviour here
has been inconsistent across versions, depending on the code).

When connectd is told to disconnect, it now does so immediately, and
doesn't wait for subds to drain etc.  That simplifies the manual
disconnect case, which now cleans up as it would from any other
disconnection when connectd says it's disconnected.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2022-07-18 20:50:04 -05:00
Rusty Russell
c415c80d48 connectd: spelling and typo fixes.
From @niftynei.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2022-07-18 20:50:04 -05:00
Rusty Russell
719d1384d1 connectd: give connections a chance to drain when lightningd says to disconnect, or peer disconnects.
We want to avoid lost messages in the common cases.

This generalizes our drain code, by giving the subds each 5 seconds to
close themselves, but continue to allow them to send us traffic (if
peer is still connected) and continue to send them traffic.

We continue to send traffic *out* to the peer (if it's still
connected), until all subds are gone.  We still have a 5 second timer
to close the connection to peer.

On reconnects, we don't do this "drain period" on reconnects: we kill
immediately.

We fix up one test which was looking for the "disconnect" message
explicitly.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2022-07-18 20:50:04 -05:00
Rusty Russell
6a2817101d connectd: don't move parent while we're being freed.
A subtle case I hadn't come across before: if a child tal_resizes()
its parent while the parent is being deleted, tal gets confused.

The subd destructor does this using tal_arr_remove() on peer->subds,
which is currently being freed:

```
==61056== Invalid read of size 8
==61056==    at 0x185632: del_tree (tal.c:417)
==61056==    by 0x18560D: del_tree (tal.c:412)
==61056==    by 0x185957: tal_free (tal.c:486)
==61056==    by 0x1183BC: peer_discard (connectd.c:1861)
==61056==    by 0x11869E: recv_req (connectd.c:1942)
==61056==    by 0x12774B: handle_read (daemon_conn.c:35)
==61056==    by 0x173453: next_plan (io.c:59)
==61056==    by 0x17405B: do_plan (io.c:407)
==61056==    by 0x17409D: io_ready (io.c:417)
==61056==    by 0x176390: io_loop (poll.c:453)
==61056==    by 0x118A68: main (connectd.c:2082)
==61056==  Address 0x4bd8850 is 16 bytes inside a block of size 48 free'd
==61056==    at 0x483DFAF: realloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==61056==    by 0x1860E6: tal_resize_ (tal.c:699)
==61056==    by 0x1373DD: tal_arr_remove_ (utils.c:184)
==61056==    by 0x11D508: destroy_subd (multiplex.c:930)
==61056==    by 0x1850A4: notify (tal.c:240)
==61056==    by 0x1855BB: del_tree (tal.c:402)
==61056==    by 0x18560D: del_tree (tal.c:412)
==61056==    by 0x18560D: del_tree (tal.c:412)
==61056==    by 0x185957: tal_free (tal.c:486)
==61056==    by 0x1183BC: peer_discard (connectd.c:1861)
==61056==    by 0x11869E: recv_req (connectd.c:1942)
==61056==    by 0x12774B: handle_read (daemon_conn.c:35)
```

So simply make the subds children of `peer` not the `peer->subds`
array.  The only effect is that drain_peer() can't simply free the
subds array but must free the subds one at a time.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2022-07-18 20:50:04 -05:00
Rusty Russell
d31420211a connectd: add counters to each peer connection.
This allows us to detect when lightningd hasn't seen our latest
disconnect/reconnect; in particular, we would hit the following pattern:

1. lightningd says to connect a subd.
2. connectd disconnects and reconnects.
3. connectd reads message, connects subd.
4. lightningd reads disconnect and reconnect, sends msg to connect to subd again.
5. connectd asserts because subd is alreacy connected.

This way connectd can tell if lightningd is talking about the previous
connection, and ignoere it.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2022-07-18 20:50:04 -05:00
Rusty Russell
41b379ed89 lightningd: hand fds to connectd, not receive them from connectd.
Before this patch:
1. connectd says it's connected (peer_connected)
2. we tell connectd we want to talk about each channel (peer_make_active)
3. connectd gives us an fd for each channel, and we connect it to a subd (peer_active)
4. OR, connectd says it sent something about a channel we didn't tell it about, with an fd (peer_active)

Now:
1. connectd says it's connected (peer_connected)
2. we start all appropriate subds and tell connectd to what channels/fds (peer_connect_subd).
3. if connectd says it sent something about a channel we didn't tell it about, we either tell
   it to hang up (peer_final_msg), or connect a new opening daemon (peer_connect_subd).

This is the minimal-size patch, which is why we create socket pairs in
so many places to use the existing functions.  Many cleanups are
possible, since the new flow is so simple.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2022-07-18 20:50:04 -05:00
Rusty Russell
ab0e5d30ee connectd: don't io_halfclose()
We don't io_halfclose() the other side, we io_sock_shutdown(), which can leave both
sides unset:

```
lightningd-2: 2022-06-07T11:00:05.053Z **BROKEN** connectd: FATAL SIGNAL 6 (version 57e1af2)
lightningd-2: 2022-06-07T11:00:05.053Z **BROKEN** connectd: backtrace: common/daemon.c:38 (send_backtrace) 0x563b9b603af7
lightningd-2: 2022-06-07T11:00:05.053Z **BROKEN** connectd: backtrace: common/daemon.c:46 (crashdump) 0x563b9b603b4b
lightningd-2: 2022-06-07T11:00:05.053Z **BROKEN** connectd: backtrace: /build/glibc-SzIz7B/glibc-2.31/signal/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0 ((null)) 0x7fe6e8d4f08f
lightningd-2: 2022-06-07T11:00:05.053Z **BROKEN** connectd: backtrace: ../sysdeps/unix/sysv/linux/raise.c:51 (__GI_raise) 0x7fe6e8d4f00b
lightningd-2: 2022-06-07T11:00:05.054Z **BROKEN** connectd: backtrace: /build/glibc-SzIz7B/glibc-2.31/stdlib/abort.c:79 (__GI_abort) 0x7fe6e8d2e858
lightningd-2: 2022-06-07T11:00:05.054Z **BROKEN** connectd: backtrace: /build/glibc-SzIz7B/glibc-2.31/assert/assert.c:92 (__assert_fail_base) 0x7fe6e8d2e728
lightningd-2: 2022-06-07T11:00:05.054Z **BROKEN** connectd: backtrace: /build/glibc-SzIz7B/glibc-2.31/assert/assert.c:101 (__GI___assert_fail) 0x7fe6e8d3ffd5
lightningd-2: 2022-06-07T11:00:05.054Z **BROKEN** connectd: backtrace: ccan/ccan/io/io.c:65 (next_plan) 0x563b9b64fd7e
lightningd-2: 2022-06-07T11:00:05.054Z **BROKEN** connectd: backtrace: ccan/ccan/io/io.c:407 (do_plan) 0x563b9b6508f0
lightningd-2: 2022-06-07T11:00:05.054Z **BROKEN** connectd: backtrace: ccan/ccan/io/io.c:423 (io_ready) 0x563b9b650984
lightningd-2: 2022-06-07T11:00:05.054Z **BROKEN** connectd: backtrace: ccan/ccan/io/poll.c:453 (io_loop) 0x563b9b652c25
lightningd-2: 2022-06-07T11:00:05.054Z **BROKEN** connectd: backtrace: connectd/connectd.c:2037 (main) 0x563b9b5f5793
lightningd-2: 2022-06-07T11:00:05.054Z **BROKEN** connectd: backtrace: ../csu/libc-start.c:308 (__libc_start_main) 0x7fe6e8d30082
lightningd-2: 2022-06-07T11:00:05.054Z **BROKEN** connectd: backtrace: (null):0 ((null)) 0x563b9b5ebf6d
lightningd-2: 2022-06-07T11:00:05.054Z **BROKEN** connectd: backtrace: (null):0 ((null)) 0xffffffffffffffff
```

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2022-07-18 20:50:04 -05:00
Rusty Russell
7b0c11efb4 connectd: don't let peer close take forever.
Sending any pending messages to peer before hanging up is a courtesy:
give it 5 seconds before simply closing.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2022-07-18 20:50:04 -05:00
Rusty Russell
8678c5efb3 connectd: release peer soon as lightingd tells us.
Now we have separate peer draining logic, we can simply use it when
connectd tells us to release the peer, without waiting.  (We could
simply free the peer, but that's a bit rude, as messages can get
lost).

This removes various complex flags and logic we had before.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Fixed: `connectd`: various crashes and issues fixed by simplification and rewrite.
2022-07-18 20:50:04 -05:00
Rusty Russell
9dc3880360 connectd: put peer into "draining" mode when we want to close it.
This removes it from the hashtable, and forces it to do nothing but
send out any remaining packets, then close.

It is, in effect, reduced to a stub, with no further interactions
with the rest of the system (all subds are freed already).

Also removes the need for an explicit "final_msg" too.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2022-07-18 20:50:04 -05:00
Rusty Russell
37ff013c2c connectd: fix subd tal parents.
This came out in a later patch: freeing the peer->subds doesn't actually
free the subds, because they're reparented onto subd->conn, which is
a child of peer itself.


This breaks because when the peer is finally freed, destroy_subd is
called, and expects to find itself in peer->subds (but we made that
NULL when we manually freed it!).

Fix this, and make it obvious that we tal_steal it.

```
ightning_connectd: FATAL SIGNAL 11 (version v0.11.0.1-25-gbf025aa-modded)
0x55de2a1b8b94 send_backtrace
	common/daemon.c:33
0x55de2a1b8c3e crashdump
	common/daemon.c:46
0x7fe2be2fc08f ???
	/build/glibc-SzIz7B/glibc-2.31/signal/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0
0x55de2a1af41e destroy_subd
	connectd/multiplex.c:1119
0x55de2a217686 notify
	ccan/ccan/tal/tal.c:240
0x55de2a217b9d del_tree
	ccan/ccan/tal/tal.c:402
0x55de2a217bef del_tree
	ccan/ccan/tal/tal.c:412
0x55de2a217bef del_tree
	ccan/ccan/tal/tal.c:412
0x55de2a217f39 tal_free
	ccan/ccan/tal/tal.c:486
0x55de2a1aa116 peer_discard
	connectd/connectd.c:1834
0x55de2a1aa38d recv_req
	connectd/connectd.c:1903
0x55de2a1b9121 handle_read
	common/daemon_conn.c:31
0x55de2a205a35 next_plan
	ccan/ccan/io/io.c:59
0x55de2a20663d do_plan
	ccan/ccan/io/io.c:407
0x55de2a20667f io_ready
	ccan/ccan/io/io.c:417
0x55de2a208972 io_loop
	ccan/ccan/io/poll.c:453
0x55de2a1aa736 main
	connectd/connectd.c:2042
0x7fe2be2dd082 __libc_start_main
	../csu/libc-start.c:308
0x55de2a1a085d ???
	???:0
0xffffffffffffffff ???
	???:0
```

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2022-07-18 20:50:04 -05:00
Rusty Russell
6fd8fa4d95 connectd: optimize requests for "recent" gossip.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2022-07-15 21:18:29 +09:30
Rusty Russell
92fe871467 connectd: optimize case where peer doesn't want gossip.
LND and us send 0xFFFFFFFF to turn off gossip.  LDK and Eclair don't
seem to turn off gossip at all, but that's OK.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2022-07-15 21:18:29 +09:30
Rusty Russell
06e1e119aa pytest: fix test_gossip_no_empty_announcements flake.
This is a side-effect of fixing aging: sometimes, we age our
rcvd_filter cache too fast, and thus re-xmit.  This breaks
our test, since it used dev-disconnect on the channel_announce,
but that closes to l3, not l1!

```
>       assert l1.rpc.listchannels()['channels'] == []
E       AssertionError: assert [{'active': T...ags': 1, ...}] == []
E         Left contains 2 more items, first extra item: {'active': True, 'amount_msat': 100000000msat, 'base_fee_millisatoshi': 1, 'channel_flags': 0, ...}
```

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Fixes: #5403
2022-07-12 21:41:19 +09:30
Rusty Russell
7dd8e27862 connectd: don't insist on ping replies when other traffic is flowing.
Got complaints about us hanging up on some nodes because they don't respond
to pings in a timely manner (e.g. ACINQ?), but that turned out to be something
else.

Nonetheless, we've had reports in the past of LND badly prioritizing gossip
traffic, and thus important messages can get queued behind gossip dumps!

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Changed: connectd: give busy peers more time to respond to pings.
2022-07-09 12:27:05 +09:30