This basically means moving the code from gossipd to connectd to handle
these queries.
This will get connectd have finer control over ratelimiting them.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This is more efficient in a few ways:
1. It's trivial to get to the end of the gossip_store, we don't have
to iterate.
2. It tends to be mmaped so we don't have to call pread().
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We currently stream gossip as fast as we can, even if they start at
timestamp 0. Instead, use a simple token bucket filter and only let
them have 1MB per second (500 bytes per second for testing).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Protocol: connectd: we now throttle outgoing gossip at 1MB/second per peer.
We were getting the following message in test_feerate_stress:
```
2024-07-08T02:15:45.5663941Z lightningd-2 2024-07-08T02:13:45.696Z **BROKEN** 0266e4598d1d3c415f572a8488830b60f7e744ed9235eb0b1ba93283b315c03518-connectd: Peer did not close, forcing close
```
I can reproduce it locally if I run the test enough, and finally found
the issue by printing the status of the fd when we time it out (using
routines from connectd.c).
The peer fd alternates between reading and writing. When we go to
discard it, we wake the write queue, so write_to_peer() get called.
It won't shutdown the socket if there are still subds attached, and
will wait again for a read.
The last subd exit has to also wake the write queue if we're draining,
so it can do the io_sock_shutdown. Otherwise, we hit the timeout,
causing the message above.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Everyone understands gossip_queries now, but peers leave it unset to indicate
they have nothing useful to say.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Currently, anything which doesn't have a live channel is considered transient.
We free this first under stress, and also if they're still connecting.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
If we don't find one searching from our random spot in the peer table,
we're supposed to wrap, not crash!
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We don't actually support it yet, but this threads through the type change,
puts it in "decode" etc.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We use a crude heuristic: if we were trying to contact them, it's a
"deliberate" connection, and should be preserved.
Changelog-Changed: connectd: prioritize peers with channels (and log!) if we run low on file descriptors.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
I thought I was going to want to have a convenient way of counting
these, but it turns out unnecessary. Still, this is slightly more
efficient and simple, so I am including it.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This can happen if we're totally out of fds, but previously we gave
no log message indicating this!
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This has the benefit of being shorter, as well as more reliable (you
will get a link error if we can't print it, not a runtime one!).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This code was trying to check that the address type is not one of the ADDR_TYPE_TOR*
types, but the is_toraddr() function checks a domain name! The cast should have been
a clue that this was wrong!
Anyway, wireaddr_to_addrinfo() aborts on these cases already, so the asserts here are
superfluous.
Found in unrelated CI run:
```
Valgrind error file: valgrind-errors.20610
==20610== Conditional jump or move depends on uninitialised value(s)
==20610== at 0x484ED28: strlen (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==20610== by 0x138FA3: is_toraddr (wireaddr.c:344)
==20610== by 0x11499B: conn_init (connectd.c:729)
==20610== by 0x28FD73: next_plan (io.c:59)
==20610== by 0x28FF94: io_new_conn_ (io.c:116)
==20610== by 0x11531B: try_connect_one_addr (connectd.c:927)
==20610== by 0x1182A8: try_connect_peer (connectd.c:1781)
==20610== by 0x11834E: connect_to_peer (connectd.c:1797)
==20610== by 0x119241: recv_req (connectd.c:2074)
==20610== by 0x12836F: handle_read (daemon_conn.c:35)
==20610== by 0x28FD73: next_plan (io.c:59)
==20610== by 0x2909A8: do_plan (io.c:407)
==20610==
```
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This happens if:
1. The peer sets a timestamp filter to non-zero, and
2. We have a channel_announcement without a channel_update.
The timestamp is 0 as a placeholder as part of the recent gossip rework
(we used to hold these channel_announcement in memory, which was complex).
But this means we won't send it in this case, and if we later send the
channel_update, CI will complain about 'Bad gossip order'.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We weakened this progressively over time, and gossip v1.5 makes spam
impossible by protocol, so we can wait until then.
Removing this code simplifies things a great deal!
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Removed: Protocol: we no longer ratelimit gossip messages by channel, making our code far simpler.
Make sure plugin has got message to connectd before sending!
```
def test_even_sendcustommsg(node_factory):
l1, l2 = node_factory.get_nodes(2, opts={'log-level': 'io',
'allow_warning': True})
l1.connect(l2)
# Even-numbered message
msg = hex(43690)[2:] + ('ff' * 30) + 'bb'
# l2 will hang up when it gets this.
l1.rpc.sendcustommsg(l2.info['id'], msg)
l2.daemon.wait_for_log(r'\[IN\] {}'.format(msg))
l1.daemon.wait_for_log('Invalid unknown even msg')
wait_for(lambda: l1.rpc.listpeers(l2.info['id'])['peers'] == [])
# Now with a plugin which allows it
l1.connect(l2)
l2.rpc.plugin_start(os.path.join(os.getcwd(), "tests/plugins/allow_even_msgs.py"))
l1.rpc.sendcustommsg(l2.info['id'], msg)
l2.daemon.wait_for_log(r'\[IN\] {}'.format(msg))
> l2.daemon.wait_for_log(r'allow_even_msgs.*Got message 43690')
tests/test_misc.py:3623:
...
> raise TimeoutError('Unable to find "{}" in logs.'.format(exs))
E TimeoutError: Unable to find "[re.compile('allow_even_msgs.*Got message 43690')]" in logs.
contrib/pyln-testing/pyln/testing/utils.py:327: TimeoutError
```
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
If we get a WIRE_TX_ABORT then another message, we send the other message to the same
subd (even though the tx abort causes it to shutdown). This means we effectively
lose the next message, and timeout (see below from CI, reproduced locally).
So, have connectd ignore the subd after it forwards the WIRE_TX_ABORT. The next
message will, correctly, cause a fresh subdaemon to be spawned.
```
@unittest.skipIf(TEST_NETWORK != 'regtest', 'elementsd doesnt yet support PSBT features we need')
@pytest.mark.openchannel('v2')
def test_v2_rbf_multi(node_factory, bitcoind, chainparams):
l1, l2 = node_factory.get_nodes(2,
opts={'may_reconnect': True,
'dev-no-reconnect': None,
'allow_warning': True})
l1.rpc.connect(l2.info['id'], 'localhost', l2.port)
amount = 2**24
chan_amount = 100000
bitcoind.rpc.sendtoaddress(l1.rpc.newaddr()['bech32'], amount / 10**8 + 0.01)
bitcoind.generate_block(1)
# Wait for it to arrive.
wait_for(lambda: len(l1.rpc.listfunds()['outputs']) > 0)
res = l1.rpc.fundchannel(l2.info['id'], chan_amount)
chan_id = res['channel_id']
vins = bitcoind.rpc.decoderawtransaction(res['tx'])['vin']
assert(only_one(vins))
prev_utxos = ["{}:{}".format(vins[0]['txid'], vins[0]['vout'])]
# Check that we're waiting for lockin
l1.daemon.wait_for_log(' to DUALOPEND_AWAITING_LOCKIN')
# Attempt to do abort, should fail since we've
# already gotten an inflight
with pytest.raises(RpcError):
l1.rpc.openchannel_abort(chan_id)
rate = int(find_next_feerate(l1, l2)[:-5])
# We 4x the feerate to beat the min-relay fee
next_feerate = '{}perkw'.format(rate * 4)
# Initiate an RBF
startweight = 42 + 172 # base weight, funding output
initpsbt = l1.rpc.utxopsbt(chan_amount, next_feerate, startweight,
prev_utxos, reservedok=True,
min_witness_weight=110,
excess_as_change=True)
# Do the bump
bump = l1.rpc.openchannel_bump(chan_id, chan_amount,
initpsbt['psbt'],
funding_feerate=next_feerate)
# Abort this open attempt! We will re-try
aborted = l1.rpc.openchannel_abort(chan_id)
assert not aborted['channel_canceled']
# We no longer disconnect on aborts, because magic!
assert only_one(l1.rpc.listpeers()['peers'])['connected']
# Do the bump, again, same feerate
> bump = l1.rpc.openchannel_bump(chan_id, chan_amount,
initpsbt['psbt'],
funding_feerate=next_feerate)
tests/test_opening.py:668:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
contrib/pyln-client/pyln/client/lightning.py:1206: in openchannel_bump
return self.call("openchannel_bump", payload)
contrib/pyln-testing/pyln/testing/utils.py:718: in call
res = LightningRpc.call(self, method, payload, cmdprefix, filter)
contrib/pyln-client/pyln/client/lightning.py:398: in call
resp, buf = self._readobj(sock, buf)
contrib/pyln-client/pyln/client/lightning.py:315: in _readobj
b = sock.recv(max(1024, len(buff)))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <pyln.client.lightning.UnixSocket object at 0x7f34675aae80>
length = 1024
def recv(self, length: int) -> bytes:
if self.sock is None:
raise socket.error("not connected")
> return self.sock.recv(length)
E Failed: Timeout >1200.0s
```
Previously, we would forward the message to a subd, but now we have
the case where the subd is gone, but we're still connected. If the
peer anything but a reestablish in that state, we drop the connection.
Instead, an error should always make us fail the channel.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
On Mac most tests report BROKEN because sodium creating an untracked fd pointing to /dev/random. dev_report_fd’s finds it at tear down and reports a BROKEN message.
We allow a single “char special” fd without reporting it as broken improving QOL for Mac developers.
While we’re here we added the fd mode to the log to help with future rogue fd issues.
ChangeLog-None
This makes it easier to use outside simple subds, and now lightningd can
simply dump to log rather than returning JSON.
JSON formatting was a lot of work, and we only did it for lightningd, not for
subdaemons. Easier to use the logs in all cases.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We still refuse to run dev commands if lightningd sends it to us
despite us not being in developer mode, but that's mainly paranoia.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Also requires us to expose memleak when !DEVELOPER, however we only
ever used the memleak tracking when the LIGHTNINGD_DEV_MEMLEAK
environment variable was set, so keep that.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Update the lightningd <-> channeld interface with lots of new commands to needed to facilitate spicing.
Implement the channeld splicing protocol leveraging the interactivetx protocol.
Implement lightningd’s channel_control to support channeld in its splicing efforts.
Changelog-Added: Added the features to enable splicing & resizing of active channels.
Fixes: #6368
Changelog-Fixed: Protocol: we no longer gossip about recently-closed channels (Eclair gets upset with this).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We will access the freed connection to gossipd. This is weird to track
down when the *actual* issue is that gossipd died!
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
I never really liked this hack: websockets are useful, advertizing
them not so much.
Note that we never actually documented that we would advertize these!
Changelog-EXPERIMENTAL: Protocol: Removed support for advertizing websocket addresses in gossip.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>