tmpctx may not get cleaned immediately, so the timeout (a child of
the struct early_peer at this point) can still outlast the conn.
Do the clearer thing, and explicitly free the timeout.
Changelog-Fixed: connectd: crash on erroneous timeout.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Things are often equivalent but different types:
1. u8 arrays in libwally.
2. sha256
3. Secrets derived via sha256
4. txids
Rather than open-coding a BUILD_ASSERT & memcpy, create a macro to do it.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
However fast we can handle them, it's antisocial to allow others to
make us spam the rest of the network.
Changelog-Protocol: onion messages: we limit incoming to 4 per second, allowing a little burst.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This is now permitted in the offers PR, so we should support it. But
we can't just look up in the gossmap, since the "short_channel_id"
could be an alias. So we get lightningd to tell us all scid->peer
mappings, and look up in that.
Changelog-Added: Protocol: onion messages can now be forwarded by short_channel_id.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Unlike "sendonionmessage" which instructs us to send to a peer, this
process it locally (presumably, it contains the next hop). This is
useful because it allows us to process an onion message which starts
with us (a legal case for a blinded path supplied by someone else!).
It also opens the door to bolt12 self-pay.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Allows for caller to log, but more importantly, when we add a command to
inject onion messages, allows for us to capture the error.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
A bit tricky, since we get more than one message at a time. However,
this just means we go over quota for a bit, and will get caught when
those are sent (we do this for a single message already, so it's not
that much worse).
Note: this not only limits sending, but it limits the actuall query
processing, which is nice.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This basically means moving the code from gossipd to connectd to handle
these queries.
This will get connectd have finer control over ratelimiting them.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This is more efficient in a few ways:
1. It's trivial to get to the end of the gossip_store, we don't have
to iterate.
2. It tends to be mmaped so we don't have to call pread().
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We currently stream gossip as fast as we can, even if they start at
timestamp 0. Instead, use a simple token bucket filter and only let
them have 1MB per second (500 bytes per second for testing).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Protocol: connectd: we now throttle outgoing gossip at 1MB/second per peer.
We were getting the following message in test_feerate_stress:
```
2024-07-08T02:15:45.5663941Z lightningd-2 2024-07-08T02:13:45.696Z **BROKEN** 0266e4598d1d3c415f572a8488830b60f7e744ed9235eb0b1ba93283b315c03518-connectd: Peer did not close, forcing close
```
I can reproduce it locally if I run the test enough, and finally found
the issue by printing the status of the fd when we time it out (using
routines from connectd.c).
The peer fd alternates between reading and writing. When we go to
discard it, we wake the write queue, so write_to_peer() get called.
It won't shutdown the socket if there are still subds attached, and
will wait again for a read.
The last subd exit has to also wake the write queue if we're draining,
so it can do the io_sock_shutdown. Otherwise, we hit the timeout,
causing the message above.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Everyone understands gossip_queries now, but peers leave it unset to indicate
they have nothing useful to say.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Currently, anything which doesn't have a live channel is considered transient.
We free this first under stress, and also if they're still connecting.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
If we don't find one searching from our random spot in the peer table,
we're supposed to wrap, not crash!
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We don't actually support it yet, but this threads through the type change,
puts it in "decode" etc.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We use a crude heuristic: if we were trying to contact them, it's a
"deliberate" connection, and should be preserved.
Changelog-Changed: connectd: prioritize peers with channels (and log!) if we run low on file descriptors.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
I thought I was going to want to have a convenient way of counting
these, but it turns out unnecessary. Still, this is slightly more
efficient and simple, so I am including it.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This can happen if we're totally out of fds, but previously we gave
no log message indicating this!
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This has the benefit of being shorter, as well as more reliable (you
will get a link error if we can't print it, not a runtime one!).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This code was trying to check that the address type is not one of the ADDR_TYPE_TOR*
types, but the is_toraddr() function checks a domain name! The cast should have been
a clue that this was wrong!
Anyway, wireaddr_to_addrinfo() aborts on these cases already, so the asserts here are
superfluous.
Found in unrelated CI run:
```
Valgrind error file: valgrind-errors.20610
==20610== Conditional jump or move depends on uninitialised value(s)
==20610== at 0x484ED28: strlen (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==20610== by 0x138FA3: is_toraddr (wireaddr.c:344)
==20610== by 0x11499B: conn_init (connectd.c:729)
==20610== by 0x28FD73: next_plan (io.c:59)
==20610== by 0x28FF94: io_new_conn_ (io.c:116)
==20610== by 0x11531B: try_connect_one_addr (connectd.c:927)
==20610== by 0x1182A8: try_connect_peer (connectd.c:1781)
==20610== by 0x11834E: connect_to_peer (connectd.c:1797)
==20610== by 0x119241: recv_req (connectd.c:2074)
==20610== by 0x12836F: handle_read (daemon_conn.c:35)
==20610== by 0x28FD73: next_plan (io.c:59)
==20610== by 0x2909A8: do_plan (io.c:407)
==20610==
```
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This happens if:
1. The peer sets a timestamp filter to non-zero, and
2. We have a channel_announcement without a channel_update.
The timestamp is 0 as a placeholder as part of the recent gossip rework
(we used to hold these channel_announcement in memory, which was complex).
But this means we won't send it in this case, and if we later send the
channel_update, CI will complain about 'Bad gossip order'.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We weakened this progressively over time, and gossip v1.5 makes spam
impossible by protocol, so we can wait until then.
Removing this code simplifies things a great deal!
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Removed: Protocol: we no longer ratelimit gossip messages by channel, making our code far simpler.
Make sure plugin has got message to connectd before sending!
```
def test_even_sendcustommsg(node_factory):
l1, l2 = node_factory.get_nodes(2, opts={'log-level': 'io',
'allow_warning': True})
l1.connect(l2)
# Even-numbered message
msg = hex(43690)[2:] + ('ff' * 30) + 'bb'
# l2 will hang up when it gets this.
l1.rpc.sendcustommsg(l2.info['id'], msg)
l2.daemon.wait_for_log(r'\[IN\] {}'.format(msg))
l1.daemon.wait_for_log('Invalid unknown even msg')
wait_for(lambda: l1.rpc.listpeers(l2.info['id'])['peers'] == [])
# Now with a plugin which allows it
l1.connect(l2)
l2.rpc.plugin_start(os.path.join(os.getcwd(), "tests/plugins/allow_even_msgs.py"))
l1.rpc.sendcustommsg(l2.info['id'], msg)
l2.daemon.wait_for_log(r'\[IN\] {}'.format(msg))
> l2.daemon.wait_for_log(r'allow_even_msgs.*Got message 43690')
tests/test_misc.py:3623:
...
> raise TimeoutError('Unable to find "{}" in logs.'.format(exs))
E TimeoutError: Unable to find "[re.compile('allow_even_msgs.*Got message 43690')]" in logs.
contrib/pyln-testing/pyln/testing/utils.py:327: TimeoutError
```
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
If we get a WIRE_TX_ABORT then another message, we send the other message to the same
subd (even though the tx abort causes it to shutdown). This means we effectively
lose the next message, and timeout (see below from CI, reproduced locally).
So, have connectd ignore the subd after it forwards the WIRE_TX_ABORT. The next
message will, correctly, cause a fresh subdaemon to be spawned.
```
@unittest.skipIf(TEST_NETWORK != 'regtest', 'elementsd doesnt yet support PSBT features we need')
@pytest.mark.openchannel('v2')
def test_v2_rbf_multi(node_factory, bitcoind, chainparams):
l1, l2 = node_factory.get_nodes(2,
opts={'may_reconnect': True,
'dev-no-reconnect': None,
'allow_warning': True})
l1.rpc.connect(l2.info['id'], 'localhost', l2.port)
amount = 2**24
chan_amount = 100000
bitcoind.rpc.sendtoaddress(l1.rpc.newaddr()['bech32'], amount / 10**8 + 0.01)
bitcoind.generate_block(1)
# Wait for it to arrive.
wait_for(lambda: len(l1.rpc.listfunds()['outputs']) > 0)
res = l1.rpc.fundchannel(l2.info['id'], chan_amount)
chan_id = res['channel_id']
vins = bitcoind.rpc.decoderawtransaction(res['tx'])['vin']
assert(only_one(vins))
prev_utxos = ["{}:{}".format(vins[0]['txid'], vins[0]['vout'])]
# Check that we're waiting for lockin
l1.daemon.wait_for_log(' to DUALOPEND_AWAITING_LOCKIN')
# Attempt to do abort, should fail since we've
# already gotten an inflight
with pytest.raises(RpcError):
l1.rpc.openchannel_abort(chan_id)
rate = int(find_next_feerate(l1, l2)[:-5])
# We 4x the feerate to beat the min-relay fee
next_feerate = '{}perkw'.format(rate * 4)
# Initiate an RBF
startweight = 42 + 172 # base weight, funding output
initpsbt = l1.rpc.utxopsbt(chan_amount, next_feerate, startweight,
prev_utxos, reservedok=True,
min_witness_weight=110,
excess_as_change=True)
# Do the bump
bump = l1.rpc.openchannel_bump(chan_id, chan_amount,
initpsbt['psbt'],
funding_feerate=next_feerate)
# Abort this open attempt! We will re-try
aborted = l1.rpc.openchannel_abort(chan_id)
assert not aborted['channel_canceled']
# We no longer disconnect on aborts, because magic!
assert only_one(l1.rpc.listpeers()['peers'])['connected']
# Do the bump, again, same feerate
> bump = l1.rpc.openchannel_bump(chan_id, chan_amount,
initpsbt['psbt'],
funding_feerate=next_feerate)
tests/test_opening.py:668:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
contrib/pyln-client/pyln/client/lightning.py:1206: in openchannel_bump
return self.call("openchannel_bump", payload)
contrib/pyln-testing/pyln/testing/utils.py:718: in call
res = LightningRpc.call(self, method, payload, cmdprefix, filter)
contrib/pyln-client/pyln/client/lightning.py:398: in call
resp, buf = self._readobj(sock, buf)
contrib/pyln-client/pyln/client/lightning.py:315: in _readobj
b = sock.recv(max(1024, len(buff)))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <pyln.client.lightning.UnixSocket object at 0x7f34675aae80>
length = 1024
def recv(self, length: int) -> bytes:
if self.sock is None:
raise socket.error("not connected")
> return self.sock.recv(length)
E Failed: Timeout >1200.0s
```
Previously, we would forward the message to a subd, but now we have
the case where the subd is gone, but we're still connected. If the
peer anything but a reestablish in that state, we drop the connection.
Instead, an error should always make us fail the channel.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
On Mac most tests report BROKEN because sodium creating an untracked fd pointing to /dev/random. dev_report_fd’s finds it at tear down and reports a BROKEN message.
We allow a single “char special” fd without reporting it as broken improving QOL for Mac developers.
While we’re here we added the fd mode to the log to help with future rogue fd issues.
ChangeLog-None
This makes it easier to use outside simple subds, and now lightningd can
simply dump to log rather than returning JSON.
JSON formatting was a lot of work, and we only did it for lightningd, not for
subdaemons. Easier to use the logs in all cases.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We still refuse to run dev commands if lightningd sends it to us
despite us not being in developer mode, but that's mainly paranoia.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Also requires us to expose memleak when !DEVELOPER, however we only
ever used the memleak tracking when the LIGHTNINGD_DEV_MEMLEAK
environment variable was set, so keep that.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>