Everyone understands gossip_queries now, but peers leave it unset to indicate
they have nothing useful to say.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We use a crude heuristic: if we were trying to contact them, it's a
"deliberate" connection, and should be preserved.
Changelog-Changed: connectd: prioritize peers with channels (and log!) if we run low on file descriptors.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We weakened this progressively over time, and gossip v1.5 makes spam
impossible by protocol, so we can wait until then.
Removing this code simplifies things a great deal!
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Removed: Protocol: we no longer ratelimit gossip messages by channel, making our code far simpler.
Make sure plugin has got message to connectd before sending!
```
def test_even_sendcustommsg(node_factory):
l1, l2 = node_factory.get_nodes(2, opts={'log-level': 'io',
'allow_warning': True})
l1.connect(l2)
# Even-numbered message
msg = hex(43690)[2:] + ('ff' * 30) + 'bb'
# l2 will hang up when it gets this.
l1.rpc.sendcustommsg(l2.info['id'], msg)
l2.daemon.wait_for_log(r'\[IN\] {}'.format(msg))
l1.daemon.wait_for_log('Invalid unknown even msg')
wait_for(lambda: l1.rpc.listpeers(l2.info['id'])['peers'] == [])
# Now with a plugin which allows it
l1.connect(l2)
l2.rpc.plugin_start(os.path.join(os.getcwd(), "tests/plugins/allow_even_msgs.py"))
l1.rpc.sendcustommsg(l2.info['id'], msg)
l2.daemon.wait_for_log(r'\[IN\] {}'.format(msg))
> l2.daemon.wait_for_log(r'allow_even_msgs.*Got message 43690')
tests/test_misc.py:3623:
...
> raise TimeoutError('Unable to find "{}" in logs.'.format(exs))
E TimeoutError: Unable to find "[re.compile('allow_even_msgs.*Got message 43690')]" in logs.
contrib/pyln-testing/pyln/testing/utils.py:327: TimeoutError
```
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
If we get a WIRE_TX_ABORT then another message, we send the other message to the same
subd (even though the tx abort causes it to shutdown). This means we effectively
lose the next message, and timeout (see below from CI, reproduced locally).
So, have connectd ignore the subd after it forwards the WIRE_TX_ABORT. The next
message will, correctly, cause a fresh subdaemon to be spawned.
```
@unittest.skipIf(TEST_NETWORK != 'regtest', 'elementsd doesnt yet support PSBT features we need')
@pytest.mark.openchannel('v2')
def test_v2_rbf_multi(node_factory, bitcoind, chainparams):
l1, l2 = node_factory.get_nodes(2,
opts={'may_reconnect': True,
'dev-no-reconnect': None,
'allow_warning': True})
l1.rpc.connect(l2.info['id'], 'localhost', l2.port)
amount = 2**24
chan_amount = 100000
bitcoind.rpc.sendtoaddress(l1.rpc.newaddr()['bech32'], amount / 10**8 + 0.01)
bitcoind.generate_block(1)
# Wait for it to arrive.
wait_for(lambda: len(l1.rpc.listfunds()['outputs']) > 0)
res = l1.rpc.fundchannel(l2.info['id'], chan_amount)
chan_id = res['channel_id']
vins = bitcoind.rpc.decoderawtransaction(res['tx'])['vin']
assert(only_one(vins))
prev_utxos = ["{}:{}".format(vins[0]['txid'], vins[0]['vout'])]
# Check that we're waiting for lockin
l1.daemon.wait_for_log(' to DUALOPEND_AWAITING_LOCKIN')
# Attempt to do abort, should fail since we've
# already gotten an inflight
with pytest.raises(RpcError):
l1.rpc.openchannel_abort(chan_id)
rate = int(find_next_feerate(l1, l2)[:-5])
# We 4x the feerate to beat the min-relay fee
next_feerate = '{}perkw'.format(rate * 4)
# Initiate an RBF
startweight = 42 + 172 # base weight, funding output
initpsbt = l1.rpc.utxopsbt(chan_amount, next_feerate, startweight,
prev_utxos, reservedok=True,
min_witness_weight=110,
excess_as_change=True)
# Do the bump
bump = l1.rpc.openchannel_bump(chan_id, chan_amount,
initpsbt['psbt'],
funding_feerate=next_feerate)
# Abort this open attempt! We will re-try
aborted = l1.rpc.openchannel_abort(chan_id)
assert not aborted['channel_canceled']
# We no longer disconnect on aborts, because magic!
assert only_one(l1.rpc.listpeers()['peers'])['connected']
# Do the bump, again, same feerate
> bump = l1.rpc.openchannel_bump(chan_id, chan_amount,
initpsbt['psbt'],
funding_feerate=next_feerate)
tests/test_opening.py:668:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
contrib/pyln-client/pyln/client/lightning.py:1206: in openchannel_bump
return self.call("openchannel_bump", payload)
contrib/pyln-testing/pyln/testing/utils.py:718: in call
res = LightningRpc.call(self, method, payload, cmdprefix, filter)
contrib/pyln-client/pyln/client/lightning.py:398: in call
resp, buf = self._readobj(sock, buf)
contrib/pyln-client/pyln/client/lightning.py:315: in _readobj
b = sock.recv(max(1024, len(buff)))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <pyln.client.lightning.UnixSocket object at 0x7f34675aae80>
length = 1024
def recv(self, length: int) -> bytes:
if self.sock is None:
raise socket.error("not connected")
> return self.sock.recv(length)
E Failed: Timeout >1200.0s
```
Previously, we would forward the message to a subd, but now we have
the case where the subd is gone, but we're still connected. If the
peer anything but a reestablish in that state, we drop the connection.
Instead, an error should always make us fail the channel.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We still refuse to run dev commands if lightningd sends it to us
despite us not being in developer mode, but that's mainly paranoia.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Update the lightningd <-> channeld interface with lots of new commands to needed to facilitate spicing.
Implement the channeld splicing protocol leveraging the interactivetx protocol.
Implement lightningd’s channel_control to support channeld in its splicing efforts.
Changelog-Added: Added the features to enable splicing & resizing of active channels.
This implements the proposal to simply use timestamp as "all", "none"
or "stream". There's also a rough spec draft which I will post soon.
This *also* removes the last place where we would sometimes sweep the
entire gossip_store looking for their given timestamps.
We could also get rid of the actual timestamp filtering logic in
gossip_store_next if we want to, as it's now basically unused.
Changelog-Changed: Protocol: Simplify gossip_timestamp_filter handling to "all", "none" or "recent" instead of exact timestamp.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This removes the sweep logic as soon as they connect. This should save
connectd a significant number of CPU cycles and make @whitslack finally
stop hitting me.
Changelog-Changed: `connectd` no longer sweeps gossip_store file when peer connects, saving CPU for large nodes.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Most of this is piping the flag through so we know it's a websocket!
Reported-by: @ShahanaFarooqui
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This contains the zeroconf stuff, with funding_locked renamed to
channel_ready. I change that everywhere, and try to fix up the
comments.
Also the `alias` field is called `short_channel_id`.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Changed: Protocol: `funding_locked` is now called `channel_ready` as per latest BOLTs.
Connectd already does this when we *receive* an error or warning, but
now do it on send. This causes some slight behavior change: we don't
disconnect when we close a channel, for example (our behaviour here
has been inconsistent across versions, depending on the code).
When connectd is told to disconnect, it now does so immediately, and
doesn't wait for subds to drain etc. That simplifies the manual
disconnect case, which now cleans up as it would from any other
disconnection when connectd says it's disconnected.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We want to avoid lost messages in the common cases.
This generalizes our drain code, by giving the subds each 5 seconds to
close themselves, but continue to allow them to send us traffic (if
peer is still connected) and continue to send them traffic.
We continue to send traffic *out* to the peer (if it's still
connected), until all subds are gone. We still have a 5 second timer
to close the connection to peer.
On reconnects, we don't do this "drain period" on reconnects: we kill
immediately.
We fix up one test which was looking for the "disconnect" message
explicitly.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
A subtle case I hadn't come across before: if a child tal_resizes()
its parent while the parent is being deleted, tal gets confused.
The subd destructor does this using tal_arr_remove() on peer->subds,
which is currently being freed:
```
==61056== Invalid read of size 8
==61056== at 0x185632: del_tree (tal.c:417)
==61056== by 0x18560D: del_tree (tal.c:412)
==61056== by 0x185957: tal_free (tal.c:486)
==61056== by 0x1183BC: peer_discard (connectd.c:1861)
==61056== by 0x11869E: recv_req (connectd.c:1942)
==61056== by 0x12774B: handle_read (daemon_conn.c:35)
==61056== by 0x173453: next_plan (io.c:59)
==61056== by 0x17405B: do_plan (io.c:407)
==61056== by 0x17409D: io_ready (io.c:417)
==61056== by 0x176390: io_loop (poll.c:453)
==61056== by 0x118A68: main (connectd.c:2082)
==61056== Address 0x4bd8850 is 16 bytes inside a block of size 48 free'd
==61056== at 0x483DFAF: realloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==61056== by 0x1860E6: tal_resize_ (tal.c:699)
==61056== by 0x1373DD: tal_arr_remove_ (utils.c:184)
==61056== by 0x11D508: destroy_subd (multiplex.c:930)
==61056== by 0x1850A4: notify (tal.c:240)
==61056== by 0x1855BB: del_tree (tal.c:402)
==61056== by 0x18560D: del_tree (tal.c:412)
==61056== by 0x18560D: del_tree (tal.c:412)
==61056== by 0x185957: tal_free (tal.c:486)
==61056== by 0x1183BC: peer_discard (connectd.c:1861)
==61056== by 0x11869E: recv_req (connectd.c:1942)
==61056== by 0x12774B: handle_read (daemon_conn.c:35)
```
So simply make the subds children of `peer` not the `peer->subds`
array. The only effect is that drain_peer() can't simply free the
subds array but must free the subds one at a time.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This allows us to detect when lightningd hasn't seen our latest
disconnect/reconnect; in particular, we would hit the following pattern:
1. lightningd says to connect a subd.
2. connectd disconnects and reconnects.
3. connectd reads message, connects subd.
4. lightningd reads disconnect and reconnect, sends msg to connect to subd again.
5. connectd asserts because subd is alreacy connected.
This way connectd can tell if lightningd is talking about the previous
connection, and ignoere it.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Before this patch:
1. connectd says it's connected (peer_connected)
2. we tell connectd we want to talk about each channel (peer_make_active)
3. connectd gives us an fd for each channel, and we connect it to a subd (peer_active)
4. OR, connectd says it sent something about a channel we didn't tell it about, with an fd (peer_active)
Now:
1. connectd says it's connected (peer_connected)
2. we start all appropriate subds and tell connectd to what channels/fds (peer_connect_subd).
3. if connectd says it sent something about a channel we didn't tell it about, we either tell
it to hang up (peer_final_msg), or connect a new opening daemon (peer_connect_subd).
This is the minimal-size patch, which is why we create socket pairs in
so many places to use the existing functions. Many cleanups are
possible, since the new flow is so simple.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Sending any pending messages to peer before hanging up is a courtesy:
give it 5 seconds before simply closing.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Now we have separate peer draining logic, we can simply use it when
connectd tells us to release the peer, without waiting. (We could
simply free the peer, but that's a bit rude, as messages can get
lost).
This removes various complex flags and logic we had before.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Fixed: `connectd`: various crashes and issues fixed by simplification and rewrite.
This removes it from the hashtable, and forces it to do nothing but
send out any remaining packets, then close.
It is, in effect, reduced to a stub, with no further interactions
with the rest of the system (all subds are freed already).
Also removes the need for an explicit "final_msg" too.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This came out in a later patch: freeing the peer->subds doesn't actually
free the subds, because they're reparented onto subd->conn, which is
a child of peer itself.
This breaks because when the peer is finally freed, destroy_subd is
called, and expects to find itself in peer->subds (but we made that
NULL when we manually freed it!).
Fix this, and make it obvious that we tal_steal it.
```
ightning_connectd: FATAL SIGNAL 11 (version v0.11.0.1-25-gbf025aa-modded)
0x55de2a1b8b94 send_backtrace
common/daemon.c:33
0x55de2a1b8c3e crashdump
common/daemon.c:46
0x7fe2be2fc08f ???
/build/glibc-SzIz7B/glibc-2.31/signal/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0
0x55de2a1af41e destroy_subd
connectd/multiplex.c:1119
0x55de2a217686 notify
ccan/ccan/tal/tal.c:240
0x55de2a217b9d del_tree
ccan/ccan/tal/tal.c:402
0x55de2a217bef del_tree
ccan/ccan/tal/tal.c:412
0x55de2a217bef del_tree
ccan/ccan/tal/tal.c:412
0x55de2a217f39 tal_free
ccan/ccan/tal/tal.c:486
0x55de2a1aa116 peer_discard
connectd/connectd.c:1834
0x55de2a1aa38d recv_req
connectd/connectd.c:1903
0x55de2a1b9121 handle_read
common/daemon_conn.c:31
0x55de2a205a35 next_plan
ccan/ccan/io/io.c:59
0x55de2a20663d do_plan
ccan/ccan/io/io.c:407
0x55de2a20667f io_ready
ccan/ccan/io/io.c:417
0x55de2a208972 io_loop
ccan/ccan/io/poll.c:453
0x55de2a1aa736 main
connectd/connectd.c:2042
0x7fe2be2dd082 __libc_start_main
../csu/libc-start.c:308
0x55de2a1a085d ???
???:0
0xffffffffffffffff ???
???:0
```
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
LND and us send 0xFFFFFFFF to turn off gossip. LDK and Eclair don't
seem to turn off gossip at all, but that's OK.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This is a side-effect of fixing aging: sometimes, we age our
rcvd_filter cache too fast, and thus re-xmit. This breaks
our test, since it used dev-disconnect on the channel_announce,
but that closes to l3, not l1!
```
> assert l1.rpc.listchannels()['channels'] == []
E AssertionError: assert [{'active': T...ags': 1, ...}] == []
E Left contains 2 more items, first extra item: {'active': True, 'amount_msat': 100000000msat, 'base_fee_millisatoshi': 1, 'channel_flags': 0, ...}
```
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Fixes: #5403
Got complaints about us hanging up on some nodes because they don't respond
to pings in a timely manner (e.g. ACINQ?), but that turned out to be something
else.
Nonetheless, we've had reports in the past of LND badly prioritizing gossip
traffic, and thus important messages can get queued behind gossip dumps!
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Changed: connectd: give busy peers more time to respond to pings.