Update the lightningd <-> channeld interface with lots of new commands to needed to facilitate spicing.
Implement the channeld splicing protocol leveraging the interactivetx protocol.
Implement lightningd’s channel_control to support channeld in its splicing efforts.
Changelog-Added: Added the features to enable splicing & resizing of active channels.
Fixes: #6368
Changelog-Fixed: Protocol: we no longer gossip about recently-closed channels (Eclair gets upset with this).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We will access the freed connection to gossipd. This is weird to track
down when the *actual* issue is that gossipd died!
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
I never really liked this hack: websockets are useful, advertizing
them not so much.
Note that we never actually documented that we would advertize these!
Changelog-EXPERIMENTAL: Protocol: Removed support for advertizing websocket addresses in gossip.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
1. Make it the standard "return the error" pattern.
2. Rather than flags to indicate what types are allowed, have the callers
check the return explicitly.
3. Document the APIs.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This contained cut & paste code, and it wasn't clear to me that
the first loop included DNS entries with IPv6 entries.
Instead, allow the iterator to take multiple types, and use
a switch statement so compile will break as new types are added.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
After the first iteration of the loop, we call memmem with a buflen that
points past the end of buf.
In practice we probably never read the uninitialized memory since we
guarantee the buffer ends with "\r\n", and since most/all libc
implementations probably read the haystack sequentially. But maybe
there's some libc with a crazy optimization out there. It's good to use
an accurate buflen just in case.
Discovered this while running some unit tests with MSan.
The push bit was convenient for connectd to send our own gossip
to peers upon connecting by naively traversing the gossip_store
and sending anything flagged `push`. This function is now
performed by gossipd leaving no use for the push bit.
Changelog-Changed: `gossipd`: gossip_store PUSH bit is no longer set.
This implements the proposal to simply use timestamp as "all", "none"
or "stream". There's also a rough spec draft which I will post soon.
This *also* removes the last place where we would sometimes sweep the
entire gossip_store looking for their given timestamps.
We could also get rid of the actual timestamp filtering logic in
gossip_store_next if we want to, as it's now basically unused.
Changelog-Changed: Protocol: Simplify gossip_timestamp_filter handling to "all", "none" or "recent" instead of exact timestamp.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This removes the sweep logic as soon as they connect. This should save
connectd a significant number of CPU cycles and make @whitslack finally
stop hitting me.
Changelog-Changed: `connectd` no longer sweeps gossip_store file when peer connects, saving CPU for large nodes.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We accept that we will fail to listen if we bind both IPv6 and IPv4 to
the same socket on a dual-stack machine (e.g. normal Linux), but we weren't
closing the fd.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Most of this is piping the flag through so we know it's a websocket!
Reported-by: @ShahanaFarooqui
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This changes connectd to use `status_fail()` on TOR problems during statup
instead of `err()`. Using `err()` did not write to the logfile.
To find out TOR problems during startup, the user needed to stop the system
daemon and call `lightningd` manually in console to see the error.
`status_fail()` logs and exits, but also prints a whole stacktrace,
which is a bit too much imho on config errors. But currently there is
no `status_SOMETHING` method that logs, prints and exists on an error
without stacktrace.
Changelog-None
ccan/io stores the context pointer for io_new_conn, but we were using
`daemon->listeners` which we reallocate, so it can use a stale pointer.
```
0x3e1700 call_error
ccan/ccan/tal/tal.c:93
0x3e1700 check_bounds
ccan/ccan/tal/tal.c:165
0x3e1700 to_tal_hdr
ccan/ccan/tal/tal.c:174
0x3e1211 to_tal_hdr_or_null
ccan/ccan/tal/tal.c:186
0x3e1211 tal_alloc_
ccan/ccan/tal/tal.c:426
0x3db8f4 io_new_conn_
ccan/ccan/io/io.c:91
0x3dd2e1 accept_conn
ccan/ccan/io/poll.c:277
0x3dd2e1 io_loop
ccan/ccan/io/poll.c:444
0x3419fa main
connectd/connectd.c:2081
```
Fixes: #6060
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
The `tmpctx` is free'd before the error is read out/sent over the wire;
there's a call that will copy the array before sending it, let's use
that instead and take() the object?
------------------------------- Valgrind errors --------------------------------
Valgrind error file: valgrind-errors.2181501
==2181501== Syscall param write(buf) points to unaddressable byte(s)
==2181501== at 0x49E4077: write (write.c:26)
==2181501== by 0x1C79A3: do_write (io.c:189)
==2181501== by 0x1C80AB: do_plan (io.c:394)
==2181501== by 0x1C81BA: io_ready (io.c:423)
==2181501== by 0x1CA45B: io_loop (poll.c:453)
==2181501== by 0x118593: main (connectd.c:2053)
==2181501== Address 0x4afb158 is 40 bytes inside a block of size 140 free'd
==2181501== at 0x483F0C3: free (vg_replace_malloc.c:872)
==2181501== by 0x1D103C: del_tree (tal.c:421)
==2181501== by 0x1D130A: tal_free (tal.c:486)
==2181501== by 0x1364B8: clean_tmpctx (utils.c:172)
==2181501== by 0x1266DD: daemon_poll (daemon.c:87)
==2181501== by 0x1CA334: io_loop (poll.c:420)
==2181501== by 0x118593: main (connectd.c:2053)
==2181501== Block was alloc'd at
==2181501== at 0x483C855: malloc (vg_replace_malloc.c:381)
==2181501== by 0x1D0AC5: allocate (tal.c:250)
==2181501== by 0x1D1086: tal_alloc_ (tal.c:428)
==2181501== by 0x1D124F: tal_alloc_arr_ (tal.c:471)
==2181501== by 0x126204: cryptomsg_encrypt_msg (cryptomsg.c:161)
==2181501== by 0x11335F: peer_connected (connectd.c:318)
==2181501== by 0x118A8A: peer_init_received (peer_exchange_initmsg.c:135)
==2181501== by 0x1C751E: next_plan (io.c:59)
==2181501== by 0x1C8126: do_plan (io.c:407)
==2181501== by 0x1C8168: io_ready (io.c:417)
==2181501== by 0x1CA45B: io_loop (poll.c:453)
==2181501== by 0x118593: main (connectd.c:2053)
==2181501==
{
<insert_a_suppression_name_here>
Memcheck:Param
write(buf)
fun:write
fun:do_write
fun:do_plan
fun:io_ready
fun:io_loop
fun:main
}
--------------------------------------------------------------------------------
We were stressing the servers if node cannot be found. Only do lookup
on manual connect commands.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Protocol: lightningd: Only use DNS server address lookup on manual `connect` commands, not normal reconnection attempts.
We had a scheme where lightningd itself would put a per-node secret in
the blinded path, then we'd tell the caller when it was used. Then it
simply checks the alias to determine if the correct path was used.
But this doesn't work when we start to offer multiple blinded paths.
So go for a far simpler scheme, where the secret is generated (and
stored) by the caller, and hand it back to them.
We keep the split "with secret" or "without secret" API, since I'm
sure callers who don't care about the secret won't check that it
doesn't exist! And without that, someone can use a blinded path for a
different message and get a response which may reveal the node.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Instead of open coding in connectd/onion_message, we move it to common
with a nice API.
This lets us process the BOLT test vectors.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Mainly, field name changes.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-EXPERIMENTAL: Protocol: Support for forwarding blinded payments (as per latest draft)
Add memleak_ignore_children() so callers can do exclusions themselves.
Having two exclusions was always such a hack!
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This allows GDB to print values, but also allows us to use them in
'case' statements. This wasn't allowed before because they're not
constant terms.
This also made it clear there's a clash between two error codes,
so move one.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Changed: JSON-RPC: Error code from bcli plugin changed from 400 to 500.
This contains the zeroconf stuff, with funding_locked renamed to
channel_ready. I change that everywhere, and try to fix up the
comments.
Also the `alias` field is called `short_channel_id`.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Changed: Protocol: `funding_locked` is now called `channel_ready` as per latest BOLTs.
We used to tell connectd to remember our connect delay, and hand it
back (increased if necessary).
Instead, simply record when we last tried to connect. If it was less
than 10 minutes ago, double delay (up to 5 minutes max), otherwise
reset delay to 1 second.
This covers all scenarios: whether we reconnect then immediately
disconnect, or never successfully connect, it doesn't matter.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Fixes: #5453
Don't log_io final messages twice (multiplex_final_message already does
this, so it's confusing to see us send e.g. WIRE_ERROR twice!).
And report that the peer has failed to connect out *before* telling
lightningd, otherwise we get a very confusing ordering, e.g.:
```
2022-07-23T05:17:36.096Z DEBUG 027d0de66d08f956a8d606c0d1c34e59bda38c05a3b1cc738fdd6378716c644997-lightningd: Reconnecting in 4 seconds
2022-07-23T05:17:36.096Z DEBUG 027d0de66d08f956a8d606c0d1c34e59bda38c05a3b1cc738fdd6378716c644997-lightningd: Will try reconnect in 4 seconds
2022-07-23T05:17:36.096Z DEBUG 027d0de66d08f956a8d606c0d1c34e59bda38c05a3b1cc738fdd6378716c644997-connectd: Failed connected out:
```
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Connectd already does this when we *receive* an error or warning, but
now do it on send. This causes some slight behavior change: we don't
disconnect when we close a channel, for example (our behaviour here
has been inconsistent across versions, depending on the code).
When connectd is told to disconnect, it now does so immediately, and
doesn't wait for subds to drain etc. That simplifies the manual
disconnect case, which now cleans up as it would from any other
disconnection when connectd says it's disconnected.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We want to avoid lost messages in the common cases.
This generalizes our drain code, by giving the subds each 5 seconds to
close themselves, but continue to allow them to send us traffic (if
peer is still connected) and continue to send them traffic.
We continue to send traffic *out* to the peer (if it's still
connected), until all subds are gone. We still have a 5 second timer
to close the connection to peer.
On reconnects, we don't do this "drain period" on reconnects: we kill
immediately.
We fix up one test which was looking for the "disconnect" message
explicitly.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
A subtle case I hadn't come across before: if a child tal_resizes()
its parent while the parent is being deleted, tal gets confused.
The subd destructor does this using tal_arr_remove() on peer->subds,
which is currently being freed:
```
==61056== Invalid read of size 8
==61056== at 0x185632: del_tree (tal.c:417)
==61056== by 0x18560D: del_tree (tal.c:412)
==61056== by 0x185957: tal_free (tal.c:486)
==61056== by 0x1183BC: peer_discard (connectd.c:1861)
==61056== by 0x11869E: recv_req (connectd.c:1942)
==61056== by 0x12774B: handle_read (daemon_conn.c:35)
==61056== by 0x173453: next_plan (io.c:59)
==61056== by 0x17405B: do_plan (io.c:407)
==61056== by 0x17409D: io_ready (io.c:417)
==61056== by 0x176390: io_loop (poll.c:453)
==61056== by 0x118A68: main (connectd.c:2082)
==61056== Address 0x4bd8850 is 16 bytes inside a block of size 48 free'd
==61056== at 0x483DFAF: realloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==61056== by 0x1860E6: tal_resize_ (tal.c:699)
==61056== by 0x1373DD: tal_arr_remove_ (utils.c:184)
==61056== by 0x11D508: destroy_subd (multiplex.c:930)
==61056== by 0x1850A4: notify (tal.c:240)
==61056== by 0x1855BB: del_tree (tal.c:402)
==61056== by 0x18560D: del_tree (tal.c:412)
==61056== by 0x18560D: del_tree (tal.c:412)
==61056== by 0x185957: tal_free (tal.c:486)
==61056== by 0x1183BC: peer_discard (connectd.c:1861)
==61056== by 0x11869E: recv_req (connectd.c:1942)
==61056== by 0x12774B: handle_read (daemon_conn.c:35)
```
So simply make the subds children of `peer` not the `peer->subds`
array. The only effect is that drain_peer() can't simply free the
subds array but must free the subds one at a time.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This allows us to detect when lightningd hasn't seen our latest
disconnect/reconnect; in particular, we would hit the following pattern:
1. lightningd says to connect a subd.
2. connectd disconnects and reconnects.
3. connectd reads message, connects subd.
4. lightningd reads disconnect and reconnect, sends msg to connect to subd again.
5. connectd asserts because subd is alreacy connected.
This way connectd can tell if lightningd is talking about the previous
connection, and ignoere it.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Before this patch:
1. connectd says it's connected (peer_connected)
2. we tell connectd we want to talk about each channel (peer_make_active)
3. connectd gives us an fd for each channel, and we connect it to a subd (peer_active)
4. OR, connectd says it sent something about a channel we didn't tell it about, with an fd (peer_active)
Now:
1. connectd says it's connected (peer_connected)
2. we start all appropriate subds and tell connectd to what channels/fds (peer_connect_subd).
3. if connectd says it sent something about a channel we didn't tell it about, we either tell
it to hang up (peer_final_msg), or connect a new opening daemon (peer_connect_subd).
This is the minimal-size patch, which is why we create socket pairs in
so many places to use the existing functions. Many cleanups are
possible, since the new flow is so simple.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We don't have to put aside a peer which is reconnecting and wait for
lightningd to remove the old peer, we can now simply free the old
and add the new.
Fixes: #5240
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Sending any pending messages to peer before hanging up is a courtesy:
give it 5 seconds before simply closing.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Now we have separate peer draining logic, we can simply use it when
connectd tells us to release the peer, without waiting. (We could
simply free the peer, but that's a bit rude, as messages can get
lost).
This removes various complex flags and logic we had before.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Fixed: `connectd`: various crashes and issues fixed by simplification and rewrite.
This removes it from the hashtable, and forces it to do nothing but
send out any remaining packets, then close.
It is, in effect, reduced to a stub, with no further interactions
with the rest of the system (all subds are freed already).
Also removes the need for an explicit "final_msg" too.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This came out in a later patch: freeing the peer->subds doesn't actually
free the subds, because they're reparented onto subd->conn, which is
a child of peer itself.
This breaks because when the peer is finally freed, destroy_subd is
called, and expects to find itself in peer->subds (but we made that
NULL when we manually freed it!).
Fix this, and make it obvious that we tal_steal it.
```
ightning_connectd: FATAL SIGNAL 11 (version v0.11.0.1-25-gbf025aa-modded)
0x55de2a1b8b94 send_backtrace
common/daemon.c:33
0x55de2a1b8c3e crashdump
common/daemon.c:46
0x7fe2be2fc08f ???
/build/glibc-SzIz7B/glibc-2.31/signal/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0
0x55de2a1af41e destroy_subd
connectd/multiplex.c:1119
0x55de2a217686 notify
ccan/ccan/tal/tal.c:240
0x55de2a217b9d del_tree
ccan/ccan/tal/tal.c:402
0x55de2a217bef del_tree
ccan/ccan/tal/tal.c:412
0x55de2a217bef del_tree
ccan/ccan/tal/tal.c:412
0x55de2a217f39 tal_free
ccan/ccan/tal/tal.c:486
0x55de2a1aa116 peer_discard
connectd/connectd.c:1834
0x55de2a1aa38d recv_req
connectd/connectd.c:1903
0x55de2a1b9121 handle_read
common/daemon_conn.c:31
0x55de2a205a35 next_plan
ccan/ccan/io/io.c:59
0x55de2a20663d do_plan
ccan/ccan/io/io.c:407
0x55de2a20667f io_ready
ccan/ccan/io/io.c:417
0x55de2a208972 io_loop
ccan/ccan/io/poll.c:453
0x55de2a1aa736 main
connectd/connectd.c:2042
0x7fe2be2dd082 __libc_start_main
../csu/libc-start.c:308
0x55de2a1a085d ???
???:0
0xffffffffffffffff ???
???:0
```
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
LND and us send 0xFFFFFFFF to turn off gossip. LDK and Eclair don't
seem to turn off gossip at all, but that's OK.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This is a side-effect of fixing aging: sometimes, we age our
rcvd_filter cache too fast, and thus re-xmit. This breaks
our test, since it used dev-disconnect on the channel_announce,
but that closes to l3, not l1!
```
> assert l1.rpc.listchannels()['channels'] == []
E AssertionError: assert [{'active': T...ags': 1, ...}] == []
E Left contains 2 more items, first extra item: {'active': True, 'amount_msat': 100000000msat, 'base_fee_millisatoshi': 1, 'channel_flags': 0, ...}
```
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Fixes: #5403
Got complaints about us hanging up on some nodes because they don't respond
to pings in a timely manner (e.g. ACINQ?), but that turned out to be something
else.
Nonetheless, we've had reports in the past of LND badly prioritizing gossip
traffic, and thus important messages can get queued behind gossip dumps!
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Changed: connectd: give busy peers more time to respond to pings.
This will be used to decouple internal use of gossip from what is
passed to gossip peers. Updates GOSSIP_STORE_VERION to 10.
Changelog-Changed: gossip_store updated to version 10.
```
VALGRIND=1 valgrind -q --error-exitcode=7 --track-origins=yes --leak-check=full --show-reachable=yes --errors-for-leak-kinds=all connectd/test/run-netaddress > /dev/null
==2483395== 16 bytes in 1 blocks are still reachable in loss record 1 of 15
==2483395== at 0x483B7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==2483395== by 0x10D59A: autodata_register_ (autodata.c:20)
==2483395== by 0x10EB26: register_autotype_type_to_string (type_to_string.h:77)
==2483395== by 0x10EB6B: register_one_type_to_string0 (type_to_string.c:8)
==2483395== by 0x188C0C: __libc_csu_init (in /home/rusty/devel/cvs/lightning/connectd/test/run-netaddress)
==2483395== by 0x4A3A00F: (below main) (libc-start.c:264)
==2483395==
==2483395== 40 bytes in 1 blocks are still reachable in loss record 2 of 15
==2483395== at 0x483B7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
...
```
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This was fixed in 1c495ca5a8 ("connectd:
fix accidental handling of old reconnections.") and then reverted by
the rework in "connectd: avoid use-after-free upon multiple
reconnections by a peer".
The latter made the race much less likely, since we cleaned up the
reconnecting struct once the connection was hung up by the remote
node, but it's still theoretically possible.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
`peer_reconnected` was freeing a `struct peer_reconnected` instance
while a pointer to that instance was registered to be passed as an
argument to the `retry_peer_connected` callback function. This caused a
use-after-free crash when `retry_peer_connected` attempted to reparent
the instance to the temporary context.
Instead, never have `peer_reconnected` free a `struct peer_reconnected`
instance, and only ever allow such an instance to be freed after the
`retry_peer_connected` callback has finished with it. To ensure that the
instance is freed even if the connection is closed before the callback
can be invoked, parent the instance to the connection rather than to the
daemon.
Absent the need to free `struct peer_reconnected` instances outside of
the `retry_peer_connected` callback, there is no use for the
`reconnected` hashtable, so remove it as well.
See: https://github.com/ElementsProject/lightning/issues/5282#issuecomment-1141454255Fixes: #5282Fixes: #5284
Changelog-Fixed: connectd no longer crashes when peers reconnect.
The crash below from @zerofeerouting left me confused. The invalid
value in fmt_wireaddr_internal is a telltale sign of use-after-free.
This backtrace shows us destroying the conn *twice*: what's happening?
Well, tal carefully protects against destroying twice: it's not that
unusual to free something in a destructor which has already been freed.
So this indicates that there are *two* io_conn hanging off one
struct connecting, which isn't supposed to happen! We deliberately
call try_connect_one_addr() initially, then inside the io_conn destructor.
But due to races in connectd vs lightningd connection state, we added
a fix which allows a connect command to sit around while the peer is
cleaning up (6cc9f37cab) and get fired
off when it's done.
But what if, in the chaos, we are already connecting again? Now we'll
end up with *two* connections.
Fortunately, we have a `conn` pointer inside struct connecting, which
(with a bit of additional care) we can ensure is only non-NULL while
we're actually trying to connect. This lets us check that before
firing off a new connection attempt in peer_conn_closed.
```
lightning_connectd: FATAL SIGNAL 6 (version v0.11.2rc2-2-g8f7e939)
0x5614a4915ae8 send_backtrace
common/daemon.c:33
0x5614a4915b72 crashdump
common/daemon.c:46
0x7ffa14fcd72f ???
???:0
0x7ffa14dc87bb ???
???:0
0x7ffa14db3534 ???
???:0
0x5614a491fc71 fmt_wireaddr_internal
common/wireaddr.c:255
0x5614a491fc7a fmt_wireaddr_internal_
common/wireaddr.c:257
0x5614a491ea6b type_to_string_
common/type_to_string.c:32
0x5614a490beaa destroy_io_conn
connectd/connectd.c:754
0x5614a494a2f1 destroy_conn
ccan/ccan/io/poll.c:246
0x5614a494a313 destroy_conn_close_fd
ccan/ccan/io/poll.c:252
0x5614a4953804 notify
ccan/ccan/tal/tal.c:240
0x5614a49538d6 del_tree
ccan/ccan/tal/tal.c:402
0x5614a4953928 del_tree
ccan/ccan/tal/tal.c:412
0x5614a4953e07 tal_free
ccan/ccan/tal/tal.c:486
0x5614a4908b7a try_connect_one_addr
connectd/connectd.c:870
0x5614a490bef1 destroy_io_conn
connectd/connectd.c:759
0x5614a494a2f1 destroy_conn
ccan/ccan/io/poll.c:246
0x5614a494a313 destroy_conn_close_fd
ccan/ccan/io/poll.c:252
0x5614a4953804 notify
ccan/ccan/tal/tal.c:240
0x5614a49538d6 del_tree
ccan/ccan/tal/tal.c:402
0x5614a4953e07 tal_free
ccan/ccan/tal/tal.c:486
0x5614a4948f08 io_close
ccan/ccan/io/io.c:450
0x5614a4948f59 do_plan
ccan/ccan/io/io.c:401
0x5614a4948fe1 io_ready
ccan/ccan/io/io.c:417
0x5614a494a8e6 io_loop
ccan/ccan/io/poll.c:453
0x5614a490c12f main
connectd/connectd.c:2164
0x7ffa14db509a ???
???:0
0x5614a4904e99 ???
???:0
0xffffffffffffffff ???
???:0
```
Fixes: #5339
Changelog-Fixed: connectd: occasional crash when we reconnect to a peer quickly.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
It doesn't get the right errno, and it says "create" not "bind".
```
2022-05-20T03:04:46.498Z DEBUG connectd: Failed to create 2 socket: Success
2022-05-20T03:04:46.500Z DEBUG connectd: REPLY WIRE_CONNECTD_INIT_REPLY with 0 fds
2022-05-20T03:04:46.501Z DEBUG connectd: connectd_init_done
2022-05-20T03:04:46.503Z **BROKEN** connectd: Failed to bind socket for 127.0.0.1:37871: Address already in use
```
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Before this fix, there was the situation where a DEVELOPER=1 node would
announce non-public addresses on mainnet if detected. Since there
are some nodes on the internet that falsely report local addresses
we move this 'testing feature' to 'dev-allow-locahost' nodes.
Changelog-None
Instead of doing an allocation per entry, put the entry in directly.
This means only 30 bit resolution on 32-bit machines, but if a bit
of gossip gets accidently suppressed that's ok.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
When we moved gossip filtering to connectd, this aging got lost.
Without this, we hit the 10,000 entry limit before expiring full
gossip anti-echo cache. This is under 1M in allocations per peer, but
in DEVELOPER mode each allocation includes adds 3 notifiers (32 bytes
each) and a backtrace child (40 + 40 + 256 bytes), making it almost
10MB per peer, plus allocation overhead.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Fixed: connectd: large memory usage with many peers fixed.
Per BIP-0171, the signature map is of pubkey to "The signature as would
be pushed to the stack from a scriptSig or witness".
Fixes 5298
Changelog-Fixed: PSBT: Fix signature encoding to comply with BIP-0171.
Signed-off-by: Jon Griffiths <jon_p_griffiths@yahoo.com>
Now it's formatted properly, we don't need the patch.
But we need to explicitly marshal/unmarshal into a byte stream,
which involves some code rearrangement.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
I have a test which reproduces this, too, and it's been seen in the
wild. It seems we can add a subd as we're closing, which causes
this assert to trigger.
Fixes: #5254
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We had multiple reports of channels being unilaterally closed because
it seemed like the peer was sending old revocation numbers.
Turns out, it was actually old reestablish messages! When we have a
reconnection, we would put the new connection aside, and tell lightningd
to close the current connection: when it did, we would restart
processing of the initial reconnection.
However, we could end up with *multiple* "reconnecting" connections,
while waiting for an existing connection to close. Though the
connections were long gone, there could still be messages queued
(particularly the channel_reestablish message, which comes early on).
Eventually, a normal reconnection would cause us to process one of
these reconnecting connections, and channeld would see the (perhaps
very old!) messages, and get confused.
(I have a test which triggers this, but it also hangs the connect
command, due to other issues we will fix in the next release...)
Fixes: #5240
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This seems to prevent broad propagation, due to LND not allowing it. See
https://github.com/lightningnetwork/lnd/issues/6432
We still announce it if you disable deprecated-apis, so tests still work,
and hopefully we can enable it in future.
Fixes: #5196
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-EXPERIMENTAL: Protocol: disabled websocket announcement due to LND propagation issues
We seem to have made node_announcement propagation *worse*, not
better. Explorers don't see my nodes updates.
At least some LND nodes never send us timestamp_filter, so we are
never actually stream *any* gossip. We should send gossip about
ourselves, even if they haven't set a filter (yet).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Added: Protocol: we more aggressively send our own gossip, to improve propagation chances.
I was seeing a strange crash:
Connectd gave bad CONNECT_PEER_CONNECTED message
The message is indeed mangled, around the remote_addr!
A quick review of the code revealed that we were not making a copy
when it was a reconnect, and so the remote_addr pointer was pointing
to memory which was freed.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Mostly comments and docs: some places are actually paths, which
I have avoided changing. We may migrate them slowly, particularly
when they're user-visible.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We do this (send warnings) in almost all cases anyway, so mainly this
is a textual update, but there are some changes:
1. Send ERROR not WARNING if they send a malformed commitment secret.
2. Send WARNING not ERROR if they get the shutdown_scriptpubkey wrong (vs upfront)
3. Send WARNING not ERROR if they send a bad shutdown_scriptpubkey (e.g. p2pkh in future)
4. Rename some vars 'err' to 'warn' to make it clear we send a warning.
This means test_option_upfront_shutdown_script can be made reliable, too,
and it now warns and doesn't automatically close channel.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Gossipd didn't actually suppress all gossip, resulting in a flake!
Doing it in connectd now makes much more sense.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
I removed these prematurely: we *haven't* had a release since
introducing them!
This consists of reverting d15d629b8b
"plugins/fetchinvoice: remove obsolete string-based API." and
plugins/fetchinvoice: remove obsolete string-based
API. "onion_messages: remove obs2 support."
Some minor changes due to updated fromwire_tlv API since they
were removed, but not much.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-EXPERIMENTAL: REVERT: Removed backwards compat with onion messages from v0.10.1.
Requiring the caller to allocate them is ugly, and differs from
other types.
This means we need a context arg if we don't have one already.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>