Took me a while (stressing under valgrind) to reproduce this,
then longer to figure out how it happened.
Turns out io_new_conn() can fail if the init function fails.
In our case, this can happen if connect() immediately returns
an error (inside io_connect). But we've already set the finish
function, which (if this was the last address), will free connect,
making the assignment `connect->conn = ...` write to a freed address.
Either way, if it fails, try_connect_one_addr() has taken care to
update connect->conn, or free connect, and the caller should not do it.
Here's the valgrind trace:
```
==384981== Invalid write of size 8
==384981== at 0x11127C: try_connect_one_addr (connectd.c:880)
==384981== by 0x112BA1: destroy_io_conn (connectd.c:708)
==384981== by 0x141459: destroy_conn (poll.c:244)
==384981== by 0x14147F: destroy_conn_close_fd (poll.c:250)
==384981== by 0x149EB9: notify (tal.c:240)
==384981== by 0x149F8B: del_tree (tal.c:402)
==384981== by 0x14A51A: tal_free (tal.c:486)
==384981== by 0x140036: io_close (io.c:450)
==384981== by 0x1400B3: do_plan (io.c:401)
==384981== by 0x140134: io_ready (io.c:423)
==384981== by 0x141A57: io_loop (poll.c:445)
==384981== by 0x112CB0: main (connectd.c:1703)
==384981== Address 0x4d67020 is 64 bytes inside a block of size 160 free'd
==384981== at 0x483CA3F: free (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==384981== by 0x14A020: del_tree (tal.c:421)
==384981== by 0x14A51A: tal_free (tal.c:486)
==384981== by 0x1110C5: try_connect_one_addr (connectd.c:806)
==384981== by 0x112BA1: destroy_io_conn (connectd.c:708)
==384981== by 0x141459: destroy_conn (poll.c:244)
==384981== by 0x14147F: destroy_conn_close_fd (poll.c:250)
==384981== by 0x149EB9: notify (tal.c:240)
==384981== by 0x149F8B: del_tree (tal.c:402)
==384981== by 0x14A51A: tal_free (tal.c:486)
==384981== by 0x140036: io_close (io.c:450)
==384981== by 0x1405DC: io_connect_ (io.c:345)
==384981== Block was alloc'd at
==384981== at 0x483B7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==384981== by 0x149CF1: allocate (tal.c:250)
==384981== by 0x14A3C6: tal_alloc_ (tal.c:428)
==384981== by 0x1114F2: try_connect_peer (connectd.c:1526)
==384981== by 0x111717: connect_to_peer (connectd.c:1558)
==384981== by 0x1124F5: recv_req (connectd.c:1627)
==384981== by 0x1188B2: handle_read (daemon_conn.c:31)
==384981== by 0x13FBCB: next_plan (io.c:59)
==384981== by 0x140076: do_plan (io.c:407)
==384981== by 0x140113: io_ready (io.c:417)
==384981== by 0x141A57: io_loop (poll.c:445)
==384981== by 0x112CB0: main (connectd.c:1703)
```
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Fixed: Occasional crash in connectd due to use-after-free
Fixes: #4343
From #clightning:
(11:24:10) andytoshi: hiya, i'm trying to set up a new lightningd node, and when i run lightningd --network=bitcoin --log-level=debug --daemon
(11:24:17) andytoshi: i get errors of the form fetchinvoice: Malformed JSON reply '2021-01-25T00:51:16.655Z DEBUG plugin-offers: disabled itself at init: offers not enabled in config
(11:24:43) andytoshi: there are a couple variants of this, but always some form of "something: failed to parse <a log line> as json"
Indeed, we close stdout, and it ends up being reused for some plugin.
But the real problem is that we log to stdout by default, which doesn't
make sense. If they really want to discard logs, they can use
--log-file=/dev/null.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Fixed: JSON failures when --daemon is used without --log-file.
rearranges the`peer_connected_hook_payload` definition to the location
where this is used in the file.
Fixes certain blanklines and linebreaks to make the code look nicer.
Nit: The underscore in "openchannel_hook" is wrong, bcause the name of
the hook is just "openchannel". The "_hook" implied this to be part of
the name.
Changelog-None
We are printing `repr(obj)` which is not pretty-printed, hard to read,
and can't even be copied and inspected to JSON tools. We now print the
JSONified and indented calls and responses for easier debugging based
on solely the logs (useful for CI!).
Changelog-Added: pyln-testing: The RPC client will now pretty-print requests and responses to facilitate log-based debugging.
We don't have a good way of referring to the configuration that
failed, so let's give them a numberic ID. Particularly useful for the
artifacts that'd be overwritten otherwise.
The test was not considering that concurrent sendrawtx of the same tx
is not stable, and either endpoint will submit it first. Now just
checking state transitions and the mempool.
If we're quick (or the node is slow) we end up reconnecting before our
counterparty has realized the state transition, resulting in an
unexpected re-establish.
We were sometimes waiting only 5 seconds, which is way too short on a
heavily loaded machine such as CI. Making it 30 seconds and collecting
it in a single place so we can adjust more easily.
We were getting a couple of starvations, so we need a fair filelock. I
also wasn't too happy with the lock as is, so I hand-coded it quickly.
Should be correct, but the overall timeout will tell us how well we
are doing on CI.
We also make the logic a bit nicer to read. The failure was due to
more than one status message being present if we look at the wrong
time:
```
arr = ['CLOSINGD_SIGEXCHANGE:We agreed on a closing fee of 20334 satoshi for tx:17f1e9d377840edf79d8b6f1ed0faba59bb307463461...9b98', 'CLOSINGD_SIGEXCHANGE:Waiting for another closing fee offer: ours was 20334 satoshi, theirs was 20332 satoshi,'] │
def only_one(arr):
"""Many JSON RPC calls return an array; often we only expect a single entry
"""
> assert len(arr) == 1
E AssertionError
```
We were getting a number of incompatibility warning due to the
dependencies being expressed too rigidly. This losens the requirement
definitions to being compatible with a known good version, and while
we're at it we also bump all outdated requirements.
We weren't waiting for the transactions to enter the mempool which
could cause all of our fine-tuned block counts to be off. Now just
waiting for the expected number of txs.