We noticed bogus behavior where (with 200 blocks added at once)
lightningd didn't see a DF channel open: this is because we told
lightningd about the new blocks (and it timed out channel) before
the watches for the transaction are fired.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We had this assertion fail, and I can't see a clear reason why. Remove it
(i.e. don't crash because we're having trouble with creating invoice routehints)
and add logging.
```
Assertion failed: a->c->rr_number < b->c->rr_number (lightningd/invoice.c: cmp_rr_number: 623)
lightningd: FATAL SIGNAL 6 (version v0.10.2-modded)
0x5654f1c40061 send_backtrace
common/daemon.c:33
0x5654f1c400e9 crashdump
common/daemon.c:46
0x7efd87da6c8a ???
???:0
```
There are several possible causes for this:
1. We have two channels with the same rr_number. A quick audit shows we always set that rr_number to a unique value (and 64 bits, so wrap is not possible between the release and now!).
2. It's theoretically possible that sort() could compare a value with itself, but that would be really dumb: it doesn't that I've ever seen, but then, we've never seen this assert() hit, either.
3. listincoming has given us the same channel twice. I don't see how that is possible: we had a race where channels could momentarily vanish, but never be duplicated (emailed crash.log shows no duplicates!).
4. General corruption/freed memory access. If a channel we've just looked up is gone but still in the hash table, this is possible but would cause lots of random behavior and crashes.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Fixes 25699994e5 (pytest: fix flake in test_node_reannounce).
Converting ->set->list does not give deterministic order, so we can
end up with the two lists differing due to order:
```
May send its own announcement *twice*, since it always spams us.
msgs2 = list(set(msgs2))
> assert msgs == msgs2
E AssertionError: assert ['01012ff5580...000000000000'] == ['01014973d81...000000000000']
E At index 0 diff: '01012ff55800f5b9492021372d74df4d6547bb0d32aec8d4c932a8c3b044e4bd983c429154e73091b0a2aff1cf9bbf16b37e6e9dd10ce4c2d949217366472acd341b0007800000080269a262bbd1750266e4598d1d3c415f572a8488830b60f7e744ed9235eb0b1ba93283b315c035180266e453454e494f524245414d000000000000000000000000000000000000000000000000' != '01014973d8160dd8fc28e8fb25c40b9d5c68aed8dfb36af9fc13e4d2040fb3718553051a188ce98239c0bed138e1f8713a64acc7de98c183c9597fa58bf37f0b89bb0007800000080269a262bbd16c022d223620a359a47ff7f7ac447c85c46c923da53389221a0054c11c1e3ca31d59022d2253494c454e544152544953542d336333626132392d6d6f6464656400000000000000'
E Full diff:
E [
E + '01012ff55800f5b9492021372d74df4d6547bb0d32aec8d4c932a8c3b044e4bd983c429154e73091b0a2aff1cf9bbf16b37e6e9dd10ce4c2d949217366472acd341b0007800000080269a262bbd1750266e4598d1d3c415f572a8488830b60f7e744ed9235eb0b1ba93283b315c035180266e453454e494f524245414d000000000000000000000000000000000000000000000000',
E '01014973d8160dd8fc28e8fb25c40b9d5c68aed8dfb36af9fc13e4d2040fb3718553051a188ce98239c0bed138e1f8713a64acc7de98c183c9597fa58bf37f0b89bb0007800000080269a262bbd16c022d223620a359a47ff7f7ac447c85c46c923da53389221a0054c11c1e3ca31d59022d2253494c454e544152544953542d336333626132392d6d6f6464656400000000000000',
E - '01012ff55800f5b9492021372d74df4d6547bb0d32aec8d4c932a8c3b044e4bd983c429154e73091b0a2aff1cf9bbf16b37e6e9dd10ce4c2d949217366472acd341b0007800000080269a262bbd1750266e4598d1d3c415f572a8488830b60f7e744ed9235eb0b1ba93283b315c035180266e453454e494f524245414d000000000000000000000000000000000000000000000000',
E ]
``
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Got complaints about us hanging up on some nodes because they don't respond
to pings in a timely manner (e.g. ACINQ?), but that turned out to be something
else.
Nonetheless, we've had reports in the past of LND badly prioritizing gossip
traffic, and thus important messages can get queued behind gossip dumps!
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Changed: connectd: give busy peers more time to respond to pings.
Caused a crash in CI, reproduced under valgrind by calling
any_channel_by_scid from io_poll_lightningd:
```
==2422524== Conditional jump or move depends on uninitialised value(s)
==2422524== at 0x12C98D: any_channel_by_scid (channel.c:606)
==2422524== by 0x14FF75: io_poll_lightningd (lightningd.c:682)
==2422524== by 0x225FDE: io_loop (poll.c:420)
==2422524== by 0x14A914: io_loop_with_timers (io_loop_with_timers.c:22)
==2422524== by 0x150C4E: main (lightningd.c:1193)
==2422524== Uninitialised value was created by a heap allocation
==2422524== at 0x483B7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==2422524== by 0x234F61: allocate (tal.c:250)
==2422524== by 0x235522: tal_alloc_ (tal.c:428)
==2422524== by 0x12B500: new_unsaved_channel (channel.c:203)
==2422524== by 0x13B77A: json_openchannel_init (dual_open_control.c:2610)
==2422524== by 0x14C78D: command_exec (jsonrpc.c:630)
==2422524== by 0x14CD9F: rpc_command_hook_final (jsonrpc.c:765)
==2422524== by 0x181DDA: plugin_hook_call_ (plugin_hook.c:278)
==2422524== by 0x14D198: plugin_hook_call_rpc_command (jsonrpc.c:853)
==2422524== by 0x14D6A0: parse_request (jsonrpc.c:957)
==2422524== by 0x14DAFE: read_json (jsonrpc.c:1054)
==2422524== by 0x2231C8: next_plan (io.c:59)
```
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Because we expire cache every flush, in DEVELOPER mode that can happen in
just over a second. If gossipd takes a while to process the gossip,
this can mean we actually forget we received it from the peer.
Easiest fix is to run this test in non-DEVELOPER mode.
``` # With DEVELOPER, this is long enough for gossip flush.
time.sleep(2)
> assert not l3.daemon.is_in_log(r'\[OUT\] 0100')
E AssertionError: assert not '2022-06-30T06:00:31.031Z 022d223620a359a47ff7f7ac447c85c46c923da53389221a0054c11c1e3ca31d59-connectd: [OUT] 0100294834e94aa2407ace65373c84c2a8056d67a6778c22bd43f03a9000d8a01f075c3996ab6c88951bd9805adbfde78292022d7514ffcec37d5e466ef5ebd235c87ffa8a8567904502cd65ae913dbb8014509f442a6e37b00fd59fa3a06811a4965f45df1612d3a32b2566001430cfec4346cc021bd0a1bf6a87417dbc29e0715f16c4f9dc5a01bff07368ba8f54cf9813c3fc8f9f8b8c8005da2a18021f322a9a168af8f82d9deba49ced51bb26a9abc001bbad314a15baccc87c685e9302533c2e86a7b9dfe90bd4cb6be5c2da375be8c4a535a63cacff0544957a34ca865fdb61f9198d19dfd990a0fcbf8de39fa2c0fc54435c16e74df2b49fe3b6e905d599000006226e46111a0b59caaf126043eb5bbf28c34f3a5e332a1fc7b2b73cf188910f0000670000010001022d223620a359a47ff7f7ac447c85c46c923da53389221a0054c11c1e3ca31d590266e4598d1d3c415f572a8488830b60f7e744ed9235eb0b1ba93283b315c0351802e3bd38009866c9da8ec4aa99cc4ea9c6c0dd46df15c61ef0ce1f271291714e5702324266de8403b3ab157a09f1f784d587af61831c998c151bcc21bb74c2b2314b'
```
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
It actually only sets the prefix for the lightningd core log messages;
the other logs have their own prefix.
Make it a real, process-wide prefix which actually goes in front of the timestamp.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Fixed: options: `log-prefix` now correctly prefixes *all* log messages.
Since we now log directly, we don't prepend the prefix ourselves, making it really
hard to tell *which* lightningd the log applies to!
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Looking at the CI logs, it seems like it took over 5 seconds, so
the unilateral close occurred instead of the expected rejection
of the WIRE_SHUTDOWN reply. Make it bulletproof.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
That's useful for "tell me everything about this node" debugging.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Fixes: #5348
Changelog-Added: lightningd: `log-level=debug:<partial-nodeid>` supported to get debug-level logs for everything about a peer.
If we can't broadcast the tx, confirm that it didn't end up in the
mempool or the utxo set before throwing an error.
Note that this doesn't protect us in the case where the funding
output has already been *spent*... but that's extremely rare, right?
Fixes#5296
Reported-By: @rustyrussell
Collab-With: @vincenzopalazzo
Check funding_outnum validity first to avoid reading invalid outputs
Changelog-Fixed: Fixed a potential issue if the number of outputs decreases in a dualopen RBF or splice.
This grows the routing state in order to index both okay-to-broadcast
and rate-limited gossip. The gossip_store also logs the rate-limited
gossip if useful. This allows the broadcast of the last non-rate-limited
gossip.
routing.c now flags rate-limited gossip as it enters the gossip_store but
makes use of it in updating the routing graph. Flagged gossip is not
rebroadcast to gossip peers.
Changelog-Changed: gossipd: now accepts spam gossip, but squelches it for
peers.
This will be used to decouple internal use of gossip from what is
passed to gossip peers. Updates GOSSIP_STORE_VERION to 10.
Changelog-Changed: gossip_store updated to version 10.
I had to parse quite a few of these files debugging zeroconf, so I
thought it might be nice to have direct access here.
Changelog-Added: pyln-testing: Added utilities to read and parse `gossip_store` file for nodes.
With zeroconf we have to duplicate the `local_channel_announcement`
since we locally announce the aliased version, and then on the first
confirmation we also add the funding scid version.
We used to agree up on the `minimum_depth` with the peer, thus when
they told us that the funding locked we'd be sure we either have a
scid or we'd trigger the state transition when we do. However if we
had a scid, and we got a funding_locked we'd trust them not to have
sent it early. Now we explicitly track the depth in the channel while
waiting for the funding to confirm.
Changelog-Fixed: channeld: Enforce our own `minimum_depth` beyond just confirming
The spec explicitly asks for the first point, while we were using the
most recent one. This worked fine before zeroconf, but with zeroconf
it can happen.
Not only can the outgoing edge be a zeroconf channel, it can also be
the incoming channel. So we revert to the usual trick of using the
local alias if the short_channel_id isn't known yet.
We use the LOCAL alias instead of the REMOTE alias even though the
sender likely used the REMOTE alias to refer to the channel. This is
because we control the LOCAL alias, and we keep it stable during the
lifetime of the channel, whereas the REMOTE one could change or not be
there yet.
We use this in a couple of places, when we want to refer to a channel
by its `short_channel_id`, I'm moving this into a separate function
primarily to have a way to mark places where we do that.
The direction only depends on the ordering between node_ids, not the
short_channel_id, so we can include it and it won't change. This was
causing some trouble loading the `channel_hints` in the `pay` plugin.
We don't trigger on depth=0 since that'd give us bogus blockheights
and pointers into the chain, instead we defer until we get a first
confirmation. This extracts some of the logic from `lockin_complete`,
into the depth change listener (not the remote funding locked since at
that point we're certainly locked in and we don't really care about
that for bookkeeping anyway).
This is needed for us to transition to CHANNELD_NORMAL for zeroconf
channels, i.e., channels where we don't have a short channel ID yet.
We'll have to call lockin_complete a second time, once we learn the
real scid.