After analyzing various weird cases where we ended up with duplicate
gossip_store entries, it could be explained by us not fully processing
the gossip store.
It's not clear that my assumptions that we would always see our own writes
are true: technically this may require an fsync(). So we now add the
check, and do an fsync and try again.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Fixed: gossipd: more sanity checks that we are correctly updating the gossip_store file.
We had at least one report of overwriting the gossip_store file at
offset 1. Make sure this doesn't happen.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
It's actually the only one that uses it. We also tweak the way
gossip_store handles failure: gossmap_manage now tells it when to
reset the corrupted store.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Default goes to stderr for LOG_UNUSUAL and higher.
We have to whitelist more cases in map_catchup so we don't spam the logs
with perfectly-expected (but ignored) messages though.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We only use it in one place, and that was simply to share an fd between
gossipd writing and gossipd reading, which may be causing our zfs problem
anyway.
In fact, it fixes a race if we don't have HAVE_PWRITEV.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
test_closing_different_fees fails:
```
2024-10-14T08:43:30.2733614Z
2024-10-14T08:43:30.2734133Z # Now wait for them all to hit normal state, do payments
2024-10-14T08:43:30.2735205Z > l1.daemon.wait_for_logs(['update for channel .* now ACTIVE'] * num_peers
2024-10-14T08:43:30.2736233Z + ['to CHANNELD_NORMAL'] * num_peers)
2024-10-14T08:43:30.2736725Z
2024-10-14T08:43:30.2736903Z tests/test_closing.py:230:
...
2024-10-14T08:43:30.2761325Z E TimeoutError: Unable to find "[re.compile('update for channel .* now ACTIVE')]" in logs.
```
For some reason one of the channel_update injections does *not* evoke this message
from gossipd...
Changelog-None: debug!
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This caused a "flake" in testing, because it's wrong:
```
_____________________________ test_gossip_pruning ______________________________
[gw2] linux -- Python 3.10.16 /home/runner/.cache/pypoetry/virtualenvs/cln-meta-project-AqJ9wMix-py3.10/bin/python
node_factory = <pyln.testing.utils.NodeFactory object at 0x7f0267530490>
bitcoind = <pyln.testing.utils.BitcoinD object at 0x7f0267532b30>
def test_gossip_pruning(node_factory, bitcoind):
""" Create channel and see it being updated in time before pruning
"""
l1, l2, l3 = node_factory.get_nodes(3, opts={'dev-fast-gossip-prune': None,
'allow_bad_gossip': True,
'autoconnect-seeker-peers': 0})
l1.rpc.connect(l2.info['id'], 'localhost', l2.port)
l2.rpc.connect(l3.info['id'], 'localhost', l3.port)
scid1, _ = l1.fundchannel(l2, 10**6)
scid2, _ = l2.fundchannel(l3, 10**6)
mine_funding_to_announce(bitcoind, [l1, l2, l3])
wait_for(lambda: l1.rpc.listchannels(source=l1.info['id'])['channels'] != [])
l1_initial_cupdate_timestamp = only_one(l1.rpc.listchannels(source=l1.info['id'])['channels'])['last_update']
# Get timestamps of initial updates, so we can ensure they change.
# Channels should be activated locally
> wait_for(lambda: [c['active'] for c in l1.rpc.listchannels()['channels']] == [True] * 4)
```
Here you can see it has pruned:
```
lightningd-1 2025-01-24T07:39:40.873Z DEBUG gossipd: Pruning channel 105x1x0 from network view (ages 1737704380 and 0)
...
lightningd-1 2025-01-24T07:39:50.941Z UNUSUAL lightningd: Bad gossip order: could not find channel 105x1x0 for peer's channel update
```
Changelog-Fixed: Protocol: we were overzealous in pruning channels if we hadn't seen one side's gossip update yet.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
The updated API requires typed htables to explicitly state whether they
allow duplicates: for most cases we don't, but we've had issues in the
past.
This is a big patch, but mainly mechanical.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
See: https://github.com/ElementsProject/lightning/issues/7899
A node with 23 connections gets far too many debug messages.
Changelog-Fixed: `gossipd` now does logging at trace, not debug level.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Testing autoconnect in --offline mode, the autoconnect never functions
if the seeker has not gotten out of the startup state. This commit
overloads the counter to start the autoconnect
attempt on the second round through the seeker_check.
Changelog-None
In fact, only 951 of 17419 (5%) of node announcements are missing an address
(and gossipd doesn't know if we can connect to Tor addresses anyway) so
just check it *has* a node_announcement.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Let lightningd feed us hints to try first, but we can extract the
addresses from node_announcement messages ourselves.
(Lightningd used to ask gossipd on our behalf: this is far simpler!)
One side effect of this is that we don't hand back address hints given to us
by lightningd: it would use these again for reconnecting. This is breaks
test_sendpay_grouping, so we disable it temporarily.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We don't want to to refresh the gossmap internally: this could invalidate the
gossmap held by the current callers.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This does not validate a node announcement and address, but it
does select a node at random from the gossmap and asks lightningd
to attempt a connection to it.
Gossipd uses this to ask lightningd -> connectd to initiate
a connection to a new gossip peer. This can be used when
there are insufficient peers already connected to gossip with.
Changelog-Changed: Gossipd can now request connections to additional nodes for improved gossip sync
This can help us backfill any missing gossip if our current
peers haven't been the most reliable.
Changelog-Changed: Gossipd requests a full sync from a random peer every hour.
Based on gossip sync data from random network peers, listening to only 5
peers will not reliably catch all gossip traffic. For now, add extra
redundancy.
This may help the cases we see where gossipd doesn't realize channels
are closed (because of shutdown before it processed the closing).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Fixed: `gossipd` will no longer miss some channel closes on restart.
No code changes, just catching up with the BOLT changes which rework our
blinded path terminology (for the better!).
Another patch will sweep the rest of our internal names, this tries only to
make things compile and fix up the BOLT quotes.
1. Inside payload: current_blinding_point -> current_path_key
2. Inside update_add_htlc TLV: blinding_point -> blinded_path
3. Inside blinded_path: blinding -> first_path_key
4. Inside onion_message: blinding -> path_key.
5. Inside encrypted_data_tlv: next_blinding_override -> next_path_key_override
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
The "init_cupdate" message is for gossipd to tell lightningd about our *own* latest
channel_update messages, not the remote ones! The "remote_channel_update" message
is for messages from the peer.
This appeared as an occasional BROKEN message in CI:
```
**BROKEN** 035d2b1192dfba134e10e540875d366ebc8bc353d5aa766b80c090b39c3a5d885d-chan#4: gossipd gave us channel_update for channel in gossip_state CGOSSIP_NEED_PEER_SIGS
```
Where we had sent (and not received) announcement_signatures, and restarted: the peer had meanwhile
sent us their channel_announcement and channel_update.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Fixed: Protocol: we could get confused on restart and not re-transmit our own channel_updates.
Since we don't compact the gossmap on the fly (FIXME!) we can
easily surpass 4GB in the gossmap, and 32 bit offsets are not
sufficient.
I'm a bit surprised we don't crash immediately, but we've definitely
seen issues.
Changelog-Fixed: gossipd: crash errors with large gossip_store (>4MB) growth on longer-running nodes.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This is a hack, but we've had nodes missing gossip. LDK does this,
so until we get a better workaround, use this one.
Changelog-Changed: Protocol: we now always ask the first peer for all its gossip.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This basically means moving the code from gossipd to connectd to handle
these queries.
This will get connectd have finer control over ratelimiting them.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This was removed from the spec on Apr 25, 2022. We stopped ever sending them
in 0.12.0 (2022-08-23). Now we remove receive support.
Changelog-Protocol: Removed support for zlib-compressed short-channel-ids in query responses.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Everyone understands gossip_queries now, but peers leave it unset to indicate
they have nothing useful to say.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We only write these in two places: one where we get a message from lightningd about
our own channel, and one where we get a reply from lightningd about a txout check.
The former case we explicitly check that we don't already have it in gossmap, so
add checks to the latter case, and give verbose detail if it's found.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We don't actually support it yet, but this threads through the type change,
puts it in "decode" etc.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We now *never* consider asking them anything, even if they are
capable (e.g. enabling full gossip stream). This aligns with
the latest spec proposal, where lack of `gossip_queries` means
"we don't have anything useful to say".
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
It's a u64, we should pass by copy. This is a big sweeping change,
but mainly mechanical (change one, compile, fix breakage, repeat).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>