Commit graph

79 commits

Author SHA1 Message Date
Rusty Russell
2b4b1479ed gossipd: check that gossmap code sees updates from gossip_store writes.
After analyzing various weird cases where we ended up with duplicate
gossip_store entries, it could be explained by us not fully processing
the gossip store.

It's not clear that my assumptions that we would always see our own writes
are true: technically this may require an fsync().  So we now add the
check, and do an fsync and try again.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Fixed: gossipd: more sanity checks that we are correctly updating the gossip_store file.
2025-02-11 15:11:47 -06:00
Rusty Russell
1df1300cc9 gossip_store: don't need to check for truncated amounts.
That's actually caught by the gossmap load now.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2025-02-11 15:11:47 -06:00
Rusty Russell
9d98740e18 gossmap: stricter checks when gossipd itself loads the gossip_store.
This means we will correctly reset the store if it has redundant
records, for example.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2025-02-11 15:11:47 -06:00
Rusty Russell
05bc4ca5f3 gossmap: use mmap directly to check checksums.
Instead of making a copy.

To measure the performance impact, I timed
tests/test_askrene.py::test_real_biases on my laptop.

	No checksum check: 194.52s
	Copying for checksum check: 202.81s
	Zero-copy checksum check: 194.40s

But these numbers proved noisy.  Still, doesn't hurt.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2025-02-11 15:11:47 -06:00
Rusty Russell
4b5e5b27ae gossmap: check checksums.
We assume if it's incorrect, we simply need to wait.  If this proves incorrect,
we will see a stream of BROKEN log messages.

To measure the performance impact, I timed
tests/test_askrene.py::test_real_biases on my laptop.

	Before: 194.52s
	After: 202.81s

So it's marginal.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2025-02-11 15:11:47 -06:00
Rusty Russell
5e2f6c5028 gossmap: don't stop reading if we hit a redundant channel_announce.
While this shouldn't happen, it does (pending other fixes), and we stop reading the
gossip store until next time.  The result is partial gossip, demonstrated beautifully
by NicolasDorier's report:

```
lightning_gossipd: gossmap: redundant channel_announce for 864063x1306x1, offsets 1272259 and 1784859!"
```

Gossipd stalld there and don't make more progress.  So gossipd itself
doesn't see the entire gossip_store.

Then things get really batshit:

```
2025-02-04T05:53:28.582Z DEBUG   gossipd: Store compact time: 1429910 msec
```

This took 1429 seconds to process.  Why?

Because it hasn't been processing the gossip store fully, gossipd kept adding "new" records to the end:

```
2025-02-04T05:53:28.583Z DEBUG   gossipd: gossip_store: Read 62716143/1739952/5158256/0 cannounce/cupdate/nannounce/delete from store in 31634458462 bytes, now 31634458440 bytes (populated=true)
```

It has 31GB of gossip in there!  No wonder it took so long...

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Fixes: https://github.com/ElementsProject/lightning/issues/8035
Changelog-Fixed: gossipd: corruption in the gossip_store could cause ever-longer startup times and no gossip updates.
2025-02-11 15:11:47 -06:00
Rusty Russell
fdfc7ce62f gossmap: add (and use) logging hook.
Default goes to stderr for LOG_UNUSUAL and higher.

We have to whitelist more cases in map_catchup so we don't spam the logs
with perfectly-expected (but ignored) messages though.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2025-02-11 15:11:47 -06:00
Rusty Russell
607b14fe12 common/gossmap: remove open-by-fd.
We only use it in one place, and that was simply to share an fd between
gossipd writing and gossipd reading, which may be causing our zfs problem
anyway.

In fact, it fixes a race if we don't have HAVE_PWRITEV.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2025-02-11 15:11:47 -06:00
Rusty Russell
927d062b04 gossmap: don't crash if we hit a zero-length record.
We have a report of this happening under ZFS.  We cannot do much if
this really is a problem where we can't read back what we write, but
this avoids the immediate crash.

Fixes: https://github.com/ElementsProject/lightning/issues/7971
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Fixed: gossmap: occasional crash (at least on ZFS) reading gossip_store.
2025-02-11 15:11:47 -06:00
Rusty Russell
b6c1ffa359 ccan/htable: update to explicit DUPS/NODUPS types.
The updated API requires typed htables to explicitly state whether they
allow duplicates: for most cases we don't, but we've had issues in the
past.

This is a big patch, but mainly mechanical.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2025-01-21 09:18:25 +10:30
Rusty Russell
69c252e06f gossmap: implement gossmap_random_node(), use it in gossipd.
It's easy for gossmap, since it has access to the htable.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-11-22 15:21:45 +10:30
Rusty Russell
0baac77a1c gossmap: allow gossmap_chan_get_update_details on locally-modified channels.
In particular, this lets you find the exact htlc_maximum_msat/htlc_minimum_msat
values.

This means we actually create real channel_updates for local mods, which
requires a second "local" scratch region.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-10-15 09:58:04 +10:30
Rusty Russell
4ee9d1d2f2 gossmap: include cltv_expiry_delta in gossmap_chan_get_update_details for completeness.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-10-15 09:58:04 +10:30
Rusty Russell
d067066b17 common/gossmap: use u64 for all offsets.
Since we don't compact the gossmap on the fly (FIXME!) we can
easily surpass 4GB in the gossmap, and 32 bit offsets are not
sufficient.

I'm a bit surprised we don't crash immediately, but we've definitely
seen issues.

Changelog-Fixed: gossipd: crash errors with large gossip_store (>4MB) growth on longer-running nodes.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-10-08 09:50:17 +02:00
Rusty Russell
5052f0763f gossmap: keep capacity for locally-generated channels as well.
It was weird not to have a capacity associated with localmods channels, and
fixing it has some very nice side effects.

Now the gossmap_chan_get_capacity() call never fails (we prevented reading
of channels from gossmap in the partially-written case already), so we
make it return the capacity.  We do this in msat, because that's what
all the callers want.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-10-04 11:27:53 +09:30
Rusty Russell
a65e325b13 gossmap: implement partial updates.
This is actually what we want in several places: to only override one or
two fields in a channel_update.

We add a gossmap_local_setchan() with a similar API to the old
gossmap_local_updatechan(), for the case where we want to set every
field.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-10-04 11:27:53 +09:30
Rusty Russell
bc1aabb014 gossmap: don't crash on localmods on non-existant channels.
We allow adding them, but crash when we remove the localmods.  Yet
this could theoretically happen if a channel we modified was removed
from the gossmap, anyway.

Reported-by: Lagrang3 <lagrang3@protonmail.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-10-04 11:27:53 +09:30
Rusty Russell
e11bab8bbb gossmap: don't process channel_announcement until amount is present.
This simplifies the callers significantly: all channel_announcements now
have an amount, so gossmap_chan_get_capacity() only fails on a local
modification.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-08-07 20:35:30 +09:30
Rusty Russell
15fb37f6d1 common: fix endless loop in gossmap iteration.
If we need to iterate forward to find a timestamp (only happens if we have gossip older than
2 hours), we didn't exit the loop, as it didn't actually move the offset.

Fixes: https://github.com/ElementsProject/lightning/issues/7462
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-07-25 12:30:42 -07:00
Rusty Russell
b29b96aae8 common: hoist scidd->pubkey conversion function into gossmap.
We will want to use it in the pay plugin too.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-07-18 10:53:55 +09:30
Rusty Russell
ba2bb5531d gossmap: add linear streaming interface.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-07-10 12:21:19 +09:30
Rusty Russell
6b91497223 common: make gossmap ignore redundant channel_announcements.
This seems to be happening to some people, so don't panic.  Unfortunately we don't have
a good error callback here, so msg to stderr.

Fixes: https://github.com/ElementsProject/lightning/issues/7249
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-05-23 20:23:36 +02:00
Rusty Russell
744116e501 gossipd: make extra-sure we don't put in redundant channel_announcement messages.
We only write these in two places: one where we get a message from lightningd about
our own channel, and one where we get a reply from lightningd about a txout check.

The former case we explicitly check that we don't already have it in gossmap, so
add checks to the latter case, and give verbose detail if it's found.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-05-23 20:23:36 +02:00
Rusty Russell
9450d46db1 bitcoin/short_channel_id: pass by copy everywhere.
It's a u64, we should pass by copy.  This is a big sweeping change,
but mainly mechanical (change one, compile, fix breakage, repeat).

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-03-20 13:51:48 +10:30
Rusty Russell
e0e879c003 common: remove type_to_string files altogther.
This means including <common/utils.h> where it was indirectly included.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-03-20 13:51:48 +10:30
Rusty Russell
0a7e6211df common: fix uninitialized member in gossmap.
Wrote a test program which passed num_channel_updates_rejected as NULL
(which we don't usually do), and valgrind complained:

```
==1048302== Conditional jump or move depends on uninitialised value(s)
==1048302==    at 0x118B90: update_channel (gossmap.c:550)
==1048302==    by 0x119EEE: map_catchup (gossmap.c:663)
==1048302==    by 0x11A299: load_gossip_store (gossmap.c:726)
==1048302==    by 0x11A352: gossmap_load (gossmap.c:1052)
==1048302==    by 0x125362: main (run-route-infloop.c:90)
```

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-03-07 14:09:14 +01:00
Rusty Russell
87f6ceb721 gossmap: fix OpenBSD crash.
Thanks to amazing debugging assistance from grubles, we figured out
that indeed, my memory was correct: write and mmap are not consistent
on all platforms.  The easiest fix is to disable mmap on OpenBSD for now:
the better fix is to do in-place updates using the mmap, and only rely
on write() for append (which always causes a remap anyway before it's accessed).

Fixes: https://github.com/ElementsProject/lightning/issues/7109
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-02-27 15:33:04 +01:00
Rusty Russell
5135658805 common: add gossmap_chan_is_dying() helper to check flags.
And fix up gossip_store backwards comment!

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-02-12 11:43:33 +01:00
Rusty Russell
e7ceffd565 gossipd: remove zombie handling.
We never enabled it, because we seemed to be eliminating valid
channels.  We discard zombie-marked records on loading.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-02-04 09:24:44 +10:30
Rusty Russell
ce39309c0c common: optional gossmap callbacks for better failure handling.
In particular, allow callers to see unknown records we ignore (and let
them fail as a result), and get called if we can't pack a
channel_update into our internal format.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-02-04 09:24:44 +10:30
Rusty Russell
f2cf353431 common: gossmap method to load fd directly, not filename.
And helpers to tell if a node_announcement exists, and get a
full channel_update.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-02-04 09:24:44 +10:30
Rusty Russell
37ccca5d69 common/gossmap: remove now-unused private flag.
The only way you'll see private channel_updates is if you put them
there yourself with localmods.

I also renamed the confusing gossmap_chan_capacity to gossmap_chan_has_capacity.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-02-04 09:24:44 +10:30
Rusty Russell
8454e4910a topology: don't call gossmap for locall added channels.
This happens in deprecated mode, and we get bogus results.  Valgrind caught it!

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-01-31 13:39:23 +10:30
Rusty Russell
4b92c773df common: gossmap now always ignores private gossip_store messages.
In the next PR, they'll be removed, but for now all our code doesn't
want them.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2023-12-14 09:16:56 +10:30
Rusty Russell
f2fff4de55 gossmap: insert temporary per-caller flag to turn off private gossip.
This lets us convert one user at a time.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2023-12-14 09:16:56 +10:30
Rusty Russell
be3a59c7c3 gossmap: fix false valgrind uninitialized error on arm64, ppc.
Doesn't happen on x86, but struct gossmap_chan defines:

```
	u32 private: 1;
	u32 plus_scid_off: 31;
```

And complains when we initialize plus_scid_off and access it later:

```
VALGRIND=1 valgrind -q --error-exitcode=7 --track-origins=yes --leak-check=full --show-reachable=yes --errors-for-leak-kinds=all plugins/renepay/test/run-mcf > /dev/null
==186886== Conditional jump or move depends on uninitialised value(s)
==186886==    at 0x10076388: chan_iter (gossmap.c:1098)
==186886==    by 0x100797F3: gossmap_next_chan (gossmap.c:1112)
==186886==    by 0x1008C5AF: main (run-mcf.c:309)
==186886==  Uninitialised value was created by a heap allocation
==186886==    at 0x40F0A44: malloc (vg_replace_malloc.c:431)
==186886==    by 0x10072BAF: allocate (tal.c:256)
==186886==    by 0x100737A7: tal_alloc_ (tal.c:463)
==186886==    by 0x100738DF: tal_alloc_arr_ (tal.c:506)
==186886==    by 0x10079507: load_gossip_store (gossmap.c:690)
==186886==    by 0x10079667: gossmap_load (gossmap.c:978)
==186886==    by 0x1008C4AF: main (run-mcf.c:295)
```

Reported-by: @grubles
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Fixes: #6557
2023-08-18 16:21:57 +09:30
Rusty Russell
2005ca436e common/gossmap: don't memcpy NULL, 0, and don't add 0 to NULL pointer.
Of course, NULL and length 0 are natural partners, but We Can't Have Nice Things.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2023-04-05 06:12:24 +09:30
Vincenzo Palazzo
a104380e49 fix: fixes FATAL SIGNAL 11 on gossmap node
This will fix a crash that I caused on armv7
and by looking inside the coredump with gdb
(by adding an assert on n that must be
different from null) I get the following stacktrace

```
(gdb) bt
\#0  0x00000000 in ?? ()
\#1  0x0043a038 in send_backtrace (why=0xbe9e3600 "FATAL SIGNAL 11") at common/daemon.c:36
\#2  0x0043a0ec in crashdump (sig=11) at common/daemon.c:46
\#3  <signal handler called>
\#4  0x00406d04 in node_announcement (map=0x938ecc, nann_off=495146) at common/gossmap.c:586
\#5  0x00406fec in map_catchup (map=0x938ecc, num_rejected=0xbe9e3a40) at common/gossmap.c:643
\#6  0x004073a4 in load_gossip_store (map=0x938ecc, num_rejected=0xbe9e3a40) at common/gossmap.c:697
\#7  0x00408244 in gossmap_load (ctx=0x0, filename=0x4e16b8 "gossip_store", num_channel_updates_rejected=0xbe9e3a40) at common/gossmap.c:976
\#8  0x0041a548 in init (p=0x93831c, buf=0x9399d4 "\n\n{\"jsonrpc\":\"2.0\",\"id\":\"cln:init#25\",\"method\":\"init\",\"params\":{\"options\":{},\"configuration\":{\"lightning-dir\":\"/home/vincent/.lightning/testnet\",\"rpc-file\":\"lightning-rpc\",\"startup\":true,\"network\":\"te"..., config=0x939cdc) at plugins/topology.c:622
\#9  0x0041e5d0 in handle_init (cmd=0x938934, buf=0x9399d4 "\n\n{\"jsonrpc\":\"2.0\",\"id\":\"cln:init#25\",\"method\":\"init\",\"params\":{\"options\":{},\"configuration\":{\"lightning-dir\":\"/home/vincent/.lightning/testnet\",\"rpc-file\":\"lightning-rpc\",\"startup\":true,\"network\":\"te"..., params=0x939c8c)
    at plugins/libplugin.c:1208
\#10 0x0041fc04 in ld_command_handle (plugin=0x93831c, toks=0x939bec) at plugins/libplugin.c:1572
\#11 0x00420050 in ld_read_json_one (plugin=0x93831c) at plugins/libplugin.c:1667
\#12 0x004201bc in ld_read_json (conn=0x9391c4, plugin=0x93831c) at plugins/libplugin.c:1687
\#13 0x004cb82c in next_plan (conn=0x9391c4, plan=0x9391d8) at ccan/ccan/io/io.c:59
\#14 0x004cc67c in do_plan (conn=0x9391c4, plan=0x9391d8, idle_on_epipe=false) at ccan/ccan/io/io.c:407
\#15 0x004cc6dc in io_ready (conn=0x9391c4, pollflags=1) at ccan/ccan/io/io.c:417
\#16 0x004cf8cc in io_loop (timers=0x9383c4, expired=0xbe9e3ce4) at ccan/ccan/io/poll.c:453
\#17 0x00420af4 in plugin_main (argv=0xbe9e3eb4, init=0x41a46c <init>, restartability=PLUGIN_STATIC, init_rpc=true, features=0x0, commands=0x6167e8 <commands>, num_commands=4, notif_subs=0x0, num_notif_subs=0, hook_subs=0x0, num_hook_subs=0, notif_topics=0x0, num_notif_topics=0) at plugins/libplugin.c:1891
\#18 0x0041a6f8 in main (argc=1, argv=0xbe9e3eb4) at plugins/topology.c:679
```

I do not know if this is a solution because I do not know
when I can parse a node announcement for a node that
it is not longer in the gossip map.

So, I hope this is just usefult for @rustyrussell

Changelog-Fixed: fixes `FATAL SIGNAL 11` on gossmap node announcement parsing.

Signed-off-by: Vincenzo Palazzo <vincenzopalazzodev@gmail.com>
2023-02-13 17:51:41 -06:00
Rusty Russell
0274d88bad common/gossip_store: clean up header.
It's actually two separate u16 fields, so actually treat it as
such!

Cleans up zombie handling code a bit too.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2023-01-30 15:15:41 -06:00
Alex Myers
1bae8cd28a gossipd: zombify inactive channels instead of pruning
Though BOLT 7 says a channel may be pruned when one side becomes inactive
and fails to refresh their channel_update, in practice, the
channel_announcement can be difficult to recover if deleted entirely.
Here the channel_announcement is tagged as zombie such that gossip_store
consumers may safely ignore it, but it may be retained should the channel
come back online in the future. Node_announcements and channel_updates may
also be retained in such a fashion until the channel is ready to be
resurrected.

Changelog-Fixed: Pruned channels are more reliably restored.
2023-01-30 16:33:03 +10:30
Rusty Russell
5dfcd15782 all: no longer need to call htable_clear to free htable contents.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2023-01-12 11:44:10 +10:30
Rusty Russell
4a570c9419 gossmap: ensure htables are always tal objects.
We want to change the htable allocator to use tal, which will need
this.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2023-01-12 11:44:10 +10:30
Rusty Russell
4bc10579e6 listincoming: add htlc_min_msat, public and peer_features fields.
This is needed for offers to generate blinded paths.

No documentation changes since listincoming is an undocumented
internal hack interface which topology presents for production
of routehints.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2022-11-09 15:08:03 +01:00
Rusty Russell
82d98e4b96 gossmap: move gossmap_guess_node_id to pay plugin.
This removes a point32 dependency.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2022-10-26 11:29:06 +10:30
Rusty Russell
bed905a394 lightningd: use 33 byte pubkeys internally.
We still use 32 bytes on the wire, but internally don't use x-only.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2022-10-26 11:29:06 +10:30
Rusty Russell
bb49e1bea5 common: assume htlc_maximum_msat, don't check bit any more.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2022-09-24 15:22:27 +09:30
Rusty Russell
253b25522b BOLT: update to version which requires option_channel_htlc_max.
We will now simply reject old-style ones as invalid.  Turns out the
only trace we could find is a channel between two nodes unconnected to
the rest of the network.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Changed: Protocol: We now require all channel_update messages include htlc_maximum_msat (as per latest BOLTs)
2022-09-24 15:22:27 +09:30
Rusty Russell
6338758018 gossmap: make API more robust against future changes.
Many changes to gossmap (including the pending ones!) don't actually
concern readers, as long as they obey certain rules:

1. Ignore unknown messages.
2. Treat all 16 upper bits of length as flags, ignore unknown ones.

So now we split the version byte into MAJOR and MINOR, and you can
ignore MINOR changes.

We don't expose the internal version (for creating the map)
programmatically: you should really hardcode what major version you
understand!

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2022-09-24 15:22:27 +09:30
Rusty Russell
fd71dfc7f7 gossmap: optimize asserts().
They are surprisingly expensive!

Running `time ./plugins/renepay/test/run-not_mcf-gossmap gossip_store-sgl.rustcorp.com.au-2022-04-19 024b9a1fa8e006f1e3937f65f66c408e6da8e1ca728ea43222a7381df1cc449605 02ebb3b8a2316b3e876ea3f3d8124a3ab97f30b128f619608eb06b5251235dc2d9 10000000000 0.1`:

Before (-Og):
	real	0m1.495s
Before (no opt):
	real	0m2.552s

After (-Og):
	real	0m0.579s
After (no opt):
	real	0m1.061s

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2022-09-19 10:18:55 +09:30
Rusty Russell
4cdb4167d2 gossmap: make local_addchan create private channel_announcement in correct order.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2022-09-19 10:18:55 +09:30