After analyzing various weird cases where we ended up with duplicate
gossip_store entries, it could be explained by us not fully processing
the gossip store.
It's not clear that my assumptions that we would always see our own writes
are true: technically this may require an fsync(). So we now add the
check, and do an fsync and try again.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Fixed: gossipd: more sanity checks that we are correctly updating the gossip_store file.
Instead of making a copy.
To measure the performance impact, I timed
tests/test_askrene.py::test_real_biases on my laptop.
No checksum check: 194.52s
Copying for checksum check: 202.81s
Zero-copy checksum check: 194.40s
But these numbers proved noisy. Still, doesn't hurt.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We assume if it's incorrect, we simply need to wait. If this proves incorrect,
we will see a stream of BROKEN log messages.
To measure the performance impact, I timed
tests/test_askrene.py::test_real_biases on my laptop.
Before: 194.52s
After: 202.81s
So it's marginal.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
While this shouldn't happen, it does (pending other fixes), and we stop reading the
gossip store until next time. The result is partial gossip, demonstrated beautifully
by NicolasDorier's report:
```
lightning_gossipd: gossmap: redundant channel_announce for 864063x1306x1, offsets 1272259 and 1784859!"
```
Gossipd stalld there and don't make more progress. So gossipd itself
doesn't see the entire gossip_store.
Then things get really batshit:
```
2025-02-04T05:53:28.582Z DEBUG gossipd: Store compact time: 1429910 msec
```
This took 1429 seconds to process. Why?
Because it hasn't been processing the gossip store fully, gossipd kept adding "new" records to the end:
```
2025-02-04T05:53:28.583Z DEBUG gossipd: gossip_store: Read 62716143/1739952/5158256/0 cannounce/cupdate/nannounce/delete from store in 31634458462 bytes, now 31634458440 bytes (populated=true)
```
It has 31GB of gossip in there! No wonder it took so long...
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Fixes: https://github.com/ElementsProject/lightning/issues/8035
Changelog-Fixed: gossipd: corruption in the gossip_store could cause ever-longer startup times and no gossip updates.
Default goes to stderr for LOG_UNUSUAL and higher.
We have to whitelist more cases in map_catchup so we don't spam the logs
with perfectly-expected (but ignored) messages though.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We only use it in one place, and that was simply to share an fd between
gossipd writing and gossipd reading, which may be causing our zfs problem
anyway.
In fact, it fixes a race if we don't have HAVE_PWRITEV.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We have a report of this happening under ZFS. We cannot do much if
this really is a problem where we can't read back what we write, but
this avoids the immediate crash.
Fixes: https://github.com/ElementsProject/lightning/issues/7971
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Fixed: gossmap: occasional crash (at least on ZFS) reading gossip_store.
The updated API requires typed htables to explicitly state whether they
allow duplicates: for most cases we don't, but we've had issues in the
past.
This is a big patch, but mainly mechanical.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
In particular, this lets you find the exact htlc_maximum_msat/htlc_minimum_msat
values.
This means we actually create real channel_updates for local mods, which
requires a second "local" scratch region.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Since we don't compact the gossmap on the fly (FIXME!) we can
easily surpass 4GB in the gossmap, and 32 bit offsets are not
sufficient.
I'm a bit surprised we don't crash immediately, but we've definitely
seen issues.
Changelog-Fixed: gossipd: crash errors with large gossip_store (>4MB) growth on longer-running nodes.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
It was weird not to have a capacity associated with localmods channels, and
fixing it has some very nice side effects.
Now the gossmap_chan_get_capacity() call never fails (we prevented reading
of channels from gossmap in the partially-written case already), so we
make it return the capacity. We do this in msat, because that's what
all the callers want.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This is actually what we want in several places: to only override one or
two fields in a channel_update.
We add a gossmap_local_setchan() with a similar API to the old
gossmap_local_updatechan(), for the case where we want to set every
field.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We allow adding them, but crash when we remove the localmods. Yet
this could theoretically happen if a channel we modified was removed
from the gossmap, anyway.
Reported-by: Lagrang3 <lagrang3@protonmail.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This simplifies the callers significantly: all channel_announcements now
have an amount, so gossmap_chan_get_capacity() only fails on a local
modification.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
If we need to iterate forward to find a timestamp (only happens if we have gossip older than
2 hours), we didn't exit the loop, as it didn't actually move the offset.
Fixes: https://github.com/ElementsProject/lightning/issues/7462
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This seems to be happening to some people, so don't panic. Unfortunately we don't have
a good error callback here, so msg to stderr.
Fixes: https://github.com/ElementsProject/lightning/issues/7249
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We only write these in two places: one where we get a message from lightningd about
our own channel, and one where we get a reply from lightningd about a txout check.
The former case we explicitly check that we don't already have it in gossmap, so
add checks to the latter case, and give verbose detail if it's found.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
It's a u64, we should pass by copy. This is a big sweeping change,
but mainly mechanical (change one, compile, fix breakage, repeat).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Wrote a test program which passed num_channel_updates_rejected as NULL
(which we don't usually do), and valgrind complained:
```
==1048302== Conditional jump or move depends on uninitialised value(s)
==1048302== at 0x118B90: update_channel (gossmap.c:550)
==1048302== by 0x119EEE: map_catchup (gossmap.c:663)
==1048302== by 0x11A299: load_gossip_store (gossmap.c:726)
==1048302== by 0x11A352: gossmap_load (gossmap.c:1052)
==1048302== by 0x125362: main (run-route-infloop.c:90)
```
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Thanks to amazing debugging assistance from grubles, we figured out
that indeed, my memory was correct: write and mmap are not consistent
on all platforms. The easiest fix is to disable mmap on OpenBSD for now:
the better fix is to do in-place updates using the mmap, and only rely
on write() for append (which always causes a remap anyway before it's accessed).
Fixes: https://github.com/ElementsProject/lightning/issues/7109
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We never enabled it, because we seemed to be eliminating valid
channels. We discard zombie-marked records on loading.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
In particular, allow callers to see unknown records we ignore (and let
them fail as a result), and get called if we can't pack a
channel_update into our internal format.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
The only way you'll see private channel_updates is if you put them
there yourself with localmods.
I also renamed the confusing gossmap_chan_capacity to gossmap_chan_has_capacity.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Doesn't happen on x86, but struct gossmap_chan defines:
```
u32 private: 1;
u32 plus_scid_off: 31;
```
And complains when we initialize plus_scid_off and access it later:
```
VALGRIND=1 valgrind -q --error-exitcode=7 --track-origins=yes --leak-check=full --show-reachable=yes --errors-for-leak-kinds=all plugins/renepay/test/run-mcf > /dev/null
==186886== Conditional jump or move depends on uninitialised value(s)
==186886== at 0x10076388: chan_iter (gossmap.c:1098)
==186886== by 0x100797F3: gossmap_next_chan (gossmap.c:1112)
==186886== by 0x1008C5AF: main (run-mcf.c:309)
==186886== Uninitialised value was created by a heap allocation
==186886== at 0x40F0A44: malloc (vg_replace_malloc.c:431)
==186886== by 0x10072BAF: allocate (tal.c:256)
==186886== by 0x100737A7: tal_alloc_ (tal.c:463)
==186886== by 0x100738DF: tal_alloc_arr_ (tal.c:506)
==186886== by 0x10079507: load_gossip_store (gossmap.c:690)
==186886== by 0x10079667: gossmap_load (gossmap.c:978)
==186886== by 0x1008C4AF: main (run-mcf.c:295)
```
Reported-by: @grubles
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Fixes: #6557
This will fix a crash that I caused on armv7
and by looking inside the coredump with gdb
(by adding an assert on n that must be
different from null) I get the following stacktrace
```
(gdb) bt
\#0 0x00000000 in ?? ()
\#1 0x0043a038 in send_backtrace (why=0xbe9e3600 "FATAL SIGNAL 11") at common/daemon.c:36
\#2 0x0043a0ec in crashdump (sig=11) at common/daemon.c:46
\#3 <signal handler called>
\#4 0x00406d04 in node_announcement (map=0x938ecc, nann_off=495146) at common/gossmap.c:586
\#5 0x00406fec in map_catchup (map=0x938ecc, num_rejected=0xbe9e3a40) at common/gossmap.c:643
\#6 0x004073a4 in load_gossip_store (map=0x938ecc, num_rejected=0xbe9e3a40) at common/gossmap.c:697
\#7 0x00408244 in gossmap_load (ctx=0x0, filename=0x4e16b8 "gossip_store", num_channel_updates_rejected=0xbe9e3a40) at common/gossmap.c:976
\#8 0x0041a548 in init (p=0x93831c, buf=0x9399d4 "\n\n{\"jsonrpc\":\"2.0\",\"id\":\"cln:init#25\",\"method\":\"init\",\"params\":{\"options\":{},\"configuration\":{\"lightning-dir\":\"/home/vincent/.lightning/testnet\",\"rpc-file\":\"lightning-rpc\",\"startup\":true,\"network\":\"te"..., config=0x939cdc) at plugins/topology.c:622
\#9 0x0041e5d0 in handle_init (cmd=0x938934, buf=0x9399d4 "\n\n{\"jsonrpc\":\"2.0\",\"id\":\"cln:init#25\",\"method\":\"init\",\"params\":{\"options\":{},\"configuration\":{\"lightning-dir\":\"/home/vincent/.lightning/testnet\",\"rpc-file\":\"lightning-rpc\",\"startup\":true,\"network\":\"te"..., params=0x939c8c)
at plugins/libplugin.c:1208
\#10 0x0041fc04 in ld_command_handle (plugin=0x93831c, toks=0x939bec) at plugins/libplugin.c:1572
\#11 0x00420050 in ld_read_json_one (plugin=0x93831c) at plugins/libplugin.c:1667
\#12 0x004201bc in ld_read_json (conn=0x9391c4, plugin=0x93831c) at plugins/libplugin.c:1687
\#13 0x004cb82c in next_plan (conn=0x9391c4, plan=0x9391d8) at ccan/ccan/io/io.c:59
\#14 0x004cc67c in do_plan (conn=0x9391c4, plan=0x9391d8, idle_on_epipe=false) at ccan/ccan/io/io.c:407
\#15 0x004cc6dc in io_ready (conn=0x9391c4, pollflags=1) at ccan/ccan/io/io.c:417
\#16 0x004cf8cc in io_loop (timers=0x9383c4, expired=0xbe9e3ce4) at ccan/ccan/io/poll.c:453
\#17 0x00420af4 in plugin_main (argv=0xbe9e3eb4, init=0x41a46c <init>, restartability=PLUGIN_STATIC, init_rpc=true, features=0x0, commands=0x6167e8 <commands>, num_commands=4, notif_subs=0x0, num_notif_subs=0, hook_subs=0x0, num_hook_subs=0, notif_topics=0x0, num_notif_topics=0) at plugins/libplugin.c:1891
\#18 0x0041a6f8 in main (argc=1, argv=0xbe9e3eb4) at plugins/topology.c:679
```
I do not know if this is a solution because I do not know
when I can parse a node announcement for a node that
it is not longer in the gossip map.
So, I hope this is just usefult for @rustyrussell
Changelog-Fixed: fixes `FATAL SIGNAL 11` on gossmap node announcement parsing.
Signed-off-by: Vincenzo Palazzo <vincenzopalazzodev@gmail.com>
It's actually two separate u16 fields, so actually treat it as
such!
Cleans up zombie handling code a bit too.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Though BOLT 7 says a channel may be pruned when one side becomes inactive
and fails to refresh their channel_update, in practice, the
channel_announcement can be difficult to recover if deleted entirely.
Here the channel_announcement is tagged as zombie such that gossip_store
consumers may safely ignore it, but it may be retained should the channel
come back online in the future. Node_announcements and channel_updates may
also be retained in such a fashion until the channel is ready to be
resurrected.
Changelog-Fixed: Pruned channels are more reliably restored.
This is needed for offers to generate blinded paths.
No documentation changes since listincoming is an undocumented
internal hack interface which topology presents for production
of routehints.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We will now simply reject old-style ones as invalid. Turns out the
only trace we could find is a channel between two nodes unconnected to
the rest of the network.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Changed: Protocol: We now require all channel_update messages include htlc_maximum_msat (as per latest BOLTs)
Many changes to gossmap (including the pending ones!) don't actually
concern readers, as long as they obey certain rules:
1. Ignore unknown messages.
2. Treat all 16 upper bits of length as flags, ignore unknown ones.
So now we split the version byte into MAJOR and MINOR, and you can
ignore MINOR changes.
We don't expose the internal version (for creating the map)
programmatically: you should really hardcode what major version you
understand!
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
They are surprisingly expensive!
Running `time ./plugins/renepay/test/run-not_mcf-gossmap gossip_store-sgl.rustcorp.com.au-2022-04-19 024b9a1fa8e006f1e3937f65f66c408e6da8e1ca728ea43222a7381df1cc449605 02ebb3b8a2316b3e876ea3f3d8124a3ab97f30b128f619608eb06b5251235dc2d9 10000000000 0.1`:
Before (-Og):
real 0m1.495s
Before (no opt):
real 0m2.552s
After (-Og):
real 0m0.579s
After (no opt):
real 0m1.061s
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>