Reported by Grubles on ARM64:
```
VALGRIND=1 valgrind -q --error-exitcode=7 --track-origins=yes --leak-check=full --show-reachable=yes --errors-for-leak-kinds=all common/test/run-tal_arr_randomize > /dev/null
==151138== Source and destination overlap in memcpy(0x4d69f08, 0x4d69f08, 8)
==151138== at 0x48CB68C: __GI_memcpy (vg_replace_strmem.c:1147)
==151138== by 0x41B50B: tal_arr_randomize_ (pseudorand.c:84)
==151138== by 0x41BB07: main (run-tal_arr_randomize.c:166)
==151138==
make: *** [Makefile:750: unittest/common/test/run-tal_arr_randomize] Error 7
```
It is correct: you can't overlap src and dst in memcpy. It *probably* works in
this case, but it's undefined!
Fixes: https://github.com/ElementsProject/lightning/issues/7030
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
After analyzing various weird cases where we ended up with duplicate
gossip_store entries, it could be explained by us not fully processing
the gossip store.
It's not clear that my assumptions that we would always see our own writes
are true: technically this may require an fsync(). So we now add the
check, and do an fsync and try again.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Fixed: gossipd: more sanity checks that we are correctly updating the gossip_store file.
Instead of making a copy.
To measure the performance impact, I timed
tests/test_askrene.py::test_real_biases on my laptop.
No checksum check: 194.52s
Copying for checksum check: 202.81s
Zero-copy checksum check: 194.40s
But these numbers proved noisy. Still, doesn't hurt.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We assume if it's incorrect, we simply need to wait. If this proves incorrect,
we will see a stream of BROKEN log messages.
To measure the performance impact, I timed
tests/test_askrene.py::test_real_biases on my laptop.
Before: 194.52s
After: 202.81s
So it's marginal.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
While this shouldn't happen, it does (pending other fixes), and we stop reading the
gossip store until next time. The result is partial gossip, demonstrated beautifully
by NicolasDorier's report:
```
lightning_gossipd: gossmap: redundant channel_announce for 864063x1306x1, offsets 1272259 and 1784859!"
```
Gossipd stalld there and don't make more progress. So gossipd itself
doesn't see the entire gossip_store.
Then things get really batshit:
```
2025-02-04T05:53:28.582Z DEBUG gossipd: Store compact time: 1429910 msec
```
This took 1429 seconds to process. Why?
Because it hasn't been processing the gossip store fully, gossipd kept adding "new" records to the end:
```
2025-02-04T05:53:28.583Z DEBUG gossipd: gossip_store: Read 62716143/1739952/5158256/0 cannounce/cupdate/nannounce/delete from store in 31634458462 bytes, now 31634458440 bytes (populated=true)
```
It has 31GB of gossip in there! No wonder it took so long...
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Fixes: https://github.com/ElementsProject/lightning/issues/8035
Changelog-Fixed: gossipd: corruption in the gossip_store could cause ever-longer startup times and no gossip updates.
Default goes to stderr for LOG_UNUSUAL and higher.
We have to whitelist more cases in map_catchup so we don't spam the logs
with perfectly-expected (but ignored) messages though.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We only use it in one place, and that was simply to share an fd between
gossipd writing and gossipd reading, which may be causing our zfs problem
anyway.
In fact, it fixes a race if we don't have HAVE_PWRITEV.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We have a report of this happening under ZFS. We cannot do much if
this really is a problem where we can't read back what we write, but
this avoids the immediate crash.
Fixes: https://github.com/ElementsProject/lightning/issues/7971
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Fixed: gossmap: occasional crash (at least on ZFS) reading gossip_store.
I'd been resisting this, but some tests really do want "hang up after
*receiving* this", and it's more reliable than getting the peer to
"hang up afer *sending* this" which seems not always work.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
The updated API requires typed htables to explicitly state whether they
allow duplicates: for most cases we don't, but we've had issues in the
past.
This is a big patch, but mainly mechanical.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
I wanted to make sure we didn't have a bug in our htable routine,
so I wrote this test. We don't, but leave it in.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We ratelimited DEBUG messages, but that can be annoying and cause us to miss things.
We demoted the worst offenders in the last release, to TRACE level.
Now, only log trace if it's wanted, and never suppress DEBUG.
Changelog-Changed: Logging: we no longer suppress DEBUG messages from subdaemons.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Fixes: https://github.com/ElementsProject/lightning/issues/7917
header sys/errno.h gets re-directed to errno.h leading to warning and
then failure so instead directly referencing the header
Changelog-None: change header
Signed-off-by: Lakshya Singh <lakshay.singh1108@gmail.com>
See: https://github.com/ElementsProject/lightning/issues/7899
A node with 23 connections gets far too many debug messages.
Changelog-Fixed: `gossipd` now does logging at trace, not debug level.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
In fact, there are several places where we try to decode old invoices,
and they should all work. The only place we should enforce expiration is
when we're going to pay.
This also revealed that xpay wasn't checking bolt11 expiries!
Reported-by: hMsats
Fixes: https://github.com/ElementsProject/lightning/issues/7869
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Fixed: JSON-RPC: `decode` refused to decode expired bolt12 invoices.
It turns out that under some circumstances we end up clearing the
pointee of `current` but not the pointer. Thus when we select the next
slot we can end up reusing the same slot, making it its own parent.
We forcefull break these cycles by enforcing that `current` should
never be returned and be set as its own parent.
Changelog-None
Trace spans form a tree, but we don't actually check that the
structure doesn't break. Breakage can for example come if we use the
same key accidentally, making a new span its own ancestor.
We have the space in memory set aside anyway, so let's just copy the
`trace_id` into the span itself, rather than resolving the `root` at
time of emission.
After adding the DB query instrumentation we ran into a couple of
issues, with spans not being resumed correctly, and it was rather hard
to identify the problem. This adds debug statements so we can trace
the tracing (traception if you will).
Changelog-None
It is possible for prevtx to be larger than max packet size, so for shared outputs (currently only the funding tx) we add support for sending the `txid` only across the wire and filling in the prevtx locally.
Changelog-None
Added and updated error messages when splicing to make it more clear to the user why a splice is failing.
Changelog-Changed: Improved error messaging for splice commands.
We actually pruned before we got all the channels. Extend the pruning time,
which unfortunately makes the test slower.
```
2024-11-18T02:13:11.7013278Z node_factory = <pyln.testing.utils.NodeFactory object at 0x7ff72969e820>
2024-11-18T02:13:11.7014386Z bitcoind = <pyln.testing.utils.BitcoinD object at 0x7ff72968fe20>
2024-11-18T02:13:11.7014996Z
2024-11-18T02:13:11.7015271Z def test_gossip_pruning(node_factory, bitcoind):
2024-11-18T02:13:11.7016222Z """ Create channel and see it being updated in time before pruning
2024-11-18T02:13:11.7017037Z """
2024-11-18T02:13:11.7017871Z l1, l2, l3 = node_factory.get_nodes(3, opts={'dev-fast-gossip-prune': None,
2024-11-18T02:13:11.7018971Z 'allow_bad_gossip': True})
2024-11-18T02:13:11.7019634Z
2024-11-18T02:13:11.7020236Z l1.rpc.connect(l2.info['id'], 'localhost', l2.port)
2024-11-18T02:13:11.7021153Z l2.rpc.connect(l3.info['id'], 'localhost', l3.port)
2024-11-18T02:13:11.7021806Z
2024-11-18T02:13:11.7022226Z scid1, _ = l1.fundchannel(l2, 10**6)
2024-11-18T02:13:11.7022886Z scid2, _ = l2.fundchannel(l3, 10**6)
2024-11-18T02:13:11.7023458Z
2024-11-18T02:13:11.7023907Z mine_funding_to_announce(bitcoind, [l1, l2, l3])
2024-11-18T02:13:11.7025183Z l1_initial_cupdate_timestamp = only_one(l1.rpc.listchannels(source=l1.info['id'])['channels'])['last_update']
2024-11-18T02:13:11.7026179Z
2024-11-18T02:13:11.7027358Z # Get timestamps of initial updates, so we can ensure they change.
2024-11-18T02:13:11.7028171Z # Channels should be activated locally
2024-11-18T02:13:11.7029326Z > wait_for(lambda: [c['active'] for c in l1.rpc.listchannels()['channels']] == [True] * 4)
```
We can see in logs, it actually started pruning already:
```
2024-11-18T02:13:11.9622477Z lightningd-1 2024-11-18T01:52:03.570Z DEBUG gossipd: Pruning channel 105x1x0 from network view (ages 1731894723 and 0)
```
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
The `send_outreq` function is a good place to suspend and resume
traces, since these are usually the places where we hand off control
back to the `io_loop`. This assumes that we do not continue doing
heavy liftin after we have queued an `outreq` call, but that is most
likely the case anyway. This frees us from having to track suspensions
whenever we call the RPC from a plugin.
And use it for `exposesecret-passphrase`. This is probably overly
cautious, but it makes me feel a little better that we won't leak it
to someone with read-only access.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
The command called “splice” can take a json payload or a ‘splice script’, process it into a list of ‘actions’ and then execute those actions.
These actions include or will include everything you would want to do with a splice:
* Splice into a channel
* Splice out of a channel
* Fund from wallet
* Deposit to wallet
* Send funds to bitcoin address
Changelog-Added: A new magic “splice” command is added that can take a ‘splice script’ or json payload and perform any complex splice across multiple channels merging the result into a single transaction. Some features are disabled and will be added in time.
This is the sister command of addpsbtoutput.
Adds inputs equal to or greater than the amount requests, reservers them, and reports important information back out to the user.
Changelog-Added: New low-level RPC command addpsbtinput to fund PSBTs directly and help with complex splices & dual-opens.