[Added version tag into documentation, to show it was added in 25.02 --RR]
Changelog-Added: Plugins: new nofitications `plugin_stopped` and `plugin_started`
Otherwise it appears to be a leak:
==612637== 11,264 bytes in 1 blocks are still reachable in loss record 1 of 1
==612637== at 0x484D953: calloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==612637== by 0x1301F2: trace_init (trace.c:153)
==612637== by 0x13065D: trace_span_start (trace.c:263)
==612637== by 0x173968: db_open_ (utils.c:367)
==612637== by 0x17AE43: create_test_wallet (run-wallet.c:1313)
==612637== by 0x17C726: test_shachain_crud (run-wallet.c:1548)
==612637== by 0x18300E: main (run-wallet.c:2329)
Changelog-None
In the time it takes connectd to flush the log message for
, gossipd can already have
the announcement sent.
lightningd-1 2025-02-11T15:26:06.745Z DEBUG 022d223620a359a47ff7f7ac447c85c46c923da53389221a0054c11c1e3ca31d59-channeld-chan#1: peer_in WIRE_ANNOUNCEMENT_SIGNATURES
lightningd-2 2025-02-11T15:26:06.887Z DEBUG 0266e4598d1d3c415f572a8488830b60f7e744ed9235eb0b1ba93283b315c03518-channeld-chan#1: peer_in WIRE_ANNOUNCEMENT_SIGNATURES
lightningd-1 2025-02-11T15:26:06.897Z TRACE gossipd: Received node_announcement for node 0266e4598d1d3c415f572a8488830b60f7e744ed9235eb0b1ba93283b315c03518
lightningd-1 2025-02-11T15:26:06.915Z DEBUG 022d223620a359a47ff7f7ac447c85c46c923da53389221a0054c11c1e3ca31d59-connectd: peer_out WIRE_ANNOUNCEMENT_SIGNATURES
Changelog-None
Reported by Grubles on ARM64:
```
VALGRIND=1 valgrind -q --error-exitcode=7 --track-origins=yes --leak-check=full --show-reachable=yes --errors-for-leak-kinds=all common/test/run-tal_arr_randomize > /dev/null
==151138== Source and destination overlap in memcpy(0x4d69f08, 0x4d69f08, 8)
==151138== at 0x48CB68C: __GI_memcpy (vg_replace_strmem.c:1147)
==151138== by 0x41B50B: tal_arr_randomize_ (pseudorand.c:84)
==151138== by 0x41BB07: main (run-tal_arr_randomize.c:166)
==151138==
make: *** [Makefile:750: unittest/common/test/run-tal_arr_randomize] Error 7
```
It is correct: you can't overlap src and dst in memcpy. It *probably* works in
this case, but it's undefined!
Fixes: https://github.com/ElementsProject/lightning/issues/7030
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
After analyzing various weird cases where we ended up with duplicate
gossip_store entries, it could be explained by us not fully processing
the gossip store.
It's not clear that my assumptions that we would always see our own writes
are true: technically this may require an fsync(). So we now add the
check, and do an fsync and try again.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Fixed: gossipd: more sanity checks that we are correctly updating the gossip_store file.
We had at least one report of overwriting the gossip_store file at
offset 1. Make sure this doesn't happen.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
It's actually the only one that uses it. We also tweak the way
gossip_store handles failure: gossmap_manage now tells it when to
reset the corrupted store.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Instead of making a copy.
To measure the performance impact, I timed
tests/test_askrene.py::test_real_biases on my laptop.
No checksum check: 194.52s
Copying for checksum check: 202.81s
Zero-copy checksum check: 194.40s
But these numbers proved noisy. Still, doesn't hurt.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We assume if it's incorrect, we simply need to wait. If this proves incorrect,
we will see a stream of BROKEN log messages.
To measure the performance impact, I timed
tests/test_askrene.py::test_real_biases on my laptop.
Before: 194.52s
After: 202.81s
So it's marginal.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
While this shouldn't happen, it does (pending other fixes), and we stop reading the
gossip store until next time. The result is partial gossip, demonstrated beautifully
by NicolasDorier's report:
```
lightning_gossipd: gossmap: redundant channel_announce for 864063x1306x1, offsets 1272259 and 1784859!"
```
Gossipd stalld there and don't make more progress. So gossipd itself
doesn't see the entire gossip_store.
Then things get really batshit:
```
2025-02-04T05:53:28.582Z DEBUG gossipd: Store compact time: 1429910 msec
```
This took 1429 seconds to process. Why?
Because it hasn't been processing the gossip store fully, gossipd kept adding "new" records to the end:
```
2025-02-04T05:53:28.583Z DEBUG gossipd: gossip_store: Read 62716143/1739952/5158256/0 cannounce/cupdate/nannounce/delete from store in 31634458462 bytes, now 31634458440 bytes (populated=true)
```
It has 31GB of gossip in there! No wonder it took so long...
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Fixes: https://github.com/ElementsProject/lightning/issues/8035
Changelog-Fixed: gossipd: corruption in the gossip_store could cause ever-longer startup times and no gossip updates.
Default goes to stderr for LOG_UNUSUAL and higher.
We have to whitelist more cases in map_catchup so we don't spam the logs
with perfectly-expected (but ignored) messages though.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We only use it in one place, and that was simply to share an fd between
gossipd writing and gossipd reading, which may be causing our zfs problem
anyway.
In fact, it fixes a race if we don't have HAVE_PWRITEV.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We have a report of this happening under ZFS. We cannot do much if
this really is a problem where we can't read back what we write, but
this avoids the immediate crash.
Fixes: https://github.com/ElementsProject/lightning/issues/7971
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Fixed: gossmap: occasional crash (at least on ZFS) reading gossip_store.
In 4e7ba96729 (ightningd: don't kill onchaind
if we are forcing a disconnect.) actually was something which happened
in our generate examples script.
This updates that.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
plugins/Makefile has target/${RUST_PROFILE}/cln-grpc depend on the
generated files via $(MSGGEN_GENALL), but cln-rpc/Makefile adds to
that variable, so needs to be included first.
Here's an example build error:
```
Combining schemas from /home/rusty/lightning-ltest/doc/schemas into /home/rusty/lightning-ltest/contrib/msggen/msggen/schema.json
Created /home/rusty/lightning-ltest/contrib/msggen/msggen/schema.json from 2 files
error: failed to run custom build command for `cln-grpc v0.3.0 (/home/rusty/lightning-ltest/cln-grpc)`
Caused by:
process didn't exit successfully: `/home/rusty/lightning-ltest/target/debug/build/cln-grpc-95489e3ba33c0ab3/build-script-build` (exit status: 101)
--- stdout
cargo:rerun-if-changed=proto/node.proto
cargo:rerun-if-changed=proto
--- stderr
thread 'main' panicked at cln-grpc/build.rs:7:10:
called `Result::unwrap()` on an `Err` value: Custom { kind: Other, error: "protoc failed: node.proto:134:52: \"AskreneageResponse\" is not defined.\nnode.proto:135:23: \"GetroutesRequest\" is not defined.\nnode.proto:135:50: \"GetroutesResponse\" is not defined.\nnode.proto:136:32: \"AskrenedisablenodeRequest\" is not defined.\nnode.proto:136:68: \"AskrenedisablenodeResponse\" is not defined.\nnode.proto:137:34: \"AskreneinformchannelRequest\" is not defined.\nnode.proto:137:72: \"AskreneinformchannelResponse\" is not defined.\nnode.proto:138:34: \"AskrenecreatechannelRequest\" is not defined.\nnode.proto:138:72: \"AskrenecreatechannelResponse\" is not defined.\nnode.proto:139:34: \"AskreneupdatechannelRequest\" is not defined.\nnode.proto:139:72: \"AskreneupdatechannelResponse\" is not defined.\nnode.proto:140:32: \"AskrenebiaschannelRequest\" is not defined.\nnode.proto:140:68: \"AskrenebiaschannelResponse\" is not defined.\nnode.proto:141:37: \"AskrenelistreservationsRequest\" is not defined.\nnode.proto:141:78: \"AskrenelistreservationsResponse\" is not defined.\nnode.proto:142:32: \"InjectpaymentonionRequest\" is not defined.\nnode.proto:142:68: \"InjectpaymentonionResponse\" is not defined.\nnode.proto:143:18: \"XpayRequest\" is not defined.\nnode.proto:143:40: \"XpayResponse\" is not defined.\nnode.proto:145:33: \"StreamBlockAddedRequest\" is not defined.\nnode.proto:145:74: \"BlockAddedNotification\" is not defined.\nnode.proto:146:40: \"StreamChannelOpenFailedRequest\" is not defined.\nnode.proto:146:88: \"ChannelOpenFailedNotification\" is not defined.\nnode.proto:147:36: \"StreamChannelOpenedRequest\" is not defined.\nnode.proto:147:80: \"ChannelOpenedNotification\" is not defined.\nnode.proto:148:30: \"StreamConnectRequest\" is not defined.\nnode.proto:148:68: \"PeerConnectNotification\" is not defined.\nnode.proto:149:32: \"StreamCustomMsgRequest\" is not defined.\nnode.proto:149:72: \"CustomMsgNotification\" is not defined.\n" }
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
make: *** [plugins/Makefile:305: target/debug/cln-grpc] Error 101
make: *** Waiting for unfinished jobs....
```
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We didn't update this when we extended the timeout to 120 seconds in
ee3133f198 ("lightningd: increase
startup time for plugins to 120 seconds.")
```
def test_failing_plugins(directory):
fail_plugins = [
os.path.join(os.getcwd(), 'contrib/plugins/fail/failtimeout.py'),
os.path.join(os.getcwd(), 'contrib/plugins/fail/doesnotexist.py'),
]
for p in fail_plugins:
> with pytest.raises(subprocess.CalledProcessError):
E Failed: DID NOT RAISE <class 'subprocess.CalledProcessError'>
tests/test_plugin.py:420: Failed
----------------------------- Captured stdout call -----------------------------
{'github_repository': 'ElementsProject/lightning', 'github_sha': '83dca18c5e9610bfaac766f957387b9a1ec48f50', 'github_ref': 'refs/pull/7887/merge', 'github_ref_name': 'HEAD', 'github_run_id': 13253210143, 'github_head_ref': 'guilt/bolt-updates-after-24.11', 'github_run_number': 12237, 'github_base_ref': 'master', 'github_run_attempt': '2', 'testname': 'test_failing_plugins', 'start_time': 1739239278, 'end_time': 1739239340, 'outcome': 'fail'}
=========================== short test summary info ============================
FAILED tests/test_plugin.py::test_failing_plugins - Failed: DID NOT RAISE <class 'subprocess.CalledProcessError'>
============= 1 failed, 80 passed, 2 skipped in 855.37s (0:14:15) ==============
```
We can actually delete it before counters are updated:
```
wait_for(lambda: len(l3.rpc.listinvoices()['invoices']) == 2)
> assert l3.rpc.autoclean_status()['autoclean']['expiredinvoices']['cleaned'] == 3
E assert 1 == 3
tests/test_plugin.py:3266: AssertionError
```
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We thought it was a good idea to terminate channel subds, but we included onchaind as well!
```
2025-01-29T21:31:46.053Z UNUSUAL 03…00-channeld-chan#255683: Adding HTLC 1737 too slow: killing connection
2025-01-29T21:31:46.053Z INFO 03…00-chan#255683: Peer transient failure in CHANNELD_NORMAL: Adding HTLC timed out: killed connection
2025-01-29T21:31:46.053Z DEBUG 03…00-channeld-chan#255683: Status closed, but not exited. Killing
2025-01-29T21:31:46.054Z DEBUG 03…00-chan#255683: Failing HTLC 1738 due to peer death
2025-01-29T21:31:46.058Z DEBUG 03…00-chan#255683: Failing HTLC 1737 due to peer death
2025-01-29T21:31:46.058Z DEBUG 03…00-chan#255673: Forcing disconnect due to One channel had an error
2025-01-29T21:31:46.059Z DEBUG 03…00-onchaind-chan#255673: Status closed, but not exited. Killing
2025-01-29T21:31:46.093Z DEBUG 03…00-connectd: disconnect_peer
2025-01-29T21:31:46.093Z DEBUG 03…00-lightningd: peer_disconnect_done
```
Reported-by: @whitslack
Fixes: https://github.com/ElementsProject/lightning/issues/8055
Changelog-Fixed: onchaind: don't die if we fail an unrelated channel with the same peer.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Raspberry Pi, with bitcoind running and full gossip topology may actually hit
this, and we have a report in practice.
Note that the comment is wrong, so fix that too.
Fixes: https://github.com/ElementsProject/lightning/issues/7724
Reported-by: m-schmoock
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Currently, pyln tests fail if the `lightning-` prefix is removed from schema/*.json files. In this release, we will update pyln to remove its reliance on this prefix, and in the next release, we will remove the prefixes from the files as well.
Changelog-None.
Test flake where the balance for lightning-2 went negative
```
> assert account_balance(l2, channel_id) == 0
tests/test_closing.py:1314:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
tests/utils.py:183: in account_balance
m_sum -= Millisatoshi(m['debit_msat'])
contrib/pyln-client/pyln/client/lightning.py:193: in __sub__
return Millisatoshi(int(self) - int(other))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = -10000msat, v = -10000
```
Led me to look into this test. lightning-2 should go negative since we
roll back the amounts it's received by going to a prior database state.
Rather than trying to do the right thing with obviously broken node
records, instead we just stop trying to account for them correctly
(impossible).
I also noticed that the anchor tests were failing the utxo output
matchup, which we should be asserting on it. The HTLC RBF that our
anchor code creates was causing an issue by creating another wallet
deposit utxo under the HTLC output. We now optionally add this utxo
in the case that anchors are turned on.
Changelog-None: Fix test flake
It's really hard to tell what on earth went wrong when a coin movement
check fails, since we dont' return good error info.
Here we replace almost every `assert` with a proper check + error with
message to help make debugging easier.
cc @rustyrussell
Changelog-None: improve failure messages