While the `k=8` value worked for the current main network tests with the
amounts in those tests, it wasn't robust across a wider range of values
(as demonstrated when other test changes broke tests!).
Time to do this properly: calculate the ratio at the time we combine them,
using median values.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Even after the previous fix, we still occasionally increase fees when my increases.
This is due to the difference between MCF's linear fees, and actual fees, and
is unavoidable, but add a check if it somehow happens.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
I noticed this in the logs:
plugin-cln-askrene: notify msg unusual: The flows had a fee of 151950msat, greater than max of 53697msat, retrying with mu of 10%...
plugin-cln-askrene: notify msg unusual: The flows had a fee of 220126msat, greater than max of 53697msat, retrying with mu of 20%...
We would expect increasing mu to *reduce* the fee!
Turns out that our linear fee is a bad terrible approximation, because I
was using base_fee_penalty of 10.0.
|
| / __ <- real fee, with base: fee = base + propfee * amount.
| / __/
| _//
| __/
| __/_/
|/ _/
| _/ <- linearized fee: fee = linear * amount
|/
+-----------------------------------
These cross over where linear = propfee + base / amount. Assume we split the
payment into 10 parts, this implies that the base_fee_penalty should be 10 / amount
(this gives a slight penalty to the normal case, but that's ok).
This gives better results, too: we get down to 650099 sats in fees, vs 801613
before.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
During "test_real_data", then only successes with reduced fees were 92 on "mu=10", and only
1 on "mu=30": the rest went to mu=100 and failed.
I tried numerous approaches, and in the end, opted for the simplest:
The typical range of probability costs looks likes:
min = 0, max = 924196240, mean = 10509.4, stddev = 1.9e+06
The typical range of linear fee costs looks like:
min = 0, max = 101000000, mean = 81894.6, stddev = 2.6e+06
This implies a k factor of 8 makes the two comparable.
This makes the two numbers comparable, and thus makes "mu" much more
effective. Here are the number of different mu values we succeeded at:
87 mu=0
90 mu=10
42 mu=20
24 mu=30
17 mu=40
19 mu=50
19 mu=60
11 mu=70
95 mu=80
19 mu=90
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
The current prob_cost_factor setting does not seem to make mu very
effective, in fact, it gives strange results:
plugin-cln-askrene: notify msg unusual: The flows had a fee of 151950msat, greater than max of 53697msat, retrying with mu of 10%...
plugin-cln-askrene: notify msg unusual: The flows had a fee of 220126msat, greater than max of 53697msat, retrying with mu of 20%...
We would expect increasing mu to *reduce* the fee!
As a first step, simplify (it can't be infinite, and the -1 are weird).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We ask it again, but reduce fees by 1msat from the previous answer.
This is really nasty, as it frequently exercises the case where we
only go over fee when we do the refinement step.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
I tested with a really large gossmap (hacked to be 4GB), and when we
keep retrying to minimize cost (calling minflow 11 times), and we
don't free tmpctx.
Due to an issue with how gossmap estimates the index sizes, we ended
up running out of memory. This fixes it.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Flow cycles can occur if we have arc zero arc costs.
The previous path construction from the flow in the network assumed the
absence of such cycles and would enter an infinite loop if it hit one.
With his patch wee add cycle detection and removal during the path
construction phase.
Reported-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Lagrang3 <lagrang3@protonmail.com>
Changelog-EXPERIMENTAL: `askrene` infinite loop fixed
This happens in the coming "real network" test!
We add fees and hit htlc_max, but don't have another flow to add to.
Rather than MCF again, we split the flow into two.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
The fp16_t values are approximations (overestimate for htlc_max,
underestimate for htlc_min), so in the refinement step we should use
the exact values.
This also fixes a logic bug: flow_remaining_capacity returned the
total capacity, not the additional capacity!
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-EXPERIMENTAL: `askrene` now honors exact htlc_maximum_msat limits.
In particular, this lets you find the exact htlc_maximum_msat/htlc_minimum_msat
values.
This means we actually create real channel_updates for local mods, which
requires a second "local" scratch region.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-EXPERIMENTAL: `getroutes` now applies `auto.sourcefree` layer in the order specified, so doesn't alter channels changed in later layers.
Rather than adding to the gossmap modifications directly, populate
the layer and have the normal layer application logic do it.
This is consistent when we query layers in the next patch.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
And unify logging for better debugging.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-EXPERIMENTAL: `askrene` now has better logging, gives notifications of progress.
The "init_cupdate" message is for gossipd to tell lightningd about our *own* latest
channel_update messages, not the remote ones! The "remote_channel_update" message
is for messages from the peer.
This appeared as an occasional BROKEN message in CI:
```
**BROKEN** 035d2b1192dfba134e10e540875d366ebc8bc353d5aa766b80c090b39c3a5d885d-chan#4: gossipd gave us channel_update for channel in gossip_state CGOSSIP_NEED_PEER_SIGS
```
Where we had sent (and not received) announcement_signatures, and restarted: the peer had meanwhile
sent us their channel_announcement and channel_update.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Fixed: Protocol: we could get confused on restart and not re-transmit our own channel_updates.
tmpctx may not get cleaned immediately, so the timeout (a child of
the struct early_peer at this point) can still outlast the conn.
Do the clearer thing, and explicitly free the timeout.
Changelog-Fixed: connectd: crash on erroneous timeout.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Whenever we build without `make clean` first, this file gets
overwritten. It asks the user for input because it is not run with
`-f`. This causes the build to hang until the user inputs `y`.
Here is an example build:
```
$ make -j7 && sudo make install
CC: cc -DCLN_NEXT_VERSION="v24.08" -DPKGLIBEXECDIR="/usr/local/libexec/c-lightning" -DBINDIR="/usr/local/bin" -DPLUGINDIR="/usr/local/libexec/c-lightning/plugins" -DCCAN_TAL_NEVER_RETURN_NULL=1 -Wall -Wundef -Wmissing-prototypes -Wmissing-declarations -Wstrict-prototypes -Wold-style-definition -Werror -Wno-maybe-uninitialized -Wshadow=local -std=gnu11 -g -fstack-protector-strong -Og -I ccan -I external/libwally-core/include/ -I external/libwally-core/src/secp256k1/include/ -I external/jsmn/ -I external/libbacktrace/ -I external/gheap/ -I external/build-x86_64-linux-gnu/libbacktrace-build -I external/libsodium/src/libsodium/include -I external/libsodium/src/libsodium/include/sodium -I external/build-x86_64-linux-gnu/libsodium-build/src/libsodium/include -I . -I/usr/local/include -I/usr/include/postgresql -DSHACHAIN_BITS=48 -DJSMN_PARENT_LINKS -DCOMPAT_V052=1 -DCOMPAT_V060=1 -DCOMPAT_V061=1 -DCOMPAT_V062=1 -DCOMPAT_V070=1 -DCOMPAT_V072=1 -DCOMPAT_V073=1 -DCOMPAT_V080=1 -DCOMPAT_V081=1 -DCOMPAT_V082=1 -DCOMPAT_V090=1 -DCOMPAT_V0100=1 -DCOMPAT_V0121=1 -c -o
LD: cc -Og config.vars -Lexternal/build-x86_64-linux-gnu -lwallycore -lsecp256k1 -ljsmn -lbacktrace -lsodium -L/usr/local/include -lm -lsqlite3 -L/usr/lib/x86_64-linux-gnu -lpq -o
mv: replace 'tools/headerversions', overriding mode 0755 (rwxr-xr-x)? cc lightningd/test/run-check_node_announcement.c
cc lightningd/test/run-find_my_abspath.c
cc lightningd/test/run-invoice-select-inchan.c
cc lightningd/test/run-jsonrpc.c
cc lightningd/test/run-log_filter.c
cc lightningd/test/run-log-pruning.c
cc lightningd/test/run-shuffle_fds.c
ld lightningd/test/run-find_my_abspath
ld lightningd/test/run-log-pruning
ld lightningd/test/run-check_node_announcement
ld lightningd/test/run-log_filter
ld lightningd/test/run-jsonrpc
ld lightningd/test/run-shuffle_fds
ld lightningd/test/run-invoice-select-inchan
^Cmake: *** [tools/Makefile:16: tools/headerversions] Interrupt
```
One workaround is to just know that you need to enter `y` here.
But the best solution is probably to update the Makefile like so.
Changelog-None
Multiple places in the pay lifecycle depend on mods to be set. By
setting the mods directly after the first listpeerchannels call,
subsequent calls to listpeerchannels are avoided.
Changelog-Fixed: pay-plugin: only call listpeerchannels once during a
payment lifecycle.
Since we don't compact the gossmap on the fly (FIXME!) we can
easily surpass 4GB in the gossmap, and 32 bit offsets are not
sufficient.
I'm a bit surprised we don't crash immediately, but we've definitely
seen issues.
Changelog-Fixed: gossipd: crash errors with large gossip_store (>4MB) growth on longer-running nodes.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
For nodes with many channels this is a tremendous improvement in pay
performance. PR #7611 improves payment performance from 15 seconds to
13.5 seconds on one of our nodes. This commit improves payment
performance from 13.5 seconds to 5.7 seconds.
Changelog-Fixed: Improved pathfinding speed for large nodes.
This minimizes the need to convert back and forth from and to sat
values, and it also removes a new instance of sats in the public
interface (`channel_hints`).
Suggested-By: Rusty Russell <@rustyrussell>
We were ignoring more reliable information because of the
stricter-is-better logic. This meant that we were also ignoring local
channel information, despite knowing better.
By simplifying the logic to pick the one with the newer timestamp we
ensure that later observations always trump earlier ones. There is a
bit of interplay between the relaxation of the constraints updating
the timestamp, and comparing to real observation timestamps, but that
should not impact us for local channels.
It was failing because the channel_hint from one attempt would prevent
us from retrying. By changing the amounts so that the channel_hints do
not concern them (value smaller than estimate) we can make things work
as before again.
A failing payment would doom all subsequent ones. Now we step down the
amount a single satoshi so any prior channel_hints do not doom the
payment outright.
Changelog-None
We were using `channel_hint` to temporarily tweak the graph inside of
a payment. However, these ad-hoc `channel_hints` are stickier than
their predecessors, in that they outlive the payment attempt itself,
and interfere with later ones.
Changelog-Changed: pay: Discarding an overly long or expensive route does not blocklist channels anymore.
If we have a large channel, fail to send through a small amount, and
then add a `channel_hint`, then it can happen that the call to
`channel_hint_set_update` is already late enough that it refills the
1msat we removed from the attempted amount, thus making the edge we
just excluded eligible again, which can lead into an infinite loop.
We slow down the updating of the channel_hints to once every
hysteresis timeout. This allows us to set tight constraints, while not
incurring in the accidental relaxation issue.