Commit graph

31 commits

Author SHA1 Message Date
Lagrang3
05514b46e3 Askrene: change median factor to 1.
The ratio of the median of the fees and probability cost is overall not
a bad factor to combine these two features. This is what the
test_real_data shows.

Changelog-None

Signed-off-by: Lagrang3 <lagrang3@protonmail.com>
2024-11-21 16:17:52 +10:30
Lagrang3
2b3fd67dfb askrene: don't skip fee_fallback test
The fee_fallback test would fail after fixing the computation of the
median. Now by we can restore it by making the probability cost factor
1000x higher than the ratio of the median. This shows how hard it is to
combine fee and probability costs and why is the current approach so
fragile.

Changelog-None

Signed-off-by: Lagrang3 <lagrang3@protonmail.com>
2024-11-21 16:17:52 +10:30
Lagrang3
44c9609f3a askrene: add arbitrary precision flow unit
Changelog-none: askrene: add arbitrary precision flow unit

Signed-off-by: Lagrang3 <lagrang3@protonmail.com>
2024-11-21 16:17:52 +10:30
Lagrang3
4dc1a44cd9 askrene: fix the median
The calculation of the median values of probability and fee cost in the
linear approximation had a bug by counting on non-existing arcs.

Changelog-none: askrene: fix the median

Signed-off-by: Lagrang3 <lagrang3@protonmail.com>
2024-11-21 16:17:52 +10:30
Lagrang3
ee623616d2 askrene: fix CI
check the return value of scanf in askrene unit tests,

Changelog-none: askrene: fix CI

Signed-off-by: Lagrang3 <lagrang3@protonmail.com>
2024-11-21 16:17:52 +10:30
Lagrang3
937cf7a554 askrene: use the new MCF solver
Changelog-none: askrene: use the new MCF solver

Signed-off-by: Lagrang3 <lagrang3@protonmail.com>
2024-11-21 16:17:52 +10:30
Rusty Russell
b2dcf7248d askrene: add askrene-bias-channel.
This lets you place annotated biases on channels, to influence routing.

Uses include avoiding TOR nodes, slow channels or other local preferences.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-None: askrene is new anyway.
2024-11-08 21:48:55 +10:30
Rusty Russell
2a0f09fc2d askrene: calculate k value dynamically, using medians.
While the `k=8` value worked for the current main network tests with the
amounts in those tests, it wasn't robust across a wider range of values
(as demonstrated when other test changes broke tests!).

Time to do this properly: calculate the ratio at the time we combine them,
using median values.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-10-15 09:58:04 +10:30
Rusty Russell
32aa79a1e2 askrene: debug and check we actually reduce fees when mu increase.
Even after the previous fix, we still occasionally increase fees when my increases.

This is due to the difference between MCF's linear fees, and actual fees, and
is unavoidable, but add a check if it somehow happens.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-10-15 09:58:04 +10:30
Rusty Russell
08df93cb25 askrene: fix base fee.
I noticed this in the logs:

	plugin-cln-askrene: notify msg unusual: The flows had a fee of 151950msat, greater than max of 53697msat, retrying with mu of 10%...
	plugin-cln-askrene: notify msg unusual: The flows had a fee of 220126msat, greater than max of 53697msat, retrying with mu of 20%...

We would expect increasing mu to *reduce* the fee!

Turns out that our linear fee is a bad terrible approximation, because I
was using base_fee_penalty of 10.0.

 |
 |          /   __ <- real fee, with base: fee = base + propfee * amount.
 |         / __/
 |       _//
 |    __/
 | __/_/
 |/  _/
 | _/ <- linearized fee: fee = linear * amount
 |/
 +-----------------------------------

These cross over where linear = propfee + base / amount.  Assume we split the
payment into 10 parts, this implies that the base_fee_penalty should be 10 / amount
(this gives a slight penalty to the normal case, but that's ok).

This gives better results, too: we get down to 650099 sats in fees, vs 801613
before.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-10-15 09:58:04 +10:30
Rusty Russell
6273adbe47 askrene: calculate prob_cost_factor using ratio of typical mainnet channel.
During "test_real_data", then only successes with reduced fees were 92 on "mu=10", and only
1 on "mu=30": the rest went to mu=100 and failed.

I tried numerous approaches, and in the end, opted for the simplest:

The typical range of probability costs looks likes:
	min = 0, max = 924196240, mean = 10509.4, stddev = 1.9e+06

The typical range of linear fee costs looks like:
	min = 0, max = 101000000, mean = 81894.6, stddev = 2.6e+06

This implies a k factor of 8 makes the two comparable.

This makes the two numbers comparable, and thus makes "mu" much more
effective.  Here are the number of different mu values we succeeded at:

     87  mu=0
     90  mu=10
     42  mu=20
     24  mu=30
     17  mu=40
     19  mu=50
     19  mu=60
     11  mu=70
     95  mu=80
     19  mu=90

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-10-15 09:58:04 +10:30
Rusty Russell
4897286c25 mcf: simplify mu -> cost translation.
The current prob_cost_factor setting does not seem to make mu very
effective, in fact, it gives strange results:

	plugin-cln-askrene: notify msg unusual: The flows had a fee of 151950msat, greater than max of 53697msat, retrying with mu of 10%...
	plugin-cln-askrene: notify msg unusual: The flows had a fee of 220126msat, greater than max of 53697msat, retrying with mu of 20%...

We would expect increasing mu to *reduce* the fee!

As a first step, simplify (it can't be infinite, and the -1 are weird).

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-10-15 09:58:04 +10:30
Rusty Russell
f17c5f5a6b askrene: don't use tmpctx in minflow()
I tested with a really large gossmap (hacked to be 4GB), and when we
keep retrying to minimize cost (calling minflow 11 times), and we
don't free tmpctx.

Due to an issue with how gossmap estimates the index sizes, we ended
up running out of memory.  This fixes it.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-10-15 09:58:04 +10:30
Lagrang3
bd8cc1fb1f askrene: detect and cancel flow cycles
Flow cycles can occur if we have arc zero arc costs.
The previous path construction from the flow in the network assumed the
absence of such cycles and would enter an infinite loop if it hit one.

With his patch wee add cycle detection and removal during the path
construction phase.

Reported-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Lagrang3 <lagrang3@protonmail.com>
Changelog-EXPERIMENTAL: `askrene` infinite loop fixed
2024-10-15 09:58:04 +10:30
Rusty Russell
bb3663c4a0 askrene: ignore disabled channels for min-cost-flow.
We also set htlc_max to 0 when disabling, so the tests worked, but
this is correct.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-10-04 11:27:53 +09:30
Rusty Russell
0a23c63d37 askrene: optimize, by calling tal_count less.
I like the clarity, but this is a hot path.  Fortunately these arrays
have very well defined lengths.

Before: 5.81 seconds
After: 1.06 seconds

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-10-04 08:47:53 +09:30
Rusty Russell
9f0c0e1cca askrene: use a simple array as our queue.
We only ever visit each node once, so we can just use an array.  This
avoids calling tal() all the time, which is *especially* slow when we're
memory tracking.

I had an old canned gossmap which I benchmarked for these (and in
particular one node was unreachable, and that was slow):

Before: 17.27 seconds
After: 5.80 seconds

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-10-04 08:47:53 +09:30
Lagrang3
0aa52b7fdd askrene: remove unused function
Signed-off-by: Lagrang3 <lagrang3@protonmail.com>
2024-09-19 12:16:53 +09:30
Rusty Russell
db29a2d6b5 askrene: don't have get_flow_paths() handle htlc_max, htlc_min and extra millisats.
We don't actually hit the htlc_max cases, since the flow code already
constrains us to that.

And handling htlc_min is better done in the caller, where diagnostics
are better (basically, we should eliminate them, and if that means no
route, give a clear error message).

And the refinement step can handle any extra millisats from rounding.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-09-19 12:16:53 +09:30
Rusty Russell
f0331cd82e askrene: add a "refining" step to add fees and handle corner cases.
This is the root cause of the problem worked around in 50949b7b9c
"askrene: hack in some padding so we don't overflow capacities."

When adding fees to flows, we didn't recheck the boundary conditions: in
renepay this is done by routebuilder.

Fortunately, we can use our "reservations" infrastructure to temporarily
use capacity as we process flows, so we handle the cases where they are
not independent correclty.

My assumption is that the resulting errors are small, so we divide
them between the remaining flows based on highest-to-least
probability.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-09-19 12:16:53 +09:30
Rusty Russell
5883aa85ca askrene: rename struct flow amount to delivers.
This is clearer: it's the final amount, not the amount we send!

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-09-19 12:16:53 +09:30
Rusty Russell
829954ac71 askrene: remove struct flow probability member.
Simply calculate it when we need it, which means we don't have to keep it
up-to-date as we tweak the flow.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-09-19 12:16:53 +09:30
Rusty Russell
50949b7b9c askrene: hack in some padding so we don't overflow capacities.
Of course, we still will, since spendable is for a single HTLC, but
this also shows why we should treat *minimum* as the incorrect answer
if they cross, too.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Fixes: https://github.com/ElementsProject/lightning/issues/7563
2024-08-23 18:52:15 +09:30
Rusty Russell
fafda82d82 askrene: fix up our handling of htlc_max.
It seems we didn't handle it correctly: we need to cap the first
segment as well as the others, as far as I can tell.

Also, it can be less than the maximum capacity.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-08-23 18:52:15 +09:30
Rusty Russell
79ceb59d7a plugins/askrene: remove local contexts.
In general, we should be using tmpctx unless there's a specific reason not to.
It's clear, and simplifies the code somewhat.

If tmpctx is not cleaned often enough, we can look at a per-MCF context, but this
seems like premature optimization.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-08-07 20:35:30 +09:30
Rusty Russell
b1817b6c52 askrene: include the mcf and flow routines.
This make the code use askrene's "struct route_query".

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-08-07 20:35:30 +09:30
Rusty Russell
1db5cf6dea askrene: simply fail if a flow amount exceeds 64 bits.
Rather than handling failure, simply report and exit the plugin.
Simplifies error handling.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-08-07 20:35:30 +09:30
Rusty Russell
7bf399cac5 askrene: remove code which tries to handle tal failures.
tal does not fail: the default handler (which we use) aborts.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-08-07 20:35:30 +09:30
Rusty Russell
e4b84f1ffb askrene: copy flow and dijkstra from renepay.
Still don't actually try compiling them.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-08-07 20:35:30 +09:30
Rusty Russell
d109fcb568 askrene: simplify minflow()
We let the caller choose mu, and iterate if necessary: it can also
check its limits for fees, etc.  Rationalize it to 0-100 inclusive for
human consumption.

This means we don't loop internally, and in fact there's only one
failure mode: we cannot find enough capacity.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-08-07 20:35:30 +09:30
Rusty Russell
5999467dce askrene: copy mcf.[ch] from renepay with minimal modifications.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-08-07 20:35:30 +09:30