Commit graph

112 commits

Author SHA1 Message Date
Rusty Russell
fba738b65e askrene: really fix race between layer creation and persistent layer loading.
Using jsonrpc_request_sync, layers are loaded before we finish init,
so we never can be asked to create a layer before we've loaded it
(xpay creates a layer immediately on startup).

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-None: persistent layers new this release.
2024-11-26 16:04:13 +10:30
Rusty Russell
d256e11108 askrene: reorder functions.
No code changes.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-11-26 16:04:13 +10:30
Rusty Russell
af629e600e askrene: don't re-save layers as we restore them!
Create lower-level versions of routines to create biases, layers,
constraints, etc and only save the ones called from the public APIs.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-None: persistent layers were only added in this release
2024-11-26 16:04:13 +10:30
Lagrang3
05514b46e3 Askrene: change median factor to 1.
The ratio of the median of the fees and probability cost is overall not
a bad factor to combine these two features. This is what the
test_real_data shows.

Changelog-None

Signed-off-by: Lagrang3 <lagrang3@protonmail.com>
2024-11-21 16:17:52 +10:30
Lagrang3
2b3fd67dfb askrene: don't skip fee_fallback test
The fee_fallback test would fail after fixing the computation of the
median. Now by we can restore it by making the probability cost factor
1000x higher than the ratio of the median. This shows how hard it is to
combine fee and probability costs and why is the current approach so
fragile.

Changelog-None

Signed-off-by: Lagrang3 <lagrang3@protonmail.com>
2024-11-21 16:17:52 +10:30
Lagrang3
9fdcc26d1d askrene: bugfix queue overflow
Changelog-none

Signed-off-by: Lagrang3 <lagrang3@protonmail.com>
2024-11-21 16:17:52 +10:30
Lagrang3
9966969d4c askrene: remove allocation checks
Rusty: "allocations don't fail"

Signed-off-by: Lagrang3 <lagrang3@protonmail.com>
2024-11-21 16:17:52 +10:30
Lagrang3
460a28bb32 askrene: add compiler flag ASKRENE_UNITTEST
Rusty: "We don't generally use NDEBUG in our code"

Instead use a compile time flag ASKRENE_UNITTEST to make checks on unit
tests that we don't normally need on release code.

Changelog-none

Signed-off-by: Lagrang3 <lagrang3@protonmail.com>
2024-11-21 16:17:52 +10:30
Lagrang3
b1cd26373b askrene: small fixes suggested by Rusty Russell
- use graph_max_num_arcs/nodes instead of tal_count in bound checks,
- don't use ccan/lqueue, use instead a minimalistic queue
  implementation with an array,
- add missing const qualifiers to temporary tal allocators,
- check preconditions with assert,
- remove inline specifier for static functions,

Changelog-None

Signed-off-by: Lagrang3 <lagrang3@protonmail.com>
2024-11-21 16:17:52 +10:30
Lagrang3
44c9609f3a askrene: add arbitrary precision flow unit
Changelog-none: askrene: add arbitrary precision flow unit

Signed-off-by: Lagrang3 <lagrang3@protonmail.com>
2024-11-21 16:17:52 +10:30
Lagrang3
4dc1a44cd9 askrene: fix the median
The calculation of the median values of probability and fee cost in the
linear approximation had a bug by counting on non-existing arcs.

Changelog-none: askrene: fix the median

Signed-off-by: Lagrang3 <lagrang3@protonmail.com>
2024-11-21 16:17:52 +10:30
Lagrang3
ee623616d2 askrene: fix CI
check the return value of scanf in askrene unit tests,

Changelog-none: askrene: fix CI

Signed-off-by: Lagrang3 <lagrang3@protonmail.com>
2024-11-21 16:17:52 +10:30
Lagrang3
937cf7a554 askrene: use the new MCF solver
Changelog-none: askrene: use the new MCF solver

Signed-off-by: Lagrang3 <lagrang3@protonmail.com>
2024-11-21 16:17:52 +10:30
Lagrang3
84a9476311 askrene: add mcf_refinement to the public API
Changelog-none: askrene: add mcf_refinement to the public API

Signed-off-by: Lagrang3 <lagrang3@protonmail.com>
2024-11-21 16:17:52 +10:30
Lagrang3
2142094e43 askrene: fix bug, not all arcs exists
We use an arc "array" in the graph structure, but not all arc indexes
correspond to real topological arcs. We must be careful when iterating
through all arcs, and check if they are enabled before making operations
on them.

Changelog-None: askrene: fix bug, not all arcs exists

Signed-off-by: Lagrang3 <lagrang3@protonmail.com>
2024-11-21 16:17:52 +10:30
Lagrang3
e655fe7bbd askrene: add bigger test for MCF
Using zlib to read big test case file.

Changelog-None: askrene: add bigger test for MCF

Signed-off-by: Lagrang3 <lagrang3@protonmail.com>
2024-11-21 16:17:52 +10:30
Lagrang3
2ea8e49683 askrene: add a MCF refinement
Add a new function to compute a MCF using a more general description of
the problem. I call it mcf_refinement because it can start with a
feasible flow (though this is not necessary) and adapt it to achieve
optimality.

Changelog-None: askrene: add a MCF refinement

Signed-off-by: Lagrang3 <lagrang3@protonmail.com>
2024-11-21 16:17:52 +10:30
Lagrang3
1dfa562cd9 askrene algorithm add helper for flow conservation
Changelog-None: askrene algorithm add helper for flow conservation

Signed-off-by: Lagrang3 <lagrang3@protonmail.com>
2024-11-21 16:17:52 +10:30
Lagrang3
42d075cc97 askrene: add a simple MCF solver
Changelog-EXPERIMENTAL: askrene: add a simple MCF solver

Signed-off-by: Lagrang3 <lagrang3@protonmail.com>
2024-11-21 16:17:52 +10:30
Lagrang3
8558299dec askrene: add algorithm to compute feasible flow
Changelog-EXPERIMENTAL: askrene: add algorithm to compute feasible flow

Signed-off-by: Lagrang3 <lagrang3@protonmail.com>
2024-11-21 16:17:52 +10:30
Lagrang3
f4f2985bdf askrene: add dijkstra algorithm
Changelog-EXPERIMENTAL: askrene: add dijkstra algorithm

Signed-off-by: Lagrang3 <lagrang3@protonmail.com>
2024-11-21 16:17:52 +10:30
Lagrang3
507153a1cd askrene: add graph algorithms module
Changelog-EXPERIMENTAL: askrene: add graph algorithms module

Signed-off-by: Lagrang3 <lagrang3@protonmail.com>
2024-11-21 16:17:52 +10:30
Lagrang3
59ce410699 askrene: add priorityqueue
It is just a copy-paste of "dijkstra" but the name
implies what it actually is. Not an implementation of minimum cost path
Dijkstra algorithm, but a helper data structure.
I keep the old "dijkstra.h/c" files for the moment to avoid breaking the
current code.

Changelog-EXPERIMENTAL: askrene: add priorityqueue

Signed-off-by: Lagrang3 <lagrang3@protonmail.com>
2024-11-21 16:17:52 +10:30
Lagrang3
32548cf02b askrene: add a new graph abstraction
Changelog-EXPERIMENTAL: askrene new graph abstraction

Signed-off-by: Lagrang3 <lagrang3@protonmail.com>
2024-11-21 16:17:52 +10:30
Rusty Russell
1b2d5acf16 askrene: don't create duplicate layers if xpay creates layer before we load them.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-11-19 17:51:18 +10:30
Rusty Russell
d85dcc0ce4 askrene: persistent layer support.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-11-08 21:48:55 +10:30
Rusty Russell
b2dcf7248d askrene: add askrene-bias-channel.
This lets you place annotated biases on channels, to influence routing.

Uses include avoiding TOR nodes, slow channels or other local preferences.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-None: askrene is new anyway.
2024-11-08 21:48:55 +10:30
Rusty Russell
3f09e503ec askrene: fix false positive memleak since we didn't scan local_updates.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-11-08 21:48:55 +10:30
Rusty Russell
c797b6fb20 libplugin: add method string to jsonrpc callbacks, implement generic helpers.
Without knowing what method was called, we can't have useful general logging
methods, so go through the pain of adding "const char *method" everywhere,
and add:

1. ignore_and_complete - we're done when jsonrpc returned
2. log_broken_and_complete - we're done, but emit BROKEN log.
3. plugin_broken_cb - if this happens, fail the plugin.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-11-07 17:04:35 +10:30
Rusty Russell
4f4ec9aefd askrene: make sure we depend on libplugin.h
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-11-07 17:04:35 +10:30
Rusty Russell
c5099b1647 libplugin: clean up API.
When we used to allow cmd to be NULL, we had to hand the plugin
everywhere.  We no longer do.

1. Various jsonrpc_ functions no longer need the plugin arg.
2. send_outreq no longer needs a plugin arg.
3. The init function takes a command, not a plugin.
4. Remove command_deprecated_in_nocmd_ok.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-11-07 17:04:35 +10:30
Rusty Russell
318e49e9c7 askrene: more logging in explain_failure.
Lagrang3 doesn't like the logging in here at all, but he suggested we at
least be consistent!

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-10-15 09:58:04 +10:30
Rusty Russell
95c5fda79f askrene: remove flowset_probability() now refine step calculates it.
Now we've checked it gives the same answers, we can remove a lot of
work in flow.c.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-10-15 09:58:04 +10:30
Rusty Russell
5501e4b13d askrene: use refine step to calculate flowset probability.
Since we know the total reservations on each hop, we can more easily
determine probabilities than using flowset_probability() which has to
replicate this collision detection.

We leave both in place for now, to check.  The results are not
identical, due to slightly different calculation methods.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-10-15 09:58:04 +10:30
Rusty Russell
4b6a38fe0a askrene: fix bug with reservations used during refinement.
We were trying to get the max capacity of a flow to see if we could add some
more sats, and hit an assertion:

tests/test_askrene.py:707: 

```
 DEBUG   plugin-cln-askrene: notify msg info: Flow reduced to deliver 88070161msat not 90008000msat, because 107x1x0/1 has remaining capacity 88071042msat
 DEBUG   plugin-cln-askrene: notify msg info: Flow reduced to deliver 284138158msat not 284787000msat, because 108x1x0/1 has remaining capacity 284141000msat
 **BROKEN** plugin-cln-askrene: Flow delivers 129565000msat but max only 56506138msat
 INFO    plugin-cln-askrene: Killing plugin: exited during normal operation
```

We need to *unreserve* our flow before asking for max capacity.  We were
also missing a few less important cases where we altered flows without altering
the reservation, so fix those too.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-10-15 09:58:04 +10:30
Rusty Russell
bcc8bd59c8 askrene: don't *completely* ignore fees to start.
I noticed that increasing mu a little bit sometimes made a big difference,
because by completely ignoring fees we were choosing the worst of two channels
in some cases.

Start at 1% fees; this saves a lot on initial fees in this test!

Here's the new stats on mu levels:

     96  mu=1
     90  mu=10
     41  mu=20
     30  mu=30
     24  mu=40
     19  mu=50
     22  mu=60
      8  mu=70
     95  mu=80
     19  mu=90

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-EXPERIMENTAL: `askrene` is now better at finding low-fee paths.
2024-10-15 09:58:04 +10:30
Rusty Russell
2a0f09fc2d askrene: calculate k value dynamically, using medians.
While the `k=8` value worked for the current main network tests with the
amounts in those tests, it wasn't robust across a wider range of values
(as demonstrated when other test changes broke tests!).

Time to do this properly: calculate the ratio at the time we combine them,
using median values.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-10-15 09:58:04 +10:30
Rusty Russell
32aa79a1e2 askrene: debug and check we actually reduce fees when mu increase.
Even after the previous fix, we still occasionally increase fees when my increases.

This is due to the difference between MCF's linear fees, and actual fees, and
is unavoidable, but add a check if it somehow happens.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-10-15 09:58:04 +10:30
Rusty Russell
08df93cb25 askrene: fix base fee.
I noticed this in the logs:

	plugin-cln-askrene: notify msg unusual: The flows had a fee of 151950msat, greater than max of 53697msat, retrying with mu of 10%...
	plugin-cln-askrene: notify msg unusual: The flows had a fee of 220126msat, greater than max of 53697msat, retrying with mu of 20%...

We would expect increasing mu to *reduce* the fee!

Turns out that our linear fee is a bad terrible approximation, because I
was using base_fee_penalty of 10.0.

 |
 |          /   __ <- real fee, with base: fee = base + propfee * amount.
 |         / __/
 |       _//
 |    __/
 | __/_/
 |/  _/
 | _/ <- linearized fee: fee = linear * amount
 |/
 +-----------------------------------

These cross over where linear = propfee + base / amount.  Assume we split the
payment into 10 parts, this implies that the base_fee_penalty should be 10 / amount
(this gives a slight penalty to the normal case, but that's ok).

This gives better results, too: we get down to 650099 sats in fees, vs 801613
before.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-10-15 09:58:04 +10:30
Rusty Russell
6273adbe47 askrene: calculate prob_cost_factor using ratio of typical mainnet channel.
During "test_real_data", then only successes with reduced fees were 92 on "mu=10", and only
1 on "mu=30": the rest went to mu=100 and failed.

I tried numerous approaches, and in the end, opted for the simplest:

The typical range of probability costs looks likes:
	min = 0, max = 924196240, mean = 10509.4, stddev = 1.9e+06

The typical range of linear fee costs looks like:
	min = 0, max = 101000000, mean = 81894.6, stddev = 2.6e+06

This implies a k factor of 8 makes the two comparable.

This makes the two numbers comparable, and thus makes "mu" much more
effective.  Here are the number of different mu values we succeeded at:

     87  mu=0
     90  mu=10
     42  mu=20
     24  mu=30
     17  mu=40
     19  mu=50
     19  mu=60
     11  mu=70
     95  mu=80
     19  mu=90

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-10-15 09:58:04 +10:30
Rusty Russell
4897286c25 mcf: simplify mu -> cost translation.
The current prob_cost_factor setting does not seem to make mu very
effective, in fact, it gives strange results:

	plugin-cln-askrene: notify msg unusual: The flows had a fee of 151950msat, greater than max of 53697msat, retrying with mu of 10%...
	plugin-cln-askrene: notify msg unusual: The flows had a fee of 220126msat, greater than max of 53697msat, retrying with mu of 20%...

We would expect increasing mu to *reduce* the fee!

As a first step, simplify (it can't be infinite, and the -1 are weird).

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-10-15 09:58:04 +10:30
Rusty Russell
83eee64fda pytest: test askrene with worse maxfee argument.
We ask it again, but reduce fees by 1msat from the previous answer.
This is really nasty, as it frequently exercises the case where we
only go over fee when we do the refinement step.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-10-15 09:58:04 +10:30
Rusty Russell
f17c5f5a6b askrene: don't use tmpctx in minflow()
I tested with a really large gossmap (hacked to be 4GB), and when we
keep retrying to minimize cost (calling minflow 11 times), and we
don't free tmpctx.

Due to an issue with how gossmap estimates the index sizes, we ended
up running out of memory.  This fixes it.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-10-15 09:58:04 +10:30
Lagrang3
bd8cc1fb1f askrene: detect and cancel flow cycles
Flow cycles can occur if we have arc zero arc costs.
The previous path construction from the flow in the network assumed the
absence of such cycles and would enter an infinite loop if it hit one.

With his patch wee add cycle detection and removal during the path
construction phase.

Reported-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Lagrang3 <lagrang3@protonmail.com>
Changelog-EXPERIMENTAL: `askrene` infinite loop fixed
2024-10-15 09:58:04 +10:30
Rusty Russell
d08a3bb9e6 askrene: don't give up if we hit htlc_max and have no other flows.
This happens in the coming "real network" test!

We add fees and hit htlc_max, but don't have another flow to add to.
Rather than MCF again, we split the flow into two.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-10-15 09:58:04 +10:30
Rusty Russell
1b82a3ad5b askrene: constrain to exact htlc_min/htlc_max values.
The fp16_t values are approximations (overestimate for htlc_max,
underestimate for htlc_min), so in the refinement step we should use
the exact values.

This also fixes a logic bug: flow_remaining_capacity returned the
total capacity, not the additional capacity!

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-EXPERIMENTAL: `askrene` now honors exact htlc_maximum_msat limits.
2024-10-15 09:58:04 +10:30
Rusty Russell
22e7a57557 askrene: make auto.sourcefree a real layer, too.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-EXPERIMENTAL: `getroutes` now applies `auto.sourcefree` layer in the order specified, so doesn't alter channels changed in later layers.
2024-10-15 09:58:04 +10:30
Rusty Russell
3321ad5883 askrene: populate auto.localchans layer properly.
Rather than adding to the gossmap modifications directly, populate
the layer and have the normal layer application logic do it.

This is consistent when we query layers in the next patch.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-10-15 09:58:04 +10:30
Rusty Russell
1230f1b832 askrene: give notifications back to caller as we go.
And unify logging for better debugging.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-EXPERIMENTAL: `askrene` now has better logging, gives notifications of progress.
2024-10-15 09:58:04 +10:30
Rusty Russell
b8acd3b37c askrene: more code tweaks on feedback from Lagrang3.
1. describe_disabled should point out if node itself is disabled.
2. Hoist constraint check for neater if branching.
3. Use amount_msat_max/min for greater clarity.
4. Simply disable channels, don't zero htlc_min/max when node disabled.

I also fixed the diagnostic of htlc_max correctly, which removes a FIXME.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-10-04 11:27:53 +09:30