Commit graph

1725 commits

Author SHA1 Message Date
Rusty Russell
dc18f3cd7b BOLTs: update which renames blinding terminology.
No code changes, just catching up with the BOLT changes which rework our
blinded path terminology (for the better!).

Another patch will sweep the rest of our internal names, this tries only to
make things compile and fix up the BOLT quotes.

1. Inside payload: current_blinding_point -> current_path_key
2. Inside update_add_htlc TLV: blinding_point -> blinded_path
3. Inside blinded_path: blinding -> first_path_key
4. Inside onion_message: blinding -> path_key.
5. Inside encrypted_data_tlv: next_blinding_override -> next_path_key_override

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-10-16 07:14:32 +10:30
Rusty Russell
f944e03fca BOLT update: remove INVALID_REALM error.
This is obsolete (since modern onions) and so removed from spec.
We should not set it, and don't need to handle it specially.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-10-16 07:14:32 +10:30
Rusty Russell
318e49e9c7 askrene: more logging in explain_failure.
Lagrang3 doesn't like the logging in here at all, but he suggested we at
least be consistent!

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-10-15 09:58:04 +10:30
Rusty Russell
95c5fda79f askrene: remove flowset_probability() now refine step calculates it.
Now we've checked it gives the same answers, we can remove a lot of
work in flow.c.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-10-15 09:58:04 +10:30
Rusty Russell
5501e4b13d askrene: use refine step to calculate flowset probability.
Since we know the total reservations on each hop, we can more easily
determine probabilities than using flowset_probability() which has to
replicate this collision detection.

We leave both in place for now, to check.  The results are not
identical, due to slightly different calculation methods.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-10-15 09:58:04 +10:30
Rusty Russell
4b6a38fe0a askrene: fix bug with reservations used during refinement.
We were trying to get the max capacity of a flow to see if we could add some
more sats, and hit an assertion:

tests/test_askrene.py:707: 

```
 DEBUG   plugin-cln-askrene: notify msg info: Flow reduced to deliver 88070161msat not 90008000msat, because 107x1x0/1 has remaining capacity 88071042msat
 DEBUG   plugin-cln-askrene: notify msg info: Flow reduced to deliver 284138158msat not 284787000msat, because 108x1x0/1 has remaining capacity 284141000msat
 **BROKEN** plugin-cln-askrene: Flow delivers 129565000msat but max only 56506138msat
 INFO    plugin-cln-askrene: Killing plugin: exited during normal operation
```

We need to *unreserve* our flow before asking for max capacity.  We were
also missing a few less important cases where we altered flows without altering
the reservation, so fix those too.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-10-15 09:58:04 +10:30
Rusty Russell
bcc8bd59c8 askrene: don't *completely* ignore fees to start.
I noticed that increasing mu a little bit sometimes made a big difference,
because by completely ignoring fees we were choosing the worst of two channels
in some cases.

Start at 1% fees; this saves a lot on initial fees in this test!

Here's the new stats on mu levels:

     96  mu=1
     90  mu=10
     41  mu=20
     30  mu=30
     24  mu=40
     19  mu=50
     22  mu=60
      8  mu=70
     95  mu=80
     19  mu=90

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-EXPERIMENTAL: `askrene` is now better at finding low-fee paths.
2024-10-15 09:58:04 +10:30
Rusty Russell
2a0f09fc2d askrene: calculate k value dynamically, using medians.
While the `k=8` value worked for the current main network tests with the
amounts in those tests, it wasn't robust across a wider range of values
(as demonstrated when other test changes broke tests!).

Time to do this properly: calculate the ratio at the time we combine them,
using median values.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-10-15 09:58:04 +10:30
Rusty Russell
32aa79a1e2 askrene: debug and check we actually reduce fees when mu increase.
Even after the previous fix, we still occasionally increase fees when my increases.

This is due to the difference between MCF's linear fees, and actual fees, and
is unavoidable, but add a check if it somehow happens.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-10-15 09:58:04 +10:30
Rusty Russell
08df93cb25 askrene: fix base fee.
I noticed this in the logs:

	plugin-cln-askrene: notify msg unusual: The flows had a fee of 151950msat, greater than max of 53697msat, retrying with mu of 10%...
	plugin-cln-askrene: notify msg unusual: The flows had a fee of 220126msat, greater than max of 53697msat, retrying with mu of 20%...

We would expect increasing mu to *reduce* the fee!

Turns out that our linear fee is a bad terrible approximation, because I
was using base_fee_penalty of 10.0.

 |
 |          /   __ <- real fee, with base: fee = base + propfee * amount.
 |         / __/
 |       _//
 |    __/
 | __/_/
 |/  _/
 | _/ <- linearized fee: fee = linear * amount
 |/
 +-----------------------------------

These cross over where linear = propfee + base / amount.  Assume we split the
payment into 10 parts, this implies that the base_fee_penalty should be 10 / amount
(this gives a slight penalty to the normal case, but that's ok).

This gives better results, too: we get down to 650099 sats in fees, vs 801613
before.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-10-15 09:58:04 +10:30
Rusty Russell
6273adbe47 askrene: calculate prob_cost_factor using ratio of typical mainnet channel.
During "test_real_data", then only successes with reduced fees were 92 on "mu=10", and only
1 on "mu=30": the rest went to mu=100 and failed.

I tried numerous approaches, and in the end, opted for the simplest:

The typical range of probability costs looks likes:
	min = 0, max = 924196240, mean = 10509.4, stddev = 1.9e+06

The typical range of linear fee costs looks like:
	min = 0, max = 101000000, mean = 81894.6, stddev = 2.6e+06

This implies a k factor of 8 makes the two comparable.

This makes the two numbers comparable, and thus makes "mu" much more
effective.  Here are the number of different mu values we succeeded at:

     87  mu=0
     90  mu=10
     42  mu=20
     24  mu=30
     17  mu=40
     19  mu=50
     19  mu=60
     11  mu=70
     95  mu=80
     19  mu=90

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-10-15 09:58:04 +10:30
Rusty Russell
4897286c25 mcf: simplify mu -> cost translation.
The current prob_cost_factor setting does not seem to make mu very
effective, in fact, it gives strange results:

	plugin-cln-askrene: notify msg unusual: The flows had a fee of 151950msat, greater than max of 53697msat, retrying with mu of 10%...
	plugin-cln-askrene: notify msg unusual: The flows had a fee of 220126msat, greater than max of 53697msat, retrying with mu of 20%...

We would expect increasing mu to *reduce* the fee!

As a first step, simplify (it can't be infinite, and the -1 are weird).

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-10-15 09:58:04 +10:30
Rusty Russell
83eee64fda pytest: test askrene with worse maxfee argument.
We ask it again, but reduce fees by 1msat from the previous answer.
This is really nasty, as it frequently exercises the case where we
only go over fee when we do the refinement step.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-10-15 09:58:04 +10:30
Rusty Russell
f17c5f5a6b askrene: don't use tmpctx in minflow()
I tested with a really large gossmap (hacked to be 4GB), and when we
keep retrying to minimize cost (calling minflow 11 times), and we
don't free tmpctx.

Due to an issue with how gossmap estimates the index sizes, we ended
up running out of memory.  This fixes it.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-10-15 09:58:04 +10:30
Lagrang3
bd8cc1fb1f askrene: detect and cancel flow cycles
Flow cycles can occur if we have arc zero arc costs.
The previous path construction from the flow in the network assumed the
absence of such cycles and would enter an infinite loop if it hit one.

With his patch wee add cycle detection and removal during the path
construction phase.

Reported-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Lagrang3 <lagrang3@protonmail.com>
Changelog-EXPERIMENTAL: `askrene` infinite loop fixed
2024-10-15 09:58:04 +10:30
Rusty Russell
d08a3bb9e6 askrene: don't give up if we hit htlc_max and have no other flows.
This happens in the coming "real network" test!

We add fees and hit htlc_max, but don't have another flow to add to.
Rather than MCF again, we split the flow into two.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-10-15 09:58:04 +10:30
Rusty Russell
1b82a3ad5b askrene: constrain to exact htlc_min/htlc_max values.
The fp16_t values are approximations (overestimate for htlc_max,
underestimate for htlc_min), so in the refinement step we should use
the exact values.

This also fixes a logic bug: flow_remaining_capacity returned the
total capacity, not the additional capacity!

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-EXPERIMENTAL: `askrene` now honors exact htlc_maximum_msat limits.
2024-10-15 09:58:04 +10:30
Rusty Russell
4ee9d1d2f2 gossmap: include cltv_expiry_delta in gossmap_chan_get_update_details for completeness.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-10-15 09:58:04 +10:30
Rusty Russell
22e7a57557 askrene: make auto.sourcefree a real layer, too.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-EXPERIMENTAL: `getroutes` now applies `auto.sourcefree` layer in the order specified, so doesn't alter channels changed in later layers.
2024-10-15 09:58:04 +10:30
Rusty Russell
3321ad5883 askrene: populate auto.localchans layer properly.
Rather than adding to the gossmap modifications directly, populate
the layer and have the normal layer application logic do it.

This is consistent when we query layers in the next patch.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-10-15 09:58:04 +10:30
Rusty Russell
1230f1b832 askrene: give notifications back to caller as we go.
And unify logging for better debugging.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-EXPERIMENTAL: `askrene` now has better logging, gives notifications of progress.
2024-10-15 09:58:04 +10:30
Se7enZ
f9e28b9bfa keysend: Add maxfee to keysend for consistency with pay. ([#7227])
Changelog-Added: keysend: Add `maxfee` to keysend for consistency with pay. ([#7227])
2024-10-14 11:58:00 +02:00
Jesse de Wit
8330e3a0df pay-plugin: set gossmods directly
Multiple places in the pay lifecycle depend on mods to be set. By
setting the mods directly after the first listpeerchannels call,
subsequent calls to listpeerchannels are avoided.

Changelog-Fixed: pay-plugin: only call listpeerchannels once during a
payment lifecycle.
2024-10-08 19:26:14 +02:00
Jesse de Wit
281c639b57 pay-plugin: direct_pay only destination channels
The direct_pay payment modifier would query all peer channels, while
only the channels of the given peer suffices.

Changelog-None
2024-10-08 19:26:14 +02:00
Rusty Russell
d067066b17 common/gossmap: use u64 for all offsets.
Since we don't compact the gossmap on the fly (FIXME!) we can
easily surpass 4GB in the gossmap, and 32 bit offsets are not
sufficient.

I'm a bit surprised we don't crash immediately, but we've definitely
seen issues.

Changelog-Fixed: gossipd: crash errors with large gossip_store (>4MB) growth on longer-running nodes.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-10-08 09:50:17 +02:00
Jesse de Wit
cc1362ead3 libplugin-pay: use map for channel hints
For nodes with many channels this is a tremendous improvement in pay
performance. PR #7611 improves payment performance from 15 seconds to
13.5 seconds on one of our nodes. This commit improves payment
performance from 13.5 seconds to 5.7 seconds.

Changelog-Fixed: Improved pathfinding speed for large nodes.
2024-10-07 15:16:46 +02:00
Christian Decker
a6a7dd8f71 pay: Switch to msat for total_capacity
This minimizes the need to convert back and forth from and to sat
values, and it also removes a new instance of sats in the public
interface (`channel_hints`).

Suggested-By: Rusty Russell <@rustyrussell>
2024-10-07 14:05:47 +02:00
Christian Decker
3845b84dbc pay: Simplify the channel_hint update logic
We were ignoring more reliable information because of the
stricter-is-better logic. This meant that we were also ignoring local
channel information, despite knowing better.

By simplifying the logic to pick the one with the newer timestamp we
ensure that later observations always trump earlier ones. There is a
bit of interplay between the relaxation of the constraints updating
the timestamp, and comparing to real observation timestamps, but that
should not impact us for local channels.
2024-10-07 14:05:47 +02:00
Christian Decker
f6d410d924 pay: Remove use of temporary local channel_hint
We were using `channel_hint` to temporarily tweak the graph inside of
a payment. However, these ad-hoc `channel_hints` are stickier than
their predecessors, in that they outlive the payment attempt itself,
and interfere with later ones.

Changelog-Changed: pay: Discarding an overly long or expensive route does not blocklist channels anymore.
2024-10-07 14:05:47 +02:00
Christian Decker
0a62416c4b pay: Inject channel_hints we receive via plugin notifications
Making sure that we don't accidentally create an endless loop.
2024-10-07 14:05:47 +02:00
Christian Decker
30d2a57f50 pay: Log when and why we exclude a channel from the route 2024-10-07 14:05:47 +02:00
Christian Decker
91ffa8e424 pay: Add channel_hint_set_count primitive 2024-10-07 14:05:47 +02:00
Christian Decker
30a2933a94 pay: Add a hysteresis for channel_hint relaxation
If we have a large channel, fail to send through a small amount, and
then add a `channel_hint`, then it can happen that the call to
`channel_hint_set_update` is already late enough that it refills the
1msat we removed from the attempted amount, thus making the edge we
just excluded eligible again, which can lead into an infinite loop.

We slow down the updating of the channel_hints to once every
hysteresis timeout. This allows us to set tight constraints, while not
incurring in the accidental relaxation issue.
2024-10-07 14:05:47 +02:00
Christian Decker
50a0321759 pay: Use the global channel_hint_set and remember across payments 2024-10-07 14:05:47 +02:00
Christian Decker
3ad0085478 route: Change the type of the funding capacity to amount_sat
Keeping it in `amount_msat` made the comparisons easier, but it was
the wrong type for this.
2024-10-07 14:05:47 +02:00
Christian Decker
904eb3795c pay: Subscribe to the channel_hint_update notifications 2024-10-07 14:05:47 +02:00
Christian Decker
cf09314b3b pay: Rename overall_capacity to just capacity
Suggested-by: Rusty Russell <@rustyrussell>
2024-10-07 14:05:47 +02:00
Christian Decker
37a204df41 plugin: Split out the struct channel_hint handling
We're getting serious about how we manage the channel_hints, so let's
give them a proper home.
2024-10-07 14:05:47 +02:00
Christian Decker
b897b4365d pay: Make the channel_hints global
We attach the hints to the plugin, so they get shared across multiple
payments.
2024-10-07 14:05:47 +02:00
Christian Decker
5225218094 pay: Use the total_mast amount as the upper limit for channel_hints 2024-10-07 14:05:47 +02:00
Christian Decker
1eb878be82 route: Add the total capacity to route_hops
We need to know the overall channel capacity, i.e., the amount_msat
that the channel was funded with, in order to relax the channel_hint
to refill over time.
2024-10-07 14:05:47 +02:00
Christian Decker
d60db28c41 pay: Add a function to update channel_hints based on their age
We relax constraints from the `channel_hint` through a linear refill.
2024-10-07 14:05:47 +02:00
Jesse de Wit
538a0076ca topology: call listpeerchannels_done directly
When cln is run without deprecated apis, the `listchannels` call would
still query `listpeerchannels`, even though it wouldn't use the result.
This adds unnecessary time to the call. Especially in case the node has
many channels. This commit skips the listpeerchannels call in case the
outcome won't be used.

Changelog-Fixed: Improve listchannels performance
2024-10-04 17:51:55 +02:00
Aditya Sharma
4c775f16aa plugins/chanbackup: Add RPC to fetch data from emergency.recover file. 2024-10-03 18:59:10 -07:00
Rusty Russell
b8acd3b37c askrene: more code tweaks on feedback from Lagrang3.
1. describe_disabled should point out if node itself is disabled.
2. Hoist constraint check for neater if branching.
3. Use amount_msat_max/min for greater clarity.
4. Simply disable channels, don't zero htlc_min/max when node disabled.

I also fixed the diagnostic of htlc_max correctly, which removes a FIXME.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-10-04 11:27:53 +09:30
Rusty Russell
e3c9bc6d3a askrene: trivial changes to avoid -O3 compiler warnings.
The code is a bit too complex for gcc to track it:

```
In file included from ccan/ccan/tal/str/str.h:7,
                 from plugins/askrene/askrene.c:11:
plugins/askrene/askrene.c: In function ‘do_getroutes’:
ccan/ccan/tal/tal.h:324:23: error: ‘routes’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
  324 | #define tal_count(p) (tal_bytelen(p) / sizeof(*p))
      |                       ^~~~~~~~~~~
plugins/askrene/askrene.c:476:24: note: ‘routes’ was declared here
  476 |         struct route **routes;
      |                        ^~~~~~
plugins/askrene/askrene.c:475:29: error: ‘amounts’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
  475 |         struct amount_msat *amounts;
      |                             ^~~~~~~
plugins/askrene/askrene.c:488:69: error: ‘probability’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
  488 |         json_add_u64(response, "probability_ppm", (u64)(probability * 1000000));
      |                                                        ~~~~~~~~~~~~~^~~~~~~~~~
cc plugins/askrene/dijkstra.c
cc1: all warnings being treated as errors
```

On my local machine, it also warns in param_dev_channel, so I fixed that too.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-10-04 11:27:53 +09:30
Rusty Russell
630ec6a566 askrene: give better feedback when we can't find a suitable route.
This turns out to be critical for users: also stops them from
bothering us when their node is offline or has insufficient capacity!

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-10-04 11:27:53 +09:30
Rusty Russell
e54c0f8ded askrene: don't replace constraints, simply accumulate.
Lagrang3 points out it's less useful (when we time them out), and probably
a premature optimization anyway.

Suggested-by: Lagrang3 <lagrang3@protonmail.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-10-04 11:27:53 +09:30
Rusty Russell
c307b77d2f askrene: split askrene-create-channel into create-channel and update-channel.
This allows for explicit partial updates to channels (e.g. just change
fees, or just disable) without haveing to set the other fields.

This generalizes askrene-disable-channel, which is removed.

We also take the chance to use the proper BOLT 7 terms in the API:

- htlc_minimum_msat
- htlc_maximum_msat
- cltv_expiry_delta
- fee_base_msat
- fee_proportional_millionths

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-10-04 11:27:53 +09:30
Rusty Russell
5052f0763f gossmap: keep capacity for locally-generated channels as well.
It was weird not to have a capacity associated with localmods channels, and
fixing it has some very nice side effects.

Now the gossmap_chan_get_capacity() call never fails (we prevented reading
of channels from gossmap in the partially-written case already), so we
make it return the capacity.  We do this in msat, because that's what
all the callers want.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2024-10-04 11:27:53 +09:30