Utilizing the results of probes sent once a minute to a random node
in the network for a random amount (within a reasonable range), we
were able to analyze the accuracy of our resulting success
probability estimation with various PDFs across the historical and
live-bounds models.
For each candidate PDF (as well as other parameters, including the
histogram bucket weight), we used the
`min_zero_implies_no_successes` fudge factor in
`success_probability` as well as a total probability multiple fudge
factor to get both the historical success model and the a priori
model to be neither too optimistic nor too pessimistic (as measured
by the relative log-loss between succeeding and failing hops in our
sample data).
We then compared the resulting log-loss for the historical success
model and selected the candidate PDF with the lowest log-loss,
skipping a few candidates with similar resulting log-loss but with
more extreme constants (such as a power of 11 with a higher
`min_zero_implies_no_successes` penalty).
Somewhat surprisingly (to me at least), the (fairly strongly)
preferred model was one where the bucket weights in the historical
histograms are exponentiated. In the current design, the weights
are effectively squared as we multiply the minimum- and maximum-
histogram buckets together before adding the weight*probabilities
together.
Here we multiply the weights yet again before addition. While the
simulation runs seemed to prefer a slightly stronger weight than
the 4th power we do here, the difference wasn't substantial
(log-loss 0.5058 to 0.4941), so we do the simpler single extra
multiply here.
Note that if we did this naively we'd run out of bits in our
arithmetic operations - we have 16-bit buckets, which when raised
to the 4th can fully fill a 64-bit int. Additionally, when looking
at the 0th min-bucket we occasionally add up to 32 weights together
before multiplying by the probability, requiring an additional five
bits.
Instead, we move to using floats during our histogram walks, which
further avoids some float -> int conversions because it allows for
retaining the floats we're already using to calculate probability.
Across the last handful of commits, the increased pessimism more
than makes up for the increased runtime complexity, leading to a
40-45% pathfinding speedup on a Xeon Silver 4116 and a 25-45%
speedup on a Xeon E5-2687W v3.
Thanks to @twood22 for being a sounding board and helping analyze
the resulting PDF.
In the next commit we'll want to return floats or ints from
`success_probability` depending on the callsite, so instead of
duplicating the calculation logic, here we split the linear (which
always uses int math) and nonlinear (which always uses float math)
into separate methods, allowing us to write trivial
`success_probability` wrappers that return the desired type.
Utilizing the results of probes sent once a minute to a random node
in the network for a random amount (within a reasonable range), we
were able to analyze the accuracy of our resulting success
probability estimation with various PDFs across the historical and
live-bounds models.
For each candidate PDF (as well as other parameters, to be tuned in
the coming commits), we used the `min_zero_implies_no_successes`
fudge factor in `success_probability` as well as a total
probability multiple fudge factor to get both the historical
success model and the a priori model to be neither too optimistic
nor too pessimistic (as measured by the relative log-loss between
succeeding and failing hops in our sample data).
Across the simulation runs, for a given PDF and other parameters,
we nearly always did better with a shorter half-life (even as short
as 1ms, i.e. only learning per-probe rather than across probes).
While this likely makes sense for nodes which do live probing, not
all nodes do, and thus we should avoid over-biasing on the dataset
we have.
While it may make sense to only learn per-payment and not across
payments, I can't fully rationalize this result and thus want to
avoid over-tuning, so here we reduce the half-life from 6 hours to
30 minutes.
If the liquidity penalty multipliers in the scoring config are both
0 (as is now the default), the corresponding liquiditiy penalties
will be 0. Thus, we should avoid doing the work to calculate them
if we're ultimately just gonna get a value of zero anyway, which we
do here.
Utilizing the results of probes sent once a minute to a random node
in the network for a random amount (within a reasonable range), we
were able to analyze the accuracy of our resulting success
probability estimation with various PDFs across the historical and
live-bounds models.
For each candidate PDF (as well as other parameters, to be tuned in
the coming commits), we used the `min_zero_implies_no_successes`
fudge factor in `success_probability` as well as a total
probability multiple fudge factor to get both the historical
success model and the a priori model to be neither too optimistic
nor too pessimistic (as measured by the relative log-loss between
succeeding and failing hops in our sample data).
We then compared the resulting log-loss for the historical success
model and selected the candidate PDF with the lowest log-loss,
skipping a few candidates with similar resulting log-loss but with
more extreme constants (such as a power of 11 with a higher
`min_zero_implies_no_successes` penalty).
In every case, the historical model performed substantially better
than the live-bounds model, so here we simply disable the
live-bounds model by default and use only the historical model.
Further, we use the calculated total probability multiple fudge
factor (0.7886892844179266) to choose the ratio between the
historical model and the per-hop penalty (as multiplying each hop's
probability by 78% is equivalent to adding a per-hop penalty of
log10(0.78) of our probabilistic penalty).
We take this opportunity to bump the penalties up a bit as well, as
anecdotally LDK users are willing to pay more than they do today to
get more successful paths.
Fixes#3040
Utilizing the results of probes sent once a minute to a random node
in the network for a random amount (within a reasonable range), we
were able to analyze the accuracy of our resulting success
probability estimation with various PDFs.
For each candidate PDF (as well as other parameters, to be tuned in
the coming commits), we used the `min_zero_implies_no_successes`
fudge factor in `success_probability` as well as a total
probability multiple fudge factor to get both the historical
success model and the a priori model to be neither too optimistic
nor too pessimistic (as measured by the relative log-loss between
succeeding and failing hops in our sample data).
We then compared the resulting log-loss for the historical success
model and selected the candidate PDF with the lowest log-loss,
skipping a few candidates with similar resulting log-loss but with
more extreme constants (such as a power of 11 with a higher
`min_zero_implies_no_successes` penalty).
This resulted in a PDF of `128 * (1/256 + 9*(x - 0.5)^8)` with a
`min_zero_implies_no_successes` probability multiplier of 64/78.
Thanks to @twood22 for being a sounding board and helping analyze
the resulting PDF.
Previously, we wouldn't set the field as we aren't yet making use of it.
Here, we start setting the field. To this end, we make `best_block` an
`RwLock<Option<BestBlock>>` rather than `Option<RwLock<BestBlock>>`.
When a peer misbehaves/sends bogus data we reply with an error message
and insert it to the ignored list.
Here, we avoid having this list grow unboundedly over time by removing
peers again once they disconnect, allowing them a second chance upon
reconnection.
We should update the return types on the signing methods here as
well, but we should at least start by documenting which methods are
async and which are not.
Once we complete async support for `get_per_commitment_point`, we
can change the return types as most things in the channel signing
traits will be finalized.
Prior to bcaba29f92, the
`chanmon_consistency` fuzzer checked that payments sent either
succeeded or failed as expected by looking at the `APIError` which
we received as a result of calling the send method.
bcaba29f92 removed the legacy send
method during fuzzing so attempted to replicate the old logic by
checking for events which contained the legacy `APIError` value.
While this was plenty thorough, it was somewhat brittle in that it
made expectations about the event state of a `ChannelManager` which
turned out to not be true.
Instead, here, we validate the send correctness by examining the
`RecentPaymentDetails` list from a `ChannelManager` immediately
after sending.
bcaba29f92 started returning
pre-built `Route`s from the router in the `chanmon_consistency`
fuzzer. In doing so, it didn't properly fill in the `route_parms`
field which is expected to match the requested parameters. This
causes a debug assertion when sending.
Here we fix this by setting the correct `route_params`.
bcaba29f92 introduced a deadlock in
the `chanmon_consistency` fuzzer by holding a lock on the route
expectations before sending a payment, which ultimately tries to
lock the route expectations. Here we fix this deadlock.
This context is included in static invoice's blinded message paths, provided
back to us in HeldHtlcAvailable onion messages for blinded path authentication.
In future work, we will check if this context is valid and respond with a
ReleaseHeldHtlc message to release the upstream payment if so.
We also add creation methods for the hmac used for authenticating said blinded
path.
While LDK/`ChannelManager` should already introduce an upper-bound on
the number of peers, here we assert that our `PeerState` map can't
grow unboundedly. To this end, we simply return an `Internal error` and
abort when we would hit the limit of 100000 peers.
We include any `OutboundJITChannel` that has not made it further than
`PendingInitialPayment` in the per-peer request limit, and will of
course prune it once it expires.
In addition to pruning expired requests on peer disconnection we also
regularly prune for all peers on block connection, and also remove the
entire `PeerState` if it's empty after pruning (i.e., has no pending
requsts or in-flight channels left).
Now that the core features required for `async_signing` are in
place, we can go ahead and expose it publicly (rather than behind a
a `cfg`-flag). We still don't have full async support for
`get_per_commitment_point`, but only one case in channel
reconnection remains. The overall logic may still have some
hiccups, but its been in use in production at a major LDK user for
some time now. Thus, it doesn't really make sense to hide behind a
`cfg`-flag, even if the feature is only 99% complete. Further, the
new paths exposed are very restricted to signing operations that
run async, so the risk for existing users should be incredibly low.
This moves the common `if during_startup { push background event }
else { apply ChannelMonitorUpdate }` pattern by simply inlining it
in `handle_new_monitor_update`.
One of the largest gaps in our async persistence functionality has
been preimage (claim) updates to closed channels. Here we finally
implement support for this (for updates which are generated during
startup).
Thanks to all the work we've built up over the past many commits,
this is a fairly straightforward patch, removing the
immediate-completion logic from `claim_mpp_part` and adding the
required in-flight tracking logic to
`apply_post_close_monitor_update`.
Like in the during-runtime case in the previous commit, we sadly
can't use the `handle_new_monitor_update` macro wholesale as it
handles the `Channel` resumption as well which we don't do here.
On startup, we walk the preimages and payment HTLC sets on all our
`ChannelMonitor`s, re-claiming all payments which we recently
claimed. This ensures all HTLCs in any claimed payments are claimed
across all channels.
In doing so, we expect to see the same payment multiple times,
after all it may have been received as multiple HTLCs across
multiple channels. In such cases, there's no reason to redundantly
claim the same set of HTLCs again and again. In the current code,
doing so may lead to redundant `PaymentClaimed` events, and in a
coming commit will instead cause an assertion failure.
One of the largest gaps in our async persistence functionality has
been preimage (claim) updates to closed channels. Here we finally
implement support for this (for updates at runtime).
Thanks to all the work we've built up over the past many commits,
this is a well-contained patch within `claim_mpp_part`, pushing
the generated `ChannelMonitorUpdate`s through the same pipeline we
use for open channels.
Sadly we can't use the `handle_new_monitor_update` macro wholesale
as it handles the `Channel` resumption as well which we don't do
here.
In d1c340a0e1 we added support in
`handle_new_monitor_update!` for handling updates without dropping
locks.
In the coming commits we'll start handling `ChannelMonitorUpdate`s
"like normal" for updates against closed channels. Here we set up
the first step by adding a new `POST_CHANNEL_CLOSE` variant on
`handle_new_monitor_update!` which attempts to handle the
`ChannelMonitorUpdate` and handles completion actions if it
finishes immediately, just like the pre-close variant.
In c99d3d785d we added a new
`apply_post_close_monitor_update` method which takes a
`ChannelMonitorUpdate` (possibly) for a channel which has been
closed, sets the `update_id` to the right value to keep our updates
well-ordered, and then applies it.
Setting the `update_id` at application time here is fine - updates
don't really have an order after the channel has been closed, they
can be applied in any order - and was done for practical reasons
as calculating the right `update_id` at generation time takes a
bit more work on startup, and was impossible without new
assumptions during claim.
In the previous commit we added exactly the new assumption we need
at claiming (as it's required for the next few commits anyway), so
now the only thing stopping us is the extra complexity.
In the coming commits, we'll move to tracking post-close
`ChannelMonitorUpdate`s as in-flight like any other updates, which
requires having an `update_id` at generation-time so that we know
what updates are still in-flight.
Thus, we go ahead and eat the complexity here, creating
`update_id`s when the `ChannelMonitorUpdate`s are generated for
closed-channel updates, like we do for channels which are still
live.
We also ensure that we always insert `ChannelMonitorUpdate`s in the
pending updates set when we push the background event, avoiding a
race where we push an update as a background event, then while its
processing another update finishes and the post-update actions get
run.
Here we make a test that disables a channel signer's ability
to return commitment points upon being first derived for a channel.
We also fit in a couple cleanups: removing a comment referencing a
previous design with a `HolderCommitmentPoint::Uninitialized` variant,
as well as adding coverage for updating channel maps in async closing
signed.
Here we handle the case where our signer is pending the next commitment
point when we try to send channel ready. We set a flag to remember to
send this message when our signer is unblocked. This follows the same
general pattern as everywhere else where we're waiting on a commitment
point from the signer in order to send a message.
Similar to `open_channel`, if a signer cannot provide a commitment point
immediately, we set a flag to remember we're waiting for a point to send
`accept_channel`. We make sure to get the first two points before moving
on, so when we advance our commitment we always have a point available.
For all of our async signing logic in channel establishment v1, we set
signer flags in the method where we create the raw lightning message
object. To keep things consistent, this commit moves setting the signer
flags to where we create funding_created, since this was being set
elsewhere before.
While we're doing this cleanup, this also slightly refactors our
funding_signed method to move some code out of an indent, as well
as removes a log to fix a nit from #3152.
In the event that a signer cannot provide a commitment point
immediately, we set a flag to remember we're waiting for this before we
can send `open_channel`. We make sure to get the first two commitment
points, so when we advance commitments, we always have a commitment
point available.
When initializing a context, we set the `signer_pending_open_channel`
flag to false, and leave setting this flag for where we attempt to
generate a message.
When checking to send messages when a signer is unblocked, we must
handle both when we haven't gotten any commitment point, as well as when
we've gotten the first but not the second point.
Following a previous commit adding `HolderCommitmentPoint` elsewhere, we
make the transition to use those commitment points and remove the
existing one.