When an HTLC fails, we currently rely on the scorer learning the
failed channel and assigning an infinite (`u64::max_value()`)
penalty to the channel so as to avoid retrying over the exact same
path (if there's only one available path). This is common when
trying to pay a mobile client behind an LSP if the mobile client is
currently offline.
This leads to the scorer being overly conservative in some cases -
returning `u64::max_value()` when a given path hasn't been tried
for a given payment may not be the best decision, even if that
channel failed 50 minutes ago.
By tracking channels which failed on a payment part level and
explicitly refusing to route over them we can relax the
requirements on the scorer, allowing it to make different decisions
on how to treat channels that failed relatively recently without
causing payments to retry the same path forever.
This does have the drawback that it could allow two separate part
of a payment to traverse the same path even though that path just
failed, however this should only occur if the payment is going to
fail anyway, at least as long as the scorer is properly learning.
Closes#1241, superseding #1252.
This is mostly motivated by the fact that payments may happen while the
latest `ChannelUpdate` indicating our new `ChannelConfig` is still
propagating throughout the network. By temporarily allowing the previous
config, we can help reduce payment failures across the network.
ChannelConfig now has its static fields removed. We introduce a new
LegacyChannelConfig struct that maintains the serialization as
previously defined by ChannelConfig to remain backwards compatible with
clients running 0.0.107 and earlier.
In the near future, we plan to allow users to update their
`ChannelConfig` after the initial channel handshake. In order to reuse
the same struct and expose it to users, we opt to move out all static
fields that cannot be updated after the initial channel handshake.
P2PGossipSync logs before delegating to NetworkGraph in its
EventHandler. In order to share this handling with RapidGossipSync,
NetworkGraph needs to take a logger so that it can implement
EventHandler instead.
NetGraphMsgHandler implements RoutingMessageHandler to handle gossip
messages defined in BOLT 7 and maintains a view of the network by
updating NetworkGraph. Rename it to P2PGossipSync, which better
describes its purpose, and to contrast with RapidGossipSync.
A NetworkUpdate indicating ChannelClosed actually corresponds to a
channel failure as described in BOLT 4:
0x2000 (NODE): node failure (otherwise channel)
Rename the enum variant to ChannelFailure and rename NetworkGraph
methods close_channel_from_update and fail_node to channel_failed and
node_failed, respectively.
`ChannelManager::fail_htlc_backwards`' bool return value is quite
confusing - just because it returns false doesn't mean the payment
wasn't (already) failed. Worse, in some race cases around shutdown
where a payment was claimed before an unclean shutdown and then
retried on startup, `fail_htlc_backwards` could return true even
though (a duplicate copy of the same payment) was claimed, but the
claim event has not been seen by the user yet.
While its possible to use it correctly, its somewhat confusing to
have a return value at all, and definitely lends itself to misuse.
Instead, we should push users towards a model where they don't care
if `fail_htlc_backwards` succeeds - either they've locally marked
the payment as failed (prior to seeing any `PaymentReceived`
events) and will fail any attempts to pay it, or they have not and
the payment is still receivable until its timeout time is reached.
We can revisit this decision based on user feedback, but will need
to very carefully document the potential failure modes here if we
do.
As additional sanity checks, before claiming a payment, we check
that we have the full amount available in `claimable_htlcs` that
the payment should be for. Concretely, this prevents one
somewhat-absurd edge case where a user may receive an MPP payment,
wait many *blocks* before claiming it, allowing us to fail the
pending HTLCs and the sender to retry some subset of the payment
before we go to claim. More generally, this is just good
belt-and-suspenders against any edge cases we may have missed.
While the HTLC-claim process happens across all MPP parts under one
lock, this doesn't imply that they are claimed fully atomically on
disk. Ultimately, an application can crash after persisting one
`ChannelMonitorUpdate` out of multiple monitor updates needed for
the full claim.
Previously, this would leave us in a very bad state - because of
the all-channels-available check in `claim_funds` we'd refuse to
claim the payment again on restart (even though the
`PaymentReceived` event will be passed to the user again), and we'd
end up having partially claimed the payment!
The fix for the consistency part of this issue is pretty
straightforward - just check for this condition on startup and
complete the claim across all channels/`ChannelMonitor`s if we
detect it.
This still leaves us in a confused state from the perspective of
the user, however - we've actually claimed a payment but when they
call `claim_funds` we return `false` indicating it could not be
claimed.
We have a bunch of fancy infrastructure to ensure we can connect
blocks using all our different connection interfaces, but we only
bother to use it in a few select tests.
This expands our use of `ConnectStyle` to most of our tests by
simply randomizing the style in each test. This makes our tests
non-deterministic, but we print the connection style at start so
that it's easy to reproduce a failure deterministically.
This update also includes a minor refactor. The return type of
`pending_monitor_events` has been changed to a `Vec` tuple with the
`OutPoint` type. This associates a `Vec` of `MonitorEvent`s with a
funding outpoint.
We've also renamed `source/sink_channel_id` to `prev/next_channel_id` in
the favour of clarity.
When we fail an HTLC which was destined for a channel that the HTLC
sender didn't know the real SCID for, we should ensure we continue
to use the alias in the channel_update we provide them. Otherwise
we will leak the channel's real SCID to HTLC senders.
This makes tests slightly more realistic by delivering
`channel_update`s to `ChannelManager`s, ensuring we have
forwarding data stored locally for all channels, including public
ones.
... by calling it both before and after every chain event in
testing and fuzzing.
This requires fixing some blockchain inconsistencies in
`do_test_onchain_htlc_reorg`, `do_retry_with_no_persist`, and
`do_test_dup_htlc_onchain_fails_on_reload` where we'd connect
conflicting transactions in the same chain.
`cargo bench` sets `cfg(test)`, causing us to hit some test-only
code in the router when benchmarking, throwing off our benchmarks
substantially. Here we swap from the `unstable` feature to a more
clearly internal feature (`_bench_unstable`) and also checking for
it when enabling test-only code.
The spec actually requires we never send `announcement_signatures`
(and, thus, `channel_announcement`s) until after six confirmations.
However, we would happily have sent them prior to that as long as
we exchange `funding_locked` messages with our countarparty. Thanks
to re-broadcasting this issue is largely harmless, however it could
have some negative interactions with less-robust peers. Much more
importantly, this represents an important step towards supporting
0-conf channels, where `funding_locked` messages may be exchanged
before we even have an SCID to construct the messages with.
Because there is no ACK mechanism for `announcement_signatures` we
rely on existing channel updates to stop rebroadcasting them - if
we sent a `commitment_signed` after an `announcement_signatures`
and later receive a `revoke_and_ack`, we know our counterparty also
received our `announcement_signatures`. This may resolve some rare
edge-cases where we send a `funding_locked` which our counterparty
receives, but lose connection before the `announcement_signatures`
(usually the very next message) arrives.
Sadly, because the set of places where an `announcement_signatures`
may now be generated more closely mirrors where `funding_locked`
messages may be generated, but they are now separate, there is a
substantial amount of code motion providing relevant parameters
about current block information and ensuring we can return new
`announcement_signatures` messages.