With MPP and Trampoline (and particularly the combination of the two),
we need to keep track of multiple amounts, recipients and fees.
There's a trampoline fee and a fee to reach the first trampoline node.
The trampoline nodes must appear in the route, but not as payment recipients.
Adding new fields to payment events and DB structs lets us distinguish those.
We also relax the spec requirement about feature graph dependencies.
The requirement to include `var_onion_optin` in invoice feature bits
was added after the first Phoenix release.
Phoenix users will thus have non spec-compliant invoices in their
payment history.
We accept invoices that don't set this field; this is a harmless
spec violation (as long as we set it in new invoices).
There was a rounding issue with the availableForSend/Receive calculation.
Because CommitTx fee and Htlc fee were computed separately,
but each was individually rounded down to Satoshis, we could
end up with an off-by-one error.
This resulted in an incapacity to send/receive the maximum amount available.
We now allow computing fees in msat, which removes rounding issues.
c-lightning fails to decode empty arrays of scids or timestamps with an encoding type set to COMPRESSED_ZLIB.
The spec is not specific enough on whether this is valid or not, so we'll set the encoding type of empty arrays to UNCOMPRESSED.
When paying an invoice, we weren't properly checking our own features.
If the invoice supported MPP, we would use it all the time.
If MPP isn't enabled in our features, we now default to a legacy payment.
Add new errors that let senders know they need to raise the trampoline fee/ctlv.
When the error is downstream, select the best error to forward.
Implement retry with more fees for trampoline payments.
This process is currently quite manual: the sender decides upfront on
each attempt's fee/cltv.
If our initial random deconnnection delay is 0 (unlikely but possible) then all "exponential backoff" reconnection delays will be 0 too, so we set a minimum value of 200 milliseconds.
lnd expects ids ranges in reply_channel_range messages to strictly follow each other, without gaps.
For example, using block heights and not ids, [1,2,4,5] would be split into (first=1, num=2, [1,2]) :: (first=3, num=2, [4, 5])
This is arguably a limitation of lnd (c-lightning does not requires this and it's not needed to properly process replies) but is easy to implement.
This is needed to make sure we broadcast our own gossip.
Otherwise we will try to gossip at the beginning of the connection,
when the peer hasn't set any timestamp, so our gossip will be dropped.
See https://github.com/lightningnetwork/lightning-rfc/pull/684
Otherwise eclair-mobile can't pay using MPP.
This heuristic was only here to help Trampoline nodes with a lot of
channels relay using MPP, but we disabled that in #1271 anyway.
We will reactivate Trampoline-MPP once split is done inside the router.
* Add test to check that we split short channel ids correctly
reply_channel_range messages should not overlap i.e different replies should not contain
channel ids that have the same block height.
The test in this commit fails, because our 'split' function needs to be updated.
* Channel Queries: make sure that our replies match the request range (fixes#1269)
Even though it's not completely explicit in the specs, we should make sure that
the [firstBlock, numBlock] range that we cover in our replies is not computed
from the ids that we actually have but instead matches the [firstBlock, numBlock] range
that was requested.
* Make sure that serialised replies stay below the 65Kb limit
We prune short channel id chunks to make sure that serialised replies stay below the 65 Kb limit.
The pruning algo is very simple: for each chunk we randomly keep the first or last 3200 ids
Selection is random so peers that re-connect will eventually receive all channel info.
The limit of 3200 was chosen for the worst case where replies are not compressed and include timestamps and checksum.
It is a fairly conservative boundary, the highest number of public channels in a single block so far is <300, and
there 3200 is roughly the currently observed number of transactions in a "full" block.
* Set default ids chunk size to 1500
Have smaller chunks (smaller than 3200 / 2) reduces the probability of merging 2 chunks and having to prune the result because the encoded reply would be over 65K.
* Smarter algo for enforcing max chunk size policy
Instead of keeping either the first or last items, we use a random offset. This way peers will eventually receive info about all channels even if chunks are much larger than the max chunk size and are pruned.
There is currently a backwards-compatibility issue with eclair-mobile.
Eclair-mobile mistakes feature bit 15 (payment_secret) for the
gossip_queries_ex prototype (which is incompatible with the spec-ed version).
To temporarily avoid this issue (until eclair-mobile is patched and all users have updated),
we never advertize those ambiguous bits in Init.
They're only really needed in the invoice so it's ok.
Implement https://github.com/lightningnetwork/lightning-rfc/pull/666
Keep the global/local split in Commitments to avoid backwards incompatibility in the codec.
Remove allowMultiPart API field: we instead rely on the MPP feature being set in nodeParams.
That means MPP-enabled nodes need to update their reference.conf.
Rework features:
* Add types to allow cleaner dependency validation.
* Most of the time we don't care whether a feature is activated as optional or mandatory, which caused duplicate code. This is now handled more cleanly.
* It also paves the way to annotate features with the places they should be advertised (Init vs NodeAnn vs ChannelAnn vs invoice).
This is safer for now since the splitting algorithm isn't working
well on nodes with a large number of channels and we don't
expect too many payments from Phoenix to non-Phoenix to
actually need MPP in the short term.
Mockito sometimes throws an unnecessary stubbing exception, it's unclear whether the test is faulty or mockito has issues with our parallel setup.
Rewrite switchboard tests without mockito makes them more flexible.
In case they randomly fail we should get more useful data to help troubleshooting.
Start relaying trampoline payments with multi-part aggregation (disabled by default,
must be enabled with config).
Recovery after a restart is correctly handled, even if payments were being forwarded.
No DB schema update in this commit.
The trampoline UX will be somewhat bad because many improvements/polish are missing.
Some shortcuts were taken, a few hacks here and there need to be fixed, but nothing too scary.
Those improvements will be done in separate commits before the next release.
Randomization is necessary, otherwise if two peers attempt to reconnect
to each other in a synchronized fashion, they will enter in a
disconnect-reconnect loop.
We already had randomization for the initial reconnection attempt, but
further reconnection attempts were using a deterministic schedule
following an exponential backoff curve.
Fixes#1238.
* Add a configurable time-out to onchain fee provider requests
We configure a timeout of 5 seconds, applicable to all fee providers. If a provider times out we switch to the next one in our list.
Our mobile app needs a feerate to start properly and currently waits too long when a fee provider is online but very slow to respond.
We can't guarantee with the current algorithm that the last HTLC won't be
a small one (the leftovers).
If we see that happen in real scenario, we'll need to add heuristics to avoid it.
Using the `max()` aggregating function on outgoing payments'
timestamps, we can ensure that the non-aggregated columns
for the outgoing payments contain the most recent/pertinent data.
If a chain re-org happens and a new ShortChannelId is assigned,
the `Relayer` kept both entries (new and old).
This resulted in an incorrect balance because we effectively counted this channel twice.
While #1222 was being reviewed, a new unit test was added to OnionCodecsSpec.
It didn't cause any file conflict so Github didn't warn about merging #1222.
However this test needed to be updated to the new truncated int format.
The spec defines tu64 (and friends) without the length prefix.
Multi-part uses a tu64 without a length prefix inside the PaymentData record.
Our previous implementation only supported using tu64 alone in a TLV record.
We make this more flexible by separating the length encoding.
MPP implies payment secret.
Avoid raising exceptions in PaymentInitiator: validate invoice instead of using a require.
This way senders always get a response.
We previously had some logic where we would fail incoming HTLCs
for which we were the final recipient when a channel would come online.
That made sense when we didn't have MPP, but with MPP we cannot do that.
There is a risk that we would be failing HTLCs that are considered received by the MPP FSM.
Instead we need to use the CommandBuffer when we are the final recipient.
This way pending commands cannot be lost and HTLCs are cleaned-up on restart.
This includes a bit of refactoring in `MultiPartPaymentLifecycle`. Note
that we can't use the `onTermination` handler to finish the spans,
because it is asynchronous and may not be called after a long time.
That's why we use a dedicated `myStop` function.
In Kamon 2.0, by default spans are automatically generated for tracked
actors, which we don't want because we define our own spans. That's why
there is an additional configuration in `application.conf`.
MPP split/retry improvements:
* Only use public channels when sending to remote node
* Don't retry when sending to direct peer
* Blacklist channels that are a bad route prefix
When paying a multi-part payment, we tell the PaymentLifecycle to use a route prefix that contains the first hop (for example a -> b via channel 1).
We need to also tell the router to ignore the nodes that are in the route prefix, otherwise when retrying it may try some completely dumb routes that have no chance of succeeding.