AFAICT it's not possible to flip back from bein synced_to_chain, so we
remove the underlying call that could reflect this. The method is moved
into the test file since it's still used to test correctness of other
portions of the flow.
Rather than performing this call in the SyncManager, we give each
gossipSyncer the ability to mark the first sync completed. This permits
pinned syncers to contribute towards the rpc-level synced_to_graph
value, allowing the value to be true after the first pinned syncer or
regular syncer complets. Unlinke regular syncers, pinned syncers can
proceed in parallel possibly decreasing the waiting time if consumers
rely on this field before proceeding to load their application.
A pinned syncer is an ActiveSyncer that is configured to always remain
active for the lifetime of the connection. Pinned syncers do not count
towards the total NumActiveSyncer count, which are rotated periodically.
This features allows nodes to more tightly synchronize their routing
tables by ensuring they are always receiving gossip from distinguished
subset of peers.
We do this instead of using the source of the AnnounceSignatures
message, as we filter out the source when broadcasting any
announcements, leading to the remote node not receiving our channel
update. Note that this is done more for the sake of correctness and to
address a flake within the integration tests, as channel updates are
sent directly and reliably to channel counterparts.
As similarly done with premature channel announcements, we'll no longer
allow premature channel updates to be rebroadcast once mature. This is
no longer necessary as channel announcements that we're not aware of are
usually broadcast to us with their accompanying channel updates.
In this commit, we add a new option to toggle gossip rate limiting. This
new option can be useful in contexts that require near instant
propagation of gossip messages like integration tests.
This change was largely motivated by an increase in high disk usage as a
result of channel update spam. With an in memory graph, this would've
gone mostly undetected except for the increased bandwidth usage, which
this doesn't aim to solve yet. To minimize the effects to disks, we
begin to rate limit channel updates in two ways. Keep alive updates,
those which only increase their timestamps to signal liveliness, are now
limited to one per lnd's rebroadcast interval (current default of 24H).
Non keep alive updates are now limited to one per block per direction.
This allows for a 1000 different validation operations to proceed
concurrently. Now that we are batching operations at the db level, the
average number of outstanding requests will be higher since the commit
latency has increased. To compensate, we allow for more outstanding
requests to keep the gossiper busy while batches are constructed.
Similarly as with kvdb.View this commits adds a reset closure to the
kvdb.Update call in order to be able to reset external state if the
underlying db backend needs to retry the transaction.
This commit adds a reset() closure to the kvdb.View function which will
be called before each retry (including the first) of the view
transaction. The reset() closure can be used to reset external state
(eg slices or maps) where the view closure puts intermediate results.
This commit moves all localized instances of mock implementations of
the Signer interface to the lntest/mock package. This allows us to
remove a lot of code and have it housed under a single interface in
many cases.
Modifies syncer.replyChanRangeQuery method to use the LastBlockHeight
method on the query. LastBlockHeight safely calculates the ending
block height and prevents an overflow of start_block + num_blocks.
Prior to this change, query messages that had a start_block +
num_blocks that overflows uint32_max would return zero results in the
reply message.
Tests are added to fix the bug and ensure proper start and end values
are supplied to the channel graph filter.
This reworks the locking behavior of the Gossiper so that a race
condition on channel updates and block notifications doesn't cause any
loss of messages.
This fixes an issue that manifested mostly as flakes on itests during
WaitForNetworkChannelOpen calls.
The previous behavior allowed ChannelUpdates to be missed if they
happened concurrently to block notifications. The
processNetworkAnnoucement call would check for the current block height,
then lock the gossiper and add the msg to the prematureAnnoucements
list. New blocks would trigger an update to the current block height
then a lock and check of the aforementioned list.
However, specially during itests it could happen that the missing lock
before checking the height could case a race condition if the following
sequence of events happened:
- A new ChannelUpdate message was received and started processing on a
separate goroutine
- The isPremature() call was made and verified that the ChannelUpdate
was in fact premature
- The goroutine was scheduled out
- A new block started processing in the gossiper. It updated the block
height, asked and was granted the lock for the gossiper and verified
there was zero premature announcements. The lock was released.
- The goroutine processing the ChannelUpdate asked for the gossiper lock
and was granted it. It added the ChannelUpdate in the
prematureAnnoucements list. This can never be processed now.
The way to fix this behavior is to ensure that both isPremature checks
done inside processNetworkAnnoucement and best block updates are made
inside the same critical section (i.e. while holding the same lock) so
that they can't both check and update the prematureAnnoucements list
concurrently.
The linter complains about not checking the return value from
WipeChannel in certain places. Instead of checking we simply remove the
returned error because the in-memory modifications cannot fail.
If the provided ChainHash in a QueryChannelRange message does not match
that of our current chain, then we should send a blank response, rather
than reply with channels for the wrong chain.
We move from our legacy way of interpreting ReplyChannelRange messages
which was incorrect. Previously, we'd rely on the Complete field of the
ReplyChannelRange message to determine when our peer had sent all of
their replies. Now, we properly adhere to the specification by
interpreting the block ranges of these messages as intended.
Due to the large number of nodes deployed with the previous method, we
still maintain and detect when we are communicating with them, such that
we are still able to sync with them for backwards compatibility.
It's not possible to send another reply once all replies have been sent
without another request. The purpose of the check is also done within
another test, TestGossipSyncerReplyChanRangeQueryNoNewChans, so it can
be removed from here.
In order to properly adhere to the spec, when handling a
QueryChannelRange message, we must reply with a series of
ReplyChannelRange messages, that when consumed together cover the
entirety of the block range requested.
In this commit we fix in a bug in `lnd` that could cause other
implementations which implement a strict version of the spec to
disconnect when trying to sync their channel graph using the gossip
query feature. Before this commit, we would embed the request to a
`QueryChannelRange` in the response, causing some clients to reject the
response as the `FirstBlockHeight` and `NumBlocks` field would be
identical for each chunk of the response.
In order to remedy this, we now properly set these two fields with each
returned chunk. Note that even after this commit, we keep our existing
behavior surrounding the `Complete` field as is. Otherwise, current
`lnd` clients which rely on this field (rather than the two
aforementioned fields) wouldn't be able to properly detect when a set of
responses to their query was "complete".
Partially fixes#3728.
The policy update logic that resided part in the gossiper and
part in the rpc server is extracted into its own object.
This prepares for additional validation logic to be added for policy
updates that would otherwise make the gossiper heavier.
It is also a small first step towards separation of our own channel data
from the rest of the graph.
As a preparation for making the gossiper less responsible for validating
and supplementing local channel policy updates, this commits moves the
on-the-fly max htlc migration up the call tree. The plan for a follow up
commit is to move it out of the gossiper completely for local channel
updates, so that we don't need to return a list of final applied policies
anymore.
In this commit, we fix a bug where if a user updates a forwarding policy to be
zero, the update will be applied to the policy correctly on-disk, but not
in-memory.
We solve this issue by having the gossiper return the list of on-disk updated
policies and passing these policies to the switch, so the switch can assume
that zero-valued fields are intentional and not just uninitialized.
There's no need to broadcast these as we assume that online nodes have
already received them. For nodes that were offline, they should receive
them as part of their initial graph sync.
This ensures that the graph synced status is marked true at some point
once a historical sync has completed. Before this commit, a stalled
historical sync could cause us to never mark the graph as synced.
In this commit, we modify the `SampleNodeAddrs` method to no longer
retry itself. Instead, we'll now leave this task to the caller of the
this method. Additionally, we'll no longer return with an error if we
can't hit a particular seed. Instead, we'll log the error and move onto
the next seed. Finally, we'll also no longer require that the DNS seed
has a secondary seed in order to support a wider array of DNS seeds.
We do this to ensure the node announcement propagates to our channel
counterparty. At times, the node announcement does not propagate to them
when opening our first channel due to a race condition between
IsPublicNode and processing announcement signatures. This isn't
necessary for channel updates and announcement signatures as we send
those to our channel counterparty directly through the reliable sender.
TestSyncManagerHistoricalSyncOnReconnect tests that the sync manager will
re-trigger a historical sync when a new peer connects after a historical
sync has completed, but we have lost all peers.
To handle the case where we have been without peers, and get a new
connection, we reset the historical scan booleans when the first active
syncer is connected to trigger another historical sync.
Prior to this change, the numQueryResponses that we calculated would be
one more than what we actually wanted since it didn't account for the
initial QueryChannelRange msg. This resulted in the test sending one
extra delayed query than was configured. This doesn't fundamentally
impact the test, but does make what happens in the test more reflective
of the configuration.
This commit makes all replies in the gossip syncer synchronous, meaning
that they will wait for each message to be successfully written to the
remote peer before attempting to send the next. This helps throttle
messages the remote peer has requested, preventing unintended
disconnects when the remote peer is slow to process messages. This
changes also helps out congestion in the peer by forcing the syncer to
buffer the messages instead of dumping them into the peer's queue.
This commit creates a distinct replyHandler, completely isolating the
requesting state machine from the processing of queries from the remote
peer. Before the two were interlaced, and the syncer could only reply to
messages in certain states. Now the two will be complete separated,
which is preliminary step to make the replies synchronous (as otherwise
we would be blocking our own requesting state machine).
With this changes, the channelGraphSyncer of each peer will drive the
replyHanlder of the other. The two can now operate independently, or
even spun up conditionally depending on advertised support for gossip
queries, as shown below:
A B
channelGraphSyncer ---control-msg--->
replyHandler
channelGraphSyncer <--control-msg----
gossiper <--gossip-msgs----
<--control-msg---- channelGraphSyncer
replyHandler
---control-msg---> channelGraphSyncer
---gossip-msgs---> gossiper
In this commit, we begin to queue any active syncers until the initial
historical sync has completed. We do this to ensure we can properly
handle any new channel updates at tip. This is required for fresh nodes
that are syncing the channel graph for the first time. If we begin
accepting updates at tip while the initial historical sync is still
ongoing, then we risk not processing certain updates since we've yet to
learn of the channels themselves.
In this commit, we add logic to handle a peer with whom we're performing
an initial historical sync disconnecting. This is required to ensure we
get as much of the graph as possible when starting a fresh node. It will
also serve useful to ensure we do not get stalled once we prevent active
GossipSyncers from starting until the initial historical sync has
completed.