This reworks the locking behavior of the Gossiper so that a race
condition on channel updates and block notifications doesn't cause any
loss of messages.
This fixes an issue that manifested mostly as flakes on itests during
WaitForNetworkChannelOpen calls.
The previous behavior allowed ChannelUpdates to be missed if they
happened concurrently to block notifications. The
processNetworkAnnoucement call would check for the current block height,
then lock the gossiper and add the msg to the prematureAnnoucements
list. New blocks would trigger an update to the current block height
then a lock and check of the aforementioned list.
However, specially during itests it could happen that the missing lock
before checking the height could case a race condition if the following
sequence of events happened:
- A new ChannelUpdate message was received and started processing on a
separate goroutine
- The isPremature() call was made and verified that the ChannelUpdate
was in fact premature
- The goroutine was scheduled out
- A new block started processing in the gossiper. It updated the block
height, asked and was granted the lock for the gossiper and verified
there was zero premature announcements. The lock was released.
- The goroutine processing the ChannelUpdate asked for the gossiper lock
and was granted it. It added the ChannelUpdate in the
prematureAnnoucements list. This can never be processed now.
The way to fix this behavior is to ensure that both isPremature checks
done inside processNetworkAnnoucement and best block updates are made
inside the same critical section (i.e. while holding the same lock) so
that they can't both check and update the prematureAnnoucements list
concurrently.
The policy update logic that resided part in the gossiper and
part in the rpc server is extracted into its own object.
This prepares for additional validation logic to be added for policy
updates that would otherwise make the gossiper heavier.
It is also a small first step towards separation of our own channel data
from the rest of the graph.
As a preparation for making the gossiper less responsible for validating
and supplementing local channel policy updates, this commits moves the
on-the-fly max htlc migration up the call tree. The plan for a follow up
commit is to move it out of the gossiper completely for local channel
updates, so that we don't need to return a list of final applied policies
anymore.
In this commit, we fix a bug where if a user updates a forwarding policy to be
zero, the update will be applied to the policy correctly on-disk, but not
in-memory.
We solve this issue by having the gossiper return the list of on-disk updated
policies and passing these policies to the switch, so the switch can assume
that zero-valued fields are intentional and not just uninitialized.
There's no need to broadcast these as we assume that online nodes have
already received them. For nodes that were offline, they should receive
them as part of their initial graph sync.
We do this to ensure the node announcement propagates to our channel
counterparty. At times, the node announcement does not propagate to them
when opening our first channel due to a race condition between
IsPublicNode and processing announcement signatures. This isn't
necessary for channel updates and announcement signatures as we send
those to our channel counterparty directly through the reliable sender.
Since ActiveSync GossipSyncers no longer synchronize our state with the
remote peers, none of the logic surrounding the round-robin is required
within the SyncManager.
In this commit, we extend the gossiper with support for external callers
to provide optional fields that can serve as useful when processing a
specific network announcement. This will serve useful for light clients,
which are unable to obtain the channel point and capacity for a given
channel, but can provide them manually for their own set of channels.
In this commit, we modify the main loop in `processChanPolicyUpdate` to
send updates for private channels directly to the remote peer via the
reliable message sender. This fixes a prior issue where the remote peer
wouldn't receive new updates as this method doesn't go through the
traditional path for channel updates.
In this commit, we introduce a new type: SyncerType. This type denotes
the type of sync a GossipSyncer is currently under. We only introduce
the two possible entry states, ActiveSync and PassiveSync. An ActiveSync
GossipSyncer will exchange channels with the remote peer and receive new
graph updates from them, while a PassiveSync GossipSyncer will not and
will only response to the remote peer's queries.
This commit does not modify the behavior and is only meant to be a
refactor.
In this commit, we address an assumption of the gossiper's recently
introduce reliable sender. The reliable sender is currently only used
for messages of unannounced channels. This makes sense as peers should
be able to retrieve messages from the network if they've previously
announced. However, within isMsgStale, we assumed that the reliable
sender would be used for every ChannelUpdate being sent, even if the
channel is already announced. Due to this, checking if the policy is
stale was unnecessary. But since this isn't the case, we should actually
be checking whether it is stale to prevent sending it later on.
In this commit, we leverage the recently introduced zombie edge index to
quickly reject announcements for edges we've previously deemed as
zombies. Care has been taken to ensure we don't reject fresh updates for
edges we've considered zombies.
In this commit, we also allow channel updates for our channels to be
sent reliably to our channel counterparty. This is especially crucial
for private channels, since they're not announced, in order to ensure
each party can receive funds from the other side.
In this commit, we alter the ValidateChannelUpdateAnn function in
ann_validation to validate a remote ChannelUpdate's message flags
and max HTLC field. If the message flag is set but the max HTLC
field is not set or vice versa, the ChannelUpdate fails validation.
Co-authored-by: Johan T. Halseth <johanth@gmail.com>
In this commit:
* we partition lnwire.ChanUpdateFlag into two (ChanUpdateChanFlags and
ChanUpdateMsgFlags), from a uint16 to a pair of uint8's
* we rename the ChannelUpdate.Flags to ChannelFlags and add an
additional MessageFlags field, which will be used to indicate the
presence of the optional field HtlcMaximumMsat within the ChannelUpdate.
* we partition ChannelEdgePolicy.Flags into message and channel flags.
This change corresponds to the partitioning of the ChannelUpdate's Flags
field into MessageFlags and ChannelFlags.
Co-authored-by: Johan T. Halseth <johanth@gmail.com>
In this commit, we allow the gossiper to also broadcast the
corresponding node announcements, if we know of them, of a channel when
constructing its full proof. We do this to ensure peers (other than our
remote peer) receive all the relevant announcements for a channel.
The tests changes were made to ensure the new behavior introduced works
as intended. Previously, the node announcements for each test channel
announcement were not processed, so they never existed from the
gossiper's point of view.
This also addresses an existing flake in the integration test
`testNodeAnnouncement`. This problem arose due to the node announcement
being sent before the connection between Dave (node announcement sender)
and Alice (node announcement receiver) was initiated and the full
channel proof was constructed.
This commit passes the peer's quit signal to the
gossipSyncer when attempt to hand off gossip query
messages. This allows a rate-limited peer's read
handler to break out immediately, which would
otherwise remain stuck until the rate-limited
gossip syncer pulled the message.
This commit restructures the delivery of gossip
query related messages, such that they are delivered
directly to the gossip syncers. Gossip query rate
limiting was introduced in #1824 on a per-peer basis.
However, since all gossip query messages were being
delivered in the main event loop, the end result is
that one rate-limited peer could stall all other
peers.
In addition, since no other peers would be able to
submit gossip-related messages through the blocked
event loop, the back pressure would eventually rate
limit the read handlers of all peers as well.
The end result would be lengthy delays in reading
messages related to htlc forwarding.
The fix is to lift the delivery of gossip query
messages outside of the main event loop. With
this change, the rate limiting backpressure is
delivered only to the intended peer.
In this commit, we modify the gossiper to no longer broadcast
NodeAnnouncements of nodes who intend to remain private. We do this to
prevent leaking their information to the greater network.
Previously, gossiper was the only object that validated channel
updates. Because updates can also be received as part of a
failed payment session in the routing package, validation logic
needs to be available there too. Gossiper already depends on
routing and having routing call the validation logic inside
gossiper would be a circular dependency. Therefore the validation
was moved to routing.
This commit removes the fallback in fetchGossipSyncer
that creates a gossip syncer if one is not registered
w/in the gossiper. Now that we register gossip syncers
explicitly before reading any gossip query messages,
this should not longer be required. The fallback also
did not honor the cfg.NoChanUpdates flag, which may
have led to inconsistencies between configuration and
actual behavior.
restransmitStaleChannels
In this commit, we add an additional error check for
ErrNoGraphEdgesFound when restransmitting stale channels during the
gossiper's startup. We do this to prevent benign log messages as we'll
log that we were unable to retransmit stale channels when we didn't have
any channels in our graph to begin with.
In this commit, we aim to resolve an issue with nodes requesting for
channel announcements when receiving a channel update for a channel
they're not aware of. This can happen if a node is not caught up with
the chain or if they receive updates for zombie channels. This would
lead to a spam issue, as if a node is not caught up with the chain,
every new update they receive is premature, causing them to manually
request the backing channel announcement. Ideally, we should be able to
detect this as a potential DoS vector and ban the node responsible, but
for now we'll simply remove this functionality.
In this commit, we select on the peer's QuitSignal to allow the caller
to unblock if the peer itself is disconnecting. With this change, we now
ensure that it isn't possible for a peer to block on this method and
prevent a graceful exit.
Previosuly we would immediately return nil on the error channel for
premature ChannelUpdates, which would break the expection that a a
returned non-error meant the update was successfully added to the
database. This meant that the caller would believe the update was added
to the database, while it is actually still in volatile memory and can
be lost during restarts.
This change makes us handle premature ChannelUpdates as we handle other
premature announcements within the gossiper, by deferring sending on the
error channel until we have reprocessed the update.
Previously we wouldn't return anything in the case where the
announcement were meant for a chain we didn't recognize. After this
change we should return an error on the error channel in all flows
within the gossiper.
Corrects an instance that holds a reference to a boltdb
byte slice after returning from the transaction. This
can cause panics under certain conditions, which is
avoided by creating a copy of the key.
In this commit, we allow the gossiper syncer to store the chunk size for
its respective encoding type. We do this to prevent a race condition
that would arise within the unit tests by modifying the values of the
encodingTypeToChunkSize map to allow for easier testing.
In this commit, we fix the logging when adding new gossip syncers. The
old log would log the byte array, rather than the byte slice. We fix
this by slicing before logging.
This commit changes the gossiper to direct messages to
peer objects, instead of sending them through the
server every time. The primary motivation is to reduce
contention on the server's mutex and, more importantly,
avoid deadlocks in the Triangle of Death.
In this commit, we fix an existing deadlock in the
gossiper->server->peer pipeline by ensuring that we're not holding the
syncer mutex while we attempt to have a syncer filter out the rest of
gossip messages.
In this commit we fix an existing bug caused by a scheduling race
condition. We'll now ensure that if we get a gossip message from a peer
before we create an instance for it, then we create one on the spot so
we can service the message. Before this commit, we would drop the first
message, and therefore never sync up with the peer at all, causing them
to miss channel announcements.
In this commit, we extend the AuthenticatedGossiper to take advantage of
the new query features in the case that it gets a channel update w/o
first receiving the full channel announcement. If this happens, we'll
attempt to find a syncer that's fully synced, and request the channel
announcement from it.
In this commit, we create a new concrete implementation for the new
discovery.ChannelGraphTimeSeries interface. We also export the
createChannelAnnouncement method to allow the chanSeries struct to
re-use the existing code for creating wire messages from the database
structs.
In this commit, we update the logic in the AuthenticatedGossiper to
ensure that can properly create, manage, and dispatch messages to any
gossipSyncer instances created by the server.
With this set of changes, the gossip now has complete knowledge of the
current set of peers we're conneted to that support the new range
queries. Upon initial connect, InitSyncState will be called by the
server if the new peer understands the set of gossip queries. This will
then create a new spot in the peerSyncers map for the new syncer. For
each new gossip query message, we'll then attempt to dispatch the
message directly to the gossip syncer. When the peer has disconnected,
we then expect the server to call the PruneSyncState method which will
allow us to free up the resources.
Finally, when we go to broadcast messages, we'll send the messages
directly to the peers that have gossipSyncer instances active, so they
can properly be filtered out. For those that don't we'll broadcast
directly, ensuring we skip *all* peers that have an active gossip
syncer.
In this commit, introduce a new struct, the gossipSyncer. The role of
this struct is to encapsulate the state machine required to implement
the new gossip query range feature recently added to the spec. With this
change, each peer that knows of this new feature will have a new
goroutine that will be managed by the gossiper.
Once created and started, the gossipSyncer will start to progress
through each possible state, finally ending at the chansSynced stage. In
this stage, it has synchronized state with the remote peer, and is
simply awaiting any new messages from the gossiper to send directly to
the peer. Each message will only be sent if the remote peer actually has
a set update horizon, and the message isn't before or after that
horizon.
A set of unit tests has been added to ensure that two state machines
properly terminate and synchronize channel state.
In this commit, we update the testUpdateChannelPolicy to exercise the
recent set of changes within the switch. If one applies this test to a
fresh branch (without those new changes) it should fail. This is due to
the fact that before, Bob would attempt to apply the constraints of the
incoming link (which we updated) instead of the outgoing link. With the
recent set of changes, the test now properly passes.
In this commit, we fix an existing deadlock in the
processChanPolicyUpdate method. Before this commit, within
processChanPolicyUpdate, we would directly call updateChannel *within*
the ForEachChannel closure. This would at times result in a deadlock, as
updateChannel will itself attempt to create a write transaction in order
to persist the newly updated channel.
We fix this deadlock by simply performing another loop once we know the
set of channels that we wish to update. This second loop will actually
update the channels on disk.
In this commit, we reduce the amount of unnecessary work that the
gossiper can carry out. When CPU profiling some nodes, I noticed that
we’d spend a lot of time validating the signatures for an announcement,
only to realize that the router already had it.
To remedy this, we’ll use the new methods added to the channel router
in order to avoid unnecessarily validating an announcement that is
actually stale. This should reduce memory usage (since it uses big
int’s under the scenes), and also idle CPU usage.
In order to reduce high CPU utilization during the initial network view
sync, we slash down the total number of active in-flight jobs that can
be launched.
This commit uses the multimutex.Mutex to esure database
state stays consistent when handling an announcement, by
restricting access to one goroutine per channel ID.
This fixes a bug where the goroutine would read the
database, make some decisions based on what was read,
then write its data to the database, but the read data
would be outdated at this point. For instance, an
AuthProof could have been added between reading the
database and when the decision whether to announce
the channel is made, making it not announce it.
Similarly, when receiving the AuthProof, the edge
policy could be added between reading the edge state
and adding the proof to the database, also resulting
in the edge not being announced.
This commit fixes a bug that could cause annoucements
to get lost, and resultet in flaky integration tests.
After a set of announcements was broadcastet, we would
reset (clear) the announcement batch, making any
annoucement that was added between the call to Emit()
and Reset() to be deleted, without ever being broadcast.
We can just remove the Reset() call, as the batch will
actually be reset within the call to Emit(), making
the previous call only delete those messages we hadn't
sent yet.
We no longer need to hand off new channels that come online as the
chainWatcher will be persistent, and always have an active signal for
the entire lifetime of the channel.
This commit ensures that we always increment the timestamp of
ChannelUpdates we send telling the network about changes to
our channel policy. We do this because it could happen
(especially during tests) that we issued an update, but the
ChannelUpdate would have the same timestamp as our last
ChannelUpdate, and would be ignored by the network.
This commit makes the gossiper aware of the timestamps
of ChannelUpdates and NodeAnnouncements, such that it
only keeps the newest message when deduping. Earlier
it would always keep the last received message, which
in some cases could be outdated.
This commit makes sure we are not attempting to create a
channel announcement with a nil ChannelAuthProof, as that
could cause a crash at startup whe the gossiper would
attempt to reprocess an edge coming from the fundingmanager.
It also makes sure we check the correct error returned from
processRejectedEdge.
In this commit, we make an incremental step towards page of the new
feature of deDupedAnnoucnements to return the set of senders for each
message. All methods the process new channel announcements, will now
return an instance of networkMsg rather than lnwire.Message. This will
allow passing the returned announcement directly into
deDupedAnnoucnements.AddMsg().
In this commit, we modify the deDupedAnnouncements struct slightly. The
element of this struct will now keep track of the set of senders that
sent a particular message. Each time a message is added, we’ll replace
the new message with the old (as normal), but we’ll also add the new
sender to the set of known senders.
With this new feature, we’ll be able to avoid re-sending a message to
the peer that sent it to us in the first place.
This commit makes the gossiper track the state of a local
AnnounceSignature message, such that it can retry sending
it to the remote peer if needed. It will also persist this
state in the WaitingProofStore, such that it can resume
from this state at startup.
This commit makes the gossiper store received ChannelUpdates
that is not for any known channel in a map, such that they
can be reprocessed when the ChannelAnnouncement arrives.
This is done to handle the case where we receive a ChannelUpdate
from our channel counterparty before we have been able to process
our own local ChannelAnnouncement.
This commit adds some comments and does some cleanup
of the logic that makes sure non-public channels
(channels with no AuthProof) are not broadcasted
to the network.
This commit fixes an existing bug wherein we would blank out a node’s
color instead of properly setting the field when syncing graph state
with another node This would cause the node to reject the node
announcement and we would generate an we would invalidate the signature
of the node announcement. We fix this simply by properly setting the
node announcement.
In this commit, we fix an existing bug when processing new node
announcement. Before this commit, we wouldn’t also copy over the color
field of a node’s announcement. As a result, when went to synchronize
our graph state with that of a connecting peer, we would generate an
invalid node announcement. We fix this by properly setting the color
field of a node.
In this commit, we now properly examine the Flag field within the
ChannelUpdate message as a bitfield. Before this commit we would
manually check the flags for zero or one. This was incorrect as a their
bit has now been defined. To properly dispatch the messages, we’ll now
treat it properly as a bitmask.
Add option to set trickleDelay for AuthenticatedGossiper in
command line, with default value of 300 milliseconds. Pass this
value to newServer, which uses it when creating a new instance of
AuthenticatedGossiper. Also set this value to 300 milliseconds when
creating nodes in integration tests.
For Part 1 of Issue #275. Create isolated private struct in
networkHandler goroutine that will de-duplicate
announcements added to the batch. The struct contains maps
for each of channel announcements, channel updates, and
node announcements to keep track of unique announcements.
The struct has a Reset method to reset stored announcements, an
AddMsg(lnwire.Message) method to add a new message to the current
batch, and a Batch method to return the set of de-duplicated
announcements.
Also fix a few minor typos.
In this commit, we fix an existing bug that could result in a panic if
we received a ChannelUpdate message with an unknown set of flags. If
the flag wasn’t set to zero or one, then the pubKey parameter would be
still nil when we attempted to validate it, causing an error to occur.
We remedy this by instead returning an error if the flags are unknown.
In a future commit, we will properly handle the set of flags that
indicates the channel should be disabled.
This commit refactors the SynchronizeNode logic such that
it can be called without interacting with the gossiper's
main execution loop. This method does not require access
to any of the gossiper's internal state, making the change
fairly straightforward. The primary motivation behind
this change is to minimize the possibility of introducing
deadlock scenarios between the gossiper and server.
This commit fixes a slight bug in the announcement processing logic
within the AuthenticatedGossiper. Before this commit, it was possible
for us to ignore one of our now announcements due to it being
pre-mature, rendering (atm) the channel unusable by the ChannelRouter
itself. To fix this, we know only check for a pre mature announcement
iff the message is coming from a remote node.
This commit adds an additional return value to the updateChannel
method. We also now return the original ChannelAnnouncement, this can
be useful as it let’s the caller ensure that the channel announcement
will be broadcast along side the new channel update.
This commit does two things. First we fix a deadlock bug within boltdb
that could arise when the channel router attempted to open a
transaction, while the retransmission loop was attempting to send a
message to the channel router (circular wait). Second, we’ve refactored
out the retransmission into its own function. This allows us to kick
off retransmission as soon as the AuthenticatedGossiper is created.
This commit modifies the recently modified logic for self-channel
retransmission to exclude pruning *our* channels which haven’t been
updated since the broadcastInterval. Instead, we only re-broadcast
channels of ours that haven’t been updated in 24 hours.