Because we handle messages (which can take some time, persisting
things to disk or validating cryptographic signatures) with the
top-level read lock, but require the top-level write lock to
connect new peers or handle disconnection, we are particularly
sensitive to writer starvation issues.
Rust's libstd RwLock does not provide any fairness guarantees,
using whatever the OS provides as-is. On Linux, pthreads defaults
to starving writers, which Rust's RwLock exposes to us (without
any configurability).
Here we work around that issue by blocking readers if there are
pending writers, optimizing for readable code over
perfectly-optimized blocking.
Only one instance of PeerManager::process_events can run at a time,
and each run always finishes all available work before returning.
Thus, having several threads blocked on the process_events lock
doesn't accomplish anything but blocking more threads.
Here we limit the number of blocked calls on process_events to two
- one processing events and one blocked at the top which will
process all available events after the first completes.
Because the peers write lock "blocks the world", and happens after
each read event, always taking the write lock has pretty severe
impacts on parallelism. Instead, here, we only take the global
write lock if we have to disconnect a peer.
Users are required to only ever call `read_event` serially
per-peer, thus we actually don't need any locks while we're
processing messages - we can only be processing messages in one
thread per-peer.
That said, we do need to ensure that another thread doesn't
disconnect the peer we're processing messages for, as that could
result in a peer_disconencted call while we're processing a
message for the same peer - somewhat nonsensical.
This significantly improves parallelism especially during gossip
processing as it avoids waiting on the entire set of individual
peer locks to forward a gossip message while several other threads
are validating gossip messages with their individual peer locks
held.
This adds the required locking to process messages from different
peers simultaneously in `PeerManager`. Note that channel messages
are still processed under a global lock in `ChannelManager`, and
most work is still processed under a global lock in gossip message
handling, but parallelizing message deserialization and message
decryption is somewhat helpful.
Type aliases are now more robustly being exported in the C bindings
generator, which requires ensuring we don't include some type
aliases which make no sense in bindings.
On connection, if our peer supports gossip queries, and we never
send a `gossip_timestamp_filter`, our peer is supposed to never
send us gossip outside of explicit queries. Thus, we'll end up
always having stale gossip information after the first few
connections we make to peers.
The solution is to send a dummy `gossip_timestamp_filter`
immediately after connecting to peers.
Its somewhat strange to have a trait method which is named after
the intended action, rather than the action that occurred, leaving
it up to the implementor what action they want to take.
In 2d3a210897, we increased the
default ping timer in `lightning-background-processor` to ten
seconds from five. However, we didn't change the timer count at
which we disconnect peers if they're not responding, which we
likely should have done. We do so here, as well as update the
documentation for `PeerManager::timer_tick_occurred` to suggest
always ticking the timer every ten seconds instead of five.
Quite some time ago, `UnknownRequiredFeature` was only used when a
gossip message has a missing required feature. These days, its also
used for any required TLV which we do not understand in any
message. However, the handling of it was never updated in
`PeerManager`, leaving it printing a warning about gossip and
ignoring the message entirely.
Instead, we send a warning message and disconnect.
Closes#1236, as caught by @jkczyz.
Even if our gossip hasn't changed, we should be willing to
re-broadcast it to our peers. All our peers may have been
disconnected the last time we broadcasted it.
When a `ChannelUpdate` message is generated for broadcast as a part
of a `BroadcastChannelAnnouncement` event, it may be newer than our
previous `ChannelUpdate` and need to be broadcast. However, if the
`ChannelAnnouncement` had already been seen we wouldn't
re-broadcast either message as the `handle_channel_announcement`
call would fail, short-circuiting the condition to broadcast both.
Instead, we split the broadcast of each message as well as the
conditional so that we always attempt to handle each message and
update our local graph state, then broadcast the message if its
update was processed successfully.
NetworkGraph is owned by NetGraphMsgHandler, but DefaultRouter requires
a reference to it. Introduce shared ownership to NetGraphMsgHandler so
that both can use the same NetworkGraph.
This ensures we don't let a hung connection stick around forever if
the peer never completes the initial handshake.
This also resolves a race where, on receiving a second connection
from a peer, we may reset their_node_id to None to prevent sending
messages even though the `channel_encryptor`
`is_ready_for_encryption()`. Sending pings only checks the
`channel_encryptor` status, not `their_node_id` resulting in an
`unwrap` on `None` in `enqueue_message`.
Associated types in C bindings is somewhat of a misnomer - we
concretize each trait to a single struct. Thus, different trait
implementations must still have the same type, which defeats the
point of associated types.
In this particular case, however, we can reasonably special-case
the `Infallible` type, as an instance of it existing implies
something has gone horribly wrong.
In order to help our bindings code figure out how to do so when
referencing a parent trait's associated type, we specify the
explicit type in the implementation method signature.
When we landed custom messages, we used the empty tuple for the
custom message type for `IgnoringMessageHandler`. This was fine,
except that we also implemented `Writeable` to panic when writing
a `()`. Later, we added support for anchor output construction in
CommitmentTransaction, signified by setting a field to `Some(())`,
which is serialized as-is.
This causes us to panic when writing a `CommitmentTransaction`
with `opt_anchors` set. Note that we never set it inside of LDK,
but downstream users may.
Instead, we implement `Writeable` to write nothing for `()` and use
`core::convert::Infallible` for the default custom message type as
it is, appropriately, unconstructable.
This also makes it easier to implement various things in bindings,
as we can always assume `Infallible`-conversion logic is
unreachable.
In order to avoid significant malloc traffic, messages previously
explicitly stated their serialized length allowing for Vec
preallocation during the message serialization pipeline. This added
some amount of complexity in the serialization code, but did avoid
some realloc() calls.
Instead, here, we drop all the complexity in favor of a fixed 2KiB
buffer for all message serialization. This should not only be
simpler with a similar reduction in realloc() traffic, but also
may reduce heap fragmentation by allocating identically-sized
buffers more often.
MessageSendEvent::PaymentFailureNetworkUpdate served as a hack to pass
an HTLCFailChannelUpdate from ChannelManager to NetGraphMsgHandler via
PeerManager. Instead, remove the event entirely and move the contained
data (renamed NetworkUpdate) to Event::PaymentFailed to be processed by
an event handler.