There's actually a bug in our closing tx size estimation; I'll do
a separate patch for this, though.
Seems this used to be flaky, now we always flush queues, so it's
more reliably caught.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We used to shut down peers atomically, but now we flush the
connections there's a delay. If we are asked to connect in that time,
we ignore it, as we are already connected, but that's wrong: we need
to remember that we were told to connect and reconnect.
This should solve a few weird test failures where "connect" would hang
indefinitely.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We seem to hit a race between manual reconnect (with address hint) and an automatic
reconnection attempt which fails:
```
> l4.rpc.connect(l3.info['id'], 'localhost', l3.port)
...
E pyln.client.lightning.RpcError: RPC call failed: method: connect, payload: {'id': '035d2b1192dfba134e10e540875d366ebc8bc353d5aa766b80c090b39c3a5d885d', 'host': 'localhost', 'port': 41285}, error: {'code': 401, 'message': 'All addresses failed: 127.0.0.1:36678: Connection establishment: Connection refused. '}
```
See how it didn't even try the given address?
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
In the case where the peer sends an error (and hangs up) immediately
after init, connectd *doesn't actually read the error* (even after all the
previous fixes so it actually receives the error!).
This is because to tried to first write WIRE_CHANNEL_REESTABLISH, and
that fails, so it never tries to read. Generally, we should ignore
write failures; we'll find out if the socket is closed when we read
nothing.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
l1 might split in a commitment_signed before it notices the disconnect, and this test fails:
```
for i in range(0, len(disconnects)):
with pytest.raises(RpcError):
l1.rpc.sendpay(route, rhash, payment_secret=inv['payment_secret'])
> l1.rpc.waitsendpay(rhash)
E Failed: DID NOT RAISE <class 'pyln.client.lightning.RpcError'>
```
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This is critical in the common case where peer sends an error and
hangs up: we almost never get to relay the error to the subd in time.
This also applies in the other direction: we need to flush the queue
to the peer when the subd closes. Note we only free the actual peer
struct when lightningd reaps us with connectd_peer_disconnected().
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We would lose packets sometimes due to this previously, but it
doesn't happen over localhost so our tests didn't notice. However,
now we have connectd being sole thing talking to peers, we can do
a more elegant shutdown, which should fix closing.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Fixed: Protocol: Always flush sockets to increase chance that final message get to peer (esp. error packets).
msg_queue was originally designed for inter-daemon comms, and so it has
a special mechanism to mark that we're trying to send an fd. Unfortunately,
a peer could also send such a message, confusing us!
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
dev_blackhole_fd was a hack, and doesn't work well now we are async
(it worked for sync comms in per-peer daemons, but now we could sneak
through a read before we get to the next write).
So, make explicit flags and use them. This is much easier now we
have all peer comms in one place.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
It's weird to have connectd ask gossipd, when lightningd can just do it
and hand all the addresses together.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We now let gossipd do it.
This also means there's nothing left in 'struct per_peer_state' to
send across the wire (the fds are sent separately), so that gets
removed from wire messages too.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We actually intercept the gossip_timestamp_filter, so the gossip_store
mechanism inside the per-peer daemon never kicks off for normal connections.
The gossipwith tool doesn't set OPT_GOSSIP_QUERIES, so it gets both, but
that only effects one place.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
channeld can't do it any more: it's using local sockets. Connectd
can do it, and simply does it by type.
Amazingly, on my machine the timing change *always* caused
test_channel_receivable() to fail, due to a latent race.
Includes feedback from @cdecker.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
As connectd handles more packets itself, or diverts them to/from gossipd,
it's the only place we can implement the dev_disconnect logic.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Now connectd is doing the crypto, we can use normal wire io. We
create helper functions to clearly differentiate between "peer" comms
and intra-daemon comms though.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We temporarily hack to sync_crypto_write/sync_crypto_read functions to
not do any crypto, and do it all in connectd.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Instead of passing the incoming socket to lightningd for the
subdaemon, create a new one and simply shuffle data between them,
keeping connectd in the loop.
For the moment, we don't decrypt at all, just shuffle. This means our
buffer code is kind of a hack, but that goes away once we start
actually decrypting and understanding message boundaries.
This implementation is naive: it closes the socket to the local daemon
as soon as the peer closes the socket to us. This is fixed in a
successive patch series (along with many other similar issues).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This test started mostly failing (in non-DEVELOPER mode) after the
next patch, due to timing issues.
Handle both cases for now, and we'll add more enhancements later to
things we should be handling more consistently.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
connectd will be keeping the conn open, so it needs to free this
"conn_timeout" timer. Pass it through, so we can do that.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Procedure redefinition is a very cool feature but when other libraries redefine the same function can be very tricky to avoid compilation error.
This PR proposed a change of logic and use a customizzation function definition to read from the stdin, so we can avoid future error at compile time.
However, we could be check also the os or compiler that cause the error and redefine the missing function.
Changelog-None: Fixed compilation error due redefinition procedure.
Signed-off-by: Vincenzo Palazzo <vincenzopalazzodev@gmail.com>
I also got an error under CI; it seems the sleep() was insufficient.
So try adding a sleep inside the check_coin_moves, which should cover
everyone.
```
acct_moves = acct_moves[number_moves:]
else:
> if not move_matches(m, acct_moves[0]):
E IndexError: list index out of range
```
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
1. tal_strndup(.., str, strlen(str)) == tal_strdup()
2. tal_strdup also takes(), so document that.
3. Avoid passing 'struct sha256' on the stack: use ptr.
4. Generally, structures shouldn't keep pointers to things they don't own.
In this case, mvt->node_id.
5. Make payment_hash a pointer, since NULL is more natural than an all-zero
hash.
And add NON_NULL_ARGS() to the functions; it's cumbersome, but make it
fairly clear what params are optional.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
local_chan was mainly around so we could "soft" disable channels (and
really disable them once we used the channel_update in an error
message).
Instead we introduce the idea of a "deferred_update": it's either
deferred indefinitely (a peer goes offline, if we need to send it in
an error we'll apply it immediatly), or simply delayed to avoid
spamming everyone.
The resulting rewrite is much clearer, IMHO.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Increasingly we want to know is it local, and get the direction: it's
more efficient to do both at once.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>