Clean restart of daemon after a tx-abort is a nice way to work around
the 'persistent' disconnect that we t-bast noticed.
Changelog-Fixed: `dualopend`: Fix behavior for tx-aborts. No longer hangs, appropriately continues re-init of RBF requests without reconnction msg exchange.
`struct log` becomes `struct logger`, and the member which points to the
`struct log_book` becomes `->log_book` not `->lr`.
Also, we don't need to keep the log_book in struct plugin, since it has
access to ld's log_book.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Be more graceful in shutting down: this should fix the issue where
bookkeeper gets upset that its commands are rejected during shutdown,
and generally make things more graceful.
1. Stop any new RPC connections.
2. Stop any per-peer daemons (channeld, etc).
3. Shut down plugins.
4. Stop all existing RPC connections.
5. Stop global daemons.
6. Free up peer, chanen HTLC datastructures.
7. Close database.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Changed: Plugins: RPC operations are now still available during shutdown.
We get some memleak reports because ld owns the subd, but once
the peer/channel is freed, there's no reference for the brief time
until the subd exits.
This happens for both opening and closingd. For openingd, the
peer owns it, for others (including dualopend) the channel owns it.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Instead of doing this weird chaining, just call them all at once and
use a reference counter.
To make it simpler, we return the subd_req so we can hang a destructor
off it which decrements after the request is complete.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Otherwise we get weird effects, as htlcs are being freed:
```
2022-01-26T05:07:37.8774610Z lightningd-1: 2022-01-26T04:47:48.770Z DEBUG 030eeb52087b9dbb27b7aec79ca5249369f6ce7b20a5684ce38d9f4595a21c2fda-chan#8: Failing HTLC 18446744073709551615 due to peer death
2022-01-26T05:07:37.8775287Z lightningd-1: 2022-01-26T04:47:48.770Z **BROKEN** 030eeb52087b9dbb27b7aec79ca5249369f6ce7b20a5684ce38d9f4595a21c2fda-chan#8: Neither origin nor in?
```
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
lightningd would race with the subd destructor to do the waitpid(),
resulting in UNUSUAL log messages, but also us missing if a plugin
was killed via a signal.
We can also get rid of the gratuitous waitpid() in test_subdaemons.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
1. Freeing an unconfirmed channel already releases the subd, so don't
do that explicitly.
2. Use channel->owner to transfer ownership where possible, using
channel_set_owner() which handles all the cases.
This simplifies the code and makes it more readable, IMHO.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This avoids subdaemons complaining about malformed messages from us,
or doing the completely wrong thing, if they are really the wrong
version.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
only needed for moving the subd->channel from an uncommitted_channel to
a channel; we removed uncommitted_channel from dual_open so it's no
longer necessary
This is in line with the warnings draft, where all-zeroes in a
channel_id is no longer special (i.e. it will be ignored).
But gossipd would send these if it got upset with us, so it's best
practice to ignore them for now anyway.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Added: Protocol: we treat error messages from peer which refer to "all channels" as warnings, not errors.
Back in the days before dual-funding, the `channel` struct on subd was
only every one type per daemon (either struct channel or struct
uncommitted_channel)
The RBF requirement on dualopend means that dualopend's channel,
however, can now be two different things -- either channel or
uncommitted_channel.
To track the difference/disambiguate, we now track the channel type on a
flag on the subd. It gets updated when we swap out the channel.
The spec says to close the channel if they send us an error, but we
need to be more lenient to preserve channels with other
implementations.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Encapsulating the peer state was a win for lightningd; not surprisingly,
it's even more of a win for the other daemons, especially as we want
to add a little gossip information.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
These are always handed to subdaemons as a set, so group them. This makes
it easier to add an fd (in the next patch).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This currently just invokes GDB, but we could generalize it (though
pdb doesn't allow attaching to a running process, other python
debuggers seem to).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This way there's no need for a context pointer, and freeing a msg_queue
frees its contents, as expected.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We currently just ignore them. This is one reason the hsm (in some places)
explicitly calls log_broken so we get some idea.
This was the only subdaemon which had a NULL msgcb and msgname, so eliminate
those checks in subd.c.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Code changes:
1. Expose daemon_poll() so lightningd can call it directly, which avoids us
having store a global and document it.
2. Remove the (undocumented, unused, forgotten) --rpc-file="" option to disable
JSON RPC.
3. Move the ickiness of finding the executable path into subd.c, so it doesn't
distract from lightningd.c overview.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
I crashed the HSMD, and it gave no output at all. That's because we
were only reading the status fd when we were waiting for a reply.
Fix this by using a separate request fd and status fd, which also means
that hsm_sync_read() is no longer required.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This was sitting in my gossip-enchancement patch queue, but it simplifies
this set too, so I moved it here).
In 94711969f we added an explicit gossip_index so when gossipd gets
peers back from other daemons, it knows what gossip it has sent (since
gossipd can send gossip after the other daemon is already complete).
This solution is insufficient for the more general case where gossipd
wants to send other messages reliably, so replace it with the other
solution: have gossipd drain the "gossip fd" which the daemon returns.
This turns out to be quite simple, and is probably how I should have
done it originally :(
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This means the caller needs to supply an explicit log to base the
subd log on, and also a callback for error handling.
The callback is kind of ugly, but it gets reworked towards the end
of this series.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This is a bit messier than I'd like, but we want to clearly remove all
dev code (not just have it uncalled), so we remove fields and functions
altogether rather than stub them out. This means we put #ifdefs in callers
in some places, but at least it's explicit.
We still run tests, but only a subset, and we run with NO_VALGRIND under
Travis to avoid increasing test times too much.
See-also: #176
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
There are now only two kinds of subdaemons: global ones (hsmd, gossipd) and
per-peer ones. We can handle many callbacks internally now.
We can have a handler to set a new peer owner, and automatically do
the cleanup of the old one if necessary, since we now know which ones
are per-peer.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This change is really to allow us to have a --dev-fail-on-subdaemon-fail option
so we can handle failures from subdaemons generically.
It also neatens handling so we can have an explicit callback for "peer
did something wrong" (which matters if we want to close the channel in
that case).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We had a terrible hack in gossip when a peer didn't exist. Formalize
a pattern when code+200 is a failure (with no fds passed), and use it
here.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We use a file descriptor, so when we consume an entry, we move past it
(and everyone shares a file offset, so this works).
The file contains packet names prefixed by - (treat fd as closed when
we try to write this packet), + (write the packet then ensure the file
descriptor fails), or @ ("lose" the packet then ensure the file
descriptor fails).
The sync and async peer-write functions hook this in automatically.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Header from folded patch 'test-run-cryptomsg__fix_compilation.patch':
test/run-cryptomsg: fix compilation.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>