lightningd would race with the subd destructor to do the waitpid(),
resulting in UNUSUAL log messages, but also us missing if a plugin
was killed via a signal.
We can also get rid of the gratuitous waitpid() in test_subdaemons.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
msg_queue was originally designed for inter-daemon comms, and so it has
a special mechanism to mark that we're trying to send an fd. Unfortunately,
a peer could also send such a message, confusing us!
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We now let gossipd do it.
This also means there's nothing left in 'struct per_peer_state' to
send across the wire (the fds are sent separately), so that gets
removed from wire messages too.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
As connectd handles more packets itself, or diverts them to/from gossipd,
it's the only place we can implement the dev_disconnect logic.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
1. Freeing an unconfirmed channel already releases the subd, so don't
do that explicitly.
2. Use channel->owner to transfer ownership where possible, using
channel_set_owner() which handles all the cases.
This simplifies the code and makes it more readable, IMHO.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
And turn "" includes into full-path (which makes it easier to put
config.h first, and finds some cases check-includes.sh missed
previously).
config.h sets _GNU_SOURCE which really needs to be done before any
'#includes': we mainly got away with it with glibc, but other platforms
like Alpine may have stricter requirements.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
```
cc lightningd/subd.c
lightningd/subd.c:216:7: error: expected identifier or '('
int stdout = STDOUT_FILENO, stderr = STDERR_FILENO;
^
/usr/include/stdio.h:198:17: note: expanded from macro 'stdout'
^
lightningd/subd.c:216:7: error: expected ')'
/usr/include/stdio.h:198:17: note: expanded from macro 'stdout'
^
lightningd/subd.c:216:7: note: to match this '('
/usr/include/stdio.h:198:16: note: expanded from macro 'stdout'
^
lightningd/subd.c:224:12: error: cannot take the address of an rvalue of type 'FILE *' (aka 'struct __sFILE *')
fds[1] = &stdout;
^~~~~~~
lightningd/subd.c:225:12: error: cannot take the address of an rvalue of type 'FILE *' (aka 'struct __sFILE *')
fds[2] = &stderr;
^~~~~~~
4 errors generated.
gmake: *** [Makefile:279: lightningd/subd.o] Error 1
```
Changelog-None: introduced since last release.
Fixes: #4914
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> ```
It's both complex and flawed, as ZmnSCPxj points out. Make a generic
fd ordering routine, and use it.
Plus, test it!
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This also inadvertently fixes a latent bug: before this patch, in the
`subd` function in `lightningd/subd.c`, we would close `execfail[1]`
*before* doing an `exec`.
We use an EOF on `execfail[1]` as a signal that `exec` succeeded (the
fd is marked CLOEXEC), and otherwise use it to pump `errno` to the
parent.
The intent is that this fd should be kept open until `exec`, at which
point CLOEXEC triggers and close that fd and sends the EOF, *or* if
`exec` fails we can send the `errno` to the parent process vua that
pipe-end.
However, in the previous version, we end up closing that fd *before*
reaching `exec`, either in the loop which `dup2`s passed-in fds (by
overwriting `execfail[1]` with a `dup2`) or in the "close everything"
loop, which does not guard against `execfail[1]`, only
`dev_disconnect_fd`.
Before:
Ten builds, laptop -j5, no ccache:
```
real 0m36.686000-38.956000(38.608+/-0.65)s
user 2m32.864000-42.253000(40.7545+/-2.7)s
sys 0m16.618000-18.316000(17.8531+/-0.48)s
```
Ten builds, laptop -j5, ccache (warm):
```
real 0m8.212000-8.577000(8.39989+/-0.13)s
user 0m12.731000-13.212000(12.9751+/-0.17)s
sys 0m3.697000-3.902000(3.83722+/-0.064)s
```
After:
Ten builds, laptop -j5, no ccache: 8% faster
```
real 0m33.802000-35.773000(35.468+/-0.54)s
user 2m19.073000-27.754000(26.2542+/-2.3)s
sys 0m15.784000-17.173000(16.7165+/-0.37)s
```
Ten builds, laptop -j5, ccache (warm): 1% faster
```
real 0m8.200000-8.485000(8.30138+/-0.097)s
user 0m12.485000-13.100000(12.7344+/-0.19)s
sys 0m3.702000-3.889000(3.78787+/-0.056)s
```
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This avoids subdaemons complaining about malformed messages from us,
or doing the completely wrong thing, if they are really the wrong
version.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
You still shouldn't do this (you could get some transient failures),
but at least you have a decent chance if you reinstall over a running
daemon, instead of getting confusing internal errors if message
formats have changed.
Changelog-Added: lightningd: we now try to restart if subdaemons are upgraded underneath us.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Fixes: #4346
only needed for moving the subd->channel from an uncommitted_channel to
a channel; we removed uncommitted_channel from dual_open so it's no
longer necessary
This is in line with the warnings draft, where all-zeroes in a
channel_id is no longer special (i.e. it will be ignored).
But gossipd would send these if it got upset with us, so it's best
practice to ignore them for now anyway.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Added: Protocol: we treat error messages from peer which refer to "all channels" as warnings, not errors.
Back in the days before dual-funding, the `channel` struct on subd was
only every one type per daemon (either struct channel or struct
uncommitted_channel)
The RBF requirement on dualopend means that dualopend's channel,
however, can now be two different things -- either channel or
uncommitted_channel.
To track the difference/disambiguate, we now track the channel type on a
flag on the subd. It gets updated when we swap out the channel.
Changelog-Added: lightningd: Added --subdaemon command to allow alternate subdaemons.
[ Wow, that was mammoth; 44 comments over 12 commits. Feels almost unfair to squash it into one commit, so I wanted to note @ksedgwic's perseverence here! --RR ]
This simplifies our tests, too, since we don't need a magic option to
enable io logging in subdaemons.
Note that test_bad_onion still takes too long, due to a separate minor
bug, so that's marked and left dev-only for now.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
1. Printed form is always "[<nodeid>-]<prefix>: <string>"
2. "jcon fd %i" becomes "jsonrpc #%i".
3. "jsonrpc" log is only used once, and is removed.
4. "database" log prefix is use for db accesses.
5. "lightningd(%i)" becomes simply "lightningd" without the pid.
6. The "lightningd_" prefix is stripped from subd log prefixes, and pid removed.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-changed: Logging: formatting made uniform: [NODEID-]SUBSYSTEM: MESSAGE
Changelog-removed: `lightning_` prefixes removed from subdaemon names, including in listpeers `owner` field.
We had a separate logbook for each peer, and copy log entries above
the printable log level into the master logbook. This didn't always
work well, since we didn't dump it on crash for example.
Keep a single global logbook instead, and remove this infrastructure.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
A log can have a default node_id, which can be overridden on a per-entry
basis. This changes the format of logging, so some tests need rework.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
There are some more #if DEVELOPER one-liners coming, this makes them
clear, but still lets them stand out.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We will soon generalize the DB, so directly reaching into the `struct db`
instance to talk to the sqlite3 connection is bad anyway. This increases
flexibility and allows us to tailor the actual implementation to the
underlying DB.
Signed-off-by: Christian Decker <decker.christian@gmail.com>
We currently end up sleeping for 1 second for channeld and gossipd:
better to use a normal blocking waitpid and an alarm to wake us in
case they don't exit.
This speeds up `lightning-cli stop` on my machine from 2.008s to 0.008s:
a 286 times speedup!
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
The spec says to close the channel if they send us an error, but we
need to be more lenient to preserve channels with other
implementations.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Encapsulating the peer state was a win for lightningd; not surprisingly,
it's even more of a win for the other daemons, especially as we want
to add a little gossip information.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
These are always handed to subdaemons as a set, so group them. This makes
it easier to add an fd (in the next patch).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This currently just invokes GDB, but we could generalize it (though
pdb doesn't allow attaching to a running process, other python
debuggers seem to).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This way there's no need for a context pointer, and freeing a msg_queue
frees its contents, as expected.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We currently just ignore them. This is one reason the hsm (in some places)
explicitly calls log_broken so we get some idea.
This was the only subdaemon which had a NULL msgcb and msgname, so eliminate
those checks in subd.c.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Code changes:
1. Expose daemon_poll() so lightningd can call it directly, which avoids us
having store a global and document it.
2. Remove the (undocumented, unused, forgotten) --rpc-file="" option to disable
JSON RPC.
3. Move the ickiness of finding the executable path into subd.c, so it doesn't
distract from lightningd.c overview.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We're going to call out to subds for memleak detection, and the disabler
looks like a memleak if we're inside a callback.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
I crashed the HSMD, and it gave no output at all. That's because we
were only reading the status fd when we were waiting for a reply.
Fix this by using a separate request fd and status fd, which also means
that hsm_sync_read() is no longer required.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(This was sitting in my gossip-enchancement patch queue, but it simplifies
this set too, so I moved it here).
In 94711969f we added an explicit gossip_index so when gossipd gets
peers back from other daemons, it knows what gossip it has sent (since
gossipd can send gossip after the other daemon is already complete).
This solution is insufficient for the more general case where gossipd
wants to send other messages reliably, so replace it with the other
solution: have gossipd drain the "gossip fd" which the daemon returns.
This turns out to be quite simple, and is probably how I should have
done it originally :(
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>