I was examining a test_onchain_timeout failure, and realized that we
were forgetting a peer even though we'd just spent the HTLC_TIMEOUT_TX!
This reveals that we weren't resolving an output when we stole the preimage
from it, like we should.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Since we panic when we see our root reorg out, even if we're not doing
anything yet, restoring the 100 block margin is the simplest fix.
Unfortunately this means adding a 100-block spacer in the tests, so things
don't get confused.
Fixes: #511
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We stopped too early; we should continue and make sure it all goes well.
This means we have to fix them to be deterministic: by generating 2
blocks at once in test_htlc_in_timeout, we raced between fulfill and
timeout on the HTLC. Now it's always fulfilled.
Also, fixed confusing comments: l1 doesn't drop to chain.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
OUR_HTLC_TIMEOUT_TO_US = normal tx, used to timeout htlc in their commit tx.
OUR_HTLC_TIMEOUT_TX = dual-sig tx with delay, used to timeout htlc in our commit tx.
Only one test looks at that string, so fix that too.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
The test_reconnect_normal test is failing rather consistently on 32bit
architectures, disabling to reduce noise. Issue #468 tracks progress.
Signed-off-by: Christian Decker <decker.christian@gmail.com>
With python-bitcoinlib==0.9.0 it appears that the URL based auth
information is no longer used, so we fall back to reading the config
file for the bitcoind daemon instead.
Signed-off-by: Christian Decker <decker.christian@gmail.com>
Relatively simple: until we reach funding-depth the channels should be
known locally, so we can already route through them, but they should
not be announced to peers to which the connection is non-local.
Signed-off-by: Christian Decker <decker.christian@gmail.com>
If send_htlc_out() fails, it doesn't initialize pc->out; that can
make us think it's still in progress.
Reported-by: Jonas Nick
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
All peers come from gossipd, and maintain an fd to talk to it. Sometimes
we hand the peer back, but to avoid a race, we always recreated it.
The race was that a daemon closed the gossip_fd, which made gossipd
forget the peer, then master handed the peer back to gossipd. We stop
the race by never closing the gossipfd, but hand it back to gossipd
for closing.
Now gossipd has to accept two fds, but the handling of peers is far
clearer.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We should not disconnect from a peer just because it fails opening; we
should return it to gossipd, and give a meaningful error.
Closes: #401
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We don't use it yet, but now we'll decode correctly.
See: https://github.com/lightningnetwork/lightning-rfc/pull/317
lightning-rfc commit: ef053c09431442697ab46e83f9d3f86e3510a18e
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
If you run locally, it fails occasionally; presumably because it
sees previous funds. Use a random HSM key for that teste.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
I noticed some breakage with git master:
1. getinfo no longer supported (for us, use getblockchaininfo)
2. generate no longer supported (use generatetoaddress)
Both these options are supported at least in 0.15, too.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This seems to happen when we manage to check between the
channel_announcement and the channel_update being processed.
Signed-off-by: Christian Decker <decker.christian@gmail.com>
Add two simple tests: one for a single direct payment and one with
hundreds of parallel payments, reusing the same route.
Signed-off-by: Christian Decker <decker.christian@gmail.com>
We are announcing that we are willing to accept incoming payments with
current_height + min_final_cltv_expiry + slack, assuming that the
sender adds some slack. In particular we'd reject the payment if
slack=0 which is allowed by the spec.
Signed-off-by: Christian Decker <decker.christian@gmail.com>
Raising the exception about non-zero exit values into the
teardown. This was previously masking the valgrind errors. Now
valgrind errors > crash errors > non-zero return value.
Still hoping to catch that elusive [7, 0] return value on travis.
Signed-off-by: Christian Decker <decker.christian@gmail.com>
These need to be different for testing the example in BOLT 11.
We also use the cltv_final instead of deadline_blocks in the final hop:
various tests assumed 5 was OK, so we tweak utils.py.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
I run with ulimit -c unlimited, and valgrind leaves core files like
valgrind-errors.22114.core.22114 which test_lightning.py tries to
parse as log files.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
It crashes under valgrind, causing a valgrind error: valgrind gives us a
backtrace anyway, so we don't need it.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
And we report these through the getpeers JSON RPC again (carefully: in
our reconnect tests we can get duplicates which this patch now filters
out).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This is a bit messier than I'd like, but we want to clearly remove all
dev code (not just have it uncalled), so we remove fields and functions
altogether rather than stub them out. This means we put #ifdefs in callers
in some places, but at least it's explicit.
We still run tests, but only a subset, and we run with NO_VALGRIND under
Travis to avoid increasing test times too much.
See-also: #176
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Makes it easier to compare before/after failures. Ideally, we should
run under Travis both with this option and with the seed based on the
entire tmp path (which is still reproducible with determination, but
not fixed every run like this is).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
There are now only two kinds of subdaemons: global ones (hsmd, gossipd) and
per-peer ones. We can handle many callbacks internally now.
We can have a handler to set a new peer owner, and automatically do
the cleanup of the old one if necessary, since we now know which ones
are per-peer.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Now the flow is much simpler from a lightningd POV:
1. If we want to connect to a peer, just send gossipd `gossipctl_reach_peer`.
2. Every new peer, gossipd hands up to lightningd, with global/local features
and the peer fd and a gossip fd using `gossip_peer_connected`
3. If lightningd doesn't want it, it just hands the peerfd and global/local
features back to gossipd using `gossipctl_handle_peer`
4. If a peer sends a non-gossip msg (eg `open_channel`) the gossipd sends
it up using `gossip_peer_nongossip`.
5. If lightningd wants to fund a channel, it simply calls `release_channel`.
Notes:
* There's no more "unique_id": we use the peer id.
* For the moment, we don't ask gossipd when we're told to list peers, so
connected peers without a channel don't appear in the JSON getpeers API.
* We add a `gossipctl_peer_addrhint` for the moment, so you can connect to
a specific ip/port, but using other sources is a TODO.
* We now (correctly) only give up on reaching a peer after we exchange init
messages, which changes the test_disconnect case.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We currently assume the daemon gives up; gossipd won't, and we want to
test it there too.
This reveals a bug (returning io_close() is bad if the call is to
duplex()), and breaks a test which now continues after dropping a
packet..
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Useful if we want to drop & suppress, for example. We change '=' to mean
do nothing to the packet.
We use this to clean up the test_reconnect_sender_add test.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We're going to make the ip/port optional, so they should go at the end.
In addition, using ip:port is nicer, for gethostbyaddr().
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We have to do a dance when we get a reconnect in openingd, because we
don't normally expect to free both owner and peer. It's a layering
violation: freeing a peer should clean up the owner's pointer to it,
to avoid a double free, and we can eliminate this dance.
The free order is now different, and the test_reconnect_openingd was
overprecise.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Now that we have HTLC persistence we'd also like to test it. This
kills the second node in the middle of an HTLC, it'll recover and
finish the flow.
Signed-off-by: Christian Decker <decker.christian@gmail.com>
This broke somewhere in the recent changes, because we override
TailalbleProc stop(). Break out log extractor.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Moved the flagging for allowed failures into the factory getter, and
renamed into `may_fail`. Also stopped the teardown of a node from
throwing an exception if we are allowed to exit non-cleanly.
Signed-off-by: Christian Decker <decker.christian@gmail.com>
A failed returncode check could result in the cleanup for other
lightningds to be skipped. Now make sure to cleanup all and then
rethrow an exception that contains all returncodes.
Signed-off-by: Christian Decker <decker.christian@gmail.com>
We used to simply kill the daemon, which in some cases could result in
half-written crashlogs and similar artifacts such as half-completed
RPC calls. Now we ask lightningd to stop nicely, give it some time and
only then kill it. We also return the returncode of the daemon.
Signed-off-by: Christian Decker <decker.christian@gmail.com>
peer_fail_permanent() frees peer->owner, but for bad_peer() we're
being called by the sd->badpeercb(), which then goes on to
io_close(conn) which is a child of sd.
We need to detach the two for this case, so neither tries to free the
other.
This leads to a corner case when the subd exits after the peer is gone:
subd->peer is NULL, so we have to handle that too.
Fixes: #282
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Note that it should really be a flag to daemon on construction, too,
but that may interfere with another concurrent branch so I've deferred.
Suggested-by: Christian Decker
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We re-use the value for reasonable_depth given by the master, and we
tell it when our timeout transactions reach that depth.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
In the next test, we wait for multiple 'sendrawtx exit 0' which
doesn't work because we use a set not a list, and the current code
would match multiple against the same thing. The result was we didn't
wait for the final sendrawtransaction, and occasionally had test
failures as a result.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
When we sent out an HTLC-Timeout or HTLC-Success tx, we need to spend
it after the timeout so it's safely in our wallet.
We generalize the tx_type OUR_UNILATERAL_TO_US_RETURN_TO_WALLET to
OUR_DELAYED_RETURN_TO_WALLET, since we use it for HTLC transactions too.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
When we see an offered HTLC onchain, we need to use the preimage if we
know it. So we dump all the known HTLC preimages at startup, and send
new ones as we discover them.
This doesn't cover preimages we know because we're the final
recipient; that can happen if an HTLC hasn't been irrevocably
committed yet. We'll do that in a followup patch.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
lightningd can crash on shutdown if it's in the middle of getchaintips;
we free the conn, the finished callback is called (process_chaintips),
and it reports that it received an empty result.
The simplest fix is to set a flag in the struct bitcoind destructor,
and avoid the callback.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Either when it exits with a signal, or sends an error status message.
Then we make test_lightningd.py use it.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We simply kill lightningd; we should stop it properly and have a timeout
to kill it if that fails. However, that's beyond my python skills :(
So we just look for crash.log. Unfortunately, we usually kill
lightningd before it's finished writing it. So we look for it and
don't kill lightningd, just wait in this case.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This is the step where we broadcast the transaction to the network and
a nice place to extract the change from the transaction.
Signed-off-by: Christian Decker <decker.christian@gmail.com>
For the permfail tests the sendpay call is supposed to fail, so this
was printing stacktraces upon success. Running in futures captures any
thrown exceptions and rethrows them when calling `result()`, in our
case we just ignore them.
Signed-off-by: Christian Decker <decker.christian@gmail.com>
So far we were always using the deadline in the announcements, that's
obviously not good, so this introduces the parameter as per spec.
Signed-off-by: Christian Decker <decker.christian@gmail.com>
We weren't killing it. Eventually it would die, and peer_owner_finished()
would access subd->peer->owner, but that peer was freed already.
Closes: #261
Reported-by: Christian Decker <decker.christian@gmail.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
To reproduce the next bug, I had to ensure that one node keeps thinking it's
disconnected, then the other node reconnects, then the first node realizes
it's disconnected.
This code does that, adding a '0' dev-disconnect modifier. That means
we fork off a process which (due to pipebuf) will accept a little
data, but when the dev_disconnect file is truncated (a hacky, but
effective, signalling mechanism) will exit, as if the socket finally
realized it's not connected any more.
The python tests hang waiting for the daemon to terminate if you leave
the blackhole around; to give a clue as to what's happening in this
case I moved the log dump to before killing the daemon.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
jl777 reported a crash when we try to pay past reserve. Fix that (and
a whole class of related bugs) and add tests.
In test_lightning.py I had to make non-async path for sendpay() non-threaded
to get the exception passed through for testing.
Closes: #236
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
I was hoping to defer HTLC updates until we actually store HTLCs, but
we need to flush to DB whenever balances update as well.
Signed-off-by: Christian Decker <decker.christian@gmail.com>
This is the big one, and it's completely anticlimactic: it loads all
channels that have reached opening and are not marked as
closingd_complete into memory, that's it.
Signed-off-by: Christian Decker <decker.christian@gmail.com>
I was hoping to trigger on more things from the bitcoind process, but
stuff like mempool is hard to trigger on. Reducing to info so we can
work a bit easier with pdb and the log becomes less noisy.
We'll need this for testing nodes going down during payment.
However, there's no good way to silence the threads that I can tell,
so we get a nasty backtrace from it.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
I tracked down a bug, and couldn't figure out why valgrind wasn't
finding it.
From man valgrind:
--log-file=<filename>
Specifies that Valgrind should send all of its messages to the
specified file. If the file name is empty, it causes an abort.
There are three special format specifiers that can be used in the
file name.
%p is replaced with the current process ID. This is very useful for
program that invoke multiple processes. WARNING: If you use
--trace-children=yes and your program invokes multiple processes OR
your program forks without calling exec afterwards, and you don't
use this specifier (or the %q specifier below), the Valgrind output
from all those processes will go into one file, possibly jumbled
up, and possibly incomplete.
"possibly incomplete" indeed!
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We weren't registering reconnecting peers for broadcasts. Just
starting a timer is enough. Also added an integration test to check
that the gossip sync is being resumed.
This is a transitional state, while we're waiting to see the
closing tx onchain (which is To Be Implemented).
The simplest way to do re-transmission is to re-use closingd, and just
disallow any updates.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We actually don't need to transition if we're reconnecting, and logic
to go to CHANNELD_NORMAL was wrong: we checked that we'd seen funding tx
locked, but not that we'd received a msg from the remote peer.
We need to fix the tests now we no longer double-transition, too.
Fixes: #188
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
When we drop an HTLC_ADD packet, sometimes the commit timer fires
before we try to read from the fd. In this case, the payment is
considered committed and we don't fail.
We need manual commit to work around this, and also we'd need to do
the pay command asynchronously from python because it will block.
That's a bit out of scope for now, so just handle either way.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We keep the scriptpubkey to send until after a commitment_signed (or,
in the corner case, if there's no pending commitment). When we
receive a shutdown from the peer, we pass it up to the master.
It's up to the master not to add any more HTLCs, which works because
we move from CHANNELD_NORMAL to CHANNELD_SHUTTING_DOWN.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Currently it's fairly ad-hoc, but we need to tell it to channeld when
it restarts, so we define it as the non-HTLC balance.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
When adding their HTLCs, it needs all the information. When failing,
it needs the id as key and the failure reason. When fulfilling, it
needs the id and payment preimage.
It also needs to know when we have received an revoke_and_ack or a
commitment_signed, to place in the database.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
I have a test which waits for multiple occurrences of the same string,
but doesn't want them to overlap. Make wait_for_log() do the right thing,
so that it only looks for log entries since the last wait_for_log.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
On my laptop under load, 5 seconds was no longer enough for legacy.
But this breaks async (they all see mempool increase, and fire
prematurely), so stop doing that.
I can't get this test to work at all, in fact, without this patch.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
I actually hit this very hard to reproduce race: if we haven't process
the channeld message when block #6 comes in, we won't send the gossip
message. We wait for logs, but don't generate new blocks, and timeout
on l1.daemon.wait_for_log('peer_out WIRE_ANNOUNCEMENT_SIGNATURES').
The solution, which also tests that we don't send announcement signatures
immediately, is to generate a single block, wait for CHANNELD_NORMAL,
then (in gossip tests), generate 5 more.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
1. We explicitly assert what state we're coming from, to make transitions
clearer.
2. Every transition has a state, even between owners while waiting for HSM.
3. Explictly step though getting the HSM signature on the funding tx
before starting channeld, rather than doing it in parallel: makes
states clearer.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We need to do this on every connection, whether reconnecting or not,
so it makes sense for the handshake daemon to handle it and return
the feature fields.
Longer term I'm considering having the handshake daemon handle the
listening and connecting, and simply hand the fds back once the peers
are ready.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We currently create a peer struct, then complete handshake to find out
who it is. This means we have a half-formed peer, and worse: if it's
a reconnect we get two peers the same.
Add an explicit 'struct connection' for the handshake phase, and
construct a 'struct peer' once that's done.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
eg:
test_routing_gossip (__main__.LightningDTests) ... ERROR
======================================================================
ERROR: test_routing_gossip (__main__.LightningDTests)
----------------------------------------------------------------------
Traceback (most recent call last):
File "tests/test_lightningd.py", line 150, in tearDown
err_count += self.printValgrindErrors(node)
File "tests/test_lightningd.py", line 137, in printValgrindErrors
errors, fname = self.getValgrindErrors(node)
File "tests/test_lightningd.py", line 132, in getValgrindErrors
with open(error_file, 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/lightning-l106st0a/test_routing_gossip/lightning-1/valgrind-errors'
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Under stress, the tests can mine blocks too soon, and the funding never
locks. This gives more of a chance, at least.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
I couldn't actually figure out how to just dump them on error, so I
dump all the time. When running 3 lightningd + bitcoind, this separates
the logs nicely.
TODO: We should delete the directories on success!
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
But it breaks:
test_forward (__main__.LightningDTests) ... lightningd_channel: Computed MAC does not match expected MAC, the message was modified.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Mainly switching from the old include to the new include and adjusting
the actual size of the onion packet. It also moves `channel.c` to use
`struct hop_data`.
It introduces a dummy next hop in `channel.c` that will be replaced in
the next commit.
This moves all the non-legacy blackbox testing into python.
Before:
real 10m18.385s
After:
real 9m54.877s
Note that this doesn't valgrind the subdaemons: that patch seems to cause
some issues in the python framework which I am still chasing.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Running long integration tests could result in `bitcoind` dropping the
connection inbetween calls, and since python-bitcoinlib does not
reconnect and/or retry, all subsequent tests would fail as well. This
patch switches to throwaway connections, each serving just one
request. It's easier than to reach into the bitcoinlib to reconnect
and reauth, but comes with some overhead, but I think it's acceptable
for the few bitcoin calls we actually perform.
By looking for 'Done loading' in the log output we should actually be
called after `SetRPCWarmupFinished` in bitcoind. Only then is it safe
to make RPC calls. This resulted in the test suite being a bit flaky.