A long time ago (93dcd5fed7), I
simplified the htlc reload code so it adjusted the amounts for HTLCs
in id order. As we presumably allowed them to be added in that order,
this avoided special-casing overflow (which was about to deliberately
be made harder by the new amount_msat code).
Unfortunately, htlc id order is not canonical, since htlc ids are
assigned consecutively in both directions! Concretely, we can have two HTLCs:
HTLC #0 LOCAL->REMOTE: 500,000,000 msat, state RCVD_REMOVE_REVOCATION
HTLC #0 REMOTE->LOCAL: 10,000 msat, state SENT_ADD_COMMIT
On a new remote-funded channel, in which we have 0 balance, these
commits *only* work in this order. Sorting by HTLC ID is not enough!
In fact, we'd have to worry about redemption order as well, as that
matters.
So, regretfully, we offset the balances halfway to UINT64_MAX, then check
they didn't underflow at the end. This loses us this one sanity check,
but that's probably OK.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We're about to change it so we always send our local messages, which
breaks this test. Add a new node which doesn't have any local
messages, so the test works correctly.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Sometimes the l3 seeker asks for scids, and the reply contains the
channel which is then closed by the time it checks, so it considers
the updates bad gossip.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Feerate changes are asymmetric, as they can only be sent by the funder.
For FUNDER, the remote feerate is set when upon send of
commitment_signed, and the local feerate is set on receipt of
revoke_and_ack.
For non-funder, the local feerate is set on receipt of
commitment_signed, and the remote feerate set on send of
revoke_and_ack. In our code, these two happen together.
channeld gets this right, but lightningd ignored the funder/fundee
distinction, and as a result, receipt of a commitment_signed by the
funder altered fees in the database. If there was a reconnection
event or restart, then these (incorrect) values would be used, causing
us to complain about a 'Bad commit_sig signature' and close the
channel.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
A 'Bad commit_sig signature' was reported by @Javier on Telegram and
@DarthCoin. This was between two c-lightning peers, so definitely our fault.
Analysis of this message revealed the signature was using the wrong
feerate. I finally managed to make a test case which triggered this.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Whenever we have multi-connected nodes, out-of-order gossip is possible.
In particular, if a node_announcement is 1 second fresher than the
channel_announcement, a timestamp_filter might get one and not the
other.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Check behavior for user supplied upfront_shutdown_script via close_to
Header from folded patch 'fix__return__not__iff_well_close_to_the_provided_addr.patch':
fix: return not iff we'll close to the provided addr
*If* we know the key has signed something else (as is the case for
channel_announcement) then we can effectively trust the key derivation.
This matches how LND's VerifyMessage works, though in the next patch
we will document it exactly.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Thanks Twitter helpers @duck1123 and @jochemin for tests!
And @bitconner for the initial test vector.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
I had a report of a 0.7.2 user whose node hadn't appeared on 1ml. Their
node_announcement wasn't visible to my node, either.
I suspect this is a consequence of recent version reducing the amount of
gossip they send, as well as large nodes increasingly turning off gossip
altogether from some peers (as we do). We should ignore timestamp filters
for our own channels: the easiest way to do this is to push them out
directly from gossipd (other messages are sent via the store).
We change channeld to wrap the local channel_announcements: previously
we just handed it to gossipd as for any other gossip message we received
from our peer. Now gossipd knows to push it out, as it's local.
This interferes with the logic in tests/test_misc.py::test_htlc_send_timeout
which expects the node_announcement message last, so we generalize
that too.
[ Thanks to @trueptolmy for bugfix! ]
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This is mainly an internal-only change, especially since we don't
offer any globalfeatures.
However, LND (as of next release) will offer global features, and also
expect option_static_remotekey to be a *global* feature. So we send
our (merged) feature bitset as both global and local in init, and fold
those bitsets together when we get an init msg.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
It sometimes fail with a bad_gossip error because the sending node
might not have found out about the channel when it gets a
channel_update. Make sure the whole network knows everything before
we start.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We completely rework test_node_reannounce: it's assumes we always ask for
all gossip and that assumption will be broken in future patches too.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Our policy is generally to omit fields which aren't sensible.
Also, @niftynei points out the spacing in for loops.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Since elements addresses look quite different from the bitcoin mainnet
addresses I just added a sample to the chainparams fixture. In addition I
extracted some of the fixed strings to reference chainparams instead.
Signed-off-by: Christian Decker <decker.christian@gmail.com>
We were checking against hardcoded hrp and prefixes. Now we parametrize via
the chainparams.
Signed-off-by: Christian Decker <decker.christian@gmail.com>
We are checking against chain-dependent constants, so let's make sure we are
using the ones for the correct chain.
Signed-off-by: Christian Decker <decker.christian@gmail.com>
Since we will soon be writing the `liquid-regtest` section instead of the
`regtest` section we should make that configurable.
Signed-off-by: Christian Decker <decker.christian@gmail.com>
There were a few places we were rebuilding the config path by appending
`bitcoin.conf` to the bitcoin directory. So now we just remember it and
reference it instead.
Signed-off-by: Christian Decker <decker.christian@gmail.com>
It relies on the fact that nodes don't do their own gossip queries.
Use devtools instead.
This revealed that the entire logic was broken! It just happened to work.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Note the use of sqrt, which makes a 13 second timeout under Travis
(180 second), or 7 seconds normally.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This adapts the test to the new 'plugin' command: no more sleeping,
since we are synchronous !
This tests the timeout by increasing the 'slowinit' plugin sleep
duration at init reception.
This adds a broken plugin to make sure we won't crash because of a
misbehaving plugin (unmet dependency is the most common case).
I was seeing some accidental pruning under load / Travis, and in
particular we stopped accepting channel_updates because they were 103
seconds old. But making it too long makes the prune test untenable,
so restore a separate flag that this test can use.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
It only had an effect if the peer didn't support option_gossip_queries, but
still, we don't want a gossip blast any more.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
The test was implicitly relying on us selecting the larger output and then not
touching the smaller, leaving it there for the final `withdraw` to claim. This
ordering of UTXOs is not guaranteed, and in particular can fail when switching
DB backends. To stabilize we just need to make sure to select the change
output as well.
This replaces the hard-coded path to the `postgres` and `initdb` binaries with
a slightly more flexible search. It'll pick the newest version installed.
Signed-off-by: Christian Decker <decker.christian@gmail.com>
This was weird right from the start, so we just split the table into integers
and blobs, so each column has a well-defined format. It is also required for
postgres not to cry about explicit casts in the `paramTypes` array.
Signed-off-by: Christian Decker <decker.christian@gmail.com>
We will soon have a postgres backend as well, so we need a way to control the
postgres process and to provision DBs to the nodes. The two interfaces are the
dsn that we pass to the node, and the python query interface needed to query
from tests.
Signed-off-by: Christian Decker <decker.christian@gmail.com>
It's generally clearer to have simple hardcoded numbers with an
#if DEVELOPER around it, than apparent variables which aren't, really.
Interestingly, our pruning test was always kinda broken: we have to pass
two cycles, since l2 will refresh the channel once to avoid pruning.
Do the more obvious thing, and cut the network in half and check that
l1 and l3 time out.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
If you send a message which simply changes timestamp and signature, we
drop it. You shouldn't be doing that, and the door to ignoring them
was opened by by option_gossip_query_ex, which would allow clients to
ignore updates with the same checksum.
This is more aggressive at reducing spam messages, but we allow refreshes
(to be conservative, we allow them even when 1/2 of the way through the
refresh period).
I dropped the now-unnecessary sleep from test_gossip_pruning, too.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Make update_local_channel use a timer if it's too soon to make another
update.
1. Implement cupdate_different() which compares two updates.
2. make update_local_channel() take a single arg for timer usage.
3. Set timestamp of non-disable update back 5 minutes, so we can
always generate a disable update if we need to.
4. Make update_local_channel() itself do the "unchanged update" suppression.
gossipd: clean up local channel updates.
5. Keep pointer to the current timer so we override any old updates with
a new one, to avoid a race.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
After switching to a plugin, we verify that we can fund a channel
before we check to contact a peer. We'll need to have a funded wallet
to pass the check in this test that verifies that 'fundchannel' cannot
be called for a peer after fundchannel_start is.
Allow a user to select the utxo set that will be added to a
transaction, via the `utxos` parameter. Optional.
Format for utxos should be of the form ["txid:vout","..."]
For now, we can't fully ensure that the broadcast was catched from a third pary. Only when the transaction (broadcast by a third pary) is onchain, we can catch it.
531c8d7d9b
In this one, we always send my_current_per_commitment_point, though it's
ignored. And we have our official feature numbers.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
As per BOLT02 #message-retransmission :
if `next_commitment_number` is 1 in both the `channel_reestablish` it sent and received:
- MUST retransmit `funding_locked`
It seems we spend a lot of time waiting for `bitcoind` and `lightningd` to
talk to disks. This adds the `TEST_DIR` environment variable, allowing for
example to use `/dev/shm`, or a faster disk than the disk `/tmp` is on, as the
root directory for all test-related files.
Testing this on one of our builder machines cut the time to run the entire
suite under valgrind roughly in half (180-200 seconds vs 440-490 seconds).
My machine would accumulate a number of zombie lightningd and bitcoind
processes over time while testing. Investigating this showed that if a fixture
raised an exception during fixture teardown then other fixtures that have not
been torn down would linger around. The issue is that pytest treats exceptions
in fixtures as non-recoverable and therefore will not catch them and call the
remaining ones.
This commit adds a new fixture, that is there just to collect eventual errors
from other fixtures and ensure that anything that needs to clean up something,
e.g., processes started by the fixture, are cleaned up before we raise an
eventual exception. This is achieved by making any fixture that needs cleaning
up dependent on the teardown_checks fixture, which also serves as central
point to collect errors and printer of eventual errors.
Signed-off-by: Christian Decker <decker.christian@gmail.com>
This has a slight side-effect of removing the actual begin and commit
statements from the `db_write` hooks, but they are mostly redundant anyway (no
harm in grouping pre-init statements into one transaction, and we know that
each post-init call is supposed to be wrapped anyway).
Signed-off-by: Christian Decker <decker.christian@gmail.com>
These are used to do one-time initializations and wait for pending statements
before closing.
Signed-off-by: Christian Decker <decker.christian@gmail.com>
log files were being deleted on memleak errors, since
we weren't marking the node has having an error.
this helper function is designed to exactly handle this, so
we use the helper function and modify it to print any additional
error messages that are handed back from killall.
Throwing an exception while killing all nodes meant that
we aren't cleaning up all the nodes properly. Instead,
collect the errors, and return them back to the upper level,
where we report them and terminate as expected.
Memleaks appear in the logs as 'broken', so the broken log
check captures them as well. This moves broken to after memleak
so we get more informative error messages.
We were checking the test request against the searched for string. This fixes
it by actually looking at the outcome instead and should clean up correctly
if tests do not fail.
Signed-off-by: Christian Decker <decker.christian@gmail.com>
This is an issue that was raised in #2665: some of the dependencies where
causing warnings to be added to the logs about deprecated dependencies. Since
I did not get these warnings I just blanket updated all the dependencies in
the hopes of getting the warnings to resolve.
Signed-off-by: Christian Decker <@cdecker>
We seem to be getting intermittant failures, but it's hard
to disgnose. Simplify it by moving all the test logic into
the test itself, and making the plugin dumber. This means we'll
see exactly what the differences are if it fails again.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
VALGRIND=1, SLOW_MACHINE=0:
Before: 197.74 seconds
After: 135.43 seconds
Note that we now spend about 13 seconds in teardown, could probably
be optimized.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
During sync it is highly likely that we can coalesce multiple calls and share
results among them. We also report back failures for non-existing blocks early
on, so we don't run into issues with blocks that our bitcoind doesn't have
yet.
Signed-off-by: Christian Decker <decker.christian@gmail.com>
This was caused by us not checking against the max_blockheight, but rather the
min_blockheight which can be negative with a newly created node. This is still
safe since we check for duplicates anyway in `wallet_filteredblock_add`.
Signed-off-by: Christian Decker <decker.christian@gmail.com>
I don't remember ever seeing a bug which only showed up in VALGRIND=1 with developer
mode disabled, so don't test that, and spread out the other test more evenly.
In addition, disable the worst-performing tests in DEVELOPER=0 mode.
Here timings from my build machine: the worst 6 (- DEVELOPER=0 VALGRIND=0)
with the same tests (+ DEVELOPER=1 VALGRIND=1)
-452.42s call tests/test_pay.py::test_channel_spendable
+87.69s call tests/test_pay.py::test_channel_spendable
-335.66s call tests/test_gossip.py::test_gossip_store_compact_on_load
+47.41s call tests/test_gossip.py::test_gossip_store_compact_on_load
-332.07s call tests/test_connection.py::test_opening_tiny_channel
+89.71s call tests/test_connection.py::test_opening_tiny_channel
-331.97s call tests/test_pay.py::test_channel_spendable_large
+56.23s call tests/test_pay.py::test_channel_spendable_large
-305.28s call tests/test_invoices.py::test_invoice_routeboost
+37.57s call tests/test_invoices.py::test_invoice_routeboost
-284.28s call tests/test_plugin.py::test_htlc_accepted_hook_forward_restart
+49.12s call tests/test_plugin.py::test_htlc_accepted_hook_forward_restart
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This is probably worth preventing.
1. Our depth estimate would be inaccurate possibly leading to us
timing out too early.
2. If we're not up-to-date our onchain funds are unknown.
3. We wouldn't be able to send or receive HTLCs until we're synced anyway.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We want to still allow incoming connections, and reestablishment of
channels, but if one tries to give us an HTLC, stall until we're
synced.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
If we don't know block height, we shouldn't be sending HTLCs. This
stops us forwarding HTLCs as well as new payments.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
You're supposed to be able to hand mock_rpc either a function to call,
or a dict canned response. We never did the latter, and the logic
was broken.
It was testing the key, not the value for whether it was a dict. And
it could never have given a valid response anyway, since it wouldn't
know the id to use. So assume it's a successful result.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
If l2 doesn't think we're onchain yet, it treats the new connection from l1
as a reconnection, triggering 'ValueError: 1 nodes had unexpected reconnections'
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We're about to change the API, so this makes the tests still work
across the transition (and, as a bonus, tests our backwards compat
shim).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This was the initial issue that was addressed by #2756 and now we just test
that all is working as expected.
Signed-off-by: Christian Decker <decker.christian@gmail.com>
This is just the test that we use to verify block backfilling below the wallet
birth height is working correctly.
Signed-off-by: Christian Decker <decker.christian@gmail.com>
@renepickhardt has a shell script we broke. While we still produce
perfectly valid JSON, we should not gratuitously change tool output.
Plus, I prefer the missing space before the ':'.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This test is spawning 100 nodes concurrently, which is a lot even when not
running with `valgrind`, especially when executing tests in parallel.
Signed-off-by: Christian Decker <decker.christian@gmail.com>
This is a followup to #2892. Since we now attempt to lock the PID file before
starting plugins we need to make sure that we actually use a unique lightning
directory for anything that attempts to call `--help`. If not we may be
conflicting with a `lightningd` that is running against that directory.
Notice that this still means that we will be unable to call `--help` on
`lightningd` if we have a running instance, but isolation in this case is
good, otherwise we'd be reading the default config anyway.
Signed-off-by: Christian Decker <decker.christian@gmail.com>
We recently noticed that the way we unpack the call arguments for hooks and
notifications in pylightning breaks pretty quickly once you start changing the
hook and notification params. If you add params they will not get mapped
correctly causing the plugin to error out.
This can be fixed by adding a `VAR_KEYWORD` argument to the calbacks, i.e., by
adding a single `**kwargs` argument at the end of the signature. This commit
adds a check that such a catch-all argument exists, and emits a warning if it
doesn't.
It also fixes up the plugins that we ship ourselves.
Signed-off-by: Christian Decker <decker.christian@gmail.com>
Dumb programs which have a --daemon option call fork() early. This is
terrible UX since startup errors get lost: the program exits with
"success" immediately then you discover via the logs that it didn't
start at all.
However, forking late introduced a heap of problems with changing
pids. Instead, fork early but keep stderr and the parent around: if
we fail early on, the parent fails with us. We release our parent
with an explicit action just before the main loop.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We create our children then fork, so we're not a parent. I noticed this
because 'lightning-cli stop' takes a long time: this is because it tries to
wait for them and they don't respond.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Note that we move adding the plugin to the plugins list to the end, otherwise
the hook from logging can examine the (uninitialized) plugin.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Our previous param support was a bit limited in this case.
We create a dev- command multiplexer, so we can exercise it.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We were having a few issues with malformed data in the past, so this time we
really check that stuff.
Signed-off-by: Christian Decker <decker.christian@gmail.com>
Instead of taking over the ->cmd pointer, append ourselves to a list
of cancels. This fixes the test_funding_cancel_race.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
And clean up some dev ones which actually happen (mainly by calling
channel_fail_permanent which logs UNUSUAL, rather than
channel_internal_error which logs BROKEN).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Rewriting the gossip_store is much more trivial when we don't have
any pointers into it, so add some simple offline compaction code
and disable the automatic compaction code.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
The crashes in #2750 are mostly caused by us trying to partially truncate
the store. The simplest fix for release is to discard the whole thing if
we detect a problem.
This is a workaround: it'd be far nicer to try to recover.
Fixes: #2750
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
It wasn't invalid due to a missing channel_update, but in fact was a
bad checksum due to a cut & paste bug. Fix that, and assert it's not
actually truncating.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
If something went wrong and there was an old one, we were
appending to it!
Reported-by: @SimonVrouwe
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We might have channel_announcements which have no channel_update: normally
these don't get written into the store until there is one, but if the
store was truncated it can happen. We then get upset on compaction, since
we don't have an in-memory representation of the channel_announcement.
Similarly, we leave the node_announcement pending until after that
channel_announcement, leading to a similar case.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
I decided to try a faster implementation, only to find our crc32c was
not correct! Ouch.
I removed the crc32c functions from ccan/crc, and added a new crc32c
module which has the Mark Adler x86-64-optimized variants.
We bump gossip_store version again, since csums have changed.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Remove gratuitous prints, add explanations of what's going on,
and demonstrate that we can add a final trimmed HTLC but not
a non-trimmed one.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Subtracting both arbitrarily reduces our capacity, even for ourselves
since the routing logic uses this maximum.
I also changed 'advertise' to 'advertize', since we use american
spelling.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Turns out we needed more comprehensive testing; we ended up with three
separate tests. To avoid changing test_channel_drainage as we fix
spendable_msat, I substituted raw numbers there.
The first is a variation of the existing tests, testing we can't
exceed spendable_msat, and we can pay it, both ways.
The second is with a larger amount, which triggers a different problem.
The final is with a giant channel, which tests our 2^32-1 msat cap.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This is where payment tests should go. Also mark it xfail for the moment,
and remove developer-only tag (propagating gossip is only 60 seconds, which
is OK).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>