We do this by keeping a current and an old map, and moving the current to old
every hour or 10,000 entries.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This encoding scheme is no longer just used for short_channel_ids, so make
the names more generic.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
I was seeing some accidental pruning under load / Travis, and in
particular we stopped accepting channel_updates because they were 103
seconds old. But making it too long makes the prune test untenable,
so restore a separate flag that this test can use.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
The only real change is dump_gossip() used to call
maybe_create_next_scid_reply(), but now I've simply renamed
that to maybe_send_query_responses() and we call it directly.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Now we queue them, we should place a limit. It's not the worst thing in
the world if we discard them (we'll catch up eventually), but we should
try not to in case we're just a bit behind.
Our behaviour here is also O(n^2) so we don't want a massive queue
anyway.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
The first one means we don't discard channels just because we're not
synced, and the second is implied by the spec: don't accept
channel_announcement if the channel isn't 6 deep. Since LND defers in
such cases, we do too (unless it's newer than the current block, in
which case we simply discard). Otherwise there's a risk that a slow
node might discard valid gossip.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This will let gossipd be more intelligent about gossiping before we're
synced, and also it might know how far behind we are.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Happened under Travis with --dev-fast-gossip (90 second prune time), but can
happen anyway if gossip is almost 2 weeks old when we receive it:
2019-09-20T19:16:51.367Z DEBUG lightning_gossipd(20972): Received node_announcement for node 022d223620a359a47ff7f7ac447c85c46c923da53389221a0054c11c1e3ca31d59
2019-09-20T19:16:51.376Z DEBUG lightning_gossipd(20972): Ignoring node_announcement timestamp 1569006918 for 022d223620a359a47ff7f7ac447c85c46c923da53389221a0054c11c1e3ca31d59
2019-09-20T19:16:51.669Z **BROKEN** lightning_gossipd(20972): pending node_announcement 01013094af771d60f4de69bb39ce045e4edf4a06fe6c80078dfa4fab58ab5617d6ad4fa34b6d3437380db0a8293cea348bbc77f714ef71fcd8515bfc82336667441f00005d852546022d223620a359a47ff7f7ac447c85c46c923da53389221a0054c11c1e3ca31d59022d2253494c454e544152544953542d633961313734610000000000000000000000000000 malformed? (version c9a174a)
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
It's generally clearer to have simple hardcoded numbers with an
#if DEVELOPER around it, than apparent variables which aren't, really.
Interestingly, our pruning test was always kinda broken: we have to pass
two cycles, since l2 will refresh the channel once to avoid pruning.
Do the more obvious thing, and cut the network in half and check that
l1 and l3 time out.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
If you send a message which simply changes timestamp and signature, we
drop it. You shouldn't be doing that, and the door to ignoring them
was opened by by option_gossip_query_ex, which would allow clients to
ignore updates with the same checksum.
This is more aggressive at reducing spam messages, but we allow refreshes
(to be conservative, we allow them even when 1/2 of the way through the
refresh period).
I dropped the now-unnecessary sleep from test_gossip_pruning, too.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Make update_local_channel use a timer if it's too soon to make another
update.
1. Implement cupdate_different() which compares two updates.
2. make update_local_channel() take a single arg for timer usage.
3. Set timestamp of non-disable update back 5 minutes, so we can
always generate a disable update if we need to.
4. Make update_local_channel() itself do the "unchanged update" suppression.
gossipd: clean up local channel updates.
5. Keep pointer to the current timer so we override any old updates with
a new one, to avoid a race.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Normally we'd put a pointer into struct half_chan for local
information, but it would be NULL on 99.99% of nodes. Instead, keep a
separate hash table.
This immediately subsumes the previous "map of local-disabled
channels", and will be enhanced further.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Write helpers to split it into non-timestamp, non-signature parts,
and simply compare those. We extract a helper to do channel_update, too.
This is more generic than our previous approach, and simpler.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
For memory-usage reasons, struct chan doesn't use a tal destructor, in
favor of us calling free_chan in the right places.
In DEVELOPER mode, we should check that is the case.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We've been slack, but it's going to be important for testing
ratelimiting. And it currently has a minor memory leak.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Rather than reaching into data structures, let them register their own
callbacks. This avoids us having to expose "memleak_remove_xxx"
functions, and call them manually.
Under the hood, this is done by having a specially-named tal child of
the thing we want to assist, containing the callback.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
`make update-mocks` is usually run in DEVELOPER mode, but then it includes
definitions for functions which aren't declared in non-DEVELOPER mode.
We hacked this in a few places, but it's fragile, and worst, now we
have EXPERIMENTAL_FEATURES as well, it's complex.
Instead, declare developer-only functions (but don't define them).
This is a bit more awkward if you accidentally use one in
non-DEVELOPER code (link error rather than compile error), but makes
autogenerating test mocks much easier.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Fortunately, again, only happens with EXPERIMENTAL_FEATURES.
If the query causes us not to actually send anything, we won't
get called again. This can validly happen if they only asked for
the node_announcements, for example.
(Found by protocol tests).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Our "are we finished?" logic was wrong: it tested if there are no more
node_announcements, but it's possible that there were no node_announcements
for either end of the channel whose information we sent.
This is actually quite unusual on the real network: looking at mainnet
statis from last May, 4301 of 4337 nodes have node_announcements.
However, with query flags it's much more likely, since they might not
ask for node announcements at all.
(Found by gossip protocol tests)
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
These both allow us to reproduce the test vectors in the next patch. But
using Z_DEFAULT_COMPRESSION is a reasonable idea anyway.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Make the TLV element a simple array. This is a bit neater, in fact, and
makes the test vectors in that 557 PR work.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
In fact, we always generate them, we only send them if asked. And we set
the flags to 0 if not --enable-experimental-features, so we never send in
that case.
Generating checksums involves pulling the channel_update from the
gossip_store, which is suboptimal: there's a FIXME to store the
checksum in memory.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We're about to use the for gossip extended info too, which *don't* put
the encoding byte at the beginning of the data stream. So this removes
some "scids" from function names and separates out the "prepend a byte"
case from the "external encoding_type" case.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
These indicate what fields we are to return. If there's now TLV, or we
haven't got --enable-experimental-features, it's set to all 1s so behaviour
is unchanged.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We ignored this before, which meant that the DEVELOPER-mode check that we
delete the correct record didn't check that it wasn't already deleted.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We always know the length, so we don't need it. It causes much extra work
when we want to delete a record, which I suspect may cause issues amongst
some users who've been seeing gossip_store corruption.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We added a random channel to the list, but we can just free it immediately
(since traversal of a uintmap isn't altered by deletion).
This was introduced in d1f43d993a where we explicitly call free_chan
rather than relying on destructors.
Fixes: #2837
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
==1503== Use of uninitialised value of size 8
==1503== at 0x566786B: _itoa_word (_itoa.c:179)
==1503== by 0x566AF0D: vfprintf (vfprintf.c:1642)
==1503== by 0x569790F: vsnprintf (vsnprintf.c:114)
==1503== by 0x156CCB: do_vfmt (str.c:66)
==1503== by 0x156DB1: tal_vfmt_ (str.c:92)
==1503== by 0x1289CD: status_vfmt (status.c:141)
==1503== by 0x128AAC: status_fmt (status.c:151)
==1503== by 0x118E05: route_prune (routing.c:2495)
==1503== by 0x11DE2D: gossip_refresh_network (gossipd.c:1997)
==1503== by 0x1292B8: timer_expired (timeout.c:39)
==1503== by 0x12088C: main (gossipd.c:3075)
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
updates the bolt version to 6639cef095a2ecc7b8f0c48c6e7f2f906fbfbc58.
this requires us to use the new bolt parser at generate-bolt.py
and updates to all of the type specifications (ie. from u8 -> byte)
Rewriting the gossip_store is much more trivial when we don't have
any pointers into it, so add some simple offline compaction code
and disable the automatic compaction code.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
The crashes in #2750 are mostly caused by us trying to partially truncate
the store. The simplest fix for release is to discard the whole thing if
we detect a problem.
This is a workaround: it'd be far nicer to try to recover.
Fixes: #2750
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We hit the timestamp assert on #2750; it shouldn't happen, but crashing
doesn't leave much information.
Reported-by: @m-schmook
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
If something went wrong and there was an old one, we were
appending to it!
Reported-by: @SimonVrouwe
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We might have channel_announcements which have no channel_update: normally
these don't get written into the store until there is one, but if the
store was truncated it can happen. We then get upset on compaction, since
we don't have an in-memory representation of the channel_announcement.
Similarly, we leave the node_announcement pending until after that
channel_announcement, leading to a similar case.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We can't continue, since we've moved the indexes. We'll just crash
anyway, as seen from bugs #2742 and #2743.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We catch node_announcements for nodes where we haven't finished
analyzing the channel_announcement yet (either because we're still
checking UTXO, or in this case, because we're waiting for a channel_update).
But we reference count the pending_node_announce, so if we have
multiple channels pending, we might try to insert it twice. Clear it
so this doesn't happen.
There's a second bug where we continue to catch node_announcements
until *all* the channel_announcements are no longer pending; this is fixed
by removing it from the map.
Fixes: #2735
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We seek a certain number of peers at each level of gossip; 3 "flood"
if we're missing gossip, 2 at 24 hours past to catch recent gossip, and
8 with current gossip. The rest are given a filter which causes them
not to gossip to us at all.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
The first sign that we're missing gossip is that we get a channel_update
for an unknown channel. The peer might be wrong (or lying), but if it turns
out to be a real channel, we were definitely missing something.
This patch does two things: queries when we get an unknown channel_update,
and then notes that a channel_announcement was from such an update when
it's finally processed.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
In particular, we'll need to know the short_channel_id if a
channel_update is unknown (implies we're missing a channel), and whether
processing a pending channel_announcement was successful (implies that
the channel was real).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Up until now we only generated these in dev mode for testing. Hoist
into common code, turn counter into a flag (we're only allowed one!)
and note if query is internal or not.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
I decided to try a faster implementation, only to find our crc32c was
not correct! Ouch.
I removed the crc32c functions from ccan/crc, and added a new crc32c
module which has the Mark Adler x86-64-optimized variants.
We bump gossip_store version again, since csums have changed.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This means there's now a semantic difference between the default `fromid`
and setting `fromid` explicitly to our own node_id. In the default case,
it means we don't charge ourselves fees on the route.
This means we can spend the full channel balance.
We still want to consider the pricing of local channels, however:
there's a *reason* to discount one over another, and that is to bias
things. So we add the first-hop fee to the *risk* value instead.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(Or, if we crashed before we got to write out the channel_update).
It's a corner case, but one reported by @darosior and reproduced
on my test node (both with bad gossip_store due to previous iterations
of this patchset!).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Triggered by a previous variant of this PR, but a goo1d idea to simply
discard the store in general when we get a duplicate entry.
We crash trying to delete old ones, which means writing to the store.
But they should have already been deleted.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This clarifies things a fair bit: we simply add and remove from the
gossip_store directly.
Before this series: (--disable-developer, -Og)
store_load_msec:20669-20902(20822.2+/-82)
vsz_kb:439704-439712(439706+/-3.2)
listnodes_sec:0.890000-1.000000(0.92+/-0.04)
listchannels_sec:11.960000-13.380000(12.576+/-0.49)
routing_sec:3.070000-5.970000(4.814+/-1.2)
peer_write_all_sec:28.490000-30.580000(29.532+/-0.78)
After: (--disable-developer, -Og)
store_load_msec:19722-20124(19921.6+/-1.4e+02)
vsz_kb:288320
listnodes_sec:0.860000-0.980000(0.912+/-0.056)
listchannels_sec:10.790000-12.260000(11.65+/-0.5)
routing_sec:2.540000-4.950000(4.262+/-0.88)
peer_write_all_sec:17.570000-19.500000(18.048+/-0.73)
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We have a problem: if we get halfway through writing the compacted store
and run out of disk space, we've already changed half the indexes.
This changes it so we do nothing until writing is finished: then we
iterate through and update indexes. It also weans us off broadcast
ordering, which we can now eliminated.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We didn't count some records before, so we could compare the two counters.
This is much simpler, and avoids reliance on bs.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>