We do this by keeping a current and an old map, and moving the current to old
every hour or 10,000 entries.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
I was seeing some accidental pruning under load / Travis, and in
particular we stopped accepting channel_updates because they were 103
seconds old. But making it too long makes the prune test untenable,
so restore a separate flag that this test can use.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Now we queue them, we should place a limit. It's not the worst thing in
the world if we discard them (we'll catch up eventually), but we should
try not to in case we're just a bit behind.
Our behaviour here is also O(n^2) so we don't want a massive queue
anyway.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
The first one means we don't discard channels just because we're not
synced, and the second is implied by the spec: don't accept
channel_announcement if the channel isn't 6 deep. Since LND defers in
such cases, we do too (unless it's newer than the current block, in
which case we simply discard). Otherwise there's a risk that a slow
node might discard valid gossip.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Happened under Travis with --dev-fast-gossip (90 second prune time), but can
happen anyway if gossip is almost 2 weeks old when we receive it:
2019-09-20T19:16:51.367Z DEBUG lightning_gossipd(20972): Received node_announcement for node 022d223620a359a47ff7f7ac447c85c46c923da53389221a0054c11c1e3ca31d59
2019-09-20T19:16:51.376Z DEBUG lightning_gossipd(20972): Ignoring node_announcement timestamp 1569006918 for 022d223620a359a47ff7f7ac447c85c46c923da53389221a0054c11c1e3ca31d59
2019-09-20T19:16:51.669Z **BROKEN** lightning_gossipd(20972): pending node_announcement 01013094af771d60f4de69bb39ce045e4edf4a06fe6c80078dfa4fab58ab5617d6ad4fa34b6d3437380db0a8293cea348bbc77f714ef71fcd8515bfc82336667441f00005d852546022d223620a359a47ff7f7ac447c85c46c923da53389221a0054c11c1e3ca31d59022d2253494c454e544152544953542d633961313734610000000000000000000000000000 malformed? (version c9a174a)
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
It's generally clearer to have simple hardcoded numbers with an
#if DEVELOPER around it, than apparent variables which aren't, really.
Interestingly, our pruning test was always kinda broken: we have to pass
two cycles, since l2 will refresh the channel once to avoid pruning.
Do the more obvious thing, and cut the network in half and check that
l1 and l3 time out.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
If you send a message which simply changes timestamp and signature, we
drop it. You shouldn't be doing that, and the door to ignoring them
was opened by by option_gossip_query_ex, which would allow clients to
ignore updates with the same checksum.
This is more aggressive at reducing spam messages, but we allow refreshes
(to be conservative, we allow them even when 1/2 of the way through the
refresh period).
I dropped the now-unnecessary sleep from test_gossip_pruning, too.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Make update_local_channel use a timer if it's too soon to make another
update.
1. Implement cupdate_different() which compares two updates.
2. make update_local_channel() take a single arg for timer usage.
3. Set timestamp of non-disable update back 5 minutes, so we can
always generate a disable update if we need to.
4. Make update_local_channel() itself do the "unchanged update" suppression.
gossipd: clean up local channel updates.
5. Keep pointer to the current timer so we override any old updates with
a new one, to avoid a race.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Normally we'd put a pointer into struct half_chan for local
information, but it would be NULL on 99.99% of nodes. Instead, keep a
separate hash table.
This immediately subsumes the previous "map of local-disabled
channels", and will be enhanced further.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
For memory-usage reasons, struct chan doesn't use a tal destructor, in
favor of us calling free_chan in the right places.
In DEVELOPER mode, we should check that is the case.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We've been slack, but it's going to be important for testing
ratelimiting. And it currently has a minor memory leak.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Rather than reaching into data structures, let them register their own
callbacks. This avoids us having to expose "memleak_remove_xxx"
functions, and call them manually.
Under the hood, this is done by having a specially-named tal child of
the thing we want to assist, containing the callback.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We added a random channel to the list, but we can just free it immediately
(since traversal of a uintmap isn't altered by deletion).
This was introduced in d1f43d993a where we explicitly call free_chan
rather than relying on destructors.
Fixes: #2837
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
==1503== Use of uninitialised value of size 8
==1503== at 0x566786B: _itoa_word (_itoa.c:179)
==1503== by 0x566AF0D: vfprintf (vfprintf.c:1642)
==1503== by 0x569790F: vsnprintf (vsnprintf.c:114)
==1503== by 0x156CCB: do_vfmt (str.c:66)
==1503== by 0x156DB1: tal_vfmt_ (str.c:92)
==1503== by 0x1289CD: status_vfmt (status.c:141)
==1503== by 0x128AAC: status_fmt (status.c:151)
==1503== by 0x118E05: route_prune (routing.c:2495)
==1503== by 0x11DE2D: gossip_refresh_network (gossipd.c:1997)
==1503== by 0x1292B8: timer_expired (timeout.c:39)
==1503== by 0x12088C: main (gossipd.c:3075)
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
The crashes in #2750 are mostly caused by us trying to partially truncate
the store. The simplest fix for release is to discard the whole thing if
we detect a problem.
This is a workaround: it'd be far nicer to try to recover.
Fixes: #2750
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We might have channel_announcements which have no channel_update: normally
these don't get written into the store until there is one, but if the
store was truncated it can happen. We then get upset on compaction, since
we don't have an in-memory representation of the channel_announcement.
Similarly, we leave the node_announcement pending until after that
channel_announcement, leading to a similar case.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We catch node_announcements for nodes where we haven't finished
analyzing the channel_announcement yet (either because we're still
checking UTXO, or in this case, because we're waiting for a channel_update).
But we reference count the pending_node_announce, so if we have
multiple channels pending, we might try to insert it twice. Clear it
so this doesn't happen.
There's a second bug where we continue to catch node_announcements
until *all* the channel_announcements are no longer pending; this is fixed
by removing it from the map.
Fixes: #2735
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
In particular, we'll need to know the short_channel_id if a
channel_update is unknown (implies we're missing a channel), and whether
processing a pending channel_announcement was successful (implies that
the channel was real).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This means there's now a semantic difference between the default `fromid`
and setting `fromid` explicitly to our own node_id. In the default case,
it means we don't charge ourselves fees on the route.
This means we can spend the full channel balance.
We still want to consider the pricing of local channels, however:
there's a *reason* to discount one over another, and that is to bias
things. So we add the first-hop fee to the *risk* value instead.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Triggered by a previous variant of this PR, but a goo1d idea to simply
discard the store in general when we get a duplicate entry.
We crash trying to delete old ones, which means writing to the store.
But they should have already been deleted.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This clarifies things a fair bit: we simply add and remove from the
gossip_store directly.
Before this series: (--disable-developer, -Og)
store_load_msec:20669-20902(20822.2+/-82)
vsz_kb:439704-439712(439706+/-3.2)
listnodes_sec:0.890000-1.000000(0.92+/-0.04)
listchannels_sec:11.960000-13.380000(12.576+/-0.49)
routing_sec:3.070000-5.970000(4.814+/-1.2)
peer_write_all_sec:28.490000-30.580000(29.532+/-0.78)
After: (--disable-developer, -Og)
store_load_msec:19722-20124(19921.6+/-1.4e+02)
vsz_kb:288320
listnodes_sec:0.860000-0.980000(0.912+/-0.056)
listchannels_sec:10.790000-12.260000(11.65+/-0.5)
routing_sec:2.540000-4.950000(4.262+/-0.88)
peer_write_all_sec:17.570000-19.500000(18.048+/-0.73)
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
(We don't increment the gossip_store version, since there are only a
few commits since the last time we did this).
This lets the reader simply filter messages; this is especially nice since
the channel_announcement timestamp is *derived*, not in the actual message.
This also creates a 'struct gossip_hdr' which makes the code a bit
clearer.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We use the high bit of the length field: this way we can still check
that the checksums are valid on deleted fields.
Once this is done, serially reading the gossip_store file will result
in a complete, ordered, minimal gossip broadcast. Also, the horrible
corner case where we might try to delete things from the store during
load time is completely gone: we only load non-deleted things.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
They're really gossipd-internal, and we don't want per-peer daemons
to confuse them with normal updates.
I don't bump the gossip_store version; that's coming with another update
anyway.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Now we handle node_announcements properly, we have a failure case where we
try to move them when a channel is deleted while loading the store.
We're going to remove this soon, in favor of in-place delete, so
workaround this for now to avoid an assert() when we try to write to
the store while loading.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
When we first receive a channel_update, we write both the
channel_announcement and that channel_update to the store: we need
that first update so we can set the channel_announcement timestamp.
However, the channel_update can be replaced later. This means we can
have a channel_announcement, a node_update which relies on it, then
the channel_update later.
So move the "this applies to a pending announcement" check lower, where
gossip_store can use it too. Has a nice side-effect of avoiding
one lookup of the node id.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Fix a path where tal_free is called on an uninitialized variable
If the first `goto bad_total` executes, then that path has
uninitialized `short_route` but bad_total passes through to `out`
whose first call is tal_free(short_route).
This was noticed by a maybe-uninitialized heuristic on gcc 7.4.0:
gossipd/routing.c: In function ‘find_shorter_route’:
gossipd/routing.c:1096:2: error: ‘short_route’ may be used
uninitialized in this function [-Werror=maybe-uninitialized]
tal_free(short_route);
Reported-by: @ZmnSCPxj <https://github.com/ElementsProject/lightning/pull/2674#issuecomment-495617253>
Signed-off-by: William Casarin <jb55@jb55.com>
Each destructor2 costs 40 bytes, and struct chan is only 120 bytes. So
this drops our memory usage quite a bit:
MCP bench results change:
-vsz_kb:580004-580016(580006+/-4.8)
+vsz_kb:533148
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We now have a test blockchain for MCP which has the correct channels,
so this is not needed.
Also fix a benchmark script bug where 'mv "$DIR"/log
"$DIR"/log.old.$$' would fail if you log didn't exist from a previous run.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We need to store the channel capacity for channel_announcement: hand it
in directly rather than having the gossip_store code do a lookup.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Now we can benchmark, and remove 500 bytes per node.
MCP results from 5 runs, min-max(mean +/- stddev):
store_load_msec:35093-37907(36146+/-1.1e+03)
vsz_kb:555168
store_rewrite_sec:12.120000-13.750000(12.7+/-0.6)
listnodes_sec:1.270000-1.370000(1.322+/-0.039)
listchannels_sec:29.770000-31.600000(30.82+/-0.64)
routing_sec:0.00
peer_write_all_sec:63.630000-67.850000(65.432+/-1.7)
MCP notable changes from pre-Dijkstra (>1 stddev):
-vsz_kb:577456
+vsz_kb:555168
-routing_sec:60.70
+routing_sec:12.04
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Do it inside the can_reach() function, which is less optimal for BFG
which does 20 ops on the same channel, but fine for Dijkstra.
This does have a measurable cost, so we might want to use
non-cryptographic fuzz in future:
$ gossipd/test/run-bench-find_route 100000 100:
Before:
100 (100 succeeded) routes in 100000 nodes in 97346 msec (973461784 nanoseconds per route)
After:
100 (100 succeeded) routes in 100000 nodes in 113381 msec (1133813412 nanoseconds per route)
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
If a route is too long, we try to bias Dijkstra towards choosing a
shorter route by adding a per-hop cost. We do a naive "shortest path"
pass, then using that cost as a ceiling on per-hop cost, we do a
binary search.
There are some subtleties: we use risk rather than total as our
counter field (we normally bias this by 1 anyway, so it's easy to make
that a variable), and we set riskfactor to a mimimal value once we're
iterating. It's good enough to get a solution, we don't need to do a
2-dimensional search on riskfactor and riskbias.
Of course, this is extremely slow if we hit it on our benchmark,
though it doesn't happen in a more realistic network:
$ gossipd/test/run-bench-find_route 100000 100:
Before:
100 (79 succeeded) routes in 100000 nodes in 25341 msec (253412314 nanoseconds per route)
After:
100 (100 succeeded) routes in 100000 nodes in 97346 msec (973461784 nanoseconds per route)
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>