Zombie channels in mobile

Roei Erez

unread,

Dec 5, 2019, 3:18:02 PM12/5/19

to lnd

Hi,

The way zombie channels are treated in nodes that configured with assumechannelvalid=false (usually mobile) is that once they enter a zombie mode they only exit this mode when receiving a channel update.

This update is designed to be received in some interval but assuming the node is most of the time offline there are highly chances that the update won't land.

So give the following case:
1. Mobile node wasn't online for two weeks.

2. Mobile node goes online, gets the new channels and all old channels are marked as zombies.

The mobile node enters a bad state that is hard to recover from.

From this point there is no easy and fast way for the mobile node to become up to date with the network other than deleting the graph and start sync from scratch.

Currently when mobile node connects to its first peer, it queries for the new channels only (those which it doesn't know about) I wonder if there is a room to also include

zombie channels in this query.

Appreciate your thoughts.

Roei.

Wilmer Paulino

unread,

Dec 11, 2019, 1:51:39 AM12/11/19

to Roei Erez, lnd

Hi Roei,

So give the following case:
1. Mobile node wasn't online for two weeks.
2. Mobile node goes online, gets the new channels and all old channels are marked as zombies.
The mobile node enters a bad state that is hard to recover from.

There’s also the case of a mobile node being online for a short period of time everyday and it still pruning most of its channels due to not having their updated timestamps as they only propagate to online nodes at the moment. There’s a way to query for channel updates in the spec, but we’ve yet to implement that. Doing so for most of your channels wouldn’t be ideal anyway, so we’d only want to query for our most used portion of the graph.

Currently when mobile node connects to its first peer, it queries for the new channels only (those which it doesn't know about) I wonder if there is a room to also include
zombie channels in this query.

A few months ago, when the channel graph blew up in size, we realized there was a lot of zombie churn within nodes. A node would fully sync with another, prune any zombie channels, connect to another, and end up requesting the same zombie channels it just pruned. As a mitigation, we decided to no longer request known zombie channels.

Instead, we can delay the initial graph pruning on startup until we’ve had sufficient uptime (1 hour should be enough) of receiving channel updates, which also solves the issue for non-mobile nodes. Mobile nodes are unlikely to reach this uptime, so they’d mostly rely on pruning double disabled channels.

- Wilmer

Roei Erez

unread,

Dec 11, 2019, 9:29:26 AM12/11/19

to Wilmer Paulino, lnd

Hi Wilmer and thanks for your detailed response.

I agree that implementing the timestamp filtering in the spec allows a node to receive only the information it really needs and it sounds like a very good improvement compared to the current behavior.
It seems like the short term solution of not querying the zombie channels solves the redundant queries but introduces other issues that make lots of mobile nodes not function at all which is a far worst and unrecoverable problem.

The proposed solution to delay the initial graph pruning in 1 hour would probably cause some mobile nodes not to prune at all which will lead to another issue where large portion of the graph that the mobile nodes sees is stale.
In addition it is not a very rare case for a mobile node to run for 1 hour so some users will still suffer from the current problem making the situation even worst as we will have two problems to deal with.
I don't really see other alternative than query for the zombie channels, we can do that only in "historical" sync that is done with the first peer and by that partially mitigate the old issue that you mentioned.
Also the number of zombie channels today (in two weeks expiration period) is ~1500 which is very small portion of the graph which give me a more confident feeling that the tradeoff is good.
One other alternative is to add a start parameter (delete-zombies-on-start), that when passed, will trigger a deletion of the zombie channels. It is less intrusive as nodes may decide to do that periodically or based on a specific case and it will make the historical sync ask for them again.
Both solutions are not ideal for sure but to me the tradeoff looks better until the filter queries will be implemented.

Roei

Roei Erez

unread,

Dec 12, 2019, 2:26:08 PM12/12/19

to lnd, wil...@lightning.engineering

Hi Wilmer,

I played more with graph sync in light nodes that is connected to hub peers and wanted to share the results here.

I have noticed that channels are pruned based on expiry time and, when assumechannelvalid=true, also when the edge is disabled.

One thing I am not sure I understand is why prune disabled edges and not just wait for them to expire, as far as I know they won't be considered in path finding even if in the graph.

By observing a well connected hub status it turns out that there are ~1500 zombies. I also noticed that when the mobile client asks for these zombies in "query_short_channel_ids" during historical sync the hub didn't include these in the response even if asked for (probably because they don't exist in its graph live view).
So according to this the client can safely query for the zombie channels, assuming that only those that are not zombies in the hub will be included in the response.
In some mobile devices the zombie channels grew up to 40,000 channels, part of these channels were real closed channels but at least 15,000 turned out to be regular live channels that were moved to the zombie bucket and never got out due to the short uptime of the light node.

I was testing the following changes on top of master on such devices:
1. Don't prune disabled channels

2. include the zombie channels in the query_short_channel_ids submitted from the client.

The results were:
1. The first sync was a bit long as it needed to fetch half of the graph.

2. After the first sync the ~15,000 channels moved from the zombies storage into the live graph which made the graph a lot more up to date.

3. From that moment on there were only ~1500 channels that the client doesn't know about which are included in the "query_short_channel_ids" of the historical sync, these channels are the real zombies on the hub.

4. The additional overhead for the next historical sync was only including these 1500 in the "query_short_channel_ids" as the hub didn't respond with updates/announcements for them due to their zombiness.

So it looks like the lifecycle of a channels in the graph of a light node after these changes is:

Start by entering the live graph view.
Pruned to zombies on expiration

Either returning to the live graph when a newer update arrives.
Or stay in the zombies bucket as real closed channel.

According to that we can safely periodically delete the zombie channels on light node as a maintenance procedure that will get rid of real closed channels.
I think these changes maintain a more consistent view of the graph for light node with a relatively small price.

Roei

Reply all

Reply to author

Forward