Hi Alex,
Thanks for starting this thread!
Just to provide some wider context to the list:
* Today the cold node connects to the signer (hot) node.
* We're looking to support a model where the signer node connects to the
cold node.
* This is useful, as then it allows things like mobile nodes to connect
into the signer. The reverse isn't always possible as the mobile node
doesn't have a public listening port.
Re that last point, it's also possible to leverage an LNC-like connection to
still have the cold node connect to the mobile node. However, for this to
work well things need to be super snappy, which argues for the mobile node
connecting to the signer, as once it's up it initiates the handshake.
As we pursue this line of thinking, we'll also want to examine to what
extent we can _speed up_ the restart time of a node. Today with bbolt,
things can take some time, but with sqlite things a _much_ faster, and start
up time is basically instant. With that out of the way, the other aspect to
examine would be: how long it takes to establish new persistent connections
to peers.
> For the initial implementation, the watch-only node would wait for the
> signer connection in order to start operation, and shut down/"crash"
> (similar to now), or at least drop all peer connections, when the signer
> is disconnected.
Yeah this is the model I had in mind. IMO it's much simpler for the daemon
to simply shutdown or go into a "safe mode" once the remote signer
connection is dropped. Ideally in this restricted mode, the main daemon can
also respond to some basic RPCs, like `GetInfo` or `WalletBalance` so a
monitoring tool (or w/e) with a restricted macaroon can still check in on
the node.
So at a high level, any time the remote signer connection is dropped, then
the main daemon needs to _immediately_ drop all active connections and
revert to this same mode.
> This would let the node run in a "safe mode" to participate in gossip
> without a signer, and allow the signer to connect/disconnect to the node
> for receiving/sending/routing money as needed.
Yep, I had that same thought: it's technically _possible_ for it to retain
the brontide p2p connections, as after the initial handshake (auth), the
connection object has the shared secret, so it can continue to
encrypt/decrypt messages.
Related to the comment above about snappy restarts: if the daemon is able to
hand on to a few "gossip only" connections while the signer is down, then it
can ensure by the time the signer is back, we've synced all the latest
gossip.
In terms of architecture, the way things work today is that:
* The signer node implements the Signer gRPC service.
* On start-up, the cold node connects to the signer node.
* If it can't reach the singer node, then it just crashes
What I think we want is instead something like:
* On start up, the cold node connects out to what it _thinks_ is the
signer node.
* This is instead a gRPC proxy that'll wait until the signer node
makes an inbound connection, and will then proxy the messages back and
forth (basically an io.Copy, but for gRPC).
* The cold node inherits a global context.Context that's linked to the
actual inbound gRPC connection. If this is ever cancelled, then things
go back to that safe mode.
This should allow for minimal-ish changes, as the behavior of the cold node
is more or less the same. It _thinks_ it has an actual connection, but then
should hit some sort of health check endpoint to ensure it exists before it
tries to do anything that requires the signer.
> we believe the signer->watch-only node connectivity method is more secure
> than the current method.
That's really interesting, can you elaborate on the security model that led
to that conclusion? Is it that then the signer node doesn't actually need a
listening port?
-- Laolu