Gathering statistics about TLS failure (Python)

34 views
Skip to first unread message

Russ Allbery

unread,
May 24, 2018, 10:32:08 PM5/24/18
to grp...@googlegroups.com
Hi all,

I want to gather statistics on how many failed TLS handshakes are
happening across a large gRPC deployment. (The motivation is a long
story, but basically I want to be able to trigger alerts if something goes
wrong with the PKI that's generating the client and server certificates.)
Just a count of failed TLS handshakes would be sufficient, although if I
can get at more detailed errors, that might be helpful. I see how to do
this in the Go gRPC code, but also need the same support for Python.

I'd mildly prefer to capture this information on the server, although it
may be adequate to capture it on the client if that's easier to do. (I
only need either the server or the client, not both.)

Could someone point me in the right direction? I can take a cut at
implementing this as a general feature suitable for a pull request, but
I'm not sure the best approach to use given the code structure.

What I've been able to determine (I think) so far:

* Actual success or failure of the TLS handshakes is determined (for both
client and server) in src/core/tsi/ssl_transport_security.cc.

* For the server, these errors pass up to on_handshake_done in
src/core/ext/transport/chttp2/server/chttp2_server.cc, where they're
logged and discarded and the channel is closed. This seems to be below
the layer where the Python bindings have any visibility. (In other
words, so far as I could determine, no Python code ever sees the error
from an attempted connection that results in a failed TLS handshake.)

* For the client, it looks like (?) these errors will trigger a notify
closure in src/core/ext/transport/chttp2/client/chttp2_connector.cc in
on_handshake_done, which in turn seems to be on_subchannel_connected in
core/ext/filters/client_channel/subchannel.cc, but it looks like the
exact contents of the error don't go any farther beyond that function.

Therefore, unless I'm mistaken, it looks like these errors are swallowed
inside the core code in places where I can't get visibility to them from
Python, hiding entirely (except for a log message) in the server and
turning into a generic connection state inside the client channels. So
there seems to be some plumbing or a hook missing here to be able to
bubble these failures up to a level where I can get at them and send them
to monitoring code.

--
Russ Allbery (ea...@eyrie.org) <http://www.eyrie.org/~eagle/>

Carl Mastrangelo

unread,
May 25, 2018, 5:28:58 PM5/25/18
to grpc.io
At least in the Java world, connections are treated as a queue of handlers, with TLS near the network side.  Any TLS failures are not surfaced (or not easily anyways) since they never make it up to the gRPC layer.  Gather these stats server side would require some code changes.  I know you said you are using python, but I think every language would be able to benefit from knowing about such failures.

Russ Allbery

unread,
May 25, 2018, 9:16:12 PM5/25/18
to 'Carl Mastrangelo' via grpc.io, Carl Mastrangelo
"'Carl Mastrangelo' via grpc.io" <grp...@googlegroups.com> writes:

> At least in the Java world, connections are treated as a queue of
> handlers, with TLS near the network side. Any TLS failures are not
> surfaced (or not easily anyways) since they never make it up to the gRPC
> layer. Gather these stats server side would require some code changes.
> I know you said you are using python, but I think every language would
> be able to benefit from knowing about such failures.

Yeah, agreed, it just so happens that my specific problem is with Python
and Go. Go is relatively straightforward to implement without making
changes to gRPC because (for various other reasons) we're already hooking
into the TLS negotiation and that's exposed fairly well with a few
strategic proxy objects. But the C/C++ native code is a bit less amenable
to that.

I assume client-side poses similar challenges? Or is it easier to hook
into?

Carl Mastrangelo

unread,
May 25, 2018, 9:19:13 PM5/25/18
to ea...@eyrie.org, Jiangtao Li, Vijay Pai, grp...@googlegroups.com
+Jiangtao and +Vijay  Would it be feasible to expose TLS failures in the gRPC core stack somehow?  Even if they weren't recoverable.

Russ Allbery

unread,
Jun 4, 2018, 4:02:58 PM6/4/18
to 'Carl Mastrangelo' via grpc.io, Jiangtao Li, Vijay Pai, Carl Mastrangelo
Friendly ping on this?

Ken Payson

unread,
Jun 13, 2018, 1:36:40 PM6/13/18
to grpc.io
It definitely should be possible to bubble handshake failure counts up to the core surface API, but there might be a non-trivial amount of work involved.

Feel free to open a feature request on https://github.com/grpc/grpc.
Reply all
Reply to author
Forward
0 new messages