Troubleshooting SSL connection failures

505 views
Skip to first unread message

Aaron Seet

unread,
Jul 18, 2013, 11:00:33 AM7/18/13
to nod...@googlegroups.com
We implement Node.js in Windows Azure to act as a persistent websocket endpoint (with Sock.js) for clients. Client messages from the websocket channel are routed as regular HTTP requests to backend services, with responses and other notifications going back up the same channel. After about a year, last month we began to switch to SSL as the main transport between the tiers.

It is with SSL/WSS did we infrequently encounter durations where Google Chrome would be unable to maintain the websocket connection, complaining "Received a broken close frame containing a reserved status code". The situation would correct itself in a few minutes, without us doing anything. After consulting the Sock.js folks,


When it happened again, I went through the process of decrypting the SSL traffic with wireshark and found that the websocket server is indeed sending a forbidden status code 1006 back to the client (RFC states it is supposed to be used for local status reporting, and not to be sent across the wire). However it is not known which layer of code is responsible for sending such a packet, and how to rectify this behaviour.

The interesting thing was, after telling my colleague to revert back to plain http to communicate with the websocket server, while he could establish the socket connection, he would get reported errors from the socket layer because itself couldn't communicate with the backend. Looking at our websocket application log there were indeed trouble for our server acting as a client to the backend. So not only is client-to-websocket SSL failing, websocket-to-backend SSL is also failing.

2013-07-11 01:36:19.497 ERROR socket - [Error: 1744:error:14094418:SSL routines:SSL3_READ_BYTES:tlsv1 alert unknown ca:openssl\ssl\s3_pkt.c:1234:SSL alert number 48
]
Error: 1744:error:14094418:SSL routines:SSL3_READ_BYTES:tlsv1 alert unknown ca:openssl\ssl\s3_pkt.c:1234:SSL alert number 48
at CleartextStream._pusher (tls.js:656:24)
at SlabBuffer.use (tls.js:199:18)
at CleartextStream.CryptoStream._push (tls.js:483:33)
at SecurePair.cycle (tls.js:880:20)
at EncryptedStream.CryptoStream.write (tls.js:267:13)
at Socket.ondata (stream.js:38:26)
at Socket.EventEmitter.emit (events.js:96:17)
at TCP.onread (net.js:397:14)


And there were well over thirty port 443 TCP sockets pending to the backend

TCP websocket_ip:56168 backend_ip:443 FIN_WAIT_2

Don't have enough experience with the networking protocols to piece together a picture that hints at the problem, and a solution. Anybody seen similar situations before?


thanks,
Aaron

Aaron Seet

unread,
Aug 5, 2013, 11:12:53 AM8/5/13
to nod...@googlegroups.com
Has anybody induced such a stack trace before?

What I think could be a potential contributor to the behaviour is that https appears to have its own agent socket limit, separate from http. We had a similar problem last year with http (although symptoms were different) that was resolve by applying a greater number of maxSockets. By additionally adjusting for https, we have not encountered the problem so far.

http.globalAgent.maxSockets = 20000;
https.globalAgent.maxSockets = 20000;


thanks,
Aaron

Aaron Seet

unread,
Aug 30, 2013, 5:06:35 AM8/30/13
to nod...@googlegroups.com
Unfortunately, that only appeared to have delayed the problem; after prolonged usage, the error has occurred again.

 Error: 1240:error:14094418:SSL routines:SSL3_READ_BYTES:tlsv1 alert unknown ca:openssl\ssl\s3_pkt.c:1234:SSL alert number 48

at CleartextStream._pusher (tls.js:656:24)
at SlabBuffer.use (tls.js:199:18)
at CleartextStream.CryptoStream._push (tls.js:483:33)
at SecurePair.cycle (tls.js:880:20)
at EncryptedStream.CryptoStream.write (tls.js:267:13)
at Socket.ondata (stream.js:38:26)
at Socket.EventEmitter.emit (events.js:96:17)
at TCP.onread (net.js:397:14)


Interestingly, this is only particular to one of the backend servers. There is no https communication error with other backend server endpoints, despite the long list of pending FIN_WAIT_2 sockets.

:-/

Aaron

Ben Noordhuis

unread,
Aug 30, 2013, 6:13:46 AM8/30/13
to nod...@googlegroups.com
On Fri, Aug 30, 2013 at 11:06 AM, Aaron Seet <ice...@gmail.com> wrote:
> Unfortunately, that only appeared to have delayed the problem; after
> prolonged usage, the error has occurred again.
>
> Error: 1240:error:14094418:SSL routines:SSL3_READ_BYTES:tlsv1 alert unknown
> ca:openssl\ssl\s3_pkt.c:1234:SSL alert number 48
>
> at CleartextStream._pusher (tls.js:656:24)
> at SlabBuffer.use (tls.js:199:18)
> at CleartextStream.CryptoStream._push (tls.js:483:33)
> at SecurePair.cycle (tls.js:880:20)
> at EncryptedStream.CryptoStream.write (tls.js:267:13)
> at Socket.ondata (stream.js:38:26)
> at Socket.EventEmitter.emit (events.js:96:17)
> at TCP.onread (net.js:397:14)
>
>
> Interestingly, this is only particular to one of the backend servers. There
> is no https communication error with other backend server endpoints, despite
> the long list of pending FIN_WAIT_2 sockets.
>
> :-/
>
> Aaron

I can't tell you what exactly the issue is but maybe I can point you
in the right direction. Apologies if I'm not telling you anything you
didn't already know.

That 'SSL alert number 48' error message is sent by the upstream
server. It suggests that you are using client SSL certificates for
authorization. The server is rejecting it because it doesn't know the
CA, the certificate authority that signed the client certificate.

You mention it only happens with one server instance. That suggests
that it has a CA certificate store that is different from the others.
If you are using an in-house CA certificate, it's plausible that you
forgot to add it to that instance's certificate store.

Try connecting with `openssl s_client -cert <filename> -connect
<host>:<port>` and see what happens. Note that s_client only
supports certificates in DER and PEM format. If your certificate is
in PKCS#12 format, you can either export it with `openssl pkcs12` or
use the MS equivalent of `openssl s_client`.

Aaron Seet

unread,
Aug 30, 2013, 8:12:08 AM8/30/13
to nod...@googlegroups.com
Mmm interesting. We do not use client certificates; I'd be puzzled if that was involved. And this tells me it should theoretically happen all the time if there is a problem in setup. But it only happens after a random extended period of uptime. Also the last times it happened, the choke-up was the other way round with the other backend server.

What we have is a wildcard cert *.cloudapp.net deployed to all our cloud services in Windows Azure (which are named as cloudservicename.cloudapp.net). cloudapp.net itself actually belongs to Microsoft, so we generated that wildcard cert ourselves, signed with our own development CA. The cert is for server identification and https traffic to happen, so client identification is not involved.

Everything actually works fine, until the problem occurs after a long while (the stretch this time has been almost a month). Restarting the node.exe process (and all the TCP sockets along with it) will "solve" the problem. But that is of course not a real solution.


thanks,
Aaron
Reply all
Reply to author
Forward
0 new messages