Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

OpenSSL bug?

68 views
Skip to first unread message

Michael Gale

unread,
May 9, 2011, 12:18:31 PM5/9/11
to
Hello,

    I am experiencing an SSL bug however I am not able to determine if the issue is on the Python SSL module side or the OpenSSL side.

I am using Python 2.7.1 and OpenSSL 0.9.8b (CentOS / RedHat) and the Python server is using non-blocking sockets. 

The following traceback is found when certain connections are made, this happens regularly however I am having trouble finding the exact client and cause, the clients are connecting over a satellite link. Here is the output gathered from an strace:

write(2, "ERROR:root:Exception in I/O handler for fd 9533
Traceback (most recent call last):
  File \"/hub/switchboard/lib/ioloop.py\", line 253, in start
    self._handlers[fd](fd, events)
  File \"/hub/switchboard/lib/stack_context.py\", line 127, in wrapped
    callback(*args, **kwargs)
  File \"/hub/switchboard/lib/iostream.py\", line 168, in _handle_events
    self._handle_read()
  File \"/hub/switchboard/lib/iostream.py\", line 360, in _handle_read
    super(SSLIOStream, self)._handle_read()
  File \"/hub/switchboard/lib/iostream.py\", line 210, in _handle_read
    chunk = self.socket.recv(self.read_chunk_size)
  File \"/hub/apps/Python-2.7/lib/python2.7/ssl.py\", line 217, in recv
    return self.read(buflen)
  File \"/hub/apps/Python-2.7/lib/python2.7/ssl.py\", line 138, in read
    return self._sslobj.read(len)
SSLError: [Errno 1] _ssl.c:1347: error:14095044:SSL routines:SSL3_READ_N:internal error\n", 896) = 896

If this is an SSL error can you provide some information as to a possible cause? That way I can try and ensure our app handles this case.

FYI - Sending this again as I do not see my first post.

Thanks
Michael

Michael Gale

unread,
May 9, 2011, 3:15:45 PM5/9/11
to
Hello,

    I believe we are triggering a bug in s3_pkt.c in OpenSSL 0.9.8b:

--snip--
 171         if (n > max) /* does not happen */
 172                 {
 173                 SSLerr(SSL_F_SSL3_READ_N,ERR_R_INTERNAL_ERROR);
 174                 return -1;
 175                 }
--snip--

I checked OpenSSL 1.0.0d release and the same comment is there "does not happen". If it does not happen why is it being checked? I am trying to determine under what conditions this would occur so I can determine why it is occurring with our SSL service.

Any insight would be appreciated.

Thanks
Michael

--snip-- Whole function below from s3_pkt.c
 123 int ssl3_read_n(SSL *s, int n, int max, int extend)
 124         {
 125         /* If extend == 0, obtain new n-byte packet; if extend == 1, increase
 126          * packet by another n bytes.
 127          * The packet will be in the sub-array of s->s3->rbuf.buf specified
 128          * by s->packet and s->packet_length.
 129          * (If s->read_ahead is set, 'max' bytes may be stored in rbuf
 130          * [plus s->packet_length bytes if extend == 1].)
 131          */
 132         int i,off,newb;
 133 
 134         if (!extend)
 135                 {
 136                 /* start with empty packet ... */
 137                 if (s->s3->rbuf.left == 0)
 138                         s->s3->rbuf.offset = 0;
 139                 s->packet = s->s3->rbuf.buf + s->s3->rbuf.offset;
 140                 s->packet_length = 0;
 141                 /* ... now we can act as if 'extend' was set */
 142                 }
 143 
 144         /* extend reads should not span multiple packets for DTLS */
 145         if ( SSL_version(s) == DTLS1_VERSION &&
 146                 extend)
 147                 {
 148                 if ( s->s3->rbuf.left > 0 && n > s->s3->rbuf.left)
 149                         n = s->s3->rbuf.left;
 150                 }
 151 
 152         /* if there is enough in the buffer from a previous read, take some */
 153         if (s->s3->rbuf.left >= (int)n)
 154                 {
 155                 s->packet_length+=n;
 156                 s->s3->rbuf.left-=n;
 157                 s->s3->rbuf.offset+=n;
 158                 return(n);
 159                 }
 160 
 161         /* else we need to read more data */
 162         if (!s->read_ahead)
 163                 max=n;
 164 
 165         {
 166                 /* avoid buffer overflow */
 167                 int max_max = s->s3->rbuf.len - s->packet_length;
 168                 if (max > max_max)
 169                         max = max_max;
 170         }
 171         if (n > max) /* does not happen */
 172                 {
 173                 SSLerr(SSL_F_SSL3_READ_N,ERR_R_INTERNAL_ERROR);
 174                 return -1;
 175                 }
 176 
 177         off = s->packet_length;
 178         newb = s->s3->rbuf.left;
 179         /* Move any available bytes to front of buffer:
 180          * 'off' bytes already pointed to by 'packet',
 181          * 'newb' extra ones at the end */
 182         if (s->packet != s->s3->rbuf.buf)
 183                 {
 184                 /*  off > 0 */
 185                 memmove(s->s3->rbuf.buf, s->packet, off+newb);
 186                 s->packet = s->s3->rbuf.buf;
 187                 }
 188 
 189         while (newb < n)
 190                 {
 191                 /* Now we have off+newb bytes at the front of s->s3->rbuf.buf and need
 192                  * to read in more until we have off+n (up to off+max if possible) */
 193 
 194                 clear_sys_error();
 195                 if (s->rbio != NULL)
 196                         {
 197                         s->rwstate=SSL_READING;
 198                         i=BIO_read(s->rbio,     &(s->s3->rbuf.buf[off+newb]), max-newb);
 199                         }
 200                 else
 201                         {
 202                         SSLerr(SSL_F_SSL3_READ_N,SSL_R_READ_BIO_NOT_SET);
 203                         i = -1;
 204                         }
 205 
 206                 if (i <= 0)
 207                         {
 208                         s->s3->rbuf.left = newb;
 209                         return(i);
 210                         }
 211                 newb+=i;
 212                 }
 213 
 214         /* done reading, now the book-keeping */
 215         s->s3->rbuf.offset = off + n;
 216         s->s3->rbuf.left = newb - n;
 217         s->packet_length += n;
 218         s->rwstate=SSL_NOTHING;
 219         return(n);
 220         }
 221 
--

Ecclesiastes 1:18

 18 For with much wisdom comes much sorrow; 
   the more knowledge, the more grief.


Ger Hobbelt

unread,
May 10, 2011, 7:06:17 AM5/10/11
to
On Mon, May 9, 2011 at 9:15 PM, Michael Gale <gale.m...@gmail.com> wrote:
I checked OpenSSL 1.0.0d release and the same comment is there "does not happen". If it does not happen why is it being checked?

Might be more correctly stated as '/* _should_ not happen */' - see the code: it's a basic sanity check to ensure the BIO_read() doesn't overrun the rx buffer, which would cause arbitrary memory corruption if the BIO_read() were allowed to do that. The error is keeping you from worse (and even harder to debug) things happening to you.

It is unclear (assuming the stack trace you provided only lists the python side of things) who invoked ssl3_read_n() and where the error originates: it may either be an input argument error to ssl3_read_n(): ..., int n, int max, ... where n>max from the start of the function, or due to a 'max' fixup inside ssl3_read_n() combined with yet unknown context conditions.

Incidentally, did you grep whether ssl3_read_n() is invoked directly from the Python glue (just to be sure only the 'usual suspects' inside openSSL itself may invoke ssl3_read_n(); it is marked as an internal-use-only function (prototype sits in ssl_locl.h)

I haven't seen this error before, but what might help you in improving diagnosis, assuming you can build your own openSSL (+ python glue code ?), is adding a bit of info to the error report plus activating the openssl assertions in the build. Then you can add these lines (typed off the top of my head, so reckon with the compiler yakking due to typos by me) to see where the issue 'starts' within the function:


--snip-- Whole function below from s3_pkt.c
 132         int i,off,newb;
 133 

    OPENSSL_assert(n <= max); /* turn on assertions in build to have this one fire on incorrect input */

alternative code, which might be more useful and doesn't require assertions enabled to fire anyway when this 'should not happen' occasion happens anyway:

         if (n > max) /* should not happen */
                {
                char errbuf[80];
                SSLerr(SSL_F_SSL3_READ_N,ERR_R_INTERNAL_ERROR);
                sprintf(errbuf, ' @ start of ssl3_read_n: n = %d, max = %d", n, max);
                ERR_add_error_data(1, errbuf);
                return -1;
                }

The ERR_add_error_data() call accepts a series of strings which are appended as extra data to the error report; very handy to transport debug info out of openSSL. Note that it must be invoked AFTER the SSLerr() pushed the error on the error stack.
 
 134         if (!extend)
[...]
 165         {
 166                 /* avoid buffer overflow */
 167                 int max_max = s->s3->rbuf.len - s->packet_length;
 168                 if (max > max_max)
 169                         max = max_max;
 170         }
 171         if (n > max) /* does not happen */
 172                 {

char errbuf[120];  /* and here, it might be useful to have a look at the rbuf, as by now we 'know' this will only fire when max was reduced above */
 
 173                 SSLerr(SSL_F_SSL3_READ_N,ERR_R_INTERNAL_ERROR);
 
                sprintf(errbuf, ' @ line %d in ssl3_read_n: n = %d, max = %d, rbuf.len = %d, packet_length = %d", n, max, (int)s->s3->rbuf.len, (int)s->packet_length);
                ERR_add_error_data(1, errbuf);
 
 174                 return -1;
 175                 }


--
Met vriendelijke groeten / Best regards,

Ger Hobbelt

--------------------------------------------------
web:    http://www.hobbelt.com/
        http://www.hebbut.net/
mail:   g...@hobbelt.com
mobile: +31-6-11 120 978
--------------------------------------------------

Dave Thompson

unread,
May 10, 2011, 5:39:22 PM5/10/11
to
 


From: owner-op...@openssl.org [mailto:owner-op...@openssl.org] On Behalf Of Ger Hobbelt
Sent: Tuesday, 10 May, 2011 07:06
On Mon, May 9, 2011 at 9:15 PM, Michael Gale <gale.m...@gmail.com> wrote:
I checked OpenSSL 1.0.0d release and the same comment is there "does not happen". If it does not happen why is it being checked?
Same comment, and error code, but slightly different logic leading to it in 1.0.0.

Might be more correctly stated as '/* _should_ not happen */' - see the code: it's a basic sanity check to ensure the BIO_read() doesn't overrun the rx buffer, which would cause arbitrary memory corruption if the BIO_read() were allowed to do that. The error is keeping you from worse (and even harder to debug) things happening to you.
Agree. 
<snip> 

I haven't seen this error before, but what might help you in improving diagnosis, assuming you can build your own openSSL (+ python glue code ?), is adding a bit of info to the error report plus activating the openssl assertions in the build. Then you can add these lines (typed off the top of my head, so reckon with the compiler yakking due to typos by me) to see where the issue 'starts' within the function: 
 
Or if you (OP) can run the affected program (AUIU a server to numerous clients)
using a build with symbols in an interactive debugger (probably gdb)
and set a breakpoint and look interactively. This is typically quicker
to set up, but not as convenient if the bug triggers numerous times.
<snip> 
 171         if (n > max) /* does not happen */
 172                 {

char errbuf[120];  /* and here, it might be useful to have a look at the rbuf, as by now we 'know' this will only fire when max was reduced above */
 
 173                 SSLerr(SSL_F_SSL3_READ_N,ERR_R_INTERNAL_ERROR);
 
                sprintf(errbuf, ' @ line %d in ssl3_read_n: n = %d, max = %d, rbuf.len = %d, packet_length = %d", n, max, (int)s->s3->rbuf.len, (int)s->packet_length);
                ERR_add_error_data(1, errbuf);
   
You don't have a value (like __LINE__)  for the "line %d", and you don't need it
because this error code (pkg/func/num) only occurs in one place.
Not to mention that adding debug code changes the linenumbers.
 
 

Michael Gale

unread,
May 10, 2011, 6:58:20 PM5/10/11
to
On Tue, May 10, 2011 at 5:06 AM, Ger Hobbelt <g...@hobbelt.com> wrote:
On Mon, May 9, 2011 at 9:15 PM, Michael Gale <gale.m...@gmail.com> wrote:
I checked OpenSSL 1.0.0d release and the same comment is there "does not happen". If it does not happen why is it being checked?
Might be more correctly stated as '/* _should_ not happen */' - see the code: it's a basic sanity check to ensure the BIO_read() doesn't overrun the rx buffer, which would cause arbitrary memory corruption if the BIO_read() were allowed to do that. The error is keeping you from worse (and even harder to debug) things happening to you.

It is unclear (assuming the stack trace you provided only lists the python side of things) who invoked ssl3_read_n() and where the error originates: it may either be an input argument error to ssl3_read_n(): ..., int n, int max, ... where n>max from the start of the function, or due to a 'max' fixup inside ssl3_read_n() combined with yet unknown context conditions.

Incidentally, did you grep whether ssl3_read_n() is invoked directly from the Python glue (just to be sure only the 'usual suspects' inside openSSL itself may invoke ssl3_read_n(); it is marked as an internal-use-only function (prototype sits in ssl_locl.h)

I haven't seen this error before, but what might help you in improving diagnosis, assuming you can build your own openSSL (+ python glue code ?), is adding a bit of info to the error report plus activating the openssl assertions in the build. Then you can add these lines (typed off the top of my head, so reckon with the compiler yakking due to typos by me) to see where the issue 'starts' within the function:
--snip-- Whole function below from s3_pkt.c
 132         int i,off,newb;
 133 

    OPENSSL_assert(n <= max); /* turn on assertions in build to have this one fire on incorrect input */

alternative code, which might be more useful and doesn't require assertions enabled to fire anyway when this 'should not happen' occasion happens anyway:

         if (n > max) /* should not happen */
                {
                char errbuf[80];
                SSLerr(SSL_F_SSL3_READ_N,ERR_R_INTERNAL_ERROR);
                sprintf(errbuf, ' @ start of ssl3_read_n: n = %d, max = %d", n, max);
                ERR_add_error_data(1, errbuf);
                return -1;
                }

The ERR_add_error_data() call accepts a series of strings which are appended as extra data to the error report; very handy to transport debug info out of openSSL. Note that it must be invoked AFTER the SSLerr() pushed the error on the error stack.
 
 134         if (!extend)
[...]
 165         {
 166                 /* avoid buffer overflow */
 167                 int max_max = s->s3->rbuf.len - s->packet_length;
 168                 if (max > max_max)
 169                         max = max_max;
 170         }
 171         if (n > max) /* does not happen */
 172                 {

char errbuf[120];  /* and here, it might be useful to have a look at the rbuf, as by now we 'know' this will only fire when max was reduced above */
 
 173                 SSLerr(SSL_F_SSL3_READ_N,ERR_R_INTERNAL_ERROR);
 
                sprintf(errbuf, ' @ line %d in ssl3_read_n: n = %d, max = %d, rbuf.len = %d, packet_length = %d", n, max, (int)s->s3->rbuf.len, (int)s->packet_length);
                ERR_add_error_data(1, errbuf);
 
 174                 return -1;
 175                 }


--
Met vriendelijke groeten / Best regards,

Ger Hobbelt

--------------------------------------------------
web:    http://www.hobbelt.com/
        http://www.hebbut.net/
mail:   g...@hobbelt.com
mobile: +31-6-11 120 978
--------------------------------------------------



Hey,

    First, thanks for the reply and feedback.

The section of the python code that invokes the function call is below, I highlighted line 138 which is where the exception is occurring according to the stack trace.

--snip-- Python-2.7/lib/python2.7/ssl.py
132     def read(self, len=1024):
133 
134         """Read up to LEN bytes and return them.
135         Return zero-length string on EOF."""
136 
137         try:
138             return self._sslobj.read(len)
139         except SSLError, x:
140             if x.args[0] == SSL_ERROR_EOF and self.suppress_ragged_eofs:
141                 return ''
142             else:
143                 raise
--snip--

I traced this to Python-2.7/Modules/_ssl.c, then to openssl-0.9.8b/ssl/ssl_lib.c where it calls SSL_read. So it does not look like it is calling the internal method.

We build our own version of Python from the python source because we make it into an RPM with a custom installation directory. I can also build a version of openSSL but getting a client to create this condition has proved difficult so far. 

We ran this new service on QA for several months with out an issue, last weekend we placed it into production and this issue occurred after about 5hrs of running. At first we thought how odd and restarted the service and the issue occurred again with in minutes. My belief is that it is caused by a client stuck in a broken state. 

However with so many connections going on at this time I am not in a position to trouble shoot it :(  ... it just forces us to roll back the deployment. 

I was really hoping that someone knows how a connection could get into this state and maybe had a test utility I could run against our service create the issue. So far my attempts are re-creating it on QA have failed :( 

Any ideas on how maybe I could create the issue?

I have added a bunch of source code below to try and layout the flow from Python to OpenSSL.

Michael

--snip-- openssl-0.9.8b/ssl/ssl_lib.c
 858 int SSL_read(SSL *s,void *buf,int num)
 859         {
 860         if (s->handshake_func == 0)
 861                 {
 862                 SSLerr(SSL_F_SSL_READ, SSL_R_UNINITIALIZED);
 863                 return -1;
 864                 }
 865 
 866         if (s->shutdown & SSL_RECEIVED_SHUTDOWN)
 867                 {
 868                 s->rwstate=SSL_NOTHING;
 869                 return(0);
 870                 }
 871         return(s->method->ssl_read(s,buf,num));
 872         }
--snip--

--snip-- Python-2.7/Modules/_ssl.c
1262 static PyObject *PySSL_SSLread(PySSLObject *self, PyObject *args)
1263 {
1264     PyObject *buf;
1265     int count = 0;
1266     int len = 1024;
1267     int sockstate;
1268     int err;
1269     int nonblocking;
1270 
1271     if (!PyArg_ParseTuple(args, "|i:read", &len))
1272         return NULL;
1273 
1274     if (!(buf = PyString_FromStringAndSize((char *) 0, len)))
1275         return NULL;
1276 
1277     /* just in case the blocking state of the socket has been changed */
1278     nonblocking = (self->Socket->sock_timeout >= 0.0);
1279     BIO_set_nbio(SSL_get_rbio(self->ssl), nonblocking);
1280     BIO_set_nbio(SSL_get_wbio(self->ssl), nonblocking);
1281 
1282     /* first check if there are bytes ready to be read */
1283     PySSL_BEGIN_ALLOW_THREADS
1284     count = SSL_pending(self->ssl);
1285     PySSL_END_ALLOW_THREADS
1286 
1287     if (!count) {
1288         sockstate = check_socket_and_wait_for_timeout(self->Socket, 0);
1289         if (sockstate == SOCKET_HAS_TIMED_OUT) {
1290             PyErr_SetString(PySSLErrorObject,
1291                             "The read operation timed out");
1292             Py_DECREF(buf);
1293             return NULL;
1294         } else if (sockstate == SOCKET_TOO_LARGE_FOR_SELECT) {
1295             PyErr_SetString(PySSLErrorObject,
1296                             "Underlying socket too large for select().");
1297             Py_DECREF(buf);
1298             return NULL;
1299         } else if (sockstate == SOCKET_HAS_BEEN_CLOSED) {
1300             if (SSL_get_shutdown(self->ssl) !=
1301                 SSL_RECEIVED_SHUTDOWN)
1302             {
1303                 Py_DECREF(buf);
1304                 PyErr_SetString(PySSLErrorObject,
1305                                 "Socket closed without SSL shutdown handshake");
1306                 return NULL;
1307             } else {
1308                 /* should contain a zero-length string */
1309                 _PyString_Resize(&buf, 0);
1310                 return buf;
1311             }
1312         }
1313     }
1314     do {
1315         PySSL_BEGIN_ALLOW_THREADS
1316         count = SSL_read(self->ssl, PyString_AsString(buf), len);
1317         err = SSL_get_error(self->ssl, count);
1318         PySSL_END_ALLOW_THREADS
1319         if(PyErr_CheckSignals()) {
1320             Py_DECREF(buf);
1321             return NULL;
1322         }
1323         if (err == SSL_ERROR_WANT_READ) {
1324             sockstate = check_socket_and_wait_for_timeout(self->Socket, 0);
1325         } else if (err == SSL_ERROR_WANT_WRITE) {
1326             sockstate = check_socket_and_wait_for_timeout(self->Socket, 1);
1327         } else if ((err == SSL_ERROR_ZERO_RETURN) &&
1328                    (SSL_get_shutdown(self->ssl) ==
1329                     SSL_RECEIVED_SHUTDOWN))
1330         {
1331             _PyString_Resize(&buf, 0);
1332             return buf;
1333         } else {
1334             sockstate = SOCKET_OPERATION_OK;
1335         }
1336         if (sockstate == SOCKET_HAS_TIMED_OUT) {
1337             PyErr_SetString(PySSLErrorObject,
1338                             "The read operation timed out");
1339             Py_DECREF(buf);
1340             return NULL;
1341         } else if (sockstate == SOCKET_IS_NONBLOCKING) {
1342             break;
1343         }
1344     } while (err == SSL_ERROR_WANT_READ || err == SSL_ERROR_WANT_WRITE);
1345     if (count <= 0) {
1346         Py_DECREF(buf);
1347         return PySSL_SetError(self, count, __FILE__, __LINE__);
1348     }
1349     if (count != len)
1350         _PyString_Resize(&buf, count);
1351     return buf;
1352 }
--snip--

Ger Hobbelt

unread,
May 10, 2011, 8:38:32 PM5/10/11
to
On Wed, May 11, 2011 at 12:58 AM, Michael Gale <gale.m...@gmail.com> wrote:

Hey,

    First, thanks for the reply and feedback.

The section of the python code that invokes the function call is below, I highlighted line 138 which is where the exception is occurring according to the stack trace.

--snip-- Python-2.7/lib/python2.7/ssl.py
132     def read(self, len=1024):
133 
134         """Read up to LEN bytes and return them.
135         Return zero-length string on EOF."""
136 
137         try:
138             return self._sslobj.read(len)
139         except SSLError, x:
140             if x.args[0] == SSL_ERROR_EOF and self.suppress_ragged_eofs:
141                 return ''
142             else:
143                 raise
--snip--

I traced this to Python-2.7/Modules/_ssl.c, then to openssl-0.9.8b/ssl/ssl_lib.c where it calls SSL_read. So it does not look like it is calling the internal method.

Nothing glaringly obvious to me in any of the code snippets. :-(   (And, Dave, thanks for catching f.u. where I was missing the __LINE__ in my code!)


It's past 0200 hours here so I'd better get some shut-eye, but here's a few thoughts to ponder:


- you mentioned 'satellite link'. Given the wickedness of the issue, a few baseline checks to get our assumptions straight: does your QA rig simulate the 'long fat pipe' that's usual for satellite comms?  (or does it not?)  (long fat pipe ~ long round trip delay, high RX/TX rate, but I've also seen extremely high RX/TX ratios (way back when: both sides had different uplink bandwidth then; kinda like ADSL but much worse). All this /should/'ve been handled at the TCP level and not affect any upper layers (like SSL/sockets), but better ask now then ride on an incorrect assumption.

- the above implies the assumption that you're using SSL over TCP, not DTLS (which, IIRC, is UDP-based. Correct me if I'm wrong on that one.)

- given a grep through the code, all I can find as 'probable culprits' are internal rbuf (read buffer) related and nothing that can be more or less directly be tickled by SSL_write/read calls. Mix that with my own experience of no trouble whatsoever in the RX/TX buffer dept. for years plus an error which hints at an impending buffer overrun happening at yours (and then, only in production. Nasty bugger to tackle) my next bet it's something related to the 'satellite comms behaviour' as visible at the socket I/F level. (I was also thinking about packet size and thus ever-so-slightly altered recv/send behaviour, but I've been using OpenSSL over PPP, raw serial, Ethernets and a few network oddities and never encountered this. :-S )

- You said you had to 'rollback': is that to a state where you use an older OpenSSL stack, a different SSL stack or no SSL at all? (Just another baseline check here.)


- the only thing that I can see which /might/ have impact (and this is mere conjecture! But we're apparently looking at some sort of obscure edge case, so options are open in my mind) is when both ends differ in their OpenSSL builds, where one has OpenSSL compiled _with_ compression enabled, and the other has not (OPENSSL_NO_COMP).
Line of thought goes like this: the thing that possibly maybe might screw you up is when the other side sends packets which don't somehow fit in the RX buffer, the SSL stack keeps on asking ssl3_read_n() to fetch ever more, and the SSL packet is not caught as overlarge in other parts of the code (which would be odd by itself, but let's continue). So the question is then: how is the RX buffer dimensioned? It's set up in either ssl3_setup_buffers() or ssl3_setup_read_buffer() and space depends on the SSL3_RT_MAX_PACKET_SIZE define. Which, when followed through, is only really dependent on either the OPENSSL_NO_COMP setting or someone seriously screwing around in the OpenSSL code internals. So I opt for the 'no_comp' setting being a 'possibly maybe', however improbable it may be. (because I've used that mix before and no trouble for me; besides, one can expect both flavors to exist in the wild so why is this a first then?)


Another thing that affects packet size in OpenSSL is the SSL_OP_MICROSOFT_BIG_SSLV3_BUFFER option, which you can set, for example, using the SSL_ctrl(s, SSL_CTRL_OPTIONS, SSL_OP_MICROSOFT_BIG_SSLV3_BUFFER, NULL) call (or the eqv. for the SSL_CTX: SSL->options are copied from the SSL_CTX). When one side has the option and the other has not, it just might..... (again, with low probability, but these are the ways I can see that things /could/ go wrong as you described. (Without a system where this can be reproduced, we're down to 'intelligent' guesswork and a couple of straws.)


Cheers and good night!


--

Michael Gale

unread,
May 11, 2011, 12:32:59 AM5/11/11
to


On Tue, May 10, 2011 at 6:38 PM, Ger Hobbelt <g...@hobbelt.com> wrote:
....

Nothing glaringly obvious to me in any of the code snippets. :-(   (And, Dave, thanks for catching f.u. where I was missing the __LINE__ in my code!)


It's past 0200 hours here so I'd better get some shut-eye, but here's a few thoughts to ponder:

I am very thankful for the detail reply as such a late hour. 

- you mentioned 'satellite link'. Given the wickedness of the issue, a few baseline checks to get our assumptions straight: does your QA rig simulate the 'long fat pipe' that's usual for satellite comms?  (or does it not?)  (long fat pipe ~ long round trip delay, high RX/TX rate, but I've also seen extremely high RX/TX ratios (way back when: both sides had different uplink bandwidth then; kinda like ADSL but much worse). All this /should/'ve been handled at the TCP level and not affect any upper layers (like SSL/sockets), but better ask now then ride on an incorrect assumption.

- the above implies the assumption that you're using SSL over TCP, not DTLS (which, IIRC, is UDP-based. Correct me if I'm wrong on that one.)

Our QA setup does not simulate the satellite connection, we did test a single connection over a satellite link but that was a quick sanity test only. We did place a large set of clients behind a Nist Net box (http://www-x.antd.nist.gov/nistnet/index.html) but simulated latency only, so the RTT was increased to 500ms each way. You are correct in that we are using SSL over TCP :)

- given a grep through the code, all I can find as 'probable culprits' are internal rbuf (read buffer) related and nothing that can be more or less directly be tickled by SSL_write/read calls. Mix that with my own experience of no trouble whatsoever in the RX/TX buffer dept. for years plus an error which hints at an impending buffer overrun happening at yours (and then, only in production. Nasty bugger to tackle) my next bet it's something related to the 'satellite comms behaviour' as visible at the socket I/F level. (I was also thinking about packet size and thus ever-so-slightly altered recv/send behaviour, but I've been using OpenSSL over PPP, raw serial, Ethernets and a few network oddities and never encountered this. :-S )

- You said you had to 'rollback': is that to a state where you use an older OpenSSL stack, a different SSL stack or no SSL at all? (Just another baseline check here.)

Currently we have several different flavors and versions of Linux (Redhat 7, Redhat 3, CentOS, etc ...) out there using Java 1.X and some Sun software :(, they are making an SSL over TCP connection to a Java 1.5 Sun server software at our data center.  Because of a compound of issues we decided to place a proxy server written in Python :) between the them. This Python proxy server is using SSL in non-blocking very similar to the Tornado web server released by Facebook. I would say about 2/3 of the connections are from over a satellite network and the rest are from broadband. So we have a great mixture of technology ;)

The "upgrade" was adding the proxy in the middle and the revert was removing it, this issue may be happening now but the Java server logs are useless .... can you tell how I feel about Java :)


- the only thing that I can see which /might/ have impact (and this is mere conjecture! But we're apparently looking at some sort of obscure edge case, so options are open in my mind) is when both ends differ in their OpenSSL builds, where one has OpenSSL compiled _with_ compression enabled, and the other has not (OPENSSL_NO_COMP).
Line of thought goes like this: the thing that possibly maybe might screw you up is when the other side sends packets which don't somehow fit in the RX buffer, the SSL stack keeps on asking ssl3_read_n() to fetch ever more, and the SSL packet is not caught as overlarge in other parts of the code (which would be odd by itself, but let's continue). So the question is then: how is the RX buffer dimensioned? It's set up in either ssl3_setup_buffers() or ssl3_setup_read_buffer() and space depends on the SSL3_RT_MAX_PACKET_SIZE define. Which, when followed through, is only really dependent on either the OPENSSL_NO_COMP setting or someone seriously screwing around in the OpenSSL code internals. So I opt for the 'no_comp' setting being a 'possibly maybe', however improbable it may be. (because I've used that mix before and no trouble for me; besides, one can expect both flavors to exist in the wild so why is this a first then?)


Another thing that affects packet size in OpenSSL is the SSL_OP_MICROSOFT_BIG_SSLV3_BUFFER option, which you can set, for example, using the SSL_ctrl(s, SSL_CTRL_OPTIONS, SSL_OP_MICROSOFT_BIG_SSLV3_BUFFER, NULL) call (or the eqv. for the SSL_CTX: SSL->options are copied from the SSL_CTX). When one side has the option and the other has not, it just might..... (again, with low probability, but these are the ways I can see that things /could/ go wrong as you described. (Without a system where this can be reproduced, we're down to 'intelligent' guesswork and a couple of straws.)


Cheers and good night!

Again thanks for the reply at such a late hour and all the help so far.  

Unless you have any more ideas in the morning I will attempt the following:
* Put our test clients behind a simulator that more accurately reflects a satellite network
* Review our code to ensure we handle this issue correctly, like closing the socket
* Review the Python SSL module more closely to see how it may be causing this issue

Michael



--
Met vriendelijke groeten / Best regards,

Ger Hobbelt

--------------------------------------------------
web:    http://www.hobbelt.com/
        http://www.hebbut.net/
mail:   g...@hobbelt.com
mobile: +31-6-11 120 978
--------------------------------------------------

0 new messages