[AOLSERVER] SSL data truncation

20 views
Skip to first unread message

John Caruso

unread,
Jul 15, 2009, 5:26:33 PM7/15/09
to AOLS...@listserv.aol.com
We've run into a bug with AOLserver 4.5.1 / nsopenssl 3.0beta26. The bug is fully documented here:

https://sourceforge.net/tracker/?func=detail&aid=2822117&group_id=3152&atid=103152

But the short version is that when using the nsopenssl client-side routines (e.g. ns_httpsget), the result may be truncated if the client starts reading before all of the data has been received. This bug ONLY occurs with an AOLserver client (any version) running against an AOLserver 4 / nsopenssl 3.0beta26 server. We've reproduced the bug on RHEL4, RHEL5, and Mac OS X.

The bug is easily demonstrated by copying the file I've attached to this message (sslbug.tcl) to the top-level context of a web server running AOLserver 4.x/nsopenssl 3.0beta26 and then navigating to https://<server>/sslbug.tcl. If you comment out the ns_httpsget and use ns_httpget instead, you'll see that the bug disappears.

We've done a lot of instrumenting of nsopenssl/AOLserver, but haven't been able to track down the root cause. It seems likely that it's related to data buffering, which seems like it would be occurring within AOLserver or Tcl...but the issue is definitely specific to SSL, which implies that it's something in nsopenssl 3.0beta26.

Does anyone have any idea what might be causing this problem?

- John


--
AOLserver - http://www.aolserver.com/

To Remove yourself from this list, simply send an email to <list...@listserv.aol.com> with the
body of "SIGNOFF AOLSERVER" in the email message. You can leave the Subject: field of your email blank.

sslbug.tcl

Scott Goodwin

unread,
Jul 15, 2009, 6:17:46 PM7/15/09
to AOLS...@listserv.aol.com
John,

Tell me what version of OpenSSL you're running.

thanks,

/s.

> Subject: field of your email blank.<sslbug.tcl>

John Caruso

unread,
Jul 15, 2009, 6:35:04 PM7/15/09
to AOLS...@listserv.aol.com
On Wednesday 03:17 PM 7/15/2009, Scott Goodwin wrote:
>Tell me what version of OpenSSL you're running.

OpenSSL 0.9.8k. It's been happening for many years with different OpenSSL
versions as well.

Tom Jackson

unread,
Jul 15, 2009, 7:26:22 PM7/15/09
to AOLS...@listserv.aol.com
Your SF bug report says that you put in a 300 millisecond delay.
Where? Even if you think that such a fix is not good, it would be
helpful to at least know what works. It might help track down the bug,
or help others start looking at something smaller than
ns_httpsget/post.

You also talk about truncation, but then the truncation stops if the
received data goes above 81000.

It might be a good idea to narrow down when the bug appears (what byte
value) and when it goes away again. This might suggest something.

tom jackson

On Wed, Jul 15, 2009 at 3:35 PM, John Caruso<jca...@arenasolutions.com> wrote:
> On Wednesday 03:17 PM 7/15/2009, Scott Goodwin wrote:
>>
>> Tell me what version of OpenSSL you're running.
>
> OpenSSL 0.9.8k.  It's been happening for many years with different OpenSSL
> versions as well.

John Caruso

unread,
Jul 15, 2009, 8:15:55 PM7/15/09
to AOLS...@listserv.aol.com
On Wednesday 04:26 PM 7/15/2009, Tom Jackson wrote:
>Your SF bug report says that you put in a 300 millisecond delay.
>Where? Even if you think that such a fix is not good, it would be
>helpful to at least know what works.

There's a massive amount of debugging I've done on this that's not
included in the bug report, actually, for reasons of brevity. But I did
state that the workaround is to "insert a delay before the data starts
being read by ns_https{post,get}"--or in other words, immediately before
the loops commented with "Read the content" in ns_httpspost/ns_httpsget:

----- 8< ----------------------------------------------------------
#
# Read the content.
#

while 1 {
set buf [_ns_https_read $timeout $rfd $length]
append page $buf
[...]
----- 8< ----------------------------------------------------------

The "after X" statement would go immediately before this while loop.

>You also talk about truncation, but then the truncation stops if the
>received data goes above 81000.
>
>It might be a good idea to narrow down when the bug appears (what byte
>value) and when it goes away again. This might suggest something.

I tried that, and it was suggestive but ultimately not much help in
debugging the problem. For one thing, the byte values vary by platform,
and aren't even consistent on the same platform (i.e., a given byte size
might work or fail depending on the run). It's a timing issue, as I said
in the bug report. However, if you're curious, this is an analysis of the
errors at various byte values taken from our internal bug report for this
issue:

----- 8< ----------------------------------------------------------
The error shows up consistently (99.9+% of the time) at 74000 through
81000 bytes (counting by 1000), so I've been using the range of
70000-83000 for testing. Also, some specific testing showed that the
errors actually kick in reliably at 73729 bytes; note that 73728=8192*9.
And in all the succeeding sizes until the errors stop again, the socket
returns exactly 73728 bytes of data regardless of the request size. This
particular run of consistent errors stops at 81884 bytes (though there are
a few rare successes in that range), which doesn't have any suggestive
powers of 2.

So it seems clear that the buffer size affects the reliability in at least
two ways: 1) larger sizes are more likely to fail, and 2) certain
multiples of 8192 are particularly significant in that they're the last
working size before a long stretch of failing sizes (all of which return
that last working size). In addition to 73728=8192*9, I verified that this
happens at 90112=8192*11 and 106496=8192*13, and that it does NOT happen
at 81920=8192*10 or 57344=8192*7. So it would appear that odd multiples of
8192 where the multiplier is >= 9 are the ones that typically start
lengthy failure sequences.
----- 8< ----------------------------------------------------------

Note that this analysis only applies to RHEL4 (the byte-size analysis for
Mac OS X is similar, but the multipliers and trigger levels are different,
though I didn't record the actual values). And even on RHEL4 these aren't
the only values that fail--other smaller and larger buffer sizes will fail
too, just not as consistently.

- John

Mark Aufflick

unread,
Jul 22, 2009, 12:01:37 AM7/22/09
to AOLS...@listserv.aol.com
Hi John,

You say that "This bug ONLY occurs with an AOLserver client (any


version) running against an AOLserver 4 / nsopenssl 3.0beta26 server"

- so you're saying this issue doesn't occur when using httpsget
against, say, Apache?

It seems very odd that it would be server specific - that would fall
in that painful bug category of "If I wanted that behaviour I have no
idea how I would code it"!

Mark.

--
Mark Aufflick
contact info at http://mark.aufflick.com/about/contact

John Caruso

unread,
Jul 24, 2009, 12:52:14 AM7/24/09
to AOLS...@listserv.aol.com
On Tuesday 09:01 PM 7/21/2009, Mark Aufflick wrote:
>You say that "This bug ONLY occurs with an AOLserver client (any
>version) running against an AOLserver 4 / nsopenssl 3.0beta26 server"
>- so you're saying this issue doesn't occur when using httpsget
>against, say, Apache?

Yes, that's correct. As I mention in the bug report, we were unable to
reproduce the bug in any of these scenarios:

- AOLserver client talking to an Apache server
- AOLserver client talking to a Java server
- wget client talking to an AOLserver server
- Firefox/IE client talking to an AOLserver server

And, crucially, it also doesn't happen with an AOLserver client (any
version) running against an AOLserver 3/nsopenssl 2.1a server. For the
bug to occur, the server *must* be AOLserver 4 with nsopenssl 3.0beta26.

>It seems very odd that it would be server specific - that would fall
>in that painful bug category of "If I wanted that behaviour I have no
>idea how I would code it"!

Actually, I think you're going on the assumption that it's a client bug,
but it appears to me that it's a server bug (since an AOLserver
4/nsopenssl 3.0beta26 server is the consistent feature of the failing
scenarios). The odd part to me is that only an AOLserver client triggers
the bug.

By the way, this isn't a theoretical problem; we ran into this bug because
Arena's web application comprises multiple services which sometimes make
client calls to one another via SSL. When we tried to migrate from
AOLserver 3/nsopenssl 2.1a to AOLserver 4/nsopenssl 3.0beta26, we saw
occasional and seemingly random failures on various pages--and after a lot
of investigation we managed to narrow it down to this bug. This is
actually just one of several SSL-related issues that have prevented us
from migrating to AOLserver 4 (but we haven't investigated all of them as
deeply as this one, and so we're hoping this is the root cause of all of
them).

Torben Brosten

unread,
Aug 2, 2009, 6:51:46 PM8/2/09
to AOLS...@listserv.aol.com
Looking through modules/https.tcl ..
ns_httpsopen depends on server's content-length header to be somewhat
accurate or greater than 0 if supplied.

iirc, AOLserver has a bug that returns inaccurate content-lengths,
sometimes 0.

Could this be a/the cause?

Torben

Torben Brosten

unread,
Aug 2, 2009, 7:05:34 PM8/2/09
to AOLS...@listserv.aol.com
Torben Brosten wrote:
> Looking through modules/https.tcl ..


> ns_httpsopen

er, I mean.. ns_httpspost

> depends on server's content-length header to be somewhat
> accurate or greater than 0 if supplied.
>

In particular, won't this code break if a server's header returns
Content-length of 0?

set length [ns_set iget $headers content-length]
if [string match "" $length] {
set length -1
}
set err [catch {


#
# Read the content.
#

while 1 {
set buf [_ns_https_read $timeout $rfd $length]
append page $buf

if [string match "" $buf] {
break
}
if {$length > 0} {
incr length -[string length $buf]
if {$length <= 0} {
break
}
}
}
} errMsg]


> iirc, AOLserver has a bug that returns inaccurate content-lengths,
> sometimes 0.
>
> Could this be a/the cause?
>
> Torben
>
>

John Caruso

unread,
Aug 10, 2009, 4:04:02 PM8/10/09
to AOLS...@listserv.aol.com
On Sunday 04:05 PM 8/2/2009, Torben Brosten wrote:
>ns_httpspost depends on server's content-length header to be somewhat
>accurate or greater than 0 if supplied.

That's true, though I'm not sure if it's a problem in the general case or
if it's ok based on the HTTP specs. It's definitely not an issue in the
case of the test code I posted, though, because the server is always
returning the correct value for the content-length header.

- John

Reply all
Reply to author
Forward
0 new messages