Strange behaviour when downloading files through HTTPS

8 views
Skip to first unread message

Jakub Krajewski

unread,
Nov 8, 2012, 3:59:00 AM11/8/12
to pi...@roxen.com
Hello,

I have encountered and reproduced possible issue with Pike's SSLfile and Protocols.HTTP.

To sum up: when certain conditions are met (see below), Pike client is unable to download complete data from Pike
webserver that uses SSL.

Conditions:
1) There is a Pike-based HTTPS webserver (my example attached as sslbug_server.pike)
2) There is a Pike-based client (my example attached as sslbug_client.pike)
3) There is a WAN between client and server (does not reproduce on localhost)
4) Client downloads one file (let's call it example.data) from server using Protocols.HTTP.get_url() call
-> As a result, file will not be transfered completely (some last fragment is missing).

*But*, when client or server is not Pike-based, it does not reproduce. So, when you use wget against example Pike web
server, it will work. Same, when using example Pike client against any webserver (I tried lighttpd and ngnix), it will
work as well. More to add, when testing on localohst, it will work with Pike on both ends (timing issue?).

It reproduces also on latest Pike compiled upstream (7.9.5 from today), but was found with Pike 7.8 (and reproduces
there as well).

I have setup example webserver for testing (example data is 49152 bytes long): https://krajewski.org:34567/example
(running Pike 7.8 release 352)

One can use attached basic client to reproduce, like this:

./sslbug_client.pike https://krajewski.org:34567/example

Unfortunately I am not fluent in Pike internals, would be great if some Pike SSL developer can take a look at it. :)

Best regards,
--
Jakub Krajewski

sslbug_client.pike
sslbug_server.pike
example.data

Chris Angelico

unread,
Nov 28, 2012, 11:47:18 PM11/28/12
to Pike mailinglist
On Thu, Nov 8, 2012 at 7:59 PM, Jakub Krajewski <jkraj...@opera.com> wrote:
> To sum up: when certain conditions are met (see below), Pike client is
> unable to download complete data from Pike webserver that uses SSL.
>
> Conditions:
> 3) There is a WAN between client and server (does not reproduce on
> localhost)
> -> As a result, file will not be transfered completely (some last fragment
> is missing).
>
> More to add, when testing on localohst, it
> will work with Pike on both ends (timing issue?).

Hi there! Looks like this hasn't had any responses from experts, so
I'm picking it up to see what I can find out.

My testing environment: Two separate uplinks with two different IP
addresses via different ISPs, and a LAN with a number of available
boxes, most of which have some version of Pike installed. One Windows
laptop with 7.8.700, one Linux server with 7.8.352, and one Linux
desktop with a fairly recent 7.9.5.

As you noted, going from any computer to localhost does not exhibit
the error. However, going from 7.8.352-server to itself via the other
IP address also doesn't - timing-wise, that's equivalent to it being
on a completely different computer. Also, I can't reproduce the
problem going from Lin-desktop to Lin-server, but it DOES happen back
the other way.

Here's the matrix of my test results (in Google Docs):

https://docs.google.com/spreadsheet/ccc?key=0AuwuhNnYgCgzdHVhLVNQRHlKYUg0RVhNMkRTR1dlZ3c

or http://tinyurl.com/ape7pvd

My first guess was that it was a Pike version problem, but a Pike
v7.8.700 behaved exactly like the 7.9.5 on the same computer.

Here's another theory. My Linux desktop is 64-bit, while the Linux
server and Windows laptop are 32-bit machines. (Yes, I have a 32-bit
server. It was deployed in late 2009, when I was on a tight budget,
and upgrading it now would mean shutting it down, which hasn't
happened since July 2010. Pike process still running since then, too.)

Okay. Pulling up a 32-bit box that has an older build of 7.9.5 on it.
Similar results.

The table makes things look pretty bad for the Windows box, which has
also been the slowest computer among those tested, so it could still
be a timing issue. But there's no clear cause for the problem. It's
quite peculiar.

Every one of the numbers seen there is repeatable. The pairings that
result in 48919 bytes transferred *always* result in 48919 bytes
transferred. I completely do not understand what's going on, but
hopefully more data will help!

ChrisA

Jakub Krajewski

unread,
Nov 29, 2012, 8:34:14 AM11/29/12
to pi...@roxen.com, Pike mailinglist
On 29.11.2012 05:47, Chris Angelico wrote:
[...]
> Hi there! Looks like this hasn't had any responses from experts, so
> I'm picking it up to see what I can find out.

Thank you Chris for checking this issue! The whole case is quite strange,
and your results helps a lot by confirming its Pike version independency
and timing related nature.

[...]
That is very good, thanks! :)

> My first guess was that it was a Pike version problem, but a Pike
> v7.8.700 behaved exactly like the 7.9.5 on the same computer.

I encountered this initially using Pike 7.8 (but older than .700). But this was
also my first pick. So I started to check the issue with Pike's upstream, and as
you mentioned, it is reproduceable even on current upstream. More to say,
that (same as you) did not find any correlation to 32-bit/64-bit architecture.

[...]
> Every one of the numbers seen there is repeatable. The pairings that
> result in 48919 bytes transferred *always* result in 48919 bytes
> transferred. I completely do not understand what's going on, but
> hopefully more data will help!

I debugged Pike internals using debug prints once and found that end of session
may lead to some results - see below, this one is end of session for Pike client:

----- BEGINNING ------
[...]
[thr:139998730520352,fd:10] ssl_read_callback (0, string[3291]): nonblocking mode=0, callback mode=0
[thr:139998730520352,fd:10] queue_write: Got nothing to write (0 strings buffered)
[thr:139998730520352,fd:10] ssl_read_callback: Got 16384 bytes of application data
[thr:139998730520352,fd:10] Running local backend [r:1 w:0], infinite timeout
[thr:139998730520352,fd:10] ssl_close_callback (0): nonblocking mode=0, callback mode=0
[thr:139998730520352,fd:10] ssl_close_callback: Abrupt close - simulating System.EPIPE
[thr:139998730520352,fd:10] Local backend ended with error
[thr:139998730520352,fd:10] update_internal_state: Not in callback mode [rcb:0]
[thr:139998730520352,fd:10] SSL.sslfile->read: Read done, returning 9723904 bytes (0 still in buffer)
[thr:139998730520352,fd:10] SSL.sslfile->read (11179, 0)
[thr:139998730520352,fd:10] SSL.sslfile->read: Propagating old callback error: Broken pipe
[thr:139998730520352,fd:10] SSL.sslfile->set_nonblocking (0, 0, 0, 0, 0, 0)
[thr:139998730520352,fd:10] update_internal_state: Not in callback mode [rcb:-1]
[thr:139998730520352,fd:10] SSL.sslfile->close (0, 0, 0)
[thr:139998730520352,fd:10] queue_write: Got nothing to write (0 strings buffered)
[thr:139998730520352,fd:10] ssl_read_callback (0, 0 (queued extra call)): nonblocking mode=1, callback mode=0, closing
(0, 2, 1)
[thr:139998730520352,fd:10] update_internal_state: In local backend - nothing done [rcb:0]
[thr:139998730520352,fd:10] Running local backend [r:0 w:1], zero timeout
[thr:139998730520352,fd:10] ssl_write_callback (0): nonblocking mode=1, callback mode=0, closing (0, 2, 1)
[thr:139998730520352,fd:10] ssl_write_callback: Queuing close packet
[thr:139998730520352,fd:10] queue_write: Got 29 bytes to write (1 strings buffered)
[thr:139998730520352,fd:10] update_internal_state: In local backend - nothing done [rcb:0]
[thr:139998730520352,fd:10] ssl_write_callback: Wrote 29 bytes (0 strings left)
[thr:139998730520352,fd:10] update_internal_state: In local backend - nothing done [rcb:0]
[thr:139998730520352,fd:10] queue_write: Connection closed normally
[thr:139998730520352,fd:10] ssl_write_callback: Close packet sent
[thr:139998730520352,fd:10] ssl_write_callback: In or after local close - shutting down
[thr:139998730520352,fd:10] SSL.sslfile->shutdown(): Proper close
[thr:139998730520352,fd:10] SSL.sslfile->shutdown(): Nonclean close - closing stream
[thr:139998730520352,ex fd:10] Local backend ended after close
[thr:139998730520352,ex fd:10] direct_write: Ok
[thr:139998730520352,ex fd:10] SSL.sslfile->shutdown(): Already shut down
[thr:139998730520352,ex fd:10] SSL.sslfile->close: Close done
[thr:139998730520352,ex fd:10] SSL.sslfile->destroy()
----- END ------

And for server that was serving data for the client above:

----- BEGIN ------
[thr:139973660772096,fd:12] SSL.sslfile->write (string[11179])
[thr:139973660772096,fd:12] queue_write: Got nothing to write (0 strings buffered)
[thr:139973660772096,fd:12] direct_write: Ok
[thr:139973660772096,fd:12] SSL.sslfile->write: Queued data[0..11178]
[thr:139973660772096,fd:12] queue_write: Got 11205 bytes to write (1 strings buffered)
[thr:139973660772096,fd:12] update_internal_state: Callback mode [r:1 w:1 rcb:0]
[thr:139973660772096,fd:12] Running local backend [r:0 w:1], zero timeout
[thr:139973660772096,fd:12] Nonwaiting local backend ended - nothing to do
[thr:139973660772096,fd:12] update_internal_state: Callback mode [r:1 w:1 rcb:0]
[thr:139973660772096,fd:12] direct_write: Ok
[thr:139973660772096,fd:12] SSL.sslfile->write: Write string done, accepted 11179 bytes
[thr:139973660772096,fd:12] SSL.sslfile->close (0, 0, 0)
[thr:139973660772096,fd:12] queue_write: Got nothing to write (1 strings buffered)
[thr:139973660772096,fd:12] Running local backend [r:0 w:1], zero timeout
[thr:139973660772096,fd:12] Nonwaiting local backend ended - nothing to do
[thr:139973660772096,fd:12] update_internal_state: After close [r:0 w:1 rcb:0]
[thr:139973660772096,fd:12] direct_write: Ok
[thr:139973660772096,fd:12] SSL.sslfile->close: Close underway
[thr:139973660772096,fd:12] update_internal_state: After close [r:0 w:1 rcb:0]
[thr:139973660772096,fd:12] SSL.sslfile->destroy()
[thr:139973660772096,fd:12] SSL.sslfile->shutdown(): Abrupt close
[thr:139973660772096,fd:12] SSL.sslfile->shutdown(): Nonclean close - closing stream
------ END ------

Same time, this one is from Pike server serving wget client (this orks OK all the time):

------ BEGIN ------
[thr:139663018305280,fd:12] SSL.sslfile->write: Write string done, accepted 11184 bytes
[thr:139663018305280,fd:12] SSL.sslfile->set_nonblocking (Protocols.HTTP.Server.Request(0 0)->read_cb, 0,
Protocols.HTTP.Server.Request(0 0)->close_cb, 0, 0, 0)
[thr:139663018305280,fd:12] update_internal_state: Callback mode [r:1 w:1 rcb:0]
[thr:139663018305280,fd:12] ssl_write_callback (1): nonblocking mode=1, callback mode=1
[thr:139663018305280,fd:12] ssl_write_callback: Wrote 11209 bytes (0 strings left)
[thr:139663018305280,fd:12] update_internal_state: Callback mode [r:1 w:0 rcb:0]
[thr:139663018305280,fd:12] queue_write: Got nothing to write (0 strings buffered)
[thr:139663018305280,fd:12] ssl_close_callback (1): nonblocking mode=1, callback mode=1
[thr:139663018305280,fd:12] ssl_close_callback: Abrupt close - simulating System.EPIPE
[thr:139663018305280,fd:12] ssl_close_callback: Calling close callback Protocols.HTTP.Server.Request(0 0)->close_cb
(error Broken pipe)
[thr:139663018305280,fd:12] SSL.sslfile->close (0, 0, 0)
[thr:139663018305280,fd:12] queue_write: Got nothing to write (0 strings buffered)
[thr:139663018305280,fd:12] Running local backend [r:0 w:1], zero timeout
[thr:139663018305280,fd:12] ssl_write_callback (0): nonblocking mode=1, callback mode=0, closing (0, 2, 1)
[thr:139663018305280,fd:12] ssl_write_callback: Queuing close packet
[thr:139663018305280,fd:12] queue_write: Got 27 bytes to write (1 strings buffered)
[thr:139663018305280,fd:12] update_internal_state: In local backend - nothing done [rcb:0]
[thr:139663018305280,fd:12] ssl_write_callback: Wrote 27 bytes (0 strings left)
[thr:139663018305280,fd:12] update_internal_state: In local backend - nothing done [rcb:0]
[thr:139663018305280,fd:12] queue_write: Connection closed normally
[thr:139663018305280,fd:12] ssl_write_callback: Close packet sent
[thr:139663018305280,fd:12] ssl_write_callback: In or after local close - shutting down
[thr:139663018305280,fd:12] SSL.sslfile->shutdown(): Proper close
[thr:139663018305280,fd:12] SSL.sslfile->shutdown(): Nonclean close - closing stream
[thr:139663018305280,ex fd:12] Local backend ended after close
[thr:139663018305280,ex fd:12] direct_write: Ok
[thr:139663018305280,ex fd:12] SSL.sslfile->shutdown(): Already shut down
[thr:139663018305280,ex fd:12] SSL.sslfile->close: Close done
[thr:139663018305280,ex fd:12] SSL.sslfile->destroy()
------ END ------

Still however I was not able to locate failure. I am working on it. It looks like
this bug does not have many friends (only us two), but I am not going to disappoint him
and fix the issue eventually, it will just take more time I suppose. :)

Best regards,
Jakub


Chris Angelico

unread,
Nov 29, 2012, 8:42:18 AM11/29/12
to Pike mailinglist
On Fri, Nov 30, 2012 at 12:34 AM, Jakub Krajewski <jkraj...@opera.com> wrote:
> Still however I was not able to locate failure. I am working on it. It looks
> like
> this bug does not have many friends (only us two), but I am not going to
> disappoint him
> and fix the issue eventually, it will just take more time I suppose. :)

I'm sure that if the bug could be pinned down and a patch produced,
one of the committers would be interested! But until then, the only
people trying to pierce the vagueness will be the one who actually ran
into it (you) and anyone who found themselves on a day off with no
urgent projects (me).

ChrisA

Stephen R. van den Berg

unread,
Nov 29, 2012, 1:31:38 PM11/29/12
to pi...@roxen.com
Jakub Krajewski wrote:
>Still however I was not able to locate failure. I am working on it. It looks like
>this bug does not have many friends (only us two), but I am not going to disappoint him
>and fix the issue eventually, it will just take more time I suppose. :)

This problem sounds like a buffering problem, possibly triggered by timing
and packet boundaries.

To pinpoint the problem, the Pike debug traces you already provided should
be augmented by millisecond timestamps, a corresponding strace, possibly ltrace
with system calls, and a corresponding tcpdump trail (all logs including
millisecond timestamps to correlate).

Create the logs for two situations, one where it doesn't work, and one where
it works. Most likely Pike somewhere "forgets" to process the final read
because it mistakenly thinks the read buffer is already empty.
--
Stephen.

"It has been said that the only standard thing about all UNIX systems is the
message-of-the-day telling users to clean up their files." -- SysV.2 manual

Chris Angelico

unread,
Nov 30, 2012, 11:11:05 PM11/30/12
to pi...@roxen.com
On Fri, Nov 30, 2012 at 5:31 AM, Stephen R. van den Berg <s...@cuci.nl> wrote:
> Jakub Krajewski wrote:
>>Still however I was not able to locate failure. I am working on it. It looks like
>>this bug does not have many friends (only us two), but I am not going to disappoint him
>>and fix the issue eventually, it will just take more time I suppose. :)
>
> This problem sounds like a buffering problem, possibly triggered by timing
> and packet boundaries.
>
> To pinpoint the problem, the Pike debug traces you already provided should
> be augmented by millisecond timestamps, a corresponding strace, possibly ltrace
> with system calls, and a corresponding tcpdump trail (all logs including
> millisecond timestamps to correlate).
>
> Create the logs for two situations, one where it doesn't work, and one where
> it works. Most likely Pike somewhere "forgets" to process the final read
> because it mistakenly thinks the read buffer is already empty.

It's a Heisenbug! As soon as I strace the server, the problem ceases
to occur. Definitely some kind of timing issue. But I've managed to
get an strace of a failed connection by making use of the Windows box
- which means I can't strace the client, only the server.

I'm having trouble getting a usable tcpdump, though. My command line is simply:
$ sudo tcpdump -vv >tcpdump_working.log
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size
65535 bytes
^C10 packets captured
141 packets received by filter
101 packets dropped by kernel

I start that, then start the server in another terminal window:

$ strace -tt -ostrace_working.log pike sslbug_server.pike
Waiting for connection
Handling status request Protocols.HTTP.Server.Request("GET" "/example")
Size of data: 49152

And then, via SSH, start the client:

$ pike sslbug_client.pike https://192.168.0.19:34567/example
Downloading, please wait...
Total bytes: 49152
Downloaded bytes: 49152

The strace log shows plenty of good-looking data, but the tcpdump
shows only background noise (netbios packets going around the network,
a few DNS queries and responses, and UDP packets looking for
Transmission (which I shut down for the test)). I eyeballed the
tcpdump man page, but since I'm already not using -p or a filter,
there was nothing there that suggested that I wouldn't see all
packets. What am I doing wrong? Is it possible that something's
getting confused because this computer has two NICs? (eth0 is the one
I'm using for this test; wlan0 connects to my Walled Garden wifi
network, and this computer is the gateway/firewall between the walled
garden and the rest of the internet.)

Three straces are attached. Two are of connections from Traal, my
Windows box, to Sikorsky, the Linux router - see the GDocs spreadsheet
linked above for more info on that. They're called strace_795.log and
strace_78700.log, according to the version of Pike running the server
on Sikorsky for that test. Finally, a third test, strace_working.log,
shows a connection from Huix's 7.9.5 to Sikorsky's 7.9.5, which worked
correctly.

It's a huge lot of raw data, and I'm not familiar enough with Pike's
internals to be able to interpret it any, sorry! Hopefully it's of
value. I'm available to do further tests if someone points me towards
what'd be useful.

ChrisA
strace_795.log
strace_78700.log
strace_working.log

Jakub Krajewski

unread,
Dec 3, 2012, 5:33:29 AM12/3/12
to pi...@roxen.com
On 29.11.2012 19:31, Stephen R. van den Berg wrote:
[...]
> To pinpoint the problem, the Pike debug traces you already provided should
> be augmented by millisecond timestamps, a corresponding strace, possibly ltrace
> with system calls, and a corresponding tcpdump trail (all logs including
> millisecond timestamps to correlate).

Results from Pike compiled from today's master (7.9.5). Logs gathered at same time. 100% reproduceability.
No ltrace logs however.

1) pike vs pike:
http://people.opera.com/jkrajewski/sslbug/pike_pike/

2) wget vs pike:
http://people.opera.com/jkrajewski/sslbug/wget_pike/

3) pike vs nginx (this however was FreeBSD box, but also 100% reproduceability):
http://people.opera.com/jkrajewski/sslbug/pike_nginx/

Jakub Krajewski

unread,
Dec 3, 2012, 6:55:44 AM12/3/12
to pi...@roxen.com
On 03.12.2012 11:33, Jakub Krajewski wrote:
> On 29.11.2012 19:31, Stephen R. van den Berg wrote:
> [...]
>> To pinpoint the problem, the Pike debug traces you already provided should
>> be augmented by millisecond timestamps, a corresponding strace, possibly ltrace
>> with system calls, and a corresponding tcpdump trail (all logs including
>> millisecond timestamps to correlate).
>
> Results from Pike compiled from today's master (7.9.5). Logs gathered at same time. 100% reproduceability.
> No ltrace logs however.
>
> 1) pike vs pike:
> http://people.opera.com/jkrajewski/sslbug/pike_pike/

By mistake I put wrong logs there (ones with pass from non-pike server), corrected.
Reply all
Reply to author
Forward
0 new messages