Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

HTTP download fails on specific case

71 views
Skip to first unread message

Gerhard Reithofer

unread,
Mar 23, 2018, 5:38:16 PM3/23/18
to
Hi TCLer's,
I wanted to create a simple web downloader. In general a complex task
but I found a file which cannot be downloaded completely.

$ tclsh test_download.tcl
Loading file outfile1.txt from http://someonewhocares.org/hosts/hosts
using wget
Loading file outfile2.txt from http://someonewhocares.org/hosts/hosts
via httpcopy
.......
Date: Fri, 23 Mar 2018 21:23:21 GMT
Server: Apache/2.2.31 (Unix) mod_ssl/2.2.31
OpenSSL/1.0.1e-fips
content-disposition: attachment: filename=hosts
cache-control: public, max-age=86400
Last-Modified: Thu, 22 Mar 2018 08:13:42 GMT
Vary: Accept-Encoding
Content-Encoding: gzip
Connection: close
Transfer-Encoding: chunked
Content-Type: text/plain
HTTP file copy size is 342964, wget filesize is 416015

The used script is the httpcopy example from the 8.6 manpage where I
added a command line option and a wget-call (which must be installed for
the test) to compare the wget result with the tcl downloaded version.
Only the hard-coded url (if you don't provide an url on the command
line) failed until now, all other tests worked.

Does anyone have an idea why this case fails?
Is it a bug or am I doing somenthing wrong?

Another file works as expected:
tclsh test_download.tcl
http://www.tech-edv.co.at/TCMS/downloads/tkfplot-0.35.tgz
Loading file outfile1.txt from
http://www.tech-edv.co.at/TCMS/downloads/tkfplot-0.35.tgz using wget
Loading file outfile2.txt from
http://www.tech-edv.co.at/TCMS/downloads/tkfplot-0.35.tgz via httpcopy
...
Date: Fri, 23 Mar 2018 21:33:41 GMT
Server: Apache
Last-Modified: Tue, 16 Jan 2018 16:14:19 GMT
ETag: "2638-562e705e1bd39"
Accept-Ranges: bytes
Content-Length: 9784
Content-Type: application/x-gzip
Age: 0
Connection: close
HTTP file copy size is 9784, wget filesize is 9784

Here's the complete script: test_download.tcl

package require http

proc httpcopy { url file {chunk 4096} } {
set out [open $file w]
set token [::http::geturl $url -channel $out \
-progress httpCopyProgress -blocksize $chunk]
close $out

# This ends the line started by httpCopyProgress
puts stderr ""

upvar #0 $token state
set max 0
foreach {name value} $state(meta) {
if {[string length $name] > $max} {
set max [string length $name]
}
if {[regexp -nocase ^location$ $name]} {
# Handle URL redirects
puts stderr "Location:$value"
return [httpcopy [string trim $value] $file $chunk]
}
}
incr max
foreach {name value} $state(meta) {
puts [format "%-*s %s" $max $name: $value]
}

return $token
}
proc httpCopyProgress {args} {
puts -nonewline stderr .
flush stderr
}

#
# === Here starts my additional testing code ===
#
if {[llength $argv]} {
set url [lindex $argv 0]
} else {
set url "http://someonewhocares.org/hosts/hosts"
}
set org "outfile1.txt"
set out "outfile2.txt"

puts "Loading file $org from $url using wget"
catch {exec wget $url -O $org}

puts "Loading file $out from $url via httpcopy"
httpcopy $url $out
puts "HTTP file copy size is [file size $out], wget filesize is [file
size $org]"

--
Gerhard Reithofer - Techn. EDV Reithofer - http://www.tech-edv.co.at

Brad Lanam

unread,
Mar 23, 2018, 7:04:32 PM3/23/18
to
What version of Tcl?

Brad Lanam

unread,
Mar 23, 2018, 7:32:22 PM3/23/18
to
On Friday, March 23, 2018 at 4:04:32 PM UTC-7, Brad Lanam wrote:
> What version of Tcl?

Confirmed issue with version 8.6.8.
I guess all of the bugs with gzip have not been fixed.

I use 'wget' myself for my application, though I am using the http
package successfully as a small web server and for
uploading (with gzip, no less).

You could set the appropriate headers and force gzip/deflate off.

Brad Lanam

unread,
Mar 23, 2018, 7:41:39 PM3/23/18
to
On Friday, March 23, 2018 at 4:32:22 PM UTC-7, Brad Lanam wrote:
> [...] package successfully as a small web server and for
> uploading (with gzip, no less).

Actually, I'm using zlib to gzip the data and then sending it, so
I am not using the http package's gzip.

Gerhard Reithofer

unread,
Mar 23, 2018, 7:57:45 PM3/23/18
to
Hi Brad,
thanks for confirming it.
So, what can I do to "download" the file completely?

Is it a bug?
If yes, in which part does the bug appear?

Should I write a bug report?

Bye,
Gerhard

Brad Lanam

unread,
Mar 23, 2018, 8:09:17 PM3/23/18
to
On Friday, March 23, 2018 at 4:57:45 PM UTC-7, Gerhard Reithofer wrote:
> Hi Brad,
> thanks for confirming it.
>
> On Fri, 23 Mar 2018, Brad Lanam wrote:
>
> > On Friday, March 23, 2018 at 4:32:22 PM UTC-7, Brad Lanam wrote:
> > > [...] package successfully as a small web server and for
> > > uploading (with gzip, no less).
> >
> > Actually, I'm using zlib to gzip the data and then sending it, so
> > I am not using the http package's gzip.
>
> So, what can I do to "download" the file completely?
>
> Is it a bug?
> If yes, in which part does the bug appear?
>
> Should I write a bug report?

If you can get the http package to add the
Accept-Encoding: identity
header to the request, you should be able to download any file.

This just turns off any gzip/deflate/etc. encodings, so the
downloaded file will not be compressed. The compression handling
in the http package has had many bugs. This is my guess.

You will notice in your response headers:
Vary: Accept-Encoding
Content-Encoding: gzip
That gzip encoding was used.

The other possibility for the bug is in the chunked transfer handling:
Transfer-Encoding: chunked

Yes, a ticket should be opened, especially as this is very reproducable.
http://core.tcl.tk/tcl/ticket

Gerhard Reithofer

unread,
Mar 25, 2018, 9:44:35 AM3/25/18
to
Hi Brad,
ticket opend:
Ticket UUID: fb642c54bc58b31daafba9ae495ded4b0417d9bc
Title: Incorrect download of compressed encoded data

On Fri, 23 Mar 2018, Brad Lanam wrote:
> On Friday, March 23, 2018 at 4:57:45 PM UTC-7, Gerhard Reithofer wrote:

...

>
> If you can get the http package to add the
> Accept-Encoding: identity
> header to the request, you should be able to download any file.

...

When adding the option
-headers {Accept-Encoding identity}
to the geturl call it works "almost".

$ tclsh test_download.tcl
Loading file outfile1.txt from http://someonewhocares.org/hosts/hosts
using wget
Loading file outfile2.txt from http://someonewhocares.org/hosts/hosts
via httpcopy
.....
Date: Sun, 25 Mar 2018 13:30:27 GMT
Server: Apache/2.2.31 (Unix) mod_ssl/2.2.31
OpenSSL/1.0.1e-fips
content-disposition: attachment: filename=hosts
cache-control: public, max-age=86400
Last-Modified: Sat, 24 Mar 2018 18:29:00 GMT
Vary: Accept-Encoding
Connection: close
Transfer-Encoding: chunked
Content-Type: text/plain
HTTP file copy size is 416046, wget filesize is 416039
...

$ diff outfile1.txt outfile2.txt
7698c7698
< 127.0.0.1 ɢoogle.com
---
> 127.0.0.1 ɢoogle.com
9811c9811
< 127.0.0.1 secret.ɢoogle.com
---
> 127.0.0.1 secret.É¢oogle.com
11728c11728
< 127.0.0.1 www.turkishạirlines.com
---
> 127.0.0.1 www.turkishạirlines.com

The contente is encoded but the different tools do interprete it
different.
But that's not a big problem frmy usecase.

> This just turns off any gzip/deflate/etc. encodings, so the
> downloaded file will not be compressed. The compression handling
> in the http package has had many bugs. This is my guess.

Yes, I think so too but.

Thank you very much,

Brad Lanam

unread,
Mar 25, 2018, 11:43:55 AM3/25/18
to
This helps confirm that the bug is in the gzip processing by the http package.

The new issue looks like a character encoding issue (as the G in google is not an ascii g). I just woke up, brain is not quite working yet. I'm pretty sure
that is solvable, I can't remember how.

serg.b...@googlemail.com

unread,
Mar 26, 2018, 8:22:35 AM3/26/18
to
I've closed the ticket [https://core.tcl.tk/tcl/info/fb642c54bc58b31d], because IMHO it is not really an tcl-issue.

As already mentioned in ticket, the charset the server acts is unknown, but the script in opposite to wget will save it in UTF-8...
so you should just open file using binary translation instead of the system encoding...

Here helps:
```diff
- set out [open $file w]
+ set out [open $file wb]
```

Please reopen if I'm wrong.

Regards,
Sergey.

Gerhard Reithofer

unread,
Mar 26, 2018, 9:57:26 AM3/26/18
to
Hi Sergey,

On Mon, 26 Mar 2018, serg.b...@googlemail.com wrote:

> I've closed the ticket [https://core.tcl.tk/tcl/info/fb642c54bc58b31d], because IMHO it is not really an tcl-issue.
>
> As already mentioned in ticket, the charset the server acts is unknown, but the script in opposite to wget will save it in UTF-8...
> so you should just open file using binary translation instead of the system encoding...
>
> Here helps:
> ```diff
> - set out [open $file w]
> + set out [open $file wb]
> ```

confirmed, download using: -headers {Accept-Encoding identity} brings
with "set out [open $file wb]" the same result as wget.

But without the geturl header {Accept-Encoding identity} the download
file size differs by ca. 70kB.
....
"HTTP file copy size is 342957, wget filesize is 416135"

> Please reopen if I'm wrong.

Don't know if this can be seen as expected behavior.
The man page also explains the option -binary and even setting this to
true and using filemode binary does not download the complete file.

From my point of view it is really hard to find out why geturl does
behave as it does.
Maybe improving the man pages could also bring more clear picture or
adding the gzip encoding to the manpage example.

BTW: Nevertheless my problem is solved, many thanks to clt :-)

serg.b...@googlemail.com

unread,
Mar 26, 2018, 12:46:15 PM3/26/18
to
> Don't know if this can be seen as expected behavior.
> The man page also explains the option -binary and even setting this to
> true and using filemode binary does not download the complete file.
>
> From my point of view it is really hard to find out why geturl does
> behave as it does.

Thus reopened.
Still not clear why it was correct in my test-cases.
0 new messages