1) First, the call waits/blocks until the timeout option.
2) Second, the data is not complete and is only partial. I use a
-blocksize of 4096.
3) Third, during this period, it completely hogs the CPU and it becomes
impossible to do anything else.
I just tested the same call against an older version (8.4.17 with http
version 2.5.3) and everything is fine.
DrS
Anyone care to confirm this?
I could revert back to an older version (ok for the time being), try to
find a work-around (not exciting) or just avoid it altogether by not
relying on it.
DrS
If you could report the exact conditions which cause this it would
help, otherwise not much anyone can do. A few weeks ago someone
reported a similar "bug", but not quite the same. (If it is a public
url, please include it.)
From what I have seen of the code, a timeout, if it works should
result in an incomplete response. I have no idea if an error is
reported to help you figure this out or not. But "chunked" transfers
put the code into blocking mode, which should screw up everything
until the transfer is complete. You could try sending an HTTP/1.0
message, which will give you a regular non-chunked response. I have no
idea how you change the HTTP version.
If you want to compare with a different http client, you could test my
experimental htclient:
http://www.junom.com/gitweb/gitweb.perl?p=htclient.git
You can't use a channel for posted data. An example POST is here:
http://www.junom.com/gitweb/gitweb.perl?p=htclient.git;a=blob;f=bin/test-post.tcl
Only thing I can guarantee is that the thing will not block your
application (it works best with Tk/wish, just don't use vwait in the
example).
> If you could report the exact conditions which cause this it would
> help, otherwise not much anyone can do. A few weeks ago someone
> reported a similar "bug", but not quite the same. (If it is a public
> url, please include it.)
I am not sure I follow - I don't have a server. I am using the client,
http::geturl, to fetch yahoo's main page. Here is the call that causes
trouble:
set file_id [open myfile.html w]
http::geturl http://www.yahoo.com -channel $file_id \
-blocksize 4096 timeout 60000
close $file_id
This call will block for a minute and then the file contains partial
data from yahoo. Same thing works with the previous http::geturl version.
> If you want to compare with a different http client, you could test my
> experimental htclient:
Thanks, I think I may do that if this does not lead anywhere.
DrS
Does this work for any url, or all failures? You did leave off the
"path".
How about http://www.yahoo.com/ ?
Same thing - no difference. I have not tested with other URL's. It
retrieves only 4165 bytes from yahoo; which seems to be the same from
run to run.
I also wonder what it does that requires the full power of the CPU
during this wait.
DrS
Loop gone while? I would try some other urls, maybe something smaller
than the buffer, or bump up the buffer to 10k and try again on yahoo.
Also, is the content making it into the file? Maybe try without the
channel option.
Wow, you're right. Please file a bug so that we can track it properly.
While Tom's code probably shows the way for the future of this
package, for the time being we might simply plug the most blatant
holes, like this one :/
-Alex
The above worked for me with Tcl 8.5.7 and http 2.7.3 on Mac OS X 10.6.1.
So either the bug was introduced in http 2.7.4 or it is platform dependent
-- which OS are you using.
BTW, here is what I see of the file:
$ls -l myfile.html
-rw-r--r-- 1 gerald gerald 9490 Nov 10 10:15 myfile.html
$ tail myfile.html
<a href="r/cp">Company Info</a> |
<a href="r/1q">Participate in Research</a> |
<a href="r/hr">Jobs</a>
</font></td></tr></table>
</td></tr></table>
</center>
</body>
</html>
<!-- pbt 1257868103 -->
--
+------------------------------------------------------------------------+
| Gerald W. Lester |
|"The man who fights for his ideals is the man who is alive." - Cervantes|
+------------------------------------------------------------------------+
>
> The above worked for me with Tcl 8.5.7 and http 2.7.3 on Mac OS X 10.6.1.
>
> So either the bug was introduced in http 2.7.4 or it is platform
> dependent -- which OS are you using.
>
I am on Windows Vista. I use stock ActiveTcl versions. It could depend
on the platform. I am not really sure.
Can you do a diff to see what has changed?
DrS
I am getting back to tcl in a limited way after a long while. So sorry
if this is too simple: how do I file a bug? Any volunteers?
DrS
Btw, this bug is not the same as the chunk-size issue. Yahoo!'s home
page appears to be written about ten years ago, is static, and doesn't
use HTTP/1.1 chunked transfer. It is possible that some kind of while
loop is looking for extra bytes, but the socket is at eof. I only have
2.7.2 using Tcl8.6, on linux, so I don't see any problem with the
selected options. I also noticed that if you use [chan pending] to
decided how many bytes to read, you must read at least one byte, even
if [chan pending] returns zero. Otherwise you get an infinite loop
even without a while loop. One way to check this would be to use a
buffer larger than the request (at least 10k for Yahoo!), so [chan
pending] would never return zero.
I also don't think the chunk-size issue would cause CPU usage, it
should just put the application to sleep if the socket blocks.
Question is if it works fine without the -channel option, or try
stdout/stderr as the channel and see if it works.
Not for a couple of hours (need to get some paying work done).
Go there:
http://www.tcl.tk/community/sourceforge.html
and in the "tcl" row, choose "Bug Database".
-Alex
I just tried this on Windows Vista, with ActiveTcl and http version
2.7.3.
I get exactly the same buggy behavior as described by DrS. I get 4165
bytes of the file, also the same.
But I tried upping the buffer to 10k and it works fine.
With the 4k buffer it consumes almost 100% of 1 cpu.
Thanks for confirming this. It is interesting that it would work with
10k as buffer size. By the way, I tried this with Activetcl version
8.4.19 and still get the same buggy behavior. Unless they happen to
share the same http code, I suspect the bug may be in the core.
DrS
I'm an admitted non-expert on the http code. But it looks very simple.
My guess is that the two procedure loop (httpCopyStart <-->
httpCopyDone) is not working correctly. The -blocksize sets the value
of the [fcopy -size] option. I thought it was the channel buffer size.
But setting the -blocksize larger than the file size is guaranteed to
prevent the two procedure loop.
I'm not sure what the need is for the two procedure loop, it is
probably so you can report progress during the copy.
I'm wondering if somehow [fcopy] suffers from the same very strange
issue I discovered with [chan pending]. What happens is that all
buffer bytes are copied out of the buffer. Then the channel becomes or
stays readable, but [chan pending] would return zero bytes. Any code,
even on the C level which relies on this will fail if it does not read
one byte. The buffer remains empty and readable.
[chan pending (input/output)] just reflects the C APIs:
Tcl_InputBuffered and Tcl_OutputBuffered. However, zero is supposed to
indicate that the channel isn't opened for the requested operation.
This obviously isn't true on the Tcl level. So something is screwy
here.
If you need a quick workaround for this, set the -blocksize to -1. If
this gets passed through to [fcopy], it should copy the entire file
(note that internally [fcopy] will yield on every buffer transfer.
> If you need a quick workaround for this, set the -blocksize to -1. If
> this gets passed through to [fcopy], it should copy the entire file
> (note that internally [fcopy] will yield on every buffer transfer.
Some additional notes for anyone looking further into this issue. I
still haven't found the root cause, but I have narrowed down what is
happening.
First, the two background procs are CopyStart and CopyDone (in
the ::http namespace). CopyStart runs this [fcopy] code:
fcopy $sock $state(-channel) -size $state(-blocksize) -command \
[list http::CopyDone $token]
Apparently one copy is done, although the number of bytes written is a
little larger than the -size (4096). This seems to be the same when
done with a stand-alone [fcopy] (I adapted the example for copymore
from the [fcopy] manpage).
But the callback command never runs, not even once. Again, in contrast
to the copymore example. So...somehow [fcopy] is screwing up, but just
in the http package.
The only reason for the need for the -blocksize option is that it
allows tracking the progress of the transfer. Assuming this problem
ever gets solved, it might be better to only enable this option if
there is a progress tracking proc (state(-progress)), otherwise set
blocksize to -1, since [fcopy] reschedules itself after every buffer
copy.
Oops, I realize my post about the bugreport I created doesn't show up.
Sorry about that. You can join the investigation there:
https://sourceforge.net/tracker/?func=detail&atid=110894&aid=2895565&group_id=10894
-Alex
Thanks for entering it into the tracker database. I tried a couple of
days ago and was presented with an error page. After a few tries, I
decided to wait and try later.
DrS
Thanks for following up on this - I see it is relevant given your recent
messages and your own client on this very topic. Hopefully, we will
have an official resolution on this soon.
DrS
Wow, that explains a lot. Thanks for getting me off the hook on
tracking this down any further.
Seems like an internal issue with fcopy/tcl i/o, but eventually the
http package may need a fix as well. The http body should not go
through eol translations. Of course, if the http package got this
right, we may not have stumbled across this bug.
Well, maybe. Depends on whether you regard the body as binary data or
textual data. Arguably for text/* content types you should perform EOL
translation...
Donal.
If this were true, then server applications would do the translation
to the network eol sequence <CR><LF>. But this doesn't happen, the
document is transfered as binary data, because it is. The Content-
Length header is for bytes, and the body specifically does not end
with an additional <CR><LF>.
There are limitless complications to performing eol translations. For
instance, what if you are downloading source code, which is text/
plain? With windows you also have code pages, wide chars, etc. which
the server may not capture and include in the content-type.
But plain experience with how mainstream browsers work should
illustrate the situation: download a tcl source file and try to edit
it with notepad. For instance, Tcl's 2007 changelog with your
initials, download with mozilla, save as a text file and open with
notepad. So even saving to disk as a text document does not preform
eol (you get the expected garbage looking text).
However, I'm arguing something less: the data is binary/opaque during
transfer. If the application can then figure out how to do a
transform, great, but it isn't part of http and the http part of the
application should preserve the exact data it received.
You're getting confused here. Servers shovel all sorts of crap down
the pipe. They ship the bytes as they are because that's fastest.
> There are limitless complications to performing eol translations. For
> instance, what if you are downloading source code, which is text/
> plain? With windows you also have code pages, wide chars, etc. which
> the server may not capture and include in the content-type.
Actually servers are supposed to capture all that and describe it in
the content type (through the charset parameter). That some miserably
fail to do this does not change the fact that they *should* do it. End-
of-line translations (and charset translations too) are meaningful for
all the text/* content types, though they may be times when you don't
want to perform them. (They're not meaningful at all for the other
major classes of content type, like image/* or application/*, when
only binary transfer even approaches sanity.)
> But plain experience with how mainstream browsers work should
> illustrate the situation: download a tcl source file and try to edit
> it with notepad. For instance, Tcl's 2007 changelog with your
> initials, download with mozilla, save as a text file and open with
> notepad. So even saving to disk as a text document does not preform
> eol (you get the expected garbage looking text).
This doesn't change the fact that you're wrong. You're using a browser
in a particular mode ("save a copy", which doesn't correspond to the
one I'm thinking of EOL translation being particularly useful in) and
you're also claiming that the only possible interpretation of a text
file is the one that it was originally created in. That's just not
true. In particular, it would mean that changing the encoding would be
prohibited, and that therefore every application that might want to
process that file must play "guess the encoding and translation" with
it, despite the (hopefully, but not necessarily, correct) metadata
being stripped from it at that point. In fact, it's because of
troubles like these that certain high-quality text processing
algorithms (e.g., XML Signature) use a canonicalization step. (OK, for
XML there are other things that need doing too, but encoding handling
is indeed part of it.)
The fundamental truth is that for text data, equivalence is not really
defined at the byte level. Instead, it's at the character level after
end-of-line handling. A lot of existing code gets this wrong; a lot of
people still don't understand the differences between bytes and
characters.
> However, I'm arguing something less: the data is binary/opaque during
> transfer. If the application can then figure out how to do a
> transform, great, but it isn't part of http and the http part of the
> application should preserve the exact data it received.
I disagree. HTTP is not just a binary download protocol.
Donal.
> > However, I'm arguing something less: the data is binary/opaque during
> > transfer. If the application can then figure out how to do a
> > transform, great, but it isn't part of http and the http part of the
> > application should preserve the exact data it received.
>
> I disagree. HTTP is not just a binary download protocol.
Just point me to the relevant part of the protocol, because I'm not
the only one who has missed this completely.
And please explain why Mozilla failed your interpretation. Why didn't
it convert <LF> to <CR><LF>?
BTW, I don't think there is anything wrong with an application doing
the conversion, but you are claiming that http must handle this and
that the client should do the conversion automatically.
You seem to require that every http client should both understand and
be able to handle every possible encoding and character set. Because
the alternative is to sometimes perform an irreversible
transformation, and sometimes not.
But it isn't even clear when you think such a transformation should
take place. Http is a transport protocol, not a file saving protocol.
My example simply proves that no conversion takes place. If the save
operation had performed eol conversions, I could not so easily
demonstrate that the http protocol did not perform the conversion. But
the saved document did not contain <CR><LF> conversions, proving that
at least mozilla on windows does not follow your interpretation.
> This doesn't change the fact that you're wrong. You're using a browser
> in a particular mode ("save a copy", which doesn't correspond to the
> one I'm thinking of EOL translation being particularly useful in) and
> you're also claiming that the only possible interpretation of a text
> file is the one that it was originally created in. That's just not
> true.
The http stuff is way over before the "save a copy" operation. If http
required or allowed any conversion they would already have been done.
Binary/opaque/octet download means that "no interpretation" is forced
on the content. Interpretation is up to the application. Conversion
destroys the possibility of user applied interpretation.
There are just endless examples of when your "prefect world" model
fails.
Imagine a tcl source file. You set up a server that allows files
ending in .tcl to be served as text/plain. Problem: tcl files allow
binary data to follow a ^Z or eof. How do you configure a server to
handle such a situation? So you have now vastly complicated the
ability to support source code browsing.
Not sure why the http protocol needs to be burdened with all these
complications. It is already stupidly complex because of the different
platform eol conventions, you want to extend that madness to the body
data.
Since you asked, RFC 1945 Section 3.6.1, to be precise the second and third
paragraphs which read:
Media subtypes of the "text" type use CRLF as the text line break when in
canonical form. However, HTTP allows the transport of text media with plain
CR or LF alone representing a line break when used consistently within the
Entity-Body. HTTP applications must accept CRLF, bare CR, and bare LF as
being representative of a line break in text media received via HTTP.
In addition, if the text media is represented in a character set that does
not use octets 13 and 10 for CR and LF respectively, as is the case for some
multi-byte character sets, HTTP allows the use of whatever octet sequences
are defined by that character set to represent the equivalent of CR and LF
for line breaks. This flexibility regarding line breaks applies only to text
media in the Entity-Body; a bare CR or LF should not be substituted for CRLF
within any of the HTTP control structures (such as header fields and
multipart boundaries).
> And please explain why Mozilla failed your interpretation. Why didn't
> it convert <LF> to <CR><LF>?
It has a bug -- file a bug report with Mozilla.
> ...
> But it isn't even clear when you think such a transformation should
> take place. Http is a transport protocol, not a file saving protocol.
To be exact it is a Hypertext Transport Protocol (note second part of the
compound first word in the name).
> My example simply proves that no conversion takes place. If the save
> operation had performed eol conversions, I could not so easily
> demonstrate that the http protocol did not perform the conversion. But
> the saved document did not contain <CR><LF> conversions, proving that
> at least mozilla on windows does not follow your interpretation.
Again, sounds like a mozilla bug -- file a bug report with them.
>...
> The http stuff is way over before the "save a copy" operation. If http
> required or allowed any conversion they would already have been done.
It does require it -- for text media subtypes, please read the RFC. Pay
particular attention to the sections I quoted above.
> Binary/opaque/octet download means that "no interpretation" is forced
> on the content. Interpretation is up to the application. Conversion
> destroys the possibility of user applied interpretation.
You seem to forget, part of the protocol allows for multiple media types --
the conversion rules *ONLY APPLY TO TEXT MEDIA SUBTYPES*!!! (yes I meant to
yell!) -- you are using bait-and-switch arguments here1
> There are just endless examples of when your "prefect world" model
> fails.
>
> Imagine a tcl source file. You set up a server that allows files
> ending in .tcl to be served as text/plain. Problem: tcl files allow
> binary data to follow a ^Z or eof. How do you configure a server to
> handle such a situation? So you have now vastly complicated the
> ability to support source code browsing.
Sounds like you have a misconfigured server -- the type shoud be
application/tcl.
> Not sure why the http protocol needs to be burdened with all these
> complications. It is already stupidly complex because of the different
> platform eol conventions, you want to extend that madness to the body
> data.
Go argue with the W3C and ISO committee -- we are just telling you the way
it *IS*.
BTW, the information took about 2 seconds to access via a Goggle search with
the following string: "http rfc 1.0".
Of course it took about 60 seconds to scan down to find it. Now it could
have been sped up by doing a *search* for LF (assuming the browser supports
searching on a retrieved page).
Okay, I take it all back. the http package requires a lot of work to
make it comply with your requirements.
Internet Explorer can't even handle files without extensions, if the
content type is text/plain but the file ends in something it doesn't
understand, good luck. Is this another bug I should report?
Still can't figure out when the translation is supposed to take place.
I ran into an issue with http and the -channel option, also version
2.7.4,
running Windows XP. In my case I wanted to retrieve zip files and did
so
with code like:
set outfile [open myzipfile.zip w]
fconfigure $outfile -translation binary
::http::geturl $URL -channel $outfile
close $outfile
This led to corrupted zip files that could not be read by the
zip::vfs
package.
Using code like:
set token [::http::geturl $URL]
set outfile [open myzipfile.zip w]
fconfigure $outfile -translation binary
puts -nonewline $outfile [::http::data $token]
close $outfile
gave me the zip files I wanted.
(see my comments at http://wiki.tcl.tk/24061 for some more details).
I will create a bug report for this.
Regards,
Arjen
They are not "my requirements" -- they are the standard.
> Internet Explorer can't even handle files without extensions, if the
> content type is text/plain but the file ends in something it doesn't
> understand, good luck.
I don't use Internet Exploder. Good luck to those that do.
> Is this another bug I should report?
Sure -- or are you being sarcastic and attempting to say that Microsoft
products always meet (or define) the standards?
> Still can't figure out when the translation is supposed to take place.
For text media types the on the wire format is supposed to use <CR><LF> as
EOL marker. In the case of Tcl's http package, it should translate those to
the Tcl internal standard of a <LF>. If the string is then written out to
the file system it should translate the <LF> to whatever that channel has
been fconfigured to (which by default is the standard for the platform the
program is running on).
While I agree with Pat on this wiki page that this belongs in SF bug
trackers, note that it specifically is about supporting chunked
transfer encoding or not.
It appears that older versions of the http package "cheated" a bit by
accepting chunked transfers but not doing the desencapsulation, or not
in all cases (see bug
1928131; maybe you can hook onto this one instead of creating a fresh
bug).
-Alex
Thanks, I will do that - I am not an expert in matters of HTTP, so I
won't claim to understand the issue of chunked or non-chunked
transfers.
And the workaround is easy. It was probably my trying to save some
typing
that led me to this oddity.
Regards,
Arjen
> Still can't figure out when the translation is supposed to take place.
you don't translate.
you interprete for display purposes.
( And you store "as received" on local storage. )
only exception I ever came across:
gzipped tar archives are marked on occasion such that
they land on the platters as unzipped tar archives.
( never delved into this )
uwe
Done - added this information as a comment to said bug report.
Regards,
Arjen
That's because they're being delivered as application/tar (or
something like that) with the gzipping expressed as a transfer
encoding (IIRC; there's several closely related things, only one of
which is used to express that the data is gzipped).
Donal.
I see that the bug has been fixed and closed. How does this get
reflected in the binary versions? Will the Activetcl versions be
available soon?
DrS
You'll be pleased to know that the fix ships in the upcoming 8.5.8
release, and is also (of course) forward-ported to the 8.6 branch, so
it'll be in 8.6b2.
For the delay to equivalent AT versions, ISTR waiting for not much
more than 24 hours ;-)
-Alex
Okay, from my reading of the standard, it only applies to text/plain,
but even that is limited.
In fact, the meaning is almost the opposite of what you claim when you
take into account all media types.
Backing up a little, to include more of the standard:
2.3.1. Canonicalization and Text Defaults
Internet media types are registered with a canonical form. An
entity-body transferred via HTTP messages MUST be represented in
the
appropriate canonical form prior to its transmission except for
"text" types, as defined in the next paragraph.
When in canonical form, media subtypes of the "text" type use CRLF
as
the text line break. HTTP relaxes this requirement and allows the
transport of text media with plain CR or LF alone representing a
line
break when it is done consistently for an entire entity-body. HTTP
applications MUST accept CRLF, bare CR, and bare LF as being
representative of a line break in text media received via HTTP. In
addition, if the text is represented in a character set that does
not
use octets 13 and 10 for CR and LF respectively, as is the case for
some multi-byte character sets, HTTP allows the use of whatever
octet
sequences are defined by that character set to represent the
equivalent of CR and LF for line breaks. This flexibility
regarding
line breaks applies only to text media in the entity-body; a bare
CR
or LF MUST NOT be substituted for CRLF within any of the HTTP
control
structures (such as header fields and multipart boundaries).
If an entity-body is encoded with a content-coding, the underlying
data MUST be in a form defined above prior to being encoded.
...
3.2.1. Type
When an entity-body is included with a message, the data type of
that
body is determined via the header fields Content-Type and Content-
Encoding. These define a two-layer, ordered encoding model:
entity-body := Content-Encoding( Content-Type( data ) )
Content-Type specifies the media type of the underlying data. Any
HTTP/1.1 message containing an entity-body SHOULD include a
Content-
Type header field defining the media type of that body, unless that
information is unknown. If the Content-Type header field is not
present, it indicates that the sender does not know the media type
of
the data; recipients MAY either assume that it is "application/
octet-stream" ([RFC2046], Section 4.5.1) or examine the content to
determine its type.
So it is clear from the first section that CR, LF or CRLF, or any
charset defined mapping of chars 13 and 10 must be allowed. And the
best way to allow them is to just accept the text "as-is".
The second section says that the server does not have to understand
and should not guess at the media types it serves, and the client is
under no obligation to figure out the type either. But if the client
wants to figure out the type, it can look at the content, somewhat
like an xml application would, or how unix applications work.
But there is still nothing here that talks about transforms or
substitutions of eol chars in text If the client receives a text
media type which contains a mixture of CR, LF and CRLF as eol, it
seems to me that the client should notify the user and refuse to do
any translation during a save operation. It seems to me this would
require several passes over the data, so using [fcopy] with
translations on the first pass would cause problems. Maybe it should
save to a temp file on the first pass.
Wow, so now the data accessible to the client/user depends on the
programming environment? Internally Tcl should care less what the data
represents. This interpretation would lead to endless data corruption
errors.
That is great! Thanks.
DrS