I just ran 2to3 on a py2.5 script that does pattern matching on the text of a web page. The resulting script crashed, because when I did
f = urllib.request.urlopen(url) text = f.read()
then "text" is a bytes object, not a string, and so I can't do a regexp on it.
Of course, this is easy to patch: just do "f.read().decode()". However, it strikes me as an obvious bug, which ought to be fixed. That is, read() should return a string, as it did in py2.5.
But apparently others disagree? This was mentioned in issue 3930 ( http://bugs.python.org/issue3930 ) back in September '08, but that issue is now closed, apparently because consistent behavior was achieved. But I figure consistently bad behavior is still bad.
This change breaks pretty much every Python program that opens a webpage, doesn't it? 2to3 doesn't catch it, and, in any case, why should read() return bytes, not string? Am I missing something?
By the way, I'm running Ubuntu 8.10. Doing "python3 --version" prints "Python 3.0rc1+".
On Dec 22, 3:41 pm, "Glenn G. Chappell" <glenn.chapp...@gmail.com> wrote:
> I just ran 2to3 on a py2.5 script that does pattern matching on the > text of a web page. The resulting script crashed, because when I did
> f = urllib.request.urlopen(url) > text = f.read()
> then "text" is a bytes object, not a string, and so I can't do a > regexp on it.
> Of course, this is easy to patch: just do "f.read().decode()". > However, it strikes me as an obvious bug, which ought to be fixed. > That is, read() should return a string, as it did in py2.5.
Well, I can't agree that it's an obvious bug (in Python 3). It might be something worth raising a warning over in 2to3. It would also be a reasonable wishlist item for automatic encoding detection and conversion to a string (see below). But it's not a bug.
> But apparently others disagree? This was mentioned in issue 3930 > (http://bugs.python.org/issue3930) back in September '08, but that > issue is now closed, apparently because consistent behavior was > achieved. But I figure consistently bad behavior is still bad.
> This change breaks pretty much every Python program that opens a > webpage, doesn't it?
No. What if someone is using urllib retrieve (say) a JPEG image? A bytes object is what they'd want in Python 3. Also, many people were already explicitly dealing with encodings in Python 2.5; the change wouldn't affect them.
> 2to3 doesn't catch it, and, in any case, why > should read() return bytes, not string? Am I missing something?
It returns bytes because it doesn't know what encoding to use. This is the appropriate behavior.
HOWEVER... a web page request often does know what encoding to use, since it ostensibly has to parse the header. It's reasonable that IF a url request's "Content-type" is text, and/or the "Content-encoding" is given, for urllib to have an option to automatically decode and return a string instead of bytes. (For all I know, it already can do that.)
> I just ran 2to3 on a py2.5 script that does pattern matching on the > text of a web page. The resulting script crashed, because when I did
> f = urllib.request.urlopen(url) > text = f.read()
> then "text" is a bytes object, not a string, and so I can't do a > regexp on it.
> Of course, this is easy to patch: just do "f.read().decode()". > However, it strikes me as an obvious bug, which ought to be fixed. > That is, read() should return a string, as it did in py2.5.
It's not possible unless you know the encoding of the bytes. Network io only returns byte and you must encode it explicitly. You "patch" breaks as soon as a remote sites returns the data in a different encoding. It also breaks if the site returns an image/*, appliation/*, audio/* or any other mimetype than text. There is no generic and simple way to detect the encoding of a remote site. Sometimes the encoding is mentioned in the HTTP header, sometimes it's embedded in the <head> section of the HTML document.
> This change breaks pretty much every Python program that opens a > webpage, doesn't it? 2to3 doesn't catch it, and, in any case, why > should read() return bytes, not string? Am I missing something?
I hope I was able to explain the issue. By the way Python 2.x and 3.0 are both returning bytes (str in 2.x, bytes in 3.0).
On Dec 22, 8:25 pm, Christian Heimes <li...@cheimes.de> wrote:
> It's not possible unless you know the encoding of the bytes. Network io > only returns byte and you must encode it explicitly. [...] > There is no generic and simple way to detect the encoding of a remote > site. Sometimes the encoding is mentioned in the HTTP header, sometimes > it's embedded in the <head> section of the HTML document.
That said, a "decode to declared HTTP header encoding" version of urlopen could be useful to give some users the output they want (text from network io) or to make it clear why bytes is the safe way.
> That said, a "decode to declared HTTP header encoding" version of > urlopen could be useful to give some users the output they want (text > from network io) or to make it clear why bytes is the safe way.
Yeah, your idea sounds both useful and feasible. A patch is welcome! :)
Okay, so I guess I didn't really *get* the whole unicode/text/binary thing. Maybe I still don't, but I think I'm getting closer. Thanks to everyone who replied.
On Dec 22, 1:41 pm, ajaksu <aja...@gmail.com> wrote:
> On Dec 22, 8:25 pm, Christian Heimes <li...@cheimes.de> wrote: > That said, a "decode to declared HTTP header encoding" version of > urlopen could be useful to give some users the output they want (text > from network io) or to make it clear why bytes is the safe way.
Sounds like a great idea. More to the point, it sounds like it's pretty much a necessary idea.
Consider: reading a web page is an easy one-liner. Now, no one is going to write that one-liner, and then spend 20 lines trying to get the Content-Type and encoding figured out. Instead we're all going to do it the short, easy, *wrong* way. So every program in the world that uses urlopen gets to have the same bug. Not good. The *right* way needs to be the *easy* way.
> Okay, so I guess I didn't really *get* the whole unicode/text/binary > thing. Maybe I still don't, but I think I'm getting closer. Thanks to > everyone who replied.
The basic principal is easy. On the one hand you have some text as unicode data, on the other hand you have some binary data that may contain text in an arbitrary encoding. In order to get the text you have to decode the byte data into unicode. The other way around is called encoding.
Everybody in the whole world has to deal with unicode *unless* you are living in USA and all you have is plain and simple ASCII text. Python 2.x makes no difference between text in ASCII and arbitrary bytes. Both are stored in the str type. This makes it easy for ASCII country but the rest of the world suffers the consequences.
Python 3.0 makes a hard break for ASCII people because with 3.0 really everybody has to deal with encodings. There is no more implicit conversion between ASCII text and unicode. http://www.joelonsoftware.com/articles/Unicode.html explains it in great detail.
> On Dec 22, 1:41 pm, ajaksu <aja...@gmail.com> wrote: >> On Dec 22, 8:25 pm, Christian Heimes <li...@cheimes.de> wrote: >> That said, a "decode to declared HTTP header encoding" version of >> urlopen could be useful to give some users the output they want (text >> from network io) or to make it clear why bytes is the safe way.
> Sounds like a great idea. More to the point, it sounds like it's > pretty much a necessary idea.
> Consider: reading a web page is an easy one-liner. Now, no one is > going to write that one-liner, and then spend 20 lines trying to get > the Content-Type and encoding figured out. Instead we're all going to > do it the short, easy, *wrong* way. So every program in the world that > uses urlopen gets to have the same bug. Not good. The *right* way > needs to be the *easy* way.
Python 2.x suffers from the same problem. It just doesn't tell you from the beginning that you need to deal with the problem. With 2.x you can read websites fine - until you have to deal with a non English, non ASCII text. 3.0 forces the developer to think about the issue right from the beginning. No more excuses :)
I suggest somebody makes a feature request for 3.1. A patch with unit test increases the changes for the patch by at least one magnitude.
On Dec 22, 9:05 pm, Christian Heimes <li...@cheimes.de> wrote:
> ajaksu schrieb:
> > That said, a "decode to declared HTTP header encoding" version of > > urlopen could be useful to give some users the output they want (text > > from network io) or to make it clear why bytes is the safe way.
> Yeah, your idea sounds both useful and feasible. A patch is welcome! :)
Would monkeypatching what urlopen returns be good enough or should we aim at a cleaner implementation?
Any comments/suggestions are very welcome. I could use some help from people that know urllib on the best way to get the charset. Maybe after some sleep I can code it in a less awful way :)
ajaksu wrote: > On Dec 22, 9:05 pm, Christian Heimes <li...@cheimes.de> wrote: >> ajaksu schrieb:
>>> That said, a "decode to declared HTTP header encoding" version of >>> urlopen could be useful to give some users the output they want (text >>> from network io) or to make it clear why bytes is the safe way. >> Yeah, your idea sounds both useful and feasible. A patch is welcome! :)
> Would monkeypatching what urlopen returns be good enough or should we > aim at a cleaner implementation?
If you want to do it right ... It should be a clean patch against the py3k svn branch including documentation and a unit test. Don't worry! It's not as hard as it sounds. Besides Python core development is fun. :)