Python 3.1 focuses on the stabilization and optimization of the features and
changes that Python 3.0 introduced. For example, the new I/O system has been
rewritten in C for speed. File system APIs that use unicode strings now handle
paths with undecodable bytes in them. Other features include an ordered
dictionary implementation, a condensed syntax for nested with statements, and
support for ttk Tile in Tkinter. For a more extensive list of changes in 3.1,
see http://doc.python.org/3.1/whatsnew/3.1.html or Misc/NEWS in the Python
distribution.
To download Python 3.1 visit:
http://www.python.org/download/releases/3.1/
The 3.1 documentation can be found at:
Bugs can always be reported to:
Enjoy!
--
Benjamin Peterson
Release Manager
benjamin at python.org
(on behalf of the entire python-dev team and 3.1's contributors)
> Python 3.1 focuses on the stabilization and optimization of the features and
> changes that Python 3.0 introduced. For example, the new I/O system has been
> rewritten in C for speed. File system APIs that use unicode strings now
> handle paths with undecodable bytes in them.
That's a significant improvement. It still decodes os.environ and sys.argv
before you have a chance to call sys.setfilesystemencoding(), but it
appears to be recoverable (with some effort; I can't find any way to re-do
the encoding without manually replacing the surrogates).
However, sys.std{in,out,err} are still created as text streams, and AFAICT
there's nothing you can do about this from within your code.
All in all, Python 3.x still has a long way to go before it will be
suitable for real-world use.
See PEP 383.
> However, sys.std{in,out,err} are still created as text streams, and AFAICT
> there's nothing you can do about this from within your code.
That's intentional, and not going to change. You can access the
underlying byte streams if you want to, as you could already in 3.0.
Regards,
Martin
P.S. Please identify yourself on this newsgroup.
Such as?
Fortunately, I have assiduously avoided the real word, and am happy to
embrace the world from our 'bot overlords.
Congratulations on another release from the hydra-like world of
multi-head development.
--Scott David Daniels
Scott....@Acm.Org
I had a quick look at the documentation, and couldn't see how to do
this. It's the first time I'd read the new IO module documentation, so
I probably missed something obvious. Could you explain how I get the
byte stream underlying sys.stdin? (That should give me enough to find
what I was misunderstanding in the docs).
Thanks,
Paul.
>PM> 2009/6/28 "Martin v. L�wis" <mar...@v.loewis.de>:
>>>> However, sys.std{in,out,err} are still created as text streams, and AFAICT
>>>> there's nothing you can do about this from within your code.
>>>
>>> That's intentional, and not going to change. You can access the
>>> underlying byte streams if you want to, as you could already in 3.0.
>PM> I had a quick look at the documentation, and couldn't see how to do
>PM> this. It's the first time I'd read the new IO module documentation, so
>PM> I probably missed something obvious. Could you explain how I get the
>PM> byte stream underlying sys.stdin? (That should give me enough to find
>PM> what I was misunderstanding in the docs).
http://docs.python.org/3.1/library/sys.html#sys.stdin
--
Piet van Oostrum <pi...@cs.uu.nl>
URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4]
Private email: pi...@vanoostrum.org
You've missed the most obvious place to look for the feature -- the
documentation of sys.stdin :)
http://docs.python.org/3.0/library/sys.html#sys.stdin
>>> import sys
>>> sys.stdin
<io.TextIOWrapper object at 0x7f65df915050>
>>> sys.stdin.buffer
<io.BufferedReader object at 0x7f65df90bdd0>
>>> sys.stdin.read(1)
'\n'
>>> sys.stdin.buffer.read(1)
b'\n'
Christian
Thanks. Like you say, the obvious place I didn't think of... :-) (I'd
have experimented, but this PC doesn't have Python 3 installed at the
moment :-()
The "buffer" attribute doesn't seem to be documented in the docs for
the io module. I'm guessing that the TextIOBase class should have a
note that you get at the buffer through the "buffer" attribute?
Paul.
Such as not trying to shoe-horn every byte string it encounters into
Unicode. Some of them really are *just* byte strings.
You're certainly allowed to convert them back to byte strings if you want.
Let's ignore the disinformation. So false it is hardly worth refuting.
> The "buffer" attribute doesn't seem to be documented in the docs for
> the io module. I'm guessing that the TextIOBase class should have a
> note that you get at the buffer through the "buffer" attribute?
Good point. I've now documented it, and the "raw" attribute of BufferedIOBase.
Yes, but do you get back the original byte strings? Maybe I'm missing
something, but my impression is that this is still an issue for the email
module as well as command-line arguments and environment variables.
--
Aahz (aa...@pythoncraft.com) <*> http://www.pythoncraft.com/
"as long as we like the same operating system, things are cool." --piranha
The email module is, yes, broken. You can recover the bytestrings of
1. Does Python offer any assistance in doing so, or do you have to
manually convert the surrogates which are generated for unrecognised bytes?
2. How do you do this for non-invertible encodings (e.g. ISO-2022)?
Most of the issues can be worked around by calling
sys.setfilesystemencoding('iso-8859-1') at the start of the program, but
sys.argv and os.environ have already been converted by this point.
>>> Nobody <nobody <at> nowhere.com> writes:
>>>> All in all, Python 3.x still has a long way to go before it will be
>>>> suitable for real-world use.
>>> Such as?
>>
>> Such as not trying to shoe-horn every byte string it encounters into
>> Unicode. Some of them really are *just* byte strings.
>
> Let's ignore the disinformation.
Translation: let's ignore anything which falsifies the assumptions.
> So false it is hardly worth refuting.
Your copy of Trolling by Numbers must be getting pretty dog-eared by now.
>
> On Sun, 28 Jun 2009 19:21:49 +0000, Benjamin Peterson wrote:
>
> >> Yes, but do you get back the original byte strings? Maybe I'm missing
> >> something, but my impression is that this is still an issue for the email
> >> module as well as command-line arguments and environment variables.
> >
> > The email module is, yes, broken. You can recover the bytestrings of
> > command-line arguments and environment variables.
>
> 1. Does Python offer any assistance in doing so, or do you have to
> manually convert the surrogates which are generated for unrecognised bytes?
fs_encoding = sys.getfilesystemencoding()
bytes_argv = [arg.encode(fs_encoding, "surrogateescape") for arg in sys.argv]
>
> 2. How do you do this for non-invertible encodings (e.g. ISO-2022)?
What's a non-invertible encoding? I can't find a reference to the term.
Different ISO-2022 strings can map to the same Unicode string.
Thus you can convert back to _some_ ISO-2022 string, but it won't
necessarily match the original.
--
Hallvard
ISO-2022 cannot be used as a system encoding.
Please do read the responses I write, and please do identify yourself.
Regards,
Martin
+1 QOTW
-- Gerhard
>> > The email module is, yes, broken. You can recover the bytestrings of
>> > command-line arguments and environment variables.
>>
>> 1. Does Python offer any assistance in doing so, or do you have to
>> manually convert the surrogates which are generated for unrecognised bytes?
>
> fs_encoding = sys.getfilesystemencoding()
> bytes_argv = [arg.encode(fs_encoding, "surrogateescape") for arg in sys.argv]
This results in an internal error:
> "\udce4\udceb\udcef\udcf6\udcfc".encode("iso-8859-1", "surrogateescape")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
SystemError: Objects/bytesobject.c:3182: bad argument to internal function
[FWIW, the error corresponds to _PyBytes_Resize, which has a
cautionary comment almost as large as the code.]
The documentation gives the impression that "surrogateescape" is only
meaningful for decoding.
>> 2. How do you do this for non-invertible encodings (e.g. ISO-2022)?
>
> What's a non-invertible encoding? I can't find a reference to the term.
One where different inputs can produce the same output.
>> That's a significant improvement. It still decodes os.environ and sys.argv
>> before you have a chance to call sys.setfilesystemencoding(), but it
>> appears to be recoverable (with some effort; I can't find any way to re-do
>> the encoding without manually replacing the surrogates).
>
> See PEP 383.
Okay, that's useful, except that it may have some bugs:
> r = "\udce4\udceb\udcef\udcf6\udcfc".encode("iso-8859-1", "surrogateescape")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
SystemError: Objects/bytesobject.c:3182: bad argument to internal function
Trying a few random test cases suggests that the ratio of valid to invalid
bytes has an effect. Strings which consist mostly of invalid bytes trigger
the error, those which are mostly valid don't.
The error corresponds to _PyBytes_Resize(), which has the following
words of caution in a preceding comment:
/* The following function breaks the notion that strings are immutable:
it changes the size of a string. We get away with this only if there
is only one module referencing the object. You can also think of it
as creating a new string object and destroying the old one, only
more efficiently. In any case, don't use this if the string may
already be known to some other part of the code...
Note that if there's not enough memory to resize the string, the original
string object at *pv is deallocated, *pv is set to NULL, an "out of
memory" exception is set, and -1 is returned. Else (on success) 0 is
returned, and the value in *pv may or may not be the same as on input.
As always, an extra byte is allocated for a trailing \0 byte (newsize
does *not* include that), and a trailing \0 byte is stored.
*/
Assuming that this gets fixed, it should make most of the problems with
3.0 solvable. OTOH, it wouldn't have killed them to have added e.g.
sys.argv_bytes and os.environ_bytes.
>> However, sys.std{in,out,err} are still created as text streams, and AFAICT
>> there's nothing you can do about this from within your code.
>
> That's intentional, and not going to change. You can access the
> underlying byte streams if you want to, as you could already in 3.0.
Okay, I've since been pointed to the relevant information (I was looking
under "File Objects"; I didn't think to look at "sys").
Please report a bug on http://bugs.python.org
As for a bytes version of sys.argv and os.environ, you're welcome to propose a
patch (this would be a separate issue on the aforementioned issue tracker).
Thanks
Antoine.
That's hopeless to keep track of across modules if something modifies
sys.argv or os.environ.
If the current scheme for recovering the original bytes proves
insufficient, what could work is a string type which can have an
attribute with the original bytes (if the source was bytes). And/or
sys.argv and os.environ maintaining the correspondence when feasible.
Anyway, I haven't looked at whether any of this is a problem, so don't
mind me:-) As long as it's definitely possible to tell python once
and for all not to apply locales and string conversions, instead of
having to keep track of an ever-expanding list of variables to tame
it's bytes->character conversions (as happened with Emacs).
--
Hallvard
But please be aware that such a proposal would have to consider:
1. That on Windows, the native form is the character version, and the
bytes version would have to address all the same sorts of encoding
issues that the OP is complaining about in the character versions. [1]
2. That the proposal address the question of how to write portable,
robust, code (given that choosing argv vs argv_bytes based on
sys.platform is unlikely to count as a good option...)
3. Why defining your own argv_bytes as argv_bytes =
[a.encode("iso-8859-1", "surrogateescape") for a in sys.argv] is
insufficient (excluding issues with bugs, which will be fixed
regardless) for the occasional cases where it's needed.
Before writing the proposal, the OP should probably review the
extensive discussions which can be found in the python-dev archives.
It would be wrong for people reading this thread to think that the
implemented approach is in any sense a "quick fix" - it's certainly a
compromise (and no-one likes all aspects of any compromise!) but it's
one made after a lot of input from people with widely differing
requirements.
Paul.
[1] And my understanding, from the PEP, is that even on POSIX, the
argv and environ data is intended to be character data, even though
the native C APIs expose a byte-oriented interface. So conceptually,
character format is "correct" on POSIX as well... (But I don't write
code for POSIX systems, so I'll leave it to the POSIX users to debate
this point further).
>> Okay, that's useful, except that it may have some bugs:
>> (...)
>> Assuming that this gets fixed, it should make most of the problems with
>> 3.0 solvable. OTOH, it wouldn't have killed them to have added e.g.
>> sys.argv_bytes and os.environ_bytes.
>
> That's hopeless to keep track of across modules if something modifies
> sys.argv or os.environ.
Oh, I wasn't suggesting that they should be updated. Just that there
should be some way to get at the original data.
The mechanism used in 3.1 is sufficient. I'm mostly concerned that it's
*possible* to recover the data; convenience is of secondary importance.
Calling sys.setfilesystemencoding('iso-8859-1') right at the start of the
code eliminates most of the issues. It's just the stuff which happens
before the first line of code is executed (sys.argv, os.environ, sys.stdin
etc) which was problematic.
[BTW, it isn't just Python that has problems. The directory where I was
performing tests happened to be an svn checkout. A subsequent "svn update"
promptly crapped out because I'd left behind a file whose name wasn't
valid ASCII.]
> Nobody <nobody <at> nowhere.com> writes:
>>
>> This results in an internal error:
>>
>> > "\udce4\udceb\udcef\udcf6\udcfc".encode("iso-8859-1", "surrogateescape")
>> Traceback (most recent call last):
>> File "<stdin>", line 1, in <module>
>> SystemError: Objects/bytesobject.c:3182: bad argument to internal function
>
> Please report a bug on http://bugs.python.org
Done.
> As for a bytes version of sys.argv and os.environ, you're welcome to propose a
> patch (this would be a separate issue on the aforementioned issue tracker).
Assuming that the above bug gets fixed, it isn't really necessary. In
particular, maintaining bytes/string versions in the presence of updates
is likely to be more trouble than it's worth.
>> As for a bytes version of sys.argv and os.environ, you're welcome to
>> propose a patch (this would be a separate issue on the aforementioned
>> issue tracker).
>
> But please be aware that such a proposal would have to consider:
>
> 1. That on Windows, the native form is the character version, and the
> bytes version would have to address all the same sorts of encoding
> issues that the OP is complaining about in the character versions. [1]
A bytes version doesn't make sense on Windows (at least, not on the
NT-based versions, and the DOS-based branch isn't worth bothering about,
IMHO).
Also, Windows *needs* to deal with characters due to the
fact that filenames, environment variables, etc are case-insensitive.
> 2. That the proposal address the question of how to write portable,
> robust, code (given that choosing argv vs argv_bytes based on
> sys.platform is unlikely to count as a good option...)
There is a tension here between robustness and portability. In my
situation, robustness means getting the "unadulterated" data. I can always
adulterate it myself if I need to.
> 3. Why defining your own argv_bytes as argv_bytes =
> [a.encode("iso-8859-1", "surrogateescape") for a in sys.argv] is
> insufficient (excluding issues with bugs, which will be fixed
> regardless) for the occasional cases where it's needed.
Other than the bug, it appears to be sufficient. I don't need to support
a locale where nl_langinfo(CODESET) is ISO-2022 (I *do* need to support
lossless round-trip of ISO-2022 filenames, possibly stored in argv and
maybe even in environ, but that's a different matter; the code only
really needs to run with LANG=C).
> [1] And my understanding, from the PEP, is that even on POSIX, the
> argv and environ data is intended to be character data, even though
> the native C APIs expose a byte-oriented interface. So conceptually,
> character format is "correct" on POSIX as well... (But I don't write
> code for POSIX systems, so I'll leave it to the POSIX users to debate
> this point further).
Even if it's "intended" to be character data, it isn't *required* to be.
In particular, it's not required to be in the locale's encoding.
A common example of what I need to handle is:
find /www ... -print0 | xargs -0 myscript
where the filenames can be in a wide variety of different encodings
(sometimes even within a single directory).