from the documentation (http://docs.python.org/lib/os-file-dir.html) for
os.listdir:
"On Windows NT/2k/XP and Unix, if path is a Unicode object, the result
will be a list of Unicode objects."
i'm on Unix. (linux, ubuntu edgy)
so it seems that it does not always return unicode filenames.
it seems that it tries to interpret the filenames using the filesystem's
encoding, and if that fails, it simply returns the filename as byte-string.
so you get back let's say an array of 21 filenames, from which 3 are
byte-strings, and the rest unicode strings.
after digging around, i found this in the source code:
> #ifdef Py_USING_UNICODE
> if (arg_is_unicode) {
> PyObject *w;
>
> w = PyUnicode_FromEncodedObject(v,
> Py_FileSystemDefaultEncoding,
> "strict");
> if (w != NULL) {
> Py_DECREF(v);
> v = w;
> }
> else {
> /* fall back to the original byte string, as
> discussed in patch #683592 */
> PyErr_Clear();
> }
> }
> #endif
so if the to-unicode-conversion fails, it falls back to the original
byte-string. i went and have read the patch-discussion.
and now i'm not sure what to do.
i know that:
1. the documentation is completely wrong. it does not always return
unicode filenames
2. it's true that the documentation does not specify what happens if the
filename is not in the filesystem-encoding, but i simply expected that i
get an Unicode-exception, as everywhere else. you see, exceptions are
ok, i can deal with them. but this is just plain wrong. from now on,
EVERYWHERE where i use os.listdir, i will have to go through all the
filenames in it, and check if they are unicode-strings or not.
so basically i'd like to ask here: am i reading something incorrectly?
or am i using os.listdir the "wrong way"? how do other people deal with
this?
p.s: one additional note. if you code expects os.listdir to return
unicode, that usually means that all your code uses unicode strings.
which in turn means, that those filenames will somehow later interact
with unicode strings. which means that that byte-string-filename will
probably get auto-converted to unicode at a later point, and that
auto-conversion will VERY probably fail, because the auto-convert only
happens using 'ascii' as the encoding, and if it was not possible to
decode the filename inside listdir, it's quite probable that it also
will not work using 'ascii' as the charset.
gabor
Unless someone says otherwise, report the discrepancy between doc and code
as a bug on the SF tracker. I have no idea of what the resolution should
be ;-).
tjr
You are reading it correctly. This is how it behaves.
> or am i using os.listdir the "wrong way"? how do other people deal with
> this?
You didn't say why the behavior causes a problem for you - you only
explained what the behavior is.
Most people use os.listdir in a way like this:
for name in os.listdir(path):
full = os.path.join(path, name)
attrib = os.stat(full)
if some-condition:
f = open(full)
...
All this code will typically work just fine with the current behavior,
so people typically don't see any problem.
Regards,
Martin
i am sorry, but it will not work. actually this is exactly what i did,
and it did not work. it dies in the os.path.join call, where file_name
is converted into unicode. and python uses 'ascii' as the charset in
such cases. but, because listdir already failed to decode the file_name
with the filesystem-encoding, it usually also fails when tried with 'ascii'.
example:
>>> dir_name = u'something'
>>> unicode_file_name = u'\u732b.txt' # the japanese cat-symbol
>>> bytestring_file_name = unicode_file_name.encode('utf-8')
>>>
>>>
>>> import os.path
>>>
>>> os.path.join(dir_name,unicode_file_name)
u'something/\u732b.txt'
>>>
>>>
>>> os.path.join(dir_name,bytestring_file_name)
Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "/usr/lib/python2.4/posixpath.py", line 65, in join
path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe7 in position 1:
ordinal not in range(128)
>>>
gabor
Ah, right. So yes, it will typically fail immediately - just as you
wanted it to do, anyway; the advantage with this failure is that you
can also find out what specific file name is causing the problem
(whereas when listdir failed completely, you could not easily find
out the cause of the failure).
How would you propose listdir should behave?
Regards,
Martin
Umm, just a wild guess, but how about raising an exception which includes
the name of the file which could not be decoded?
Jean-Paul
i also recommend this approach.
also, raising an exception goes well with the principle of the least
surprise imho.
gabor
There may be multiple of these, of course, but I assume that you want
it to report the first one it encounters?
Regards,
Martin
Are you saying you wouldn't have been surprised if that had been
the behavior? How would you deal with that exception in your code?
Regards,
Martin
> get an Unicode-exception, as everywhere else. you see, exceptions are
> ok, i can deal with them.
> p.s: one additional note. if you code expects os.listdir to return
> unicode, that usually means that all your code uses unicode strings.
> which in turn means, that those filenames will somehow later interact
> with unicode strings. which means that that byte-string-filename will
> probably get auto-converted to unicode at a later point, and that
> auto-conversion will VERY probably fail
it will raise an exception, most likely. didn't you just say that
exceptions were ok?
</F>
Maybe, for each filename, you can test if it is an unicode string, and
if not, convert it to unicode using the encoding indicated by
sys.getfilesystemencoding().
Have a try.
A+
Laurent.
>>How would you propose listdir should behave?
>
> Umm, just a wild guess, but how about raising an exception which includes
> the name of the file which could not be decoded?
Suppose you have a directory with just some files having a name that can't
be decoded with the file system encoding. So `listdir()` fails at this
point and raises an exception. How would you get the names then? Even the
ones that *can* be decoded? This doesn't look very nice:
path = u'some path'
try:
files = os.listdir(path)
except UnicodeError, e:
files = os.listdir(path.encode(sys.getfilesystemencoding()))
# Decode and filter the list "manually" here.
Ciao,
Marc 'BlackJack' Rintsch
i don't think it would work. because os.listdir already tried, and
failed (that's why we got a byte-string and not an unicode-string)
gabor
How about returning two lists, first list contains unicode names, the
second list contains undecodable names:
files, troublesome = os.listdir(separate_errors=True)
and make separate_errors=True by default in python 3.0 ?
-- Leo
i agree that it does not look very nice.
but does this look nicer? :)
path = u'some path'
files = os.listdir(path)
def check_and_fix_wrong_filename(file):
if isinstance(file,unicode):
return file
else:
#somehow convert it to unicode, and return it
files = [check_and_fix_wrong_filename(f) for f in files]
in other words, your opinion is that the proposed solution is not
optimal, or that the current behavior is fine?
gabor
yes, but it's raised at the wrong place imho :)
(just to clarify: simply pointing out this behavior in the documentation
is also one of the possible solutions)
for me the current behavior seems as if file-reading would work like this:
a = open('foo.txt')
data = a.read()
a.close()
print data
>>> TheFileFromWhichYouHaveReadDidNotExistException
gabor
yes, i would not have been surprised. because it's kind-of expected when
dealing with input, that malformed input raises an unicode-exception.
and i would also expect, that if os.listdir completed without raising an
exception, then the returned data is correct.
> How would you deal with that exception in your code?
depends on the application. in the one where it happened i would just
display an error message, and tell the admins to check the
filesystem-encoding.
(in other ones, where it's not critical to get the correct name, i would
probably just convert the text to unicode using the "replace" behavior)
what about using flags similar to how unicode() works? strict, ignore,
replace and maybe keep-as-bytestring.
like:
os.listdir(dirname,'strict')
i know it's not the most elegant, but it would solve most of the
use-cases imho (at least my use-cases).
gabor
Strange coincident, as I was wrestling with this problem only yesterday.
I found this most illuminating discussion on the topic with
contributions from Mr Lövis and others:
http://www.thescripts.com/forum/thread41954.html
/johan
The problem is that most programmers just don't want to deal with
filesystem garbage but they won't be happy if the program breaks
either.
> > How would you deal with that exception in your code?
>
> depends on the application. in the one where it happened i would just
> display an error message, and tell the admins to check the
> filesystem-encoding.
>
> (in other ones, where it's not critical to get the correct name, i would
> probably just convert the text to unicode using the "replace" behavior)
>
> what about using flags similar to how unicode() works? strict, ignore,
> replace and maybe keep-as-bytestring.
>
> like:
> os.listdir(dirname,'strict')
That's actually an interesting idea. The error handling modes could be:
'mix' -- current behaviour, 'ignore' -- drop names that cannot be
decoded, 'separate' -- see my other message.
-- Leo
I think this is very "special" code as you can't use the fixed names to
open the files anymore unless you guess the encoding correctly. I think
it's a bit fragile. Wouldn't it be a better solution to convert the
`path` to the file system encoding for getting the file names. This way
you can use all the names to process the files.
> in other words, your opinion is that the proposed solution is not
> optimal, or that the current behavior is fine?
I think the current behavior is okay but should be documented.
Maybe I just didn't had enough use cases yet that needed the names as
unicode objects and from my linux file systems experience file names are
just byte strings with two limitations: no slashes and no zero bytes. :-)
Ciao,
Marc 'BlackJack' Rintsch
That would be quite an incompatible change, no?
Regards,
Martin
Of course, it's possible to implement this on top of the existing
listdir operation.
def failing_listdir(dirname, mode):
result = os.listdir(dirname)
if mode != 'strict': return result
for r in result:
if isinstance(r, str):
raise UnicodeDecodeError
return result
Regards,
Martin
>> How about returning two lists, first list contains unicode names, the
>> second list contains undecodable names:
>>
>> files, troublesome = os.listdir(separate_errors=True)
>>
>> and make separate_errors=True by default in python 3.0 ?
>
> That would be quite an incompatible change, no?
it also violates a fundamental design rule for the standard library.
</F>
Yeah, that was idea-dump. Actually it is possible to make this idea
mostly backward compatible by making os.listdir() return only unicode
names and os.binlistdir() return only binary directory entries.
Unfortunately the same trick will not work for getcwd.
Another idea is to map all 256 bytes to unicode private code points.
When a file name cannot be fully decoded the undecoded bytes will be
mapped to specially allocated code points. Unfortunately this idea
seems to leak if the program later wants to write such unicode string
to a file. Python will have to throw an exception since we don't know
if it is ok to write broken string to a file. So we are back to square
one, programs need to deal with filesystem garbage :(
-- Leo
yes, sure... but then.. it's possible to implement it also on top of an
raise-when-error version :)
so, what do you think, how should this issue be solved?
currently i see 2 ways:
1. simply fix the documentation, and state that if the file-name cannot
be decoded into unicode, then it's returned as byte-string. but that
also means, that the typical usage of:
[os.path.join(path,n) for n in os.listdir(path)]
will not work.
2. add support for some unicode-decoding flags, like i wrote before
3. some solution.
?
gaobr
> yes, sure... but then.. it's possible to implement it also on top of an
> raise-when-error version :)
not necessarily if raise-when-error means raise-error-in-os-listdir.
</F>
could you please clarify?
currently i see 2 approaches how to do it on the raise-when-error version:
1.
dirname = u'something'
try:
files = os.listdir(dirname)
except UnicodeError:
byte_files = os.listdir(dirname.encode('encoding))
#do something with it
2.
dirname = u'something'
byte_files = os.listdir(dirname.encode('encoding'))
for byte_file in byte_files:
try:
file = byte_file.decode(sys.getfsenc())
except UnicodeError:
#do something else
#do something
the byte-string version of os.listdir remains. so all the other versions
can be implemented on the top of it. imho the question is:
which should be the 'default' behavior, offered by the python standard
library.
gabor
For 2.5, this should be done. Contributions are welcome.
[...then]
> [os.path.join(path,n) for n in os.listdir(path)]
>
> will not work.
>
> 2. add support for some unicode-decoding flags, like i wrote before
I may have missed something, but did you present a solution that would
make the case above work?
> 3. some solution.
One approach I had been considering is to always make the decoding
succeed, by using the private-use-area of Unicode to represent bytes
that don't decode correctly.
Regards,
Martin
if we use the same decoding flags as binary-string.decode(),
then we could do:
[os.path.join(path,n) for n in os.listdir(path,'ignore')]
or
[os.path.join(path,n) for n in os.listdir(path,'replace')]
it's not an elegant solution, but it would solve i think most of the
problems.
>
>> 3. some solution.
>
> One approach I had been considering is to always make the decoding
> succeed, by using the private-use-area of Unicode to represent bytes
> that don't decode correctly.
>
hmm..an interesting idea..
and what happens with such texts, when they are encoded into let's say
utf-8? are the in-private-use-area characters ignored?
gabor
That wouldn't work. The characters in the file name that didn't
decode would be dropped, so the resulting file names would be
invalid. Trying to do os.stat() on such a file name would raise
an exception that the file doesn't exist.
> [os.path.join(path,n) for n in os.listdir(path,'replace')]
Likewise. The characters would get replaced with REPLACEMENT
CHARACTER; passing that to os.stat would give an encoding
error.
> it's not an elegant solution, but it would solve i think most of the
> problems.
No, it wouldn't. This idea is as bad or worse than just dropping
these file names from the directory listing.
>> One approach I had been considering is to always make the decoding
>> succeed, by using the private-use-area of Unicode to represent bytes
>> that don't decode correctly.
>>
>
> hmm..an interesting idea..
>
> and what happens with such texts, when they are encoded into let's say
> utf-8? are the in-private-use-area characters ignored?
UTF-8 supports encoding of all Unicode characters, including the PUA
blocks.
py> u"\ue020".encode("utf-8")
'\xee\x80\xa0'
Regards,
Martin
i think that depends on the point of view.
if you need to do something later with the content of files, then you're
right.
but if all you need is to display them for example...
>
>>> One approach I had been considering is to always make the decoding
>>> succeed, by using the private-use-area of Unicode to represent bytes
>>> that don't decode correctly.
>>>
>> hmm..an interesting idea..
>>
>> and what happens with such texts, when they are encoded into let's say
>> utf-8? are the in-private-use-area characters ignored?
>
> UTF-8 supports encoding of all Unicode characters, including the PUA
> blocks.
>
> py> u"\ue020".encode("utf-8")
> '\xee\x80\xa0'
so basically you'd like to be able to "round-trip"?
so that:
listdir returns an array of filenames, the un-representable bytes will
be represented in the PUA.
all the other file-handling functions (stat, open, etc..) recognize such
strings, and handle them correctly.
?
gabor
That would conflict with private use characters appearing in file
names.
Personally, I think os.listdir() should return the file names only in
Unicode if they're actually stored that way in the underlying file
system (eg. NTFS), otherwise return them as byte strings. I doubt
anyone in this thread would like that, though.
Ross Ridge
Not necessarily: they could get escaped.
AFAICT, you can have that conflict only if the file system encoding
is UTF-8: otherwise, there is no way to represent them.
> Personally, I think os.listdir() should return the file names only in
> Unicode if they're actually stored that way in the underlying file
> system (eg. NTFS), otherwise return them as byte strings. I doubt
> anyone in this thread would like that, though.
So I assume you would not want to allow to pass Unicode strings
to open(), stat() etc. either, as the _real_ file system API requires
byte strings there, as well?
People would indeed see that as a step backwards. If you don't want
Unicode strings returned from listdir, don't pass Unicode string
as the directory name.
Technically, how do you determine whether the underlying file
system stores file names "in Unicode"? Does OSX use Unicode
(it requires path names to be UTF-8)? After all, each and
every encoding is a Unicode encoding - that was a design
goal of Unicode.
Regards,
Martin
Ross Ridge schrieb:
> That would conflict with private use characters appearing in file
> names.
Martin v. Löwis wrote:
> Not necessarily: they could get escaped.
How?
> AFAICT, you can have that conflict only if the file system encoding
> is UTF-8: otherwise, there is no way to represent them.
They can also appear UTF-16 filenames (obviously) and various Far-East
multi-byte encodings.
> > Personally, I think os.listdir() should return the file names only in
> > Unicode if they're actually stored that way in the underlying file
> > system (eg. NTFS), otherwise return them as byte strings. I doubt
> > anyone in this thread would like that, though.
>
> So I assume you would not want to allow to pass Unicode strings
> to open(), stat() etc. either, as the _real_ file system API requires
> byte strings there, as well?
No, I just expect that if the underlying file system API does accept a
given byte or Unicode string that I could pass the same string to
open() and stat(), etc.. and have it work. I have no problem if
additional strings happen to work because Python converts byte strings
to Unicode or vice-versa as the API requires.
Should I assume that since you think that having "os.listdir()" return
Unicode strings when passed a Unicode directory name is a good idea,
that you also think that file object methods (eg. readline) should
return Unicode strings when opened with a Unicode filename?
> Technically, how do you determine whether the underlying file
> system stores file names "in Unicode"?
On Windows you can use GetVolumeInformation(), though it may be more
practical to assume Unicode or byte strings based on the OS. On Unix
you'd assume byte strings.
> Does OSX use Unicode (it requires path names to be UTF-8)?
HFS+ uses Unicode. I have no idea how you'd figure out the properties
of a filesystem under OS/X, but then the Python docs suggests this
os.listdir() Unicode feature doesn't work on Macintosh systems anyways.
> After all, each and every encoding is a Unicode encoding - that was a design
> goal of Unicode.
If it were as simple as that, then yes, there wouldn't be a problem.
Unfortunately, as this thread has revealed, os.llistdir() isn't always
able to map byte string filenames into Unicode, either because they
don't use the assumed encoding, don't all use the same encoding or
don't use any standard encoding. That's the problem here, there's no
encoding associated Unix filenames, they're just byte strings. Since
Python byte strings also have no encoding associated with them they're
the natural way of representing all valid file names on Unix systems.
On the other hand, under Windows NT/2K/XP and NTFS or VFAT the natural
way to represent all valid file names is Unicode strings.
Ross Ridge
Suppose I use U+E001..U+E0FF as the PUA characters for unencodable
bytes; U+E000 wouldn't be needed since it \0 cannot be part of
a file name in POSIX.
Then I would use U+E000 for escaping. Each PUA character in the
listed file name would get escaped with U+E000 in the Python
string; when the file name is converted back to the system, it
gets unescaped.
Notice that I think this is a really unrealistic case - I expect
that all file names containing PUA characters were deliberately
crafted to investigate using PUA characters in file names.
>> AFAICT, you can have that conflict only if the file system encoding
>> is UTF-8: otherwise, there is no way to represent them.
>
> They can also appear UTF-16 filenames (obviously) and various Far-East
> multi-byte encodings.
No: UTF-16 file names cannot occur in POSIX, as this is not a null-byte
free encoding. What Far-East multi-byte encoding uses PUA characters,
and for what characters?
> No, I just expect that if the underlying file system API does accept a
> given byte or Unicode string that I could pass the same string to
> open() and stat(), etc.. and have it work.
On no operating system I'm aware of can you pass "Unicode strings" to
open() or stat(). You always have to find some byte encoding as
parameters for open() and stat(), because that's what POSIX specifies.
> Should I assume that since you think that having "os.listdir()" return
> Unicode strings when passed a Unicode directory name is a good idea,
> that you also think that file object methods (eg. readline) should
> return Unicode strings when opened with a Unicode filename?
No, not at all. How file names are interpreted is entirely independent
on how file content is interpreted.
Many people believe file names are character strings, and use them
as such in every day's life. OTOH, many people are aware that the
file contents of a file isn't necessarily plain text - most people
are familiar with PDF and executable files.
> On Windows you can use GetVolumeInformation(), though it may be more
> practical to assume Unicode or byte strings based on the OS. On Unix
> you'd assume byte strings.
On Windows, the entire issue doesn't exist: We don't use open() or
stat() on Windows. If we have a Unicode file name on Windows, we
use the system's Unicode API.
>> Does OSX use Unicode (it requires path names to be UTF-8)?
>
> HFS+ uses Unicode. I have no idea how you'd figure out the properties
> of a filesystem under OS/X, but then the Python docs suggests this
> os.listdir() Unicode feature doesn't work on Macintosh systems anyways.
Either the docs are wrong, or you are misinterpreting them. It works
just fine in practice.
> That's the problem here, there's no
> encoding associated Unix filenames, they're just byte strings.
Can you please quote chapter and verse of the POSIX spec that says
so? I believe POSIX specifies the entire opposite:
http://www.opengroup.org/onlinepubs/007908799/xbd/glossary.html#tag_004_000_114
says
# A name consisting of 1 to {NAME_MAX} bytes used to name a file. The
# characters composing the name may be selected from the set of all
# character values excluding the slash character and the null byte. The
# filenames dot and dot-dot have special meaning; see pathname
# resolution . A filename is sometimes referred to as a pathname
# component.
#
# Filenames should be constructed from the portable filename character
# set because the use of other characters can be confusing or ambiguous
# in certain contexts. (For instance, the use of a colon (:) in a
# pathname could cause ambiguity if that pathname were included in a
# PATH definition.)
So they are not "just byte strings"; they must come from the set of all
characters. "character" is defined as "A sequence of one or more bytes
representing a single graphic symbol or control code."
> Since
> Python byte strings also have no encoding associated with them they're
> the natural way of representing all valid file names on Unix systems.
And still, people want to render file names in a user interface to
the user.
Regards,
Martin
How would you tell an escaped file name containing these private use
characters obtain from os.listdir() from an unescaped file name
containing these characters obtained from some other source?
> Notice that I think this is a really unrealistic case - I expect
> that all file names containing PUA characters were deliberately
> crafted to investigate using PUA characters in file names.
I suspect a more common case is file names containing end-user defined
characters.
> What Far-East multi-byte encoding uses PUA characters,
> and for what characters?
Pretty much all of them near as I can tell. See the following WWW page
for a discusion of this issue:
http://www.opengroup.or.jp/jvc/cde/ucs-conv-e.html
> On no operating system I'm aware of can you pass "Unicode strings" to
> open() or stat().
*sigh* I was refering to the Python functions "open() and stat() etc."
just as you had in paragraph I copied those exact words from.
> > On Windows you can use GetVolumeInformation()...
>
> On Windows, the entire issue doesn't exist:
On Windows, I think you should use GetVolumeInformation() to decide
whether or not os.listdir() returns Unicode or byte strings, rather
than the type of the argument.
>> ... but then the Python docs suggests this
> > os.listdir() Unicode feature doesn't work on Macintosh systems anyways.
>
> Either the docs are wrong, or you are misinterpreting them. It works
> just fine in practice.
As the original poster in this thread wrote, the docs say:
On Windows NT/2k/XP and Unix, if path is a Unicode
object, the result will be a list of Unicode objects
The implication being that Macintosh systems don't support this
feature.
> > That's the problem here, there's no
> > encoding associated Unix filenames, they're just byte strings.
>
> Can you please quote chapter and verse of the POSIX spec that says
> so?
I said Unix, not POSIX. In practice, Unix systems don't associate an
encoding with filenames, and any byte value other than '/' or '\0' is
permitted in a filename. Not that it really matters, Python byte
strings are also the natural way to respresent file names stored in a
unspecified encoding.
Ross Ridge
How will it interoperate with non-python world? Will these file names
ever escape python process?
Unicode consortium thinks "safe" utf-8 is a bad idea:
http://www.mail-archive.com/uni...@unicode.org/msg27241.html
[Lars Kristan]
> Which could be understood as "a proposal to amend UTF-8 to allow invalid
> sequences".
[Kenneth Whistler, Technical Director, The Unicode Consortium]
O.k., and as pointed out already, that simply won't fly. *Nobody*
in the UTC or WG2 is going to go for that. It would destroy
UTF-8, not fix it.
---------------------------------
Kenneth Whistler on invalid file names:
http://www.mail-archive.com/uni...@unicode.org/msg27225.html
And also: http://www.mail-archive.com/uni...@unicode.org/msg27167.html
[Lars Kristan]
> Should all
> filenames that do not conform to UTF-8 be declared invalid?
[Doug Ewell, the guy behind Unicode Technical Note #14]
If you have a UTF-8 file system, yes.
--------------------------------------------------------------
-- Leo
But that doesn't mean you should store pictures in file names. File
names are supposed to be human readable text labels for binary files.
Most UNIX-like distributions and admins that *care* about multilingual
support setup systems properly, *all* file names within local network
use the same encoding. In those situations when systems are
misconfigured, you can ask user what is the encoding, for example
through locale setting: I can go to a misconfigured windows share and
type "LANG=ru_RU.cp1251 ls" to correctly display Russian file names.
-- Leo