unicode filenames

14 views
Skip to first unread message

Andrew Dalke

unread,
Feb 2, 2003, 7:14:43 PM2/2/03
to
Okay, I'm confused. I've been working my way through the changes
put into Python 2.3. One of these is PEP 277, "Unicode file name
support for Windows NT" at http://www.python.org/peps/pep-0277.html .

I decided to experiment with how to use unicode filenames. I
thought I understood, until I tried it out.

How do I deal with possibly unicode filenames in a platform
independent manner?

I normally use unix. What's the right way to treat filenames
under that OS? As Latin-1? Or UTF-8? As far as I can tell,
filenames are simply bytes, so I can make whatever interpretation
I want on the characters, and the standard viewpoint is to
interpret those characters as Latin-1.

[dalke@zebulon src]$ ./python
Python 2.3a1 (#7, Feb 2 2003, 15:54:30)
[GCC 2.96 20000731 (Red Hat Linux 7.2 2.96-112.7.2)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> open("spårvägen", "w").close()
>>> ^D
[dalke@zebulon src]$ ls -l sp*
-rw-r--r-- 1 dalke users 0 Feb 2 16:19 spårvägen
[dalke@zebulon src]$ ls sp* | od -c
0000000 s p å r v ä g e n \n
0000012
[dalke@zebulon src]$ ./python
Python 2.3a1 (#7, Feb 2 2003, 15:54:30)
[GCC 2.96 20000731 (Red Hat Linux 7.2 2.96-112.7.2)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> s = unicode("spårvägen", "latin-1")
>>> s
u'sp\xe5rv\xe4gen'
>>> open(s, "w").close()
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character '\ue5' in
position 2: ordinal not in range(128)
>>> s.encode("utf8")
'sp\xc3\xa5rv\xc3\xa4gen'
>>> open(s.encode("utf8"), "w").close()
>>> ^D
[dalke@zebulon src]$ ls -l sp*
-rw-r--r-- 1 dalke users 0 Feb 2 16:19 spårvägen
-rw-r--r-- 1 dalke users 0 Feb 2 16:22 spårvägen


Does that mean Unix filenames can't contain non-Latin-1 characters?
Or does it mean I need to get the info on how to interpret the
filename using something from the current environment?
(sys.getdefaultencoding() doesn't work since that reports 'ascii'
for me.) Could different directories be encoded differently, eg
/home/<encoded in ASCII>/<encoded in Latin-1>/<encoded in big5> ?

And what happens when a remote file is mounted, say, from a MS
Windows OS? Are they represented as UTF-8? Something else?
Is that standardized or is it a property of the mount mechanism
and can change accordingly?

Okay, now let's see what changed in Python 2.3. According to
Andrew Kuchling's "What's new in Python 2.3" at
http://www.python.org/doc/2.3a1/whatsnew/node5.html

On Windows NT, 2000, and XP, the system stores file names
as Unicode strings. Traditionally, Python has represented
file names as byte strings, which is inadequate because it
renders some file names inaccessible.

Python now allows using arbitrary Unicode strings (within
the limitations of the file system) for all functions that
expect file names, most notably the open() built-in function.
If a Unicode string is passed to os.listdir(), Python now
returns a list of Unicode strings. A new function,
os.getcwdu(), returns the current directory as a Unicode string.
...
Other systems also allow Unicode strings as file names but
convert them to byte strings before passing them to the system,
which can cause a UnicodeError to be raised. Applications can
test whether arbitrary Unicode strings are supported as file
names by checking os.path.unicode_file_names, a Boolean value.

Indeed, on my Linux system, "os.path.supports_unicode_filenames" (I've
sent in a bug report on the difference in attribute names) is False.

Still, 'os.getcwdu()' does exist and works for my Linux system. (I
removed the 'spårvägen' files and restarted Python.)

>>> import os
>>> os.path.supports_unicode_filenames
False
>>> os.getcwdu()
u'/home/dalke/cvses/python/dist/src'
>>> os.mkdir("spårvägen")
>>> os.chdir("spårvägen")
>>> os.getcwdu()
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 36:
ordinal not in range(128)
>>>

This seems to imply that if I want to know the current working
directory (eg, to display in a GUI widget, which understands how to
display unicode strings) my code needs to work like this:

if os.path.supports_unicode_filenames:
cwd = os.getcwdu()
else:
encoding = .. get default filesystem encoding ... or 'latin-1'
cwd = unicode(os.getcwd(), encoding)

Ugly .. quite ugly. And suggestions on the proper way to
handle this is not documented as far as I can find.

Next I want to display the files in that directory. For MS
Windows it looks like I can do that with the unicode string,
as in

os.listdir(cwd)

but with unix I need to do

[unicode(x, encoding) for x in os.listdir(s.encode(encoding))]

so the portable code to list the files in a directory is something
like this

def my_listdir(dirname, filesystem_encoding = "ascii"):
if os.path.supports_unicode_filename:
return os.listdir(dirname)
enc = filesystem_encoding
return [unicode(x, enc) for x in os.listdir(dirname.encode(enc))]

Again, that seems rather ugly, since I need to roll my own code
to get what I believe to be platform independence.

Similar problems hold true for mkdir and other functions. Eg,
I get a unicode string from the user which is a directory to
create. To work for both MS Windows and non-MS Windows machines,
I need to do

def my_mkdir(dirname, filesystem_encoding = "ascii"):
if os.path.supports_unicode_filenames):
os.path.mkdir(filename)
else:
os.path.mkdir(filename.encode(filesystem_encoding))

(possibly with some error catching if the new filename is in
Thai for a Latin-1 filesystem.)

In other words, it seems that I need to write a wrapper for
all functions which might take a unicode string so when
supports_unicode_filename is False I convert it to the appropriate
default filesystem encoding.

(Again, if different directory components can be in different
character sets then this doesn't work. But I don't think anyone
can reasonable expect that.)

It seems, in my naive view of unicode, that there should be a
system-wide function to get/set the default filesystem encoding,
and the Python functions to mkdir, listdir, rmdir, etc. should
use that encoding when a Unicode string is passed in to them,
and that the default encoding be ASCII as it is now.

But as I said, I am naive about unicode, so this post is meant
as a shout for help from those more experienced, to clear up
my own confusion.

Andrew
da...@dalkescientific.com

Erik Max Francis

unread,
Feb 2, 2003, 8:45:19 PM2/2/03
to
Andrew Dalke wrote:

> I normally use unix. What's the right way to treat filenames
> under that OS? As Latin-1? Or UTF-8? As far as I can tell,
> filenames are simply bytes, so I can make whatever interpretation
> I want on the characters, and the standard viewpoint is to
> interpret those characters as Latin-1.

I believe that's the most common interpretation, but as you say, it
doesn't much matter since filenames in UNIX are just considered streams
of bytes. No reference to an encoding -- as far as I know -- is made in
any UNIX-relevant standard.

> Does that mean Unix filenames can't contain non-Latin-1 characters?
> Or does it mean I need to get the info on how to interpret the
> filename using something from the current environment?

It means that filenames are strings of bytes. What the meaning of those
bytes are is entirely application dependent. They could be raw ASCII
(the most common), Latin-1 (probably the most common with filenames that
contain bytes with the MSB set), or any other encoding whatsoever. It's
applications that make the files, it's applications that decide what
encoding to use.

--
Erik Max Francis / m...@alcyone.com / http://www.alcyone.com/max/
__ San Jose, CA, USA / 37 20 N 121 53 W / &tSftDotIotE
/ \ The quickest way of ending a war is to lose it.
\__/ George Orwell
REALpolitik / http://www.realpolitik.com/
Get your own customized newsfeed online in realtime ... for free!

David Eppstein

unread,
Feb 2, 2003, 10:32:14 PM2/2/03
to
In article <3E3DC9AF...@alcyone.com>,

Erik Max Francis <m...@alcyone.com> wrote:

> > I normally use unix. What's the right way to treat filenames
> > under that OS? As Latin-1? Or UTF-8? As far as I can tell,
> > filenames are simply bytes, so I can make whatever interpretation
> > I want on the characters, and the standard viewpoint is to
> > interpret those characters as Latin-1.
>
> I believe that's the most common interpretation, but as you say, it
> doesn't much matter since filenames in UNIX are just considered streams
> of bytes. No reference to an encoding -- as far as I know -- is made in
> any UNIX-relevant standard.

Under Mac OS X, the shell displays text (e.g. from cat, or from ls
without the -q option) as utf-8 by default, and the Finder (gui file
browser) uses utf-8 for accented characters in file names. So I infer
that the correct interpretation of filenames under my OS is utf-8.
But other unixes may differ...

--
David Eppstein UC Irvine Dept. of Information & Computer Science
epps...@ics.uci.edu http://www.ics.uci.edu/~eppstein/

Andrew Dalke

unread,
Feb 3, 2003, 12:21:36 AM2/3/03
to
David Eppstein wrote:
> Under Mac OS X, the shell displays text (e.g. from cat, or from ls
> without the -q option) as utf-8 by default, and the Finder (gui file
> browser) uses utf-8 for accented characters in file names. So I infer
> that the correct interpretation of filenames under my OS is utf-8.

Nautilus and Konquerer both interpret files under Linux as Latin-1.
So does Konsole and xterm. And Mozilla's "File Open" dialog.

I think I would rather they use UTF-8.

Andrew
da...@dalkescientific.com

Alex Martelli

unread,
Feb 3, 2003, 2:24:52 AM2/3/03
to
Erik Max Francis wrote:
...

> It means that filenames are strings of bytes. What the meaning of those
> bytes are is entirely application dependent. They could be raw ASCII

ALMOST entirely -- for example, none of the bytes is allowed to have
the value 47 (since that is the code for "slash" in ASCII).

> applications that make the files, it's applications that decide what
> encoding to use.

As long as the encoding never needs to use a byte whose value is
47. I think that rules out UTF-8 and most other popular
multi-byte encodings, doesn't it?


Alex

Erik Max Francis

unread,
Feb 3, 2003, 2:44:12 AM2/3/03
to
Alex Martelli wrote:

> ALMOST entirely -- for example, none of the bytes is allowed to have
> the value 47 (since that is the code for "slash" in ASCII).

I thought we would all be reasonable enough to implicitly understand
that was a condition. I thought of explicitly mentioning it, but
thought it too obvious. Just goes to show.

> As long as the encoding never needs to use a byte whose value is
> 47. I think that rules out UTF-8 and most other popular
> multi-byte encodings, doesn't it?

UTF-8 not including a slash. Or UTF-16 not including a slash. Or
Latin-1 not including a slash. And so on.

The context is Unicode filenames; Unicode filenames on Windows certainly
have similar restrictions; you can't put _any_ character in there and
expect it to work (for precisely the reasons; I suspect Windows would
restrict them more, in fact). Same goes for a UNIX filesystem, so it's
not like in context that limitation wasn't already apparent.

--
Erik Max Francis / m...@alcyone.com / http://www.alcyone.com/max/
__ San Jose, CA, USA / 37 20 N 121 53 W / &tSftDotIotE

/ \ ... Not merely peace in our time, but peace for all time.
\__/ John F. Kennedy
Python chess module / http://www.alcyone.com/pyos/chess/
A chess game adjudicator in Python.

Andrew Bennetts

unread,
Feb 3, 2003, 2:40:27 AM2/3/03
to
On Mon, Feb 03, 2003 at 07:24:52AM +0000, Alex Martelli wrote:
> Erik Max Francis wrote:
> ...
> > It means that filenames are strings of bytes. What the meaning of those
> > bytes are is entirely application dependent. They could be raw ASCII
>
> ALMOST entirely -- for example, none of the bytes is allowed to have
> the value 47 (since that is the code for "slash" in ASCII).

I believe slash and the null byte are the only disallowed characters in unix
path names.

-Andrew.


Alex Martelli

unread,
Feb 3, 2003, 3:39:39 AM2/3/03
to
Erik Max Francis wrote:

> Alex Martelli wrote:
>
>> ALMOST entirely -- for example, none of the bytes is allowed to have
>> the value 47 (since that is the code for "slash" in ASCII).
>
> I thought we would all be reasonable enough to implicitly understand
> that was a condition. I thought of explicitly mentioning it, but
> thought it too obvious. Just goes to show.

It appears to me that you may not fully understand the
implications of this -- or, am _I_ missing something...?


>> As long as the encoding never needs to use a byte whose value is
>> 47. I think that rules out UTF-8 and most other popular
>> multi-byte encodings, doesn't it?
>
> UTF-8 not including a slash. Or UTF-16 not including a slash. Or
> Latin-1 not including a slash. And so on.

Latin-1 is not a multi-byte encoding, so, "no byte with a value
of 47" does indeed equate to "not including a slash". But it
appears to me that you may be generalizing unduly: for a multi-byte
encoding, "no byte with a value of 47" is a MUCH stricter
condition than "not including a slash".

For example, the Lithuanian character "Latin small letter i with
ogonek", in Unicode, is represented by code 012F. The Livonian
"Latin small letter o with dot above", by code 022F. And so on.

So, for example, in UTF-16, a filename containing the former would
have to include a byte of value 47 (0x2F), either right before
or right after a byte of value 0x01, depending on endianness.

Therefore, on a Unix system that is not specifically and
explicitly Unicode-aware, you could use filenames with UTF-16
encoding only if they didn't include ANY of: slash (obvious),
the said Lithuanian and Livonian characters (you _might_
perhaps use combinations instead -- 0069 0328 as equivalent
to 012F, for example -- but, isn't 00 the OTHER value you
are NOT allowed to use, besides 0x2F...?!-), the Cyrillic
capital letter Ya (042F, no combination equivalents that I
know of), the Arabic letter Dal (062F, no comb.), and so on,
and so forth.

Similar considerations apply for any other multibyte encoding
(such as, UTF-8) that is NOT specifically and carefully
designed to avoid ever needing a byte of value 47 (0x2F) in
order to represent ANY character except a slash. I am not
aware of any such multi-byte encoding -- there may be some,
but, even if one can be found, using it would still fall WELL
short of "any other encoding whatsoever" as you claimed.

(Note in passing that careful avoidance of bytes with value
of ZERO for any character except NUL _is_ typical of the design
of several popular multi-byte encodings -- not UTF-16, which
needs a most significant byte of value 00 to represent any
character in the code range 1-255, i.e. characters that are
also in Ascii and Iso-8859-1 -- but I recall, from back when
I worked with shifted-JIS and JIS-EUC in C, that one COULD
at least rely on a byte of value 0 always meaning end-of-
string, without accidentally hitting any such bytes within
the representation of any other character).


> The context is Unicode filenames; Unicode filenames on Windows certainly
> have similar restrictions; you can't put _any_ character in there and
> expect it to work (for precisely the reasons; I suspect Windows would
> restrict them more, in fact). Same goes for a UNIX filesystem, so it's
> not like in context that limitation wasn't already apparent.

Unicode-supporting Windows filesystems (NTFS, in particular,
under Windows/NT, /2000, /XP) certainly do restrict some of
the punctuation you're allowed to use in filenames -- but,
being specifically Unicode-aware and using whatever encoding
THEY choose, they do NOT arbitrarily forbid you to use letters
such as Arabic Dal, Cyrillic capital Ya, and so on, just because
of the value that a byte happens to have for such a character
in some multi-byte encoding or other.

So, what _am_ I missing? Can you please explain in more detail
your original claim that:


"""
It means that filenames are strings of bytes. What the meaning of those
bytes are is entirely application dependent. They could be raw ASCII

(the most common), Latin-1 (probably the most common with filenames that
contain bytes with the MSB set), or any other encoding whatsoever. It's

applications that make the files, it's applications that decide what
encoding to use.
"""

Ascii, Latin-1, ISO-8859-whatever -- sure. But -- "any other
encoding whatsoever", "it's applications that decide"? And
specifically UTF-8 and UTF-16, just as long as no _slash_ is
there, as you very specifically claim in this post? I guess
I'm thick, since you keep claiming it's all so obvious and
apparent -- but, can you PLEASE patiently explain in words
of one syllable how my application could decide to use e.g
UTF-16 and then name a file "Cyrillic upper Ya"+"Arabic Dal",
on a non-Unicode-aware Unix system? Thanks!


Alex

Alex Martelli

unread,
Feb 3, 2003, 3:48:24 AM2/3/03
to
Andrew Bennetts wrote:

Bytes, not characters, IF you accept Erik's claim that the
application can freely decide on the encoding -- and there's
the rub, as least as far as I can see -- in the multibyte
encodings I know of, forbidding those two byte values (in
particular 47, i.e. 0x2F) ends up forbidding a LOT of
_characters_ -- because many characters may happen to need
a BYTE of value 0x2F as a part of their representation in
such an encoding. Please see my other response to Erik in
this thread for a detailed explanation of where I see the
problem coming up, with reference to this.

If the system is not aware of any distinction between
bytes and characters in a filename (or other string that
is somehow relevant to the system), and in particular is
unaware of Unicode, then it appears to me that similar
limitations would always emerge regarding arbitrary use
of multi-byte encodings. UTF-16 will in particular be
unusable if bytes with a value of 0 are prohibited. Most
others, I believe, WOULD be usable if that was the only
prohibition (as they're carefully design around the
"null byte problem", so to speak) -- but the further
prohibition of value 47 (0x2F) seems to be a killer from
my point of view.

If I _am_ missing something "obvious and apparent", as
it would seem from Erik's response, I would definitely
appreciate being helped to understand it. Otherwise, I
will operate on the working hypothesis that some people
do not understand the difference between "character"
and "byte" in the context of multi-byte encoding, and
that their claims that something is "apparent" and/or
"obvious" are therefore of somewhat dubious validity.


Alex

Neil Hodgson

unread,
Feb 3, 2003, 4:32:58 AM2/3/03
to
Alex Martelli:

> Similar considerations apply for any other multibyte encoding
> (such as, UTF-8) that is NOT specifically and carefully
> designed to avoid ever needing a byte of value 47 (0x2F) in
> order to represent ANY character except a slash. I am not
> aware of any such multi-byte encoding -- there may be some,
> but, even if one can be found, using it would still fall WELL
> short of "any other encoding whatsoever" as you claimed.

UTF-8 is a superset of ASCII. A slash has the same representation in
UTF-8 as ASCII. No multi-byte UTF-8 character may contain a byte < 128.

Neil


Alex Martelli

unread,
Feb 3, 2003, 5:39:34 AM2/3/03
to
Neil Hodgson wrote:

Ah! Wonderful, thanks -- and clearly this was one crucial
point I was missing: UTF-8 *IS* "specifically and carefully

designed to avoid ever needing a byte of value 47 (0x2F) in

order to represent ANY character except a slash" (among
other things;-), and therefore _IS_ usable as the encoding
of Unicode names on a non-Unicode-aware Unix system.

I think it's still true that this doesn't apply to other
multi-byte encodings, and therefore it's misleading to claim
that applications can decide on any such encoding (as Erik
did), but, given the _ability_ to use UTF-8, this only
means each application _should_ use UTF-8 rather than
other encodings (if it needs to be able to represent all
Unicode characters, rather than, say, just the subset of
them that's in Latin-1) in this context.


Thanks!


Alex


Neil Hodgson

unread,
Feb 3, 2003, 6:00:19 AM2/3/03
to
Andrew Dalke:

> And what happens when a remote file is mounted, say, from a MS
> Windows OS? Are they represented as UTF-8? Something else?
> Is that standardized or is it a property of the mount mechanism
> and can change accordingly?

The default mount options I have seen turn the Unicode file names into
'?'s. However, with the a VFAT file system that has some Unicode file names
on my machine, mounting the partition from Linux with the utf8 option in
fstab:
/dev/hda5 /eff vfat auto,shortname=winnt,utf8,owner 0 0
leads to UTF-8 strings being returned to user programs. Since Red Hat 8.0
defaults to UTF-8 locales, many programs such as Nautilus and the standard
GTK+ file open dialog display these file names correctly although some
characters are still not seen because the default UI fonts do not have all
the required characters. Still, European, Cyrillic, Greek, were OK and Asian
characters often displayed as boxes with codes inside.

> if os.path.supports_unicode_filenames:
> cwd = os.getcwdu()
> else:
> encoding = .. get default filesystem encoding ... or 'latin-1'
> cwd = unicode(os.getcwd(), encoding)
>
> Ugly .. quite ugly. And suggestions on the proper way to
> handle this is not documented as far as I can find.

Yes, it is ugly but I don't know how to handle this well on Unix. In my
above example there is one partition mounted in UTF-8 mode but other
partitions could be using other encodings. I imagine there is some way to
reach the mount options for a given directory...

Neil


Paul Boddie

unread,
Feb 3, 2003, 6:35:54 AM2/3/03
to
Andrew Dalke <ada...@mindspring.com> wrote in message news:<b1kc9o$vf1$1...@slb9.atl.mindspring.net>...

>
> I normally use unix. What's the right way to treat filenames
> under that OS? As Latin-1? Or UTF-8? As far as I can tell,
> filenames are simply bytes, so I can make whatever interpretation
> I want on the characters, and the standard viewpoint is to
> interpret those characters as Latin-1.

It may be locale-based on Linux, at least, and possibly on other UNIX
platforms, too.

> [dalke@zebulon src]$ ls sp* | od -c
> 0000000 s p å r v ä g e n \n
> 0000012

I hadn't heard of 'od' before, so this is a useful piece of
information. When accessing Red Hat Linux 7.3 on Intel with locale as
en_US.iso885915, I can apparently create filenames with ISO-8859-15
characters, and in the terminal program I'm using, these characters
appear as question marks when switching locale to en_US.utf8. However,
in the former locale, 'od -c' returns the characters as part of the
"dump", whereas in the latter, 'od -c' returns the octal codes for
those characters.

What is interesting is that if I try to remove the file in UTF-8 mode,
it succeeds, even though the byte encoding of the filename should
really be different from what it was before. Moreover, if I create a
file with ISO-8859-15-encodable characters in UTF-8 mode, it seems to
use the ISO-8859-15 byte values.

Perhaps the "UTF-8 and Unicode FAQ" and the manual might be of help:

man unicode

Still, I see your point about it being harder to use non-ASCII
characters in filenames on UNIX with the upcoming Python 2.3. In many
environments, this is a highly unsatisfactory situation.

Paul

Ganesan R

unread,
Feb 3, 2003, 6:45:21 AM2/3/03
to al...@aleax.it
>>>>> "Alex" == Alex Martelli <al...@aleax.it> writes:

> Neil Hodgson wrote:
>> Alex Martelli:
>>
>>> Similar considerations apply for any other multibyte encoding
>>> (such as, UTF-8) that is NOT specifically and carefully
>>> designed to avoid ever needing a byte of value 47 (0x2F) in
>>> order to represent ANY character except a slash. I am not
>>> aware of any such multi-byte encoding -- there may be some,
>>> but, even if one can be found, using it would still fall WELL
>>> short of "any other encoding whatsoever" as you claimed.
>>
>> UTF-8 is a superset of ASCII. A slash has the same representation in
>> UTF-8 as ASCII. No multi-byte UTF-8 character may contain a byte < 128.

> Ah! Wonderful, thanks -- and clearly this was one crucial
> point I was missing: UTF-8 *IS* "specifically and carefully
> designed to avoid ever needing a byte of value 47 (0x2F) in
> order to represent ANY character except a slash" (among
> other things;-), and therefore _IS_ usable as the encoding
> of Unicode names on a non-Unicode-aware Unix system.

Indeed. UTF-8 had it's origin in Plan 9 (if I remember correctly) as
a "File System Safe" unicode tranformation formation. You can find a
document titled FSS-UTF on the net.

Ganesan

--
Ganesan R

Ganesan R

unread,
Feb 3, 2003, 7:15:45 AM2/3/03
to

Sorry to follow up on my own post, but I was incorrect. It appears that
UTF-FSS first appeared as an X/Open document authored by an IBM
employee. See
http://www.mail-archive.com/linux...@nl.linux.org/msg03609.html

Ganesan

--
Ganesan R

Andrew Dalke

unread,
Feb 4, 2003, 1:55:32 AM2/4/03
to
Me:

>> if os.path.supports_unicode_filenames:
>> cwd = os.getcwdu()
>> else:
>> encoding = .. get default filesystem encoding ... or 'latin-1'
>> cwd = unicode(os.getcwd(), encoding)
>>
>>Ugly .. quite ugly. And suggestions on the proper way to
>>handle this is not documented as far as I can find.

Neil Hodgson wrote:
> Yes, it is ugly but I don't know how to handle this well on Unix. In my
> above example there is one partition mounted in UTF-8 mode but other
> partitions could be using other encodings. I imagine there is some way to
> reach the mount options for a given directory...

Okay, so it seems like no one knows how to handle unicode filenames
under Unix. Perhaps the following is the proper behaviour?

1) there is a default filesystem encoding, which is initialized
to None if os.path.supports_unicode_file is True, otherwise
it's set to sys.getdefaultencoding()

2) there is a registration system which is used to define encodings
used for different mount locations. If a filename/dirname is
not covered, sue the default filesystem encoding

3) a) when the input dirname or filename is a string, use the
current behaviour
b) when unicode, use the encoding from 2 (may have to get
the absolute path name ... don't like this part of it.
Perhaps the call to #2 should only be done for full paths?)

Here's an example implementation for listdir and getcwdu


_filesystem_encoding = None

def get_default_filesystem_encoding():
if _filesystem_encoding is None:
if os.path.supports_unicode_filenames:
return None
return sys.getdefaultencoding()
return _filesystem_encoding

def set_default_filesystem_encoding(encoding):
global _filesystem_encoding
_filesystem_encoding = encoding

# This is use if different mount points have different
# encodings. See below for how to use it
class FilesystemEncoding:
def __init__(self):
self.data = {}
def __setitem__(self, dirname, encoding):
if dirname.endswith(os.sep):
dirname = dirname[:-len(os.sep)]
self.data[dirname] = encoding
def lookup(self, name):
while 1:
if not name:
return get_default_filesystem_encoding()
if name in self.data:
return self.data[name]
new_name = os.path.dirname(name)
if name == new_name:
name = None
else:
name = new_name

filesystem_encodings = FilesystemEncoding()
....

>>> filesystem_encodings["/home/dalke"] = "utf8"
>>> filesystem_encodings["/"] = "latin-1"
>>> filesystem_encodings.lookup("/home/dalke/test.txt")
'utf8'
>>> filesystem_encodings.lookup("/home/spam")
'ascii'
>>>

....

def listdir(dirname):
if not isinstance(dirname, unicode):
return os.listdir(dirname)
encoding = filesystem_encodings.lookup(os.path.abspath(dirname))
if encoding is None:
return os.listdir(dirname)
raw_dirname = dirname.encode(encoding)
return [unicode(s, encoding) for s in os.listdir(raw_dirname)]

def getcwdu():
if os.path.supports_unicode_filenames:
return os.path.getcwdu()
s = os.getcwd()
encoding = filesystem_encodings.lookup(s)
return unicode(s, encoding)

...

>>> os.path.abspath(".")
'/home/dalke/tmp'
>>> os.path.abspath(u".")
u'/home/dalke/tmp'
>>>
>>> s = u"1 to \N{INFINITY}"
>>> s.encode("utf8")
'1 to \xe2\x88\x9e'
>>> t = s.encode("utf8")
>>> os.mkdir(t)
>>> os.listdir(".")
['1 to \xe2\x88\x9e']
>>> os.listdir(u".")
['1 to \xe2\x88\x9e']
>>>
>>> listdir(".")
['1 to \xe2\x88\x9e']
>>> listdir(u".")
['1 to \u221e']
>>>

>>> os.chdir(t)


>>> os.getcwdu()
Traceback (most recent call last):
File "<stdin>", line 1, in ?

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 39:
ordinal not in range(128)
>>> getcwdu()
u'/home/dalke/tmp/1 to \u221e'
>>>

If this makes sense, should it be added to Python's core?

Andrew
da...@dalkescientific.com

Andrew Dalke

unread,
Feb 4, 2003, 2:15:42 AM2/4/03
to
Paul Boddie wrote:
> I hadn't heard of 'od' before, so this is a useful piece of
> information. When accessing Red Hat Linux 7.3 on Intel with locale as
> en_US.iso885915, I can apparently create filenames with ISO-8859-15
> characters, and in the terminal program I'm using, these characters
> appear as question marks when switching locale to en_US.utf8. However,
> in the former locale, 'od -c' returns the characters as part of the
> "dump", whereas in the latter, 'od -c' returns the octal codes for
> those characters.

It doesn't look like I can handle LANG=en_US.utf8 very well

>>> s = u"1 to \N{INFINITY}"
>>> s.encode("utf8")
'1 to \xe2\x88\x9e'
>>> t = s.encode("utf8")
>>> os.mkdir(t)

>>> ^D
[dalke@zebulon src]$ ls -ld 1*
drwxr-xr-x 2 dalke users 4096 Feb 3 23:33 1 to ā

(note that the end of the filaname shows two empty boxes on my
screen, which is what my terminal uses when it can't show the
right character.)

[dalke@zebulon src]$ echo 1* | od -cd
0000000 2031 6f74 e220 9e88 000a
1 t o 342 210 236 \n \0
0000011
[dalke@zebulon src]$

(Bleh. this is a little-endian machine, so the hex
characters should be "31 20 74 6f 20 e2 88 9e 0a 00" when
interpreted as characters. So you can see the characters
are exactly as was in the original string, which was the
UTF-8 encoding of the filename. If I use the unicode string
directlry I get a 'cannot encode as ASCII' error.)

When I start nautilus with LANG=en_US.utf8 I get

[dalke@zebulon src]$ nautilus .

Gdk-WARNING **: locale not supported by Xlib, locale set to C

...

When I start Konqueror

[dalke@zebulon src]$ konqueror
Qt: Locales not supported on X server
qstring_to_xtp result code -2


I'm on RedHat 7.2, so it may be that 7.3 improves unicode support.

This is harder than I want it to be. Python! Make it just
work for me! :)

Andrew
da...@dalkescientific.com

Neil Hodgson

unread,
Feb 4, 2003, 7:02:31 AM2/4/03
to
Andrew Dalke:

> Okay, so it seems like no one knows how to handle unicode filenames
> under Unix. Perhaps the following is the proper behaviour?
>

> ...


> 2) there is a registration system which is used to define encodings
> used for different mount locations. If a filename/dirname is
> not covered, sue the default filesystem encoding

The encoding registry uses byte strings.

I'd hope there would be an attempt to discover file systems encodings
automatically such as reading /etc/fstab to find the utf8 flag mentioned.
Some Unix distributions (MacOS X, Red Hat 8.0) seem to be moving towards
making UTF-8 be the only exposed file system encoding.

> def listdir(dirname):
> if not isinstance(dirname, unicode):
> return os.listdir(dirname)
> encoding = filesystem_encodings.lookup(os.path.abspath(dirname))

How does os.path.abspath deal with a Unicode string?

> If this makes sense, should it be added to Python's core?

There are quite a few calls that need to change - from the file
constructor to stat ...

To be robust it needs to deal with multiple encodings in a path.

Neil


Andrew Dalke

unread,
Feb 4, 2003, 11:39:57 AM2/4/03
to
Neil Hodgson wrote:
> The encoding registry uses byte strings.

True. I meant it mostly as a sketch of a solution.

> How does os.path.abspath deal with a Unicode string?

Err, ummm, I didn't yet include that wrapper function? Don't
believe me? How about that I forgot? :)

> > If this makes sense, should it be added to Python's core?
>
> There are quite a few calls that need to change - from the file
> constructor to stat ...
>
> To be robust it needs to deal with multiple encodings in a path.

Yep, and yep.

I think I have just shown that I'm not the perfect candidate to
do so ;)

BTW, how do I test your assertion the RedHat uses UTF-8 for filename
encoding? I can't figure that out. I did figure out one problem
is that I need to say "en_US.UTF-8" instead of "en_US.utf-8".

Andrew
da...@dalkescientific.com

Neil Hodgson

unread,
Feb 5, 2003, 7:28:53 AM2/5/03
to
Andrew Dalke:

> BTW, how do I test your assertion the RedHat uses UTF-8 for filename
> encoding? I can't figure that out. I did figure out one problem
> is that I need to say "en_US.UTF-8" instead of "en_US.utf-8".

Red Hat defaults the locale to <something>.UTF-8 whch on my machine is
en_US.UTF-8. Then they ship most of the utilities and bundled applications
compiled or configured for UTF-8. Mostly this depends on using GNOME 2
applications which use UTF-8 as their normal string type.

For PEP 277, there was a test script which produces files with Unicode
names on Windows. I have changed it a bit to run on Linux producing UTF-8
names.
The Linux version of the script:
http://scintilla.sourceforge.net/unilin.py
A screenshot from Windows with what the files look like when the original
script is run along with the Linux script displayed in SciTE:
http://scintilla.sourceforge.net/winss.png
A screenshot from Red Hat Linux 8.0 with, on the left, Nautilus showing a
directory on VFAT where the Windows script was run (displaying the ASCII,
European, and Cyrillic well, the Greek with one problem on an accented
character, the Hebrew invisibly, and the Japanese and Chinese as code
blocks), Nautilus showing a directory on ext3 where the Linux script was run
(similar to VFAT case), an ls in a console (Cyrillic and European are
displayed well). On the right hand side are two editors, gedit is a GNOME 2
application so works similarly to Nautilus; SciTE is a GTK+ 1.x application
with some Unicode fontset support.
http://scintilla.sourceforge.net/linuxss.png
Linux would look a lot better if I had some Asian Unicode fonts
installed.

Neil


Andrew Dalke

unread,
Feb 6, 2003, 3:30:01 AM2/6/03
to
Neil Hodgson wrote:
> Red Hat defaults the locale to <something>.UTF-8 whch on my machine is
> en_US.UTF-8.

My default LANG is "en_US" which is, I believe, a Latin-1 encoding.
I'm running RH 7.2. You say you use 8.0....

Yep, looks like
http://www.redhat.com/docs/manuals/linux/RHL-8.0-Manual/release-notes/x86/
concurs that that was an 8.0 change

] Red Hat Linux now installs using UTF-8 (Unicode) locales by default in
] languages other than Chinese, Japanese, or Korean.
]
] This has been known to cause various issues:
]
] · Line drawing characters in applications such as make menuconfig
] do not always appear correctly in certain locales.
]
] · On the console, the latarcyrheb-sun16 font is used for best Unicode
] coverage. Due to the use of this font, bold colors are not available.
]
] · Certain third party applications, such as the Adobe® Acrobat
] Reader®, may not function correctly (or crash upon startup) because
] they lack support for Unicode locales. Until third party developers
] provide such support in their products, you may work around this
] issue by setting the LANG environment variable at the shell prompt to
] C prior to typing the application name. For example:
]
] env LANG=C acroread

Hence my difficulties stem partially from using a too-old RH install.


> A screenshot from Red Hat Linux 8.0 with, on the left, Nautilus showing a
> directory on VFAT where the Windows script was run (displaying the ASCII,
> European, and Cyrillic well, the Greek with one problem on an accented
> character, the Hebrew invisibly, and the Japanese and Chinese as code
> blocks), Nautilus showing a directory on ext3 where the Linux script was run
> (similar to VFAT case), an ls in a console (Cyrillic and European are
> displayed well). On the right hand side are two editors, gedit is a GNOME 2
> application so works similarly to Nautilus; SciTE is a GTK+ 1.x application
> with some Unicode fontset support.
> http://scintilla.sourceforge.net/linuxss.png

What does 'os.listdir()' do for that directory? I assume it's the byte
strings, which means I need to do the UTF-8 conversion myself, which
means dealing with unicode filenames on non-MS Windows machines is still
complicated for Python.

At the very least, it's more confusing than I prefer dealing with.

Andrew
da...@dalkescientific.com

Beni Cherniavsky

unread,
Feb 6, 2003, 6:16:25 AM2/6/03
to
On 2003-02-03, Andrew Dalke wrote:

> Okay, so it seems like no one knows how to handle unicode filenames
> under Unix.

Since unix can afford to change all APIs and programs like windows did
(the mess that resulted explains why <wink>), unix must stay with the
byte-orineted filenames at the low level. This ensures that all programs
that store file names in files, etc., continue to work. UTF-8 is the only
encoding that can represent all of unicode that satisfies all these needs,
so everybody should migrate to UTF-8 filenames (CJK users might have
reservations to this; I'd be happy to learn their opinion).

In the transition period, many people still use other encodings,
sometimes different on different mounts. Since filenames are
frequently storedin files, programs will break if the filename
encoding is different on different mountpoints. If you suggest
supporting that in programs, you effective require that all utilities
like ls, find, xargs, etc. learn to convert filenames. For if they
don't, things will break: e.g. find will produce output in a mix of
encoding, which can't be fixed. But that's too much work! The only
chance to do that is in glibc - but it will subtly upset a lot of
programs in any case. This also implies that the filename encoding
must be the same as the standard I/O encoding (that's why there is
LC_FILENAME_CTYPE).

So *please*, expect the user to configure all mounts to use the same
encoding, the one he is using in his locale. It's not hard.
Otherwise he will not be able to work with other programs anyway...
And that encoding best be UTF-8, of course.

> Perhaps the following is the proper behaviour?
>
> 1) there is a default filesystem encoding, which is initialized
> to None if os.path.supports_unicode_file is True, otherwise
> it's set to sys.getdefaultencoding()
>

Yep. For corner cases, it should be settable. And I don't like the
name, it should say "filename" instead of "file" (that prompts for
shortening some other part, like "supports").

One important point: files with names illegal in this encoding must
not become inaccessible. Instead of raising exceptions, Python's
library should just fall back and return the byte string.

> 2) there is a registration system which is used to define encodings
> used for different mount locations. If a filename/dirname is
> not covered, sue the default filesystem encoding
>

No way! See above. Instead of fixing a couple of places (fstab,
nfs&samba conf) you are trying to fix this in every single application
running in the system.

> 3) a) when the input dirname or filename is a string, use the
> current behaviour
> b) when unicode, use the encoding from 2 (may have to get
> the absolute path name ... don't like this part of it.
> Perhaps the call to #2 should only be done for full paths?)
>

No #2, no problem <wink>.

> If this makes sense, should it be added to Python's core?
>

+1.

--
Beni Cherniavsky <cb...@tx.technion.ac.il>

Do not feed the Bugzillas.

Carlos Ribeiro

unread,
Feb 6, 2003, 8:44:52 AM2/6/03
to
On Thursday 06 February 2003 11:16 am, Beni Cherniavsky wrote:
> Since unix can afford to change all APIs and programs like windows did
> (the mess that resulted explains why <wink>), unix must stay with the
> byte-orineted filenames at the low level. This ensures that all programs
> that store file names in files, etc., continue to work. UTF-8 is the only
> encoding that can represent all of unicode that satisfies all these needs,
> so everybody should migrate to UTF-8 filenames (CJK users might have
> reservations to this; I'd be happy to learn their opinion).

Sorry. It would be a big mess. Here in Brazil, I can safely assume that it is
nearly impossible to find a computer *without* filenames with latin-1
accented characters. Not to mention the problems that we have when mounting
FAT partitions under Linux - many Unix users still need to use dual boot
machines in order to use a few Windows apps.

In my opinion, this is the type of problem that has to be solved at its root,
by slowly migrating the filesystem itself to accept only UTF-8 filenames. All
conversions during the migration phase have to be done by the operating
system itself; when moving files from one FS to the other, it would do the
necessary conversions. It's not going to be easy, though.


Carlos Ribeiro
crib...@mail.inet.com.br

Beni Cherniavsky

unread,
Feb 6, 2003, 9:29:32 AM2/6/03
to
On 2003-02-06, Carlos Ribeiro wrote:

> On Thursday 06 February 2003 11:16 am, Beni Cherniavsky wrote:

> > Since unix can afford to change all APIs and programs like windows did
> > (the mess that resulted explains why <wink>), unix must stay with the
> > byte-orineted filenames at the low level. This ensures that all programs
> > that store file names in files, etc., continue to work. UTF-8 is the only
> > encoding that can represent all of unicode that satisfies all these needs,
> > so everybody should migrate to UTF-8 filenames (CJK users might have
> > reservations to this; I'd be happy to learn their opinion).
>

> Sorry. It would be a big mess. Here in Brazil, I can safely assume that it is
> nearly impossible to find a computer *without* filenames with latin-1
> accented characters. Not to mention the problems that we have when mounting
> FAT partitions under Linux - many Unix users still need to use dual boot
> machines in order to use a few Windows apps.
>

If you use latin1 everywhere on the computer, you are OK too. Just don't
have one directory in latin1, another in latin8 and another in UTF-8.

If and when you decide to convert to UTF-8, you can run one script to
convert the whole filesystem. The problem will be with remaining
filenames lurking in files (e.g. playlists). That most probably requires
a period of manual fix-as-it-breaks after the conversion...

> In my opinion, this is the type of problem that has to be solved at its root,
> by slowly migrating the filesystem itself to accept only UTF-8 filenames. All
> conversions during the migration phase have to be done by the operating
> system itself; when moving files from one FS to the other, it would do the
> necessary conversions. It's not going to be easy, though.
>

No encoding conversion is easy :-(.

Andrew Dalke

unread,
Feb 6, 2003, 3:24:00 PM2/6/03
to
Beni Cherniavsky <cb...@techunix.technion.ac.il>:

> unix must stay with the
> byte-orineted filenames at the low level. This ensures that all programs
> that store file names in files, etc., continue to work.

> In the transition period, many people still use other encodings,


> sometimes different on different mounts. Since filenames are
> frequently storedin files, programs will break if the filename
> encoding is different on different mountpoints. If you suggest
> supporting that in programs, you effective require that all utilities
> like ls, find, xargs, etc. learn to convert filenames.

I don't know if you saw the example code I posted earlier. My
suggestions were only meant for Python, hence the comments about
ls, find, xargs, etc. are of no concern.

I was also only concerned with low-level functions which deal with
the filesystem. Right now for unix these take byte strings, not
unicode strings. The behaviour I requested would only be triggered if
a unicode string was passed in, as in 'os.listdir(u".")', or if
os.getcwdu() was called. Hence it should break no existing applications
because they don't pass in unicode filenames, except those which are
identically represented in 7-bit ASCII.

> So *please*, expect the user to configure all mounts to use the same
> encoding, the one he is using in his locale. It's not hard.
> Otherwise he will not be able to work with other programs anyway...
> And that encoding best be UTF-8, of course.

In that I differed. In my naive view, I had a registration system
for directory locations, so different mount points could have different
encodings. Eg, I don't know if NFS mounts support unicode recoding.

>> 1) there is a default filesystem encoding, which is initialized
>> to None if os.path.supports_unicode_file is True, otherwise
>> it's set to sys.getdefaultencoding()
>>
> Yep. For corner cases, it should be settable. And I don't like the
> name, it should say "filename" instead of "file" (that prompts for
> shortening some other part, like "supports").

I have no ability to change that name -- it's already in Python 2.3.

I don't like 'filename' as it can also refer to a directory name. OTOH,
few should care about this name so I don't think it's that important.

> One important point: files with names illegal in this encoding must
> not become inaccessible. Instead of raising exceptions, Python's
> library should just fall back and return the byte string.

So have mixed unicode and byte strings returned from a function? Yech.
Python's unicode functions have an error mode which can describe how
to handle this case. I would rather the conversion functions allow
passing in that parameter as well.

If you want assess to the raw filesystem,

os.listdir(unicode_string.encode("utf-8"))

should still return a list of raw byte strings. My proposal is to
make it easy to handle unicode filenames for those who don't expect
to handle all the corner cases, and let fans of the details still
have access to those details, with a bit more work.

>> 2) there is a registration system which is used to define encodings
>> used for different mount locations. If a filename/dirname is
>> not covered, sue the default filesystem encoding
>>
> No way! See above. Instead of fixing a couple of places (fstab,
> nfs&samba conf) you are trying to fix this in every single application
> running in the system.

No, I am not. This is a per-Python registry only. It would only used
by a very few people, as when building apps which try to handle the
filesytem in the best way it can.

I am not happy with it because I don't see a good way for it to
work. I include it because it's the most general solution I could
consider, and I didn't want it to be ignored by accident.

>> If this makes sense, should it be added to Python's core?
>>
> +1.

Still waiting for Martin "Herr Unicode" van Löwis to comment. I
hear he's on vacation ...

Andrew
da...@dalkescientific.com

Neil Hodgson

unread,
Feb 6, 2003, 4:50:45 PM2/6/03
to
Andrew Dalke:

> What does 'os.listdir()' do for that directory? I assume it's the byte
> strings, which means I need to do the UTF-8 conversion myself, which
> means dealing with unicode filenames on non-MS Windows machines is still
> complicated for Python.

The byte strings:

['abc', 'ascii', 'Gr\xc3\xbc\xc3\x9f-Gott', 'unilin.py',
'\xce\x93\xce\xb5\xce\xb9\xce\xac-\xcf\x83\xce\xb1\xcf\x82',
'\xd0\x97\xd0\xb4\xd1\x80\xd0\xb0\xd0\xb2\xd1\x81\xd1\x82\xd0\xb2\xd1\x83\xd
0\xb9\xd1\x82\xd0\xb5', '\xe3\x81\xab\xe3\x81\xbd\xe3\x82\x93',
'\xd7\x94\xd7\xa9\xd7\xa7\xd7\xa6\xd7\xa5\xd7\xa1',
'\xe6\x9b\xa8\xe6\x9b\xa9\xe6\x9b\xab',
'\xe6\x9b\xa8\xd7\xa9\xe3\x82\x93\xd0\xb4\xce\x93\xc3\x9f']

>
> At the very least, it's more confusing than I prefer dealing with.

Yes, same here. I don't think Python is the right level to solve it.

Neil


Piet van Oostrum

unread,
Feb 16, 2003, 1:24:54 PM2/16/03
to
>>>>> David Eppstein <epps...@ics.uci.edu> (DE) wrote:

DE> Under Mac OS X, the shell displays text (e.g. from cat, or from ls
DE> without the -q option) as utf-8 by default, and the Finder (gui file
DE> browser) uses utf-8 for accented characters in file names. So I infer
DE> that the correct interpretation of filenames under my OS is utf-8.
DE> But other unixes may differ...

On Mac OS X, it is a bit more complicated. First cat will indeed show the
unicode (utf-8) contents of a file, but ls won't display filenames with
non-ASCII characters right. At least not in 10.1.5. Maybe 10.2 does it better.
Like if my filename is "€200", ls will display "???200".

Secondly, the filesystem requires the unicode characters to be normalized,
which means that accented characters like "é" will be broken up into "e"
followed by "´". So if the finder has a file with name "é200", the bytes
used in the filename will be 0x65 followed by 0xCC 0x81 (unicode character
0x301). ls will print this as "e??200".

And in the shell I can't even type a € sign or é. That, however, is a
problem of the Terminal application, as I can do it in emacs.

Although ... aftre I tried it out, and wanted to send this article out, my
emacs crashed (fortunately after saving it).
--
Piet van Oostrum <pi...@cs.uu.nl>
URL: http://www.cs.uu.nl/~piet [PGP]
Private email: P.van....@hccnet.nl

Just

unread,
Feb 16, 2003, 4:09:41 PM2/16/03
to
In article <wzptps1...@nono.cs.uu.nl>,

Piet van Oostrum <pi...@cs.uu.nl> wrote:

> >>>>> David Eppstein <epps...@ics.uci.edu> (DE) wrote:
>
> DE> Under Mac OS X, the shell displays text (e.g. from cat, or from ls
> DE> without the -q option) as utf-8 by default, and the Finder (gui file
> DE> browser) uses utf-8 for accented characters in file names. So I infer
> DE> that the correct interpretation of filenames under my OS is utf-8.
> DE> But other unixes may differ...
>
> On Mac OS X, it is a bit more complicated. First cat will indeed show the
> unicode (utf-8) contents of a file, but ls won't display filenames with
> non-ASCII characters right. At least not in 10.1.5. Maybe 10.2 does it better.
> Like if my filename is "¤200", ls will display "???200".

Although in Terminal.app supports utf-8 in 10.2, what you describe is
still true.

> Secondly, the filesystem requires the unicode characters to be normalized,
> which means that accented characters like "é" will be broken up into "e"
> followed by "´". So if the finder has a file with name "é200", the bytes
> used in the filename will be 0x65 followed by 0xCC 0x81 (unicode character
> 0x301). ls will print this as "e??200".

You don't have to worry about that: the file system will _give_ you
normalized unicode, but it does the right thing if you feed it
non-normalized unicode.

Btw. in 2.3 (current CVS, not a1), the file system calls fully support
unicode strings on OSX. I've also got a patch pending that makes
os.listdir() return unicode strings when appropriate:
http://python.org/sf/683592. I think this has a fair chance to make it
in.

Just

Erik Max Francis

unread,
Feb 16, 2003, 4:46:51 PM2/16/03
to
Piet van Oostrum wrote:

> On Mac OS X, it is a bit more complicated. First cat will indeed show
> the
> unicode (utf-8) contents of a file, but ls won't display filenames
> with
> non-ASCII characters right. At least not in 10.1.5. Maybe 10.2 does it
> better.

> Like if my filename is "â,¬200", ls will display "???200".

This is ls-specific behavior, and has nothing to do with what the
filename actually is.

--
Erik Max Francis / m...@alcyone.com / http://www.alcyone.com/max/
__ San Jose, CA, USA / 37 20 N 121 53 W / &tSftDotIotE

/ \ If a thing is worth doing, then it is worth doing badly.
\__/ G.K. Chesterton
Church / http://www.alcyone.com/pyos/church/
A lambda calculus explorer in Python.

David Eppstein

unread,
Feb 17, 2003, 1:28:05 AM2/17/03
to
In article <3E5006CB...@alcyone.com>,

Erik Max Francis <m...@alcyone.com> wrote:

> Piet van Oostrum wrote:
> > On Mac OS X, it is a bit more complicated. First cat will indeed
> > show the unicode (utf-8) contents of a file, but ls won't display
> > filenames with non-ASCII characters right. At least not in 10.1.5.
> > Maybe 10.2 does it better. Like if my filename is "â,¬200", ls will
> > display "???200".
>
> This is ls-specific behavior, and has nothing to do with what the
> filename actually is.

Specifically, ls will do this by default when output is to a terminal.
Probably on OS X this behavior should be removed. Anyway, I have in my
.cshrc the line

alias ls 'ls -v'

which prevents the question mark replacement from happening.

--
David Eppstein UC Irvine Dept. of Information & Computer Science
epps...@ics.uci.edu http://www.ics.uci.edu/~eppstein/

Michael Hudson

unread,
Feb 17, 2003, 7:47:10 AM2/17/03
to
Just <ju...@xs4all.nl> writes:

> Although in Terminal.app supports utf-8 in 10.2, what you describe
> is still true.

Terminal.app has always supported utf8, I think. Certainly it does in
10.1.x.

Cheers,
M.

--
The only problem with Microsoft is they just have no taste.
-- Steve Jobs, (From _Triumph of the Nerds_ PBS special)
and quoted by Aahz Maruch on comp.lang.python

Just

unread,
Feb 17, 2003, 12:55:42 PM2/17/03
to
In article <7h3ptpr...@pc150.maths.bris.ac.uk>,
Michael Hudson <m...@python.net> wrote:

> Just <ju...@xs4all.nl> writes:
>
> > Although in Terminal.app supports utf-8 in 10.2, what you describe
> > is still true.
>
> Terminal.app has always supported utf8, I think. Certainly it does in
> 10.1.x.

Perhaps it was just the default encoding that changed. I think the
default encoding was latin-1 by default in 10.1.x, and utf-8 in 10.2.
Could be wrong, though.

Just

Mike Meyer

unread,
Feb 17, 2003, 5:05:50 PM2/17/03
to
David Eppstein <epps...@ics.uci.edu> writes:

> Specifically, ls will do this by default when output is to a terminal.
> Probably on OS X this behavior should be removed. Anyway, I have in my
> .cshrc the line
>
> alias ls 'ls -v'
> which prevents the question mark replacement from happening.

I just set the LANG environment variable to en_US.ISO8859-1, which
causes the characters to be listed using the correct encoding for my
system. It may be different - and even a different environment
variable - in OSX.

<mike
--
Mike Meyer <m...@mired.org> http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.

Piet van Oostrum

unread,
Feb 18, 2003, 10:26:18 AM2/18/03
to
>>>>> Just <ju...@xs4all.nl> (J) wrote:

J> Perhaps it was just the default encoding that changed. I think the
J> default encoding was latin-1 by default in 10.1.x, and utf-8 in 10.2.
J> Could be wrong, though.

I had it set tu utf-8, but it appeared to be a ls problem. ls -v gives
utf-8 filenames. Now only emacs' dired-mode should display accented chars
correctly (it doesn't combine the letter with the accent). But this is far
off python.

Martin v. Löwis

unread,
Mar 2, 2003, 6:58:36 AM3/2/03
to
Andrew Dalke <ada...@mindspring.com> writes:

> Okay, so it seems like no one knows how to handle unicode filenames
> under Unix. Perhaps the following is the proper behaviour?

"Unix" is a too wide term here. Different *installations* of the very
same software product may use different means to represent non-ASCII
characters in file names (even different directories in the same
installation); it is all convention how to interpret them. Python is
somewhat at a loss in guessing the "right" thing.

The emerging convention is that the locale's codeset determines the
encoding of file names. This convention is used in a number of Linux
distributions, and other Unices.

> 1) there is a default filesystem encoding, which is initialized
> to None if os.path.supports_unicode_file is True, otherwise
> it's set to sys.getdefaultencoding()

Since Python 2.2 (I believe), invoking locale.setlocale will set the
file system default encoding to what the system's nl_langinfo(CODESET)
returns - provided the system has both nl_langinfo and CODESET.

> 2) there is a registration system which is used to define encodings
> used for different mount locations. If a filename/dirname is
> not covered, sue the default filesystem encoding

Ok, I'll sue :-)

Such a scenario should not be supported. The encoding should be
uniform in all components of a path, and it is the system
administrator's task to make sure this is the case.

> If this makes sense, should it be added to Python's core?

Not in the way you have described it. Because Unix is tricky (and NT+
is much more advanced) in this respect, the existing PEP deliberately
targets NT+ only, leaving Unix for further study.

Regards,
Martin

Reply all
Reply to author
Forward
0 new messages