Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Any portable way get a filename in UTF-8 or to get the FS encoding ?

1 view
Skip to first unread message

Timothy Madden

unread,
Oct 7, 2007, 3:22:12 PM10/7/07
to
Hello,

I am trying to devise a simple tool in which I read many directory and
file names (to compare two directories).

I never wrote code that I would port to different systems, but I would
not mind doing it now.

So I download and read sus v2 and sus v3 to see the
openddir/readdir/closedir functions, but they only return char[] strings
for file names and they say nothing about the encoding of the file names.

A computer system may mount and/or access many kinds of file systems.
NTFS as I know is an UNICODE file system (Sorry I do not know how ufs or
extfs are). When mounting FAT systems one can explicitly specify a
charset for all the file names.

I have seen _wreaddir function in some implementations, but is there a
portable way to get a file's name in UTF-8 or to get a file name in the
underlaying encoding of its file system and to get the encoding ?

Are POSIX implementations required to convert the file name return by
readdir to the application's execution character set ?

Thank you,
Timothy Madden,
Romania

Robert Harris

unread,
Oct 7, 2007, 4:09:26 PM10/7/07
to

A filename is just a NUL terminated string which is completely
compatible with UTF-8 (and with most other character encodings).

So if files are created in a UTF-8 locale, the filenames will be encoded
already in UTF-8. If not, then use iconv (or something like it) to convert.

Robert

Alan Curry

unread,
Oct 7, 2007, 4:30:35 PM10/7/07
to
In article <470931e0$0$90263$1472...@news.sunsite.dk>,
Timothy Madden <termin...@gmail.com> wrote:
>Hello,

>
>So I download and read sus v2 and sus v3 to see the
>openddir/readdir/closedir functions, but they only return char[] strings
>for file names and they say nothing about the encoding of the file names.

And they shouldn't! Filenames are made of bytes, not characters. 0x2f (the
directory separator) and 0x00 (the string terminator) are special. The rest
is just bytes.

Think about what it would mean for filenames to be bound to a specific
character set. open(), instead of being a plain syscall, would have to do
character set translation. Ouch! Or you'd have to do translation in the
kernel. Double ouch!

readdir() returns the same bytes that were passed to creat(). You wanna know
what the bytes mean? Ask the guy who named the file.

>
>A computer system may mount and/or access many kinds of file systems.
>NTFS as I know is an UNICODE file system (Sorry I do not know how ufs or
>extfs are). When mounting FAT systems one can explicitly specify a
>charset for all the file names.

When mounting non-unix filesystems, sometimes we emulate the brokenness of
the creating OS, using ugly hacks like bloating the kernel with character
translation tables and making open() reject perfectly legitimate filenames
that contain bytes which would upset the poor, easily confused, non-unix OS.

When unix is being itself, on its own well-designed filesystems, there's no
need for such behavior. Latin-1 filenames can sit right next to UTF-8
filenames and they don't bother each other, because the kernel doesn't care.

If you perceive some benefit in knowing that all filenames in a directory are
in a common character set, that can be achieved by agreement between you and
the other users who put files in that directory. Much better than inserting a
complicated translation mechanism into the various syscalls that deal with
filenames.

--
Alan Curry
pac...@world.std.com

Timothy Madden

unread,
Oct 7, 2007, 4:32:00 PM10/7/07
to Robert Harris
Robert Harris wrote:
> A filename is just a NUL terminated string which is completely
> compatible with UTF-8 (and with most other character encodings).
>
> So if files are created in a UTF-8 locale, the filenames will be encoded
> already in UTF-8. If not, then use iconv (or something like it) to convert.
>

How would I know if files are created in UTF-8 locale ?

How would I know if readdir has converted the filename from its encoding
in the filesystem to the application execution character set or if it
has converted the file name to UTF-8 or if it has returned the filename
in its native encoding ?

Timothy Madden

unread,
Oct 7, 2007, 4:48:58 PM10/7/07
to Alan Curry
Alan Curry wrote:
> In article <470931e0$0$90263$1472...@news.sunsite.dk>,
> Timothy Madden <termin...@gmail.com> wrote:
>> Hello,
>>
>> So I download and read sus v2 and sus v3 to see the
>> openddir/readdir/closedir functions, but they only return char[] strings
>> for file names and they say nothing about the encoding of the file names.
>
> And they shouldn't! Filenames are made of bytes, not characters. 0x2f (the
> directory separator) and 0x00 (the string terminator) are special. The rest
> is just bytes.
>

Are you saying that filenames are binary data ?
Are you sure about that ?
Can I read that somewhere in SUS or in a man page or anything ?

And how does the OS convert that data to strings ? I mean any
application that I ever used, including the OS shell, displays filenames
as text. How do all applications convert that binary data to text when
they display file names? They just leave printf to use the encoding from
the current locale when output ?

Alan Curry

unread,
Oct 7, 2007, 5:48:14 PM10/7/07
to
In article <4709463A...@gmail.com>,

Timothy Madden <termin...@gmail.com> wrote:
>Alan Curry wrote:
>> In article <470931e0$0$90263$1472...@news.sunsite.dk>,
>>
>> And they shouldn't! Filenames are made of bytes, not characters. 0x2f (the
>> directory separator) and 0x00 (the string terminator) are special. The rest
>> is just bytes.
>>
>
>Are you saying that filenames are binary data ?

Everything in computers is binary data.

>Are you sure about that ?
>Can I read that somewhere in SUS or in a man page or anything ?
>
>And how does the OS convert that data to strings ? I mean any

Strings are binary data too. 0x2f is the slash character in ASCII, in case
you didn't realize that the first time I mentioned it. The reason I called it
0x2f instead of slash was to help make the point: the kernel understands the
'/' character to be the directory separator, not because it looks like a
diagonal line from top-right to bottom-left when printed on your terminal,
but because it's 0x2f. If you wanted to use an exotic character set that was
not a superset of ASCII, you could. But character 0x2f would still be the
directory separator, so you couldn't use it in a filename.

>application that I ever used, including the OS shell, displays filenames
>as text. How do all applications convert that binary data to text when

When the byte sequence 0x66 0x6f 0x6f 0x2f 0x62 0x61 0x72 is sent to your
terminal, it looks like "foo/bar". There's no converting at all!

When the byte sequence 0xc4 0xbf is sent to my terminal, it looks like a
capital A with 2 dots on it followed by an upside-down question mark. That's
how those bytes are rendered in Latin-1. If I was using a UTF-8 terminal, it
would look like something else ("LATIN CAPITAL LETTER L WITH MIDDLE DOT" if
I'm interpreting my Unicode correctly).

If I now run this little program:

main(){ creat("\xc4\xbf", 0666); }

I'll have a file whose name is composed of those 2 bytes. If I run "ls" I'll
see the A with 2 dots and upside down question mark. If you come along with a
UTF-8 terminal and run "ls" in the same directory, you'll see that funky
L-dot thing. Which one is correct? Both!

The bytes being shown are the same. You can look at the contents of the file
by typing "cat <funky-L-dot-thing>" on your terminal. (It'll probably be
easier to cut and paste the weird character than figure out how to type it).
I can likewise "cat" the file by pasting the 2 characters that were displayed
by my "ls". Does it matter that we're not seeing the same graphical
representation of the filename?

If that does matter, the only way to fix it is for us to have an agreement on
what character set is used for filenames. That agreement could be made by the
person with the UTF-8 terminal to find the other person and yell "Upgrade
your terminal and stop making those ugly non-UTF-8 filenames, you jerk!"
while beating him on the head with a rolled-up newspaper. In the future, he
can expect increased probability that readdir() will return UTF-8 names. This
is something the OS does not need to know about.

>they display file names? They just leave printf to use the encoding from
>the current locale when output ?

I don't think printf does any conversions either. It's just a matter of the
terminal (or graphical text widget) converting a sequence of bytes into a
sequence of glyphs based on its configured character set. The filesystem
doesn't know what character set that is. If it's not the same character set
that was used by the person who named the file in the first place, it won't
look the same.

(If you're experimenting, note that "ls" may actually show question marks if
it thinks your terminal won't recognize a filename as a printable character
sequence. That's not because of any translation that the OS is doing. It's
just "ls" trying to be friendly and not mess up your terminal with control
codes.)

--
Alan Curry
pac...@world.std.com

Timothy Madden

unread,
Oct 7, 2007, 9:04:14 PM10/7/07
to Alan Curry
Alan Curry wrote:
> In article <4709463A...@gmail.com>,
> Timothy Madden <termin...@gmail.com> wrote:
>> Alan Curry wrote:
>>> In article <470931e0$0$90263$1472...@news.sunsite.dk>,
>>>
>>> And they shouldn't! Filenames are made of bytes, not characters. 0x2f (the
>>> directory separator) and 0x00 (the string terminator) are special. The rest
>>> is just bytes.
>>>
>> Are you saying that filenames are binary data ?
>
> Everything in computers is binary data.
>
>> Are you sure about that ?
>> Can I read that somewhere in SUS or in a man page or anything ?
>>
>> And how does the OS convert that data to strings ? I mean any
[...]

>> application that I ever used, including the OS shell, displays filenames
>> as text. How do all applications convert that binary data to text when
>
> When the byte sequence 0x66 0x6f 0x6f 0x2f 0x62 0x61 0x72 is sent to your
> terminal, it looks like "foo/bar". There's no converting at all!
>
> When the byte sequence 0xc4 0xbf is sent to my terminal, it looks like a
> capital A with 2 dots on it followed by an upside-down question mark. That's
> how those bytes are rendered in Latin-1. If I was using a UTF-8 terminal, it
> would look like something else ("LATIN CAPITAL LETTER L WITH MIDDLE DOT" if
> I'm interpreting my Unicode correctly).
>
> If I now run this little program:
>
> main(){ creat("\xc4\xbf", 0666); }
>
> I'll have a file whose name is composed of those 2 bytes. If I run "ls" I'll
> see the A with 2 dots and upside down question mark. If you come along with a
> UTF-8 terminal and run "ls" in the same directory, you'll see that funky
> L-dot thing. Which one is correct? Both!

I think this is a problem with the POSIX/SUS standard, as long as this
behavior required is by the standard.

I find it normal to see the same file name no matter what terminal I
have (as long as it has the glyphs), no matter what computer I use to
access the file system, as long as it has the proper software, no matter
what my current character encoding is on my system. Would you not like
that ?

The current encoding on my computer is Latin-2 (for Romanian language),
still my computer can display text encoded in Latin-1, UTF-8, UTF-16 and
other encodings. So if I read messages in a newsgroup and I see a
message written by a person from japan, encoded in UTF-8, I can still
see the same text the person wrote, even if UTF-8 is different from
Latin-2. I just need to know the message is encoded in UTF-8, no matter
what my current encoding is.

The same thing should happen with filenames. I think filenames are text
just as much as a word from that message is text. It is just the POSIX
standard that thinks otherwise, and that should be fixed.

Only some byte sequences can encode characters in UTF-8. Others are for
example reserved for future code points in UNICODE. This shows I could
have POSIX filenames that can not even be sent to that "UTF-8 terminal"
you were talking about. Would you not like POSIX to fix the situation ?

Not to mention that various implementations already offer non-standard
functions for reading and writing file names in UTF-16 (like _wopen and
_wreaddir), which convert the names from some encoding to UTF-16 (if
only to extend each char to wchar_t by prepending 0 bits, but I think
they use something like mbtowc). Anyway this non-standard functions show
various implementations treat filenames as text, unless you think
wreaddir for a file named AB returns 66*256 + 65 (multi-byte character
'AB'), instead of L"AB".

Alan Curry

unread,
Oct 7, 2007, 10:22:28 PM10/7/07
to
In article <4709820...@gmail.com>,

Timothy Madden <termin...@gmail.com> wrote:
>Alan Curry wrote:
>>
>> I'll have a file whose name is composed of those 2 bytes. If I run "ls" I'll
>> see the A with 2 dots and upside down question mark. If you come along with a
>> UTF-8 terminal and run "ls" in the same directory, you'll see that funky
>> L-dot thing. Which one is correct? Both!
>
>I think this is a problem with the POSIX/SUS standard, as long as this
>behavior required is by the standard.

I can't see where it's required. It's reality though.

>
>I find it normal to see the same file name no matter what terminal I
>have (as long as it has the glyphs), no matter what computer I use to
>access the file system, as long as it has the proper software, no matter
>what my current character encoding is on my system. Would you not like
>that ?

No, I actually prefer not to use a computer that displays characters I can't
read.

>
>The current encoding on my computer is Latin-2 (for Romanian language),

To pick a example from Latin-2, you may be happy to be able to create
filenames containing an "OGONEK" (whatever that is) and see it displayed
correctly - and there's no reason you shouln't - but if I come across that
file it'll be more useful to me to have that character displayed as \262.
Unicode is full of characters I can't identify, can't reproduce from the
keyboard, and in some cases can't even distinguish from each other.
Displaying them to me would only be irritating (or dangerous if a pair of
identical-looking characters confuse me into rm'ing the wrong file).

>still my computer can display text encoded in Latin-1, UTF-8, UTF-16 and
> other encodings. So if I read messages in a newsgroup and I see a
>message written by a person from japan, encoded in UTF-8, I can still
>see the same text the person wrote, even if UTF-8 is different from
>Latin-2. I just need to know the message is encoded in UTF-8, no matter
>what my current encoding is.

Yes, newsgroup messages have headers, in which you can find the character
encoding. Filesystems don't.

>
>The same thing should happen with filenames. I think filenames are text
>just as much as a word from that message is text. It is just the POSIX
>standard that thinks otherwise, and that should be fixed.
>
>Only some byte sequences can encode characters in UTF-8. Others are for
>example reserved for future code points in UNICODE. This shows I could
>have POSIX filenames that can not even be sent to that "UTF-8 terminal"
>you were talking about. Would you not like POSIX to fix the situation ?

Absolutely not. Fixing that situation would mean that non-UTF-8-legal byte
sequences become banned from filenames. open() acquires a new mode of failure
that it didn't have before. The simple rule of "All these bytes are yours
except 0x2f. Attempt no landing there" gets replaced with a complicated
system in which the validity of a byte depends on what came before it.

The way things are now, I can use whatever character set I like and you can
use whatever character set you like. You want to impose a single character
set on everybody. That's not nice.

You're free to assume that all filenames on unix are encoded in UTF-8. That
seems to be the consensus that's being built by all the Unicode advocates
with their rolled-up newspapers. In fact I'm pretty sure that UTF-8 has been
made the official character set of the Linux ext2fs filesystem.

I can't find the announcement right now, but the cool thing about that change
(and it was a change, because Latin-1 was far more likely to be used in the
early days) is that it involved 0 bytes of new code. It is purely a social
guideline with no software enforcement. Existing filesystems populated with
Latin-2 and KOI-8 and SHIFT-JIS filenames didn't suddenly stop working. They
became "incorrect" in some unimportant theoretical sense, but they still work
fine because the kernel and libc - and pretty much everything that isn't in
charge of displaying characters on screen or converting keypresses to
characters - treats a filename as an opaque sequence of bytes.

--
Alan Curry
pac...@world.std.com

Timothy Madden

unread,
Oct 8, 2007, 4:58:53 AM10/8/07
to Alan Curry
Alan Curry wrote:
> In article <4709820...@gmail.com>,
> Timothy Madden <termin...@gmail.com> wrote:
[...]

>> Only some byte sequences can encode characters in UTF-8. Others are for
>> example reserved for future code points in UNICODE. This shows I could
>> have POSIX filenames that can not even be sent to that "UTF-8 terminal"
>> you were talking about. Would you not like POSIX to fix the situation ?
>
> Absolutely not. Fixing that situation would mean that non-UTF-8-legal byte
> sequences become banned from filenames. open() acquires a new mode of failure
> that it didn't have before. The simple rule of "All these bytes are yours
> except 0x2f. Attempt no landing there" gets replaced with a complicated
> system in which the validity of a byte depends on what came before it.
>
> The way things are now, I can use whatever character set I like and you can
> use whatever character set you like. You want to impose a single character
> set on everybody. That's not nice.

Everyone is free to use their character set. I just want a way to know
that character set, so I can see the names the same way as you.

What is bad about filesystems or filenames having a charset property ?

Old apps would then be free to ignore it, but new apps would know better.

New apps might even chose to let the system transcode filenames on the
fly if they do not want to take all the charset hassle.

You said it yourself you are pretty sure ext2fs adopted UTF-8 as the
filenames charset, and they did it without breaking compatibility for
existing apps. I just want the same thing in the POSIX standard, for
whatever charset is appropiate on a given implementation.

Timothy Madden,
Romania

Fredrik Roubert

unread,
Oct 8, 2007, 5:14:12 AM10/8/07
to
On Sun, 07 Oct 2007 22:22:12 +0300, Timothy Madden wrote:

> I have seen _wreaddir function in some implementations, but is there a
> portable way to get a file's name in UTF-8 or to get a file name in the
> underlaying encoding of its file system and to get the encoding ?
>
> Are POSIX implementations required to convert the file name return by
> readdir to the application's execution character set ?

The encoding used for file names on any given file system is never
specified in a POSIX system, and a user is free to create file names
using several different encodings even on the same file system. (I
actually have such a file system myself, where most file names are
encoded in UTF-8 but the file names in one directory are encoded in
ISO-8859-1.)

A process that wants to interpret the bytes that makes up a file name
must look at its environment for hints about which encoding the user
wants those file names to be interpreted as (eg. the LC_* environment
variables). You can use the mbstowcs() library function to automatically
convert a string into a wide character string according to the encoding
specified by the current environment.

Cheers // Fredrik Roubert

--
Dyre Halses gate 10 | +47 73568556 / +47 41266295
NO-7042 Trondheim | http://www.df.lth.se/~roubert/

Timothy Madden

unread,
Oct 8, 2007, 7:15:22 AM10/8/07
to Fredrik Roubert
Fredrik Roubert wrote:
> On Sun, 07 Oct 2007 22:22:12 +0300, Timothy Madden wrote:
[...]

> A process that wants to interpret the bytes that makes up a file name
> must look at its environment for hints about which encoding the user
> wants those file names to be interpreted as (eg. the LC_* environment
> variables). You can use the mbstowcs() library function to automatically
> convert a string into a wide character string according to the encoding
> specified by the current environment.

How about files from a remote file system ? Than I am out of luck !

I use to connect through VPN, at work, to my client's LAN. They use
Latin-1, I use Latin-2.

How can I tell that programmatically and portably ? My app has to work
with files from both machines.

I would like a standard way to get that encoding, and the file system
should be the first to know about it.

I guess I will just have to rely on the user passing the encoding for
files whose names I process on the command line, or else assume the LC_*
default.

This is not allways possible, for example when simply browsing the FS
(like the GUI shell does), you can not ask the user for the encoding of
files before browsing ...

I would like POSIX to fix this problem.

P.S. Fortunately me and my client have only used 7-bit ASCII characters
in file names until now.

Robert Harris

unread,
Oct 8, 2007, 9:05:13 AM10/8/07
to

The only portable solution is to use UNICODE everywhere!

Robert

Timothy Madden

unread,
Oct 8, 2007, 10:36:45 AM10/8/07
to Robert Harris
Robert Harris wrote:
> Timothy Madden wrote:
>> Fredrik Roubert wrote:
>>> On Sun, 07 Oct 2007 22:22:12 +0300, Timothy Madden wrote:
>> [...]
>>> A process that wants to interpret the bytes that makes up a file name
>>> must look at its environment for hints about which encoding the user
>>> wants those file names to be interpreted as (eg. the LC_* environment
>>> variables). You can use the mbstowcs() library function to automatically
>>> convert a string into a wide character string according to the encoding
>>> specified by the current environment.
>>
>> How about files from a remote file system ? Than I am out of luck !
>>
[...]

>
> The only portable solution is to use UNICODE everywhere!
>
> Robert

Yes, well, my app only reads directories (to compare them), so even if I
use UNICODE, I still need the encoding the others have used when they
created the directories and files. I want my tool to work with all the
files, everywhere. That is what portability is about. Unfortunately
POSIX only gives me a binary char[] array for the file name.

Timothy Madden,
Romania.

Alan Curry

unread,
Oct 8, 2007, 4:10:44 PM10/8/07
to
In article <4709F14D...@gmail.com>,

Timothy Madden <termin...@gmail.com> wrote:
>
>Everyone is free to use their character set. I just want a way to know
>that character set, so I can see the names the same way as you.
>
>What is bad about filesystems or filenames having a charset property ?

Don't underestimate the inertia effect. All of the system interfaces that use
filenames (open/creat, readdir, link, unlink, symlink, mkdir, mknod, etc.)
have been around for a long time. They were "opaque char *" back when it was
obvious that all strings were ASCII, and they haven't changed much since.

Replacing them all with a new set of syscalls that associates a charset tag
with each name would be a major effort. And who's going to bother, now that
we're approaching a time when it will be obvious that all strings are UTF-8?

You keep saying POSIX should "fix" this. Well, that's not how it works.
Successful standards are the ones that codify existing practice. You need to
show at least one working implementation of whatever interface you'd like to
standardize. Otherwise you're just using the standard as a club to beat
people with, and that doesn't make them eager to implement your idea for you.

One implementation that could be easily done would be to add a "charset"
mount option, and make the mount syscall ignore it. Anyone who's interested
could look at the mount options with getmntent(). Add your own opendir()
wrapper and you've got something. Of course it's only a per-mount tag, not
per-directory or per-file, and there's nothing to prevent users from creating
files with the "wrong" kind of names. But at least the implementation
overhead is fairly low.

--
Alan Curry
pac...@world.std.com

Timothy Madden

unread,
Oct 8, 2007, 5:51:44 PM10/8/07
to Alan Curry

I know standards are meant for everyone, big or small. But standards
should also offer directions for future development.

I was thinking about a 'charset' option for mkfs, with mkfs taking the
default from LC_* variables, or some hard-coded value. I only want to
allow per-directory charsets in the interface just for future
enhancements, but I would like the charset implemented for the entire fs
only. And mount would look for the FS-specific charset first, and if not
present would take it from the mount options. Then all open/creat/..
functions would ignore it for compatibility with the old apps, until the
application makes some special syscall or uses some new flag for open,
and then the new feature gets activated, and the system transcodes
filenames from LC_* encoding to FS encoding on the fly. Also wopen and
wreaddir would know this charset and convert from it to UTF-16.

Anyway I give up. It is all a mess and it is not in my power to fix it.
Not because compatibility or technical reasons, but because people do
not care. If I get negative feedback on the NG, I think I will get even
worse feedback from the POSIX group. Even if anyone can devise such a
new feature and still keep compatibility with POSIX.3

Thank you for bearing with me up until now anyway.
Timothy Madden,
Romania

William Ahern

unread,
Oct 8, 2007, 6:43:06 PM10/8/07
to
Timothy Madden <termin...@gmail.com> wrote:
> Alan Curry wrote:
> > In article <4709820...@gmail.com>,
> > Timothy Madden <termin...@gmail.com> wrote:
> [...]
> >> Only some byte sequences can encode characters in UTF-8. Others are for
> >> example reserved for future code points in UNICODE. This shows I could
> >> have POSIX filenames that can not even be sent to that "UTF-8 terminal"
> >> you were talking about. Would you not like POSIX to fix the situation ?
> >
> > Absolutely not. Fixing that situation would mean that non-UTF-8-legal byte
> > sequences become banned from filenames. open() acquires a new mode of failure
> > that it didn't have before. The simple rule of "All these bytes are yours
> > except 0x2f. Attempt no landing there" gets replaced with a complicated
> > system in which the validity of a byte depends on what came before it.
> >
> > The way things are now, I can use whatever character set I like and you can
> > use whatever character set you like. You want to impose a single character
> > set on everybody. That's not nice.

> Everyone is free to use their character set. I just want a way to know
> that character set, so I can see the names the same way as you.

What if the "character set" is actually a special binary 3D object
description for use in some new visualization application? And your terminal
or file manager doesn't have a prayer of supporting it in your life time?
(Also, who says file names are meant to be read by humans? A filesystem is a
database like any other, unless you cripple it with provincial features.)

The fact that NTFS, Win32, and Java adopted UTF-16 as their "character set"
encoding, and the subsequent issues that arose when Unicode evolved and
trashed almost all the presumed benefits of co-mingling the concepts of
textual data (such as many filesystem object identifiers, i.e. file names)
with textual representation, shows how misguided this notion is.

Just basic engineering and historical sense suggests that the kernel and
low-level system libraries should keep arms-length from these things.

Notwithstanding the unfortuante limbo that various locales have found
themselves in because of half-measures like ISO-8859 and ISO-2022, the
entire software industry is converging on UTF-8 (not the least because of
its disposition in regards to ASCII and 8-bit opaque data). Unsurprisingly,
Unix dodged another bullet by keeping its nose out of application
developers' faces, and letting them fight it out in due course.

Disregarding some minor anachronisms, Unix has treated file names and file
content as opaque bytes. As well it should continue to do so. It might not
be the best way, but all the other bright ideas have inevitably crashed and
burned. These arguments you make can find no currency with people who have
watched the industry evolve. They're short-sighted, and don't even
satisfactorily solve the problems at hand.

William Ahern

unread,
Oct 8, 2007, 7:02:36 PM10/8/07
to
Timothy Madden <termin...@gmail.com> wrote:
<snip>

> I know standards are meant for everyone, big or small. But standards
> should also offer directions for future development.

Exactly. And there's no better standard for future development than treating
the file names as opaque bytes. That is exactly what people having been
trying to explain.

Adding an external character set identifier--let's call it "meta data"--does
not fit with the sensibilities with many or even most developers. Not that
they don't think such meta data isn't useful, but that they'd prefer the
meta data to be in the actual file data (or in some file data, not
necessarily in the same file, or executed as a matter of policy). Why?
Because the needs and modes of these things are constantly evolving (the
concept of character set is not immune to this process). The filesystem
provides a very primitive interface. And time has taught that it's more
flexibile and economical to keep the interface primitive and allow
developer's more freedom to build on top of this, rather than forcing them
to deal with excess baggage which they may or may not make use of.

It makes your life harder, for sure, but it makes life easier for many more,
now and in the future (perhaps yourself). This is an area of software
archiecture where people are justifiably conservative. Maybe because they're
all stupid, or maybe because it's not a terribly bad idea and nothing better
has come along.

Logan Shaw

unread,
Oct 9, 2007, 12:26:12 AM10/9/07
to
Timothy Madden wrote:
> How can I tell that programmatically and portably ? My app has to work
> with files from both machines.
>
> I would like a standard way to get that encoding, and the file system
> should be the first to know about it.

If the filesystem has an encoding set for it, how do you expect multiuser
systems to work? Or do you want to make it impossible to support different
encodings for different users? If user 'bob' wants to use US-ASCII in
/home/bob and user 'andre' wants to use Italian in /home/andre, and both
/home/bob and /home/andre are on the filesystem, your system where the
filesystem knows (and enforces) the proper encoding for everybody would
make this impossible.

Essentially, putting the encoding into the filesystem makes it have a
global scope. It makes encoding into a global variable that only root
can set. Is that what you really want? Is it really cleaner? I would
argue that it's much less flexible for the user this way.

Note that this is certainly not a hypothetical situation. I have myself
been system administrator at a site where many users on the same system
had different native languages and preferred to use a different encoding
from each other. They should be allowed to choose the encoding for their
filenames as well.

- Logan

fjb...@yahoo.com

unread,
Oct 9, 2007, 1:39:29 AM10/9/07
to
On Oct 8, 4:15 am, Timothy Madden <terminato...@gmail.com> wrote:
> Fredrik Roubert wrote:
> > On Sun, 07 Oct 2007 22:22:12 +0300, Timothy Madden wrote:
> [...]
> > A process that wants to interpret the bytes that makes up a file name
> > must look at its environment for hints about which encoding the user
> > wants those file names to be interpreted as (eg. the LC_* environment
> > variables). You can use the mbstowcs() library function to automatically
> > convert a string into a wide character string according to the encoding
> > specified by the current environment.
>
> How about files from a remote file system ? Than I am out of luck !
>
> I use to connect through VPN, at work, to my client's LAN. They use
> Latin-1, I use Latin-2.
>
> How can I tell that programmatically and portably ? My app has to work
> with files from both machines.
>
> I would like a standard way to get that encoding, and the file system
> should be the first to know about it.
>
> I guess I will just have to rely on the user passing the encoding for
> files whose names I process on the command line, or else assume the LC_*
> default.

You could adopt a convention where the encoding is contained in the
filename itself. There's a scheme like this for email subject lines.
For example I have a piece of spam in my inbox with a subject of =?
ISO-2022-JP?B?GyRCMnEwd0ApNVUxZyU1JSQbKEI=?= which I presume a smart
enough mail client would display as Japanese text. (Mine doesn't, but
I don't care cause it's spam and I can't read Japanese anyway.)

J de Boyne Pollard

unread,
Oct 9, 2007, 6:53:45 AM10/9/07
to
TM> So if I read messages in a newsgroup and I see a message
TM> written by a person from japan, encoded in UTF-8, I can still
TM> see the same text the person wrote, even if UTF-8 is different
TM> from Latin-2. I just need to know the message is encoded in
TM> UTF-8, no matter what my current encoding is. The same
TM> thing should happen with filenames.

AC> Yes, newsgroup messages have headers, in which you can
AC> find the character encoding. Filesystems don't.

Wrong. The filesystem formats that two messages ago you characterized
as "well-designed" may not support such metadata. But the filesystem
formats that you associated with "brokenness", ugliness, and easy
confusion, most certainly do. HPFS, for example, has a code page
(index) field in its data structures for directory entries,
immediately preceding the name field.

AC> open() acquires a new mode of failure that it didn't have before.
AC> The simple rule of "All these bytes are yours except 0x2f.
AC> Attempt no landing there" gets replaced with a complicated
AC> system in which the validity of a byte depends on what
AC> came before it.

... or it gets replaced with the equally simple rule of "All these
codepoints are yours except for U+0000 and U+002F." and a syscall
interface that uses UTF16.

Of course, it is false that the rule actually _is_ as simple in
practice as "All these bytes are yours except 0x2f." in the first
place. The "complicated system in which the validity of a byte
depends on what came before it" already exists and is what is actually
enforced right now, because the on-disc data structures for many
filesystem formats don't use octets for storing filenames. NTFS, HFS
+, and FAT all use UTF16, for example. Thus an operating system
kernel that uses octet strings in its system call interface _already_
has to impose multi-byte encoding rules on those strings, because they
have to convert cleanly to UTF16 in order to be valid filenames.

These rules have nothing to do with "brokenness", "ugliness", or
"confusion". Those filesystem formats pretty much (glossing over
issues such as decomposition) have the UTF16 equivalent of the simple
rule mentioned above when it comes to the on-disc data structures, and
as a result when employed by operating systems that have a UTF16
native system API have the very same elegance that you are discussing
for the 8-bit world. Blaming this on the "poor non-Unix operating
systems" is to not understand the actual issue at all. The issue that
mandates these rules has nothing whatsoever to do with operating
systems not being Unix, and everything to do with the mechanics of
converting between 8-bit character strings and 16-bit character
strings. One faces the stark choice between having 16-bit character
strings that cannot be represented as 8-bit character strings, i.e. an
8-bit system where some of the on-disc filenames created by 16-bit
systems are inaccessible; and having 8-bit character strings that have
no mapping to 16-bit character strings, i.e. an 8-bit system where
some 8-bit filenames are invalid because the multi-byte encoding is
incorrect.

Timothy Madden

unread,
Oct 9, 2007, 6:36:21 PM10/9/07
to William Ahern
[...]

>
> Disregarding some minor anachronisms, Unix has treated file names and file
> content as opaque bytes. As well it should continue to do so. It might not
> be the best way, but all the other bright ideas have inevitably crashed and
> burned. These arguments you make can find no currency with people who have
> watched the industry evolve. They're short-sighted, and don't even
> satisfactorily solve the problems at hand.

What do you mean "all the other bright ideas have crashed and burned" ?
They are still living, _wreaddir functions are perfectly working, and
WinNT+ is all UNICODE (ANSI functions are wrappers around the UNICODE ones).

It is POSIX who keeps doing things the old way. But even standards
evolve, so all they need now is finding a standard, interoperable way to
somehow include the charset in the filesystem interfaces.

And since cd, ls and cat are user commands, file names are clearly meant
to be read and written by humans. Event though the file system is a
database like any other.

It is ok if file names have been created on systems with special
encodings and can only be displayed there, this is happening all the
time, but at least now I would have a way to know about it, instead of
seeing a different file name and taking it for good, that most likely
looks like "garbage" as some say it.

Timothy Madden,
Romania

Timothy Madden

unread,
Oct 9, 2007, 6:44:33 PM10/9/07
to William Ahern

I only want on optional feature. I am also conservative and I value
compatibility before new features.

Any applications, including the existing ones, can ignore any charset
values and work as before. But I want the option of letting the system
transcode filenames for me, or just let me know the charset and then I
will deal with it.

I know the best solutions are often the simple ones. But sometimes you
have to work if you want to make things right.

Gianni Mariani

unread,
Oct 9, 2007, 6:43:25 PM10/9/07
to
Timothy Madden wrote:
...

> Are POSIX implementations required to convert the file name return by
> readdir to the application's execution character set ?

A reasonable convention to use (hard to enforce) is that all file names
be stored in a normalized utf-8. This is similar to the Windows
solution of storing all file names in utf-16. The question of what to
do where a process's character set is unable to convert from utf-8.
There are two solutions - keep file names in utf-8 and display them in
utf-8 or convert the entire application to use utf-8.

The third solution is to only use a subset of utf-8 - ascii, for file names.

Timothy Madden

unread,
Oct 9, 2007, 7:04:23 PM10/9/07
to Logan Shaw
Logan Shaw wrote:
> Timothy Madden wrote:
>> How can I tell that programmatically and portably ? My app has to work
>> with files from both machines.
>>
>> I would like a standard way to get that encoding, and the file system
>> should be the first to know about it.
>
> If the filesystem has an encoding set for it, how do you expect multiuser
> systems to work?

It is easy. Use UTF-8 or UTF-16 as the file system encoding and let the
system re-encode names between UTF and the user's current LC_* encoding
on the fly.

So every user effectively sees the file system in it's own LC_* encoding.

Even more, if the user has two apps, one that only knows SHIFT_JIS and
one that only knows ANSI, the user just needs to arrange that current
locale for the first app is SHIFT_JIS, and the current locale for the
second app is ANSI. And suddenly the same filesystem appears all in
SHIFT_JIS to app1, and all in ANSI to app2. Even in the same time :).

Since not all SHIFT_JIS names can be re-encoded to ANSI without question
mark characters it is still better for the user to use unicode
applications. Which is true anyway, with or without filesystem/filename
charsets.

Timothy Madden,
Romania

Timothy Madden

unread,
Oct 9, 2007, 7:12:11 PM10/9/07
to fjb...@yahoo.com
fjb...@yahoo.com wrote:
> On Oct 8, 4:15 am, Timothy Madden <terminato...@gmail.com> wrote:
>> Fredrik Roubert wrote:
>>> On Sun, 07 Oct 2007 22:22:12 +0300, Timothy Madden wrote:
>> [...]
>>> A process that wants to interpret the bytes that makes up a file name
>>> must look at its environment for hints about which encoding the user
>>> wants those file names to be interpreted as (eg. the LC_* environment
>>> variables). You can use the mbstowcs() library function to automatically
>>> convert a string into a wide character string according to the encoding
>>> specified by the current environment.
>> How about files from a remote file system ? Than I am out of luck !
>>
[...]

>
> You could adopt a convention where the encoding is contained in the
> filename itself. There's a scheme like this for email subject lines.
> For example I have a piece of spam in my inbox with a subject of =?
> ISO-2022-JP?B?GyRCMnEwd0ApNVUxZyU1JSQbKEI=?= which I presume a smart
> enough mail client would display as Japanese text. (Mine doesn't, but
> I don't care cause it's spam and I can't read Japanese anyway.)
>

This problem is about the POSIX standard or interoperability or the
entire world if you want.

However the encoding is stored in the file system is the decision of the
FS implementation and I am sure there are many possibilities to choose from.

Timothy Madden,
Romania.

Timothy Madden

unread,
Oct 9, 2007, 7:30:40 PM10/9/07
to Gianni Mariani

Every one is free to use their character set. POSIX is a standard, and
it is meant for every one. If everything your application knows is
EBCDIC, just keep using EBCDIC, UTF-8 should be just one of the options.

If you meant internal storage, that is the decision of the file system
implementation only.

The question of names using characters outside the application's
character set is still a difficult one for me. I guess the system could
use some escape mechanism in the file names for such characters, like
uri-encoding them, and in the same time set some variable to let the
requesting application know about what happened.

Timothy Madden,
Romania

Gianni Mariani

unread,
Oct 10, 2007, 5:07:58 AM10/10/07
to

If you want interoperability then a very good solution is to use a
common base. Unicode is designed to accomodate that common base for
languages. You don't *have* to use it and you can go and whack your
head against a brick wall if you really want to.

It gets to the point that once you have decided you need to have
multiple processes with different locale encodings to talk to each other
(which is the inevitable problem with file names), then using a common
encoding like utf-8 and deprecating all other encodings becomes an
interesting solution.

It will take a while still before it is ubiquitous, however, many web
based documents are utf-8 or many applications communicate in utf-8 or
utf-16. Most of the recent web browsers work very well multiligially,
the tools are there, the problems are solved. There are a plethora of
multilingual documents on the web today.

See if this works below :-

س اスセソタチツテ لاБ Г Д من 1441

Fredrik Roubert

unread,
Oct 10, 2007, 6:05:24 AM10/10/07
to
On Mon, 08 Oct 2007 14:15:22 +0300, Timothy Madden wrote:

> How about files from a remote file system ? Than I am out of luck !
>
> I use to connect through VPN, at work, to my client's LAN. They use
> Latin-1, I use Latin-2.
>
> How can I tell that programmatically and portably ? My app has to work
> with files from both machines.

When mounting your remote file system you should specify the character
set conversion you would like to get done in order to get the file names
in the encoding that you want your process to receive.

If you have file systems with latin1 file names and file systems with
latin2 file names that you want to access from the same process, then I
suggest that you mount them with character set conversion to UTF-8 and
then run your process in a UTF-8 locale.

Fredrik Roubert

unread,
Oct 10, 2007, 6:11:25 AM10/10/07
to
On Mon, 08 Oct 2007 11:58:53 +0300, Timothy Madden wrote:

> What is bad about filesystems or filenames having a charset property ?

That would make it necessary for all files on any given filesystem to
have their names encoded in the same character set. This would prevent,
say, one user from encoding his file names in ISO-8859-1 and another
user to encode his file names in GB2312.

Many other systems work this way, but from the Unix point of view, every
single process should be able to run in its own locale. In many kinds of
larger and distributed systems, this is a really good idea.

J de Boyne Pollard

unread,
Oct 10, 2007, 8:05:10 AM10/10/07
to
TM> What is bad about filesystems or filenames having a charset
property ?

FR> That would make it necessary for all files on any given filesystem
to
FR> have their names encoded in the same character set.

_This is actually already the case_ for the FAT, NTFS, and HFS+
filesystem formats. It's a requirement of the filesystem formats.

FR> This would prevent, say, one user from encoding his file names in
FR> ISO-8859-1 and another user to encode his file names in GB2312.

Wrong. That is _not_ a consequence of filenames having a character
set property. If filenames had a character set property -- as _they
have_ on HPFS -- then one user could use one character set for xyr
file names and another user could use another character set for xyr
file names. And, indeed, on those operating systems that support this
facility of HPFS, that is exactly what they do.

FR> Many other systems work this way, but from the Unix point of
FR> view, every single process should be able to run in its own
locale.

This is irrelevant to the issue. If the system API were UTF16, for
example, translation between UTF16 and an 8-bit character set would be
done in application-mode code, and would use the process' current
locale. Thus the 8-bit character set would be locale-dependent and
per-process, as desired. This is exactly how those "other systems"
actually work.

<URL:http://reactos.org./generated/doxygen/d4/d47/
dll_2win32_2kernel32_2file_2find_8c.html#a10>

Rainer Weikusat

unread,
Oct 10, 2007, 8:34:00 AM10/10/07
to
J de Boyne Pollard <j.deboyn...@tesco.net> writes:
> TM> What is bad about filesystems or filenames having a charset
> property ?
>
> FR> That would make it necessary for all files on any given filesystem
> to
> FR> have their names encoded in the same character set.
>
> _This is actually already the case_ for the FAT, NTFS, and HFS+
> filesystem formats. It's a requirement of the filesystem formats.

In other words, DOS/Windows and Mac OS behave differently in this
respect.

> FR> This would prevent, say, one user from encoding his file names in
> FR> ISO-8859-1 and another user to encode his file names in GB2312.
>
> Wrong. That is _not_ a consequence of filenames having a character
> set property.

If a filename has a 'character set property', there is obviously a
character set attached to the filename. The same goes for a filesystem
with a character set property: It would have one.

> If filenames had a character set property -- as _they have_ on HPFS
> -- then one user could use one character set for xyr file names and
> another user could use another character set for xyr file names.

That's a non-sequitur.

> FR> Many other systems work this way, but from the Unix point of
> FR> view, every single process should be able to run in its own
> locale.
>
> This is irrelevant to the issue. If the system API were UTF16, for
> example, translation between UTF16 and an 8-bit character set would be
> done in application-mode code, and would use the process' current
> locale.

If the filesystem actually used an 'encoding' internally, instead
of just using the supplied bytestring, applications would be required
to translate to and from that encoding in case the application would
want to use a different encoding. More generally put, if the kernel
had a 'default policy', application would need to work around that if
the user (for some reason) would like to use a different policy.

I do not quite understand why this would be an argument for or against
anything.

Fredrik Roubert

unread,
Oct 12, 2007, 4:44:24 AM10/12/07
to
On Wed, 10 Oct 2007 05:05:10 -0700, J de Boyne Pollard wrote:

> > This would prevent, say, one user from encoding his file names in

> > ISO-8859-1 and another user to encode his file names in GB2312.
>
> Wrong. That is _not_ a consequence of filenames having a character
> set property.

Of course that's not a consequence of filenames having a character set
property. If you read a bit more carefully, you'll realize that I was
referring to filesystems having a character set property.

Roger Leigh

unread,
Oct 14, 2007, 6:24:13 AM10/14/07
to
William Ahern <wil...@wilbur.25thandClement.com> writes:

> Timothy Madden <termin...@gmail.com> wrote:
> <snip>
>> I know standards are meant for everyone, big or small. But standards
>> should also offer directions for future development.
>
> Exactly. And there's no better standard for future development than treating
> the file names as opaque bytes. That is exactly what people having been
> trying to explain.

While opaque bytes have been flexible enough to serve for the past
30-odd years, it doesn't mean that there isn't a better way to be
found. There are obvious disadvantages, or else this discussion
wouldn't be taking place.

As an example, if filesystems were mounted with an "encoding" mount
option, all system calls taking a path or filename as an argument
could iconv() it to the encoding used by the filesystem in question.
This could be all done in userspace, with no kernel changes needed.
The C system call interface would be unchanged, so old code would
continue to work. The only change would be in the C wrappers to the
kernel system calls. It could even traverse the path from the root to
recode different parts of the path which reside on differently-encoded
filesystems. Now, there are obvious performance problems in this
situation, so it would probably not be a practical solution, but there
are likely other approaches which could be considered.

I would have to agree that being encoding-agnostic does have
advantages. While some systems have tied themselves to specific 8-bit
and even 16-bit encodings, Linux is free to change. While several
people advocate mandating Unicode as UTF-8, we should realise that
even Unicode might be supplanted in a decade or so, and not being tied
to it gives us the ability to replace it quite quickly.


Regards,
Roger
--
.''`. Roger Leigh
: :' : Debian GNU/Linux http://people.debian.org/~rleigh/
`. `' Printing on GNU/Linux? http://gutenprint.sourceforge.net/
`- GPG Public Key: 0x25BFB848 Please GPG sign your mail.

J de Boyne Pollard

unread,
Oct 15, 2007, 7:59:40 AM10/15/07
to
TM> What is bad about filesystems or filenames having a
TM> charset property ?

FR> This would prevent, say, one user from encoding his file names in
FR> ISO-8859-1 and another user to encode his file names in GB2312.

JdeBP> Wrong. That is _not_ a consequence of filenames having
JdeBP> a character set property.

FR> Of course that's not a consequence of filenames having a
FR> character set property. If you read a bit more carefully,
FR> you'll realize that I was referring to filesystems having a
FR> character set property.

False. You were answering the question that is quoted above, which
talks about filenames having a character set property. And nowhere
did you write anything to indicate that this was not what you were
talking about.

0 new messages