Any portable way get a filename in UTF-8 or to get the FS encoding ?

Skip to first unread message

J de Boyne Pollard

Oct 9, 2007, 6:53:45 AM10/9/07
TM> So if I read messages in a newsgroup and I see a message
TM> written by a person from japan, encoded in UTF-8, I can still
TM> see the same text the person wrote, even if UTF-8 is different
TM> from Latin-2. I just need to know the message is encoded in
TM> UTF-8, no matter what my current encoding is. The same
TM> thing should happen with filenames.

AC> Yes, newsgroup messages have headers, in which you can
AC> find the character encoding. Filesystems don't.

Wrong. The filesystem formats that two messages ago you characterized
as "well-designed" may not support such metadata. But the filesystem
formats that you associated with "brokenness", ugliness, and easy
confusion, most certainly do. HPFS, for example, has a code page
(index) field in its data structures for directory entries,
immediately preceding the name field.

AC> open() acquires a new mode of failure that it didn't have before.
AC> The simple rule of "All these bytes are yours except 0x2f.
AC> Attempt no landing there" gets replaced with a complicated
AC> system in which the validity of a byte depends on what
AC> came before it.

... or it gets replaced with the equally simple rule of "All these
codepoints are yours except for U+0000 and U+002F." and a syscall
interface that uses UTF16.

Of course, it is false that the rule actually _is_ as simple in
practice as "All these bytes are yours except 0x2f." in the first
place. The "complicated system in which the validity of a byte
depends on what came before it" already exists and is what is actually
enforced right now, because the on-disc data structures for many
filesystem formats don't use octets for storing filenames. NTFS, HFS
+, and FAT all use UTF16, for example. Thus an operating system
kernel that uses octet strings in its system call interface _already_
has to impose multi-byte encoding rules on those strings, because they
have to convert cleanly to UTF16 in order to be valid filenames.

These rules have nothing to do with "brokenness", "ugliness", or
"confusion". Those filesystem formats pretty much (glossing over
issues such as decomposition) have the UTF16 equivalent of the simple
rule mentioned above when it comes to the on-disc data structures, and
as a result when employed by operating systems that have a UTF16
native system API have the very same elegance that you are discussing
for the 8-bit world. Blaming this on the "poor non-Unix operating
systems" is to not understand the actual issue at all. The issue that
mandates these rules has nothing whatsoever to do with operating
systems not being Unix, and everything to do with the mechanics of
converting between 8-bit character strings and 16-bit character
strings. One faces the stark choice between having 16-bit character
strings that cannot be represented as 8-bit character strings, i.e. an
8-bit system where some of the on-disc filenames created by 16-bit
systems are inaccessible; and having 8-bit character strings that have
no mapping to 16-bit character strings, i.e. an 8-bit system where
some 8-bit filenames are invalid because the multi-byte encoding is

Reply all
Reply to author
0 new messages