wide characters and i18n

Sad Clouds

unread,

Jul 10, 2010, 5:32:39 AM7/10/10

to

Hi, I'm trying to understand how to write portable C code that supports
international character sets. As I understand so far, it has a lot to
do with C library and current locale setting.

1. What is the recommended way for user level applications to deal with
different character encodings? Treat all external character data as
multi-byte and use C library wchar_t and wide stdio functions to
convert multi-byte to wchart_t for internal character processing?

2. How extensive is NetBSD's support for i18n and wide characters? Are
there any missing bits or things I need to look out for?

3. Any functions that are not thread-safe that I need to look out for?

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-...@muc.de

Matthew Mondor

unread,

Jul 10, 2010, 3:33:34 PM7/10/10

to

On Sat, 10 Jul 2010 10:32:39 +0100
Sad Clouds <cryintot...@googlemail.com> wrote:

> Hi, I'm trying to understand how to write portable C code that supports
> international character sets. As I understand so far, it has a lot to
> do with C library and current locale setting.
>
> 1. What is the recommended way for user level applications to deal with
> different character encodings? Treat all external character data as
> multi-byte and use C library wchar_t and wide stdio functions to
> convert multi-byte to wchart_t for internal character processing?

Others might also have good suggestions, I only have some experience
with UTF-8 and UTF-32/UCS-4 here, and they were using custom code
rather than the wchar C99 related functions. I can however share some
of the "issues" I encountered.

When using them in the way above, basically the input should be
considered UTF-8 and decoded to an internal 32-bit (host-endian)
representation. Here there can be problems if the input isn't valid
UTF-8, in which case various implementations (and applications)
differ. Some will treat invalid sequences as ISO-8859-15 and convert
them to their equivalent UTF-32. Others will simply refuse to parse
the string (if I understand, the C99 functions will stop at invalid
sequences with a restartable error, allowing the application to decide
what to do).

When dealing with invalid UTF-8 or UTF-16 input sequences, a possible
solution is to store those invalid sequence bytes or words as-is by
using a range of special/invalid unicode characters to map them. This
permits to non-destructively preserve the original input's integrity.
Some software will also simply replace any invalid sequence by the
special unicode "invalid character", but it's generally considered
bad practice, being destructive.

For output, the 32-bit representation is then encoded back to UTF-8 (or
the external encoding of course). If a special range of characters was
used to preserve invalid sequences, those bytes/words are restored as
they were.

I didn't look much at the wchar_t support implementation, but it seems
that wchar_t maps to a 32-bit int, so it shouldn't be much different
to what I described.

> 2. How extensive is NetBSD's support for i18n and wide characters? Are
> there any missing bits or things I need to look out for?

I'm not sure how well normalization is done, but obviously in UCS-4 the
rules are different to convert between lowercase and uppercase, to
compare for sorting and to convert between accentuated character to
unaccentuated character (useful for matching strings against
user-supplied keywords for instance). So it's important to use the
proper functions to perform those operations. An example is using
wcsncmp(3) over strncmp(3), towlower(3) over tolower(3) etc.

> 3. Any functions that are not thread-safe that I need to look out for?

The strerror_r(3) function should be used instead of strerror(3) in
threads because with the advent of NLS and locales strerror(3) can no
longer simply return a pointer to static const string from a const
array.

As for the input/output wchar_t related functions I'm unsure if their
state is thread-safe or requires explicit locking for concurrency.

Handling locales correctly is more complex too, as a locale might use a
different decimal formatting and date format, and its typographical
conventions might favour a particular quoting format. Some functions
support local-specific output options, such as strftime(3). I'm not
sure if printf(3) is supposed to support this automatically for decimal
or not. nl_langinfo(3) can be used for libraries to conform to the
locale in use (NLS(7) has more information). I personally have no
experience with it here.
--
Matt

Sad Clouds

unread,

Jul 10, 2010, 6:15:10 PM7/10/10

to

On Sat, 10 Jul 2010 15:33:34 -0400
Matthew Mondor <mm_l...@pulsar-zone.net> wrote:

> On Sat, 10 Jul 2010 10:32:39 +0100
> Sad Clouds <cryintot...@googlemail.com> wrote:
>

> > Hi, I'm trying to understand how to write portable C code that
> > supports international character sets. As I understand so far, it
> > has a lot to do with C library and current locale setting.
> >
> > 1. What is the recommended way for user level applications to deal
> > with different character encodings? Treat all external character
> > data as multi-byte and use C library wchar_t and wide stdio
> > functions to convert multi-byte to wchart_t for internal character
> > processing?
>

> Others might also have good suggestions, I only have some experience
> with UTF-8 and UTF-32/UCS-4 here, and they were using custom code
> rather than the wchar C99 related functions. I can however share some
> of the "issues" I encountered.

OK thanks, I've spent hours searching the Internet for documentation
and howtos and I think I'm beginning to understand how it fits together
on Unix system.

I'm not sure how portable it is to assume that input character data is
in UTF-8 format. Some articles suggest to let the user set locale
environment variables and let C library routines perform the correct
conversion from multi-byte to wchar_t characters. This should be
MT-safe with restartable multi-byte functions, as long as setlocale()
is not called. This basically binds you to one locale at run time.

If you need to convert character encodings which are different from the
current locale, then I guess the only option is to use something like
iconv or custom conversion functions...

Joerg Sonnenberger

unread,

Jul 10, 2010, 6:42:28 PM7/10/10

to

On Sat, Jul 10, 2010 at 11:15:10PM +0100, Sad Clouds wrote:
> I'm not sure how portable it is to assume that input character data is
> in UTF-8 format. Some articles suggest to let the user set locale
> environment variables and let C library routines perform the correct
> conversion from multi-byte to wchar_t characters. This should be
> MT-safe with restartable multi-byte functions, as long as setlocale()
> is not called. This basically binds you to one locale at run time.

Depending on your environment, the UTF8 assumption is questionable.
In many European countries, either one of the ISO-8859 charsets or
Unicode (UTF-8 or UTF-16) is used. IIRC China tends to use its own
character set a lot too.

You are correct about the setlocale() issue. There have been discussions
about supporting multiple locales at the same time, but nothing
implemented (yet).

> If you need to convert character encodings which are different from the
> current locale, then I guess the only option is to use something like
> iconv or custom conversion functions...

Use iconv. It is part of SUS and has a portable implementation with
libiconv for systems that (still) don't provide it natively.

Joerg

Joerg Sonnenberger

unread,

Jul 10, 2010, 6:50:46 PM7/10/10

to

On Sat, Jul 10, 2010 at 10:32:39AM +0100, Sad Clouds wrote:
> 2. How extensive is NetBSD's support for i18n and wide characters? Are
> there any missing bits or things I need to look out for?

The biggest missing item I know is the collation support. Basically,
NetBSD currently doesn't allow locale-sensitive sorting and comparing.

Joerg

Matthew Mondor

unread,

Jul 10, 2010, 7:08:32 PM7/10/10

to

On Sat, 10 Jul 2010 23:15:10 +0100
Sad Clouds <cryintot...@googlemail.com> wrote:

> I'm not sure how portable it is to assume that input character data is
> in UTF-8 format. Some articles suggest to let the user set locale
> environment variables and let C library routines perform the correct
> conversion from multi-byte to wchar_t characters. This should be
> MT-safe with restartable multi-byte functions, as long as setlocale()
> is not called. This basically binds you to one locale at run time.

Indeed, assuming an UTF-8 external format is only valid for protocols
where UTF-8 is the norm (which was my use case, although it was also
allright otherwise as I'm using an UTF-8 locale, UTF-8 aware tools and
terminals).

But "locale -a" lists various encodings, and most probably that wide
character conversion C99 functions take those into consideration (after
checking I now see that for instance wcsrtombs(3) is implemented in
src/libc/locale/ from the citrus project and it seems to have
locale-specific handling), so I think you're right.

> If you need to convert character encodings which are different from the
> current locale, then I guess the only option is to use something like
> iconv or custom conversion functions...

I've had to use iconv(3) (and iconv(1)) at times and noticed that
it could be destructive depending on the conversion, but it seemed fine
otherwise.
--
Matt

der Mouse

unread,

Jul 10, 2010, 11:01:10 PM7/10/10

to

> Hi, I'm trying to understand how to write portable C code that
> supports international character sets.

The very first thing you need to do is determine just what "supports"
means for you here.

Personally, the biggest problems I've run into have been due to the
mismatch between octet strings and character strings. There are a lot
of places where I as a coder get octet strings but humans think of them
as character strings, and the mismatch can be problematic. File names
are an example: most Unixish filesystems actually name files with octet
strings, not character strings; for example, a file name consisting of
a single lowercase beta generated by a user using 8859-7 is
indistinguishable from a file name consisting of a single lowercase
a-circumflex generated by a user using 8859-1, and from a filename
consisting of a 0xe2 octet generated by an application that uses
filenames to store binary data that does not represent characters at
all: the file name is not a character sequence but an octet sequence
(which may or may not be an encoded character sequence).

As an example of the sort of problem this confusion between octets
strings and character strings engenders, the ssh spec is, strictly,
unimplementable on NetBSD (and probably other Unix variants), because
things like user names and passwords in the OS are octet strings,
whereas the protocol specifies that they are character strings encoded
in UTF-8. This means that, for example, it is impossible for the
implementation to tell whether a given username on the wire should
match a username in the user database, because there is no way to tell
what encoding the stored octet string was generated using.

> As I understand so far, it has a lot to do with C library and current
> locale setting.

Depending on what you want to do, it might.

> 1. What is the recommended way for user level applications to deal
> with different character encodings?

I doubt there is a "_the_ recommended way" - or, at least, if there is
the recommendation should be ignored because it came from someone who
either has an axe to grind or hasn't thought about the issues. What
the most sensible way to deal with different encodings is depends on
what you need to do. For some purposes, for example, all input data is
tagged with a character set (peerhaps implicitly) and it's enough to
just make sure you preserve that marking through whatever processing
you do. For other purposes, it is necessary to recode, but nothing
more. For yet others, what you outline (convert everything to some
>8-bit type for internal use) is a right answer.

/~\ The ASCII Mouse
\ / Ribbon Campaign
X Against HTML mo...@rodents-montreal.org
/ \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B

Matthew Mondor

unread,

Jul 10, 2010, 11:28:38 PM7/10/10

to

On Sat, 10 Jul 2010 23:01:10 -0400 (EDT)
der Mouse <mo...@Rodents-Montreal.ORG> wrote:

> what you need to do. For some purposes, for example, all input data is
> tagged with a character set (peerhaps implicitly) and it's enough to
> just make sure you preserve that marking through whatever processing
> you do. For other purposes, it is necessary to recode, but nothing
> more.

You probably refer to the ASCII tags which can enclose arbitrary
encodings such as those from RFC 2047 here?

Thanks,
--
Matt

Sad Clouds

unread,

Jul 11, 2010, 5:38:53 AM7/11/10

to

On Sat, 10 Jul 2010 23:01:10 -0400 (EDT)
der Mouse <mo...@Rodents-Montreal.ORG> wrote:

> > Hi, I'm trying to understand how to write portable C code that
> > supports international character sets.
>

> The very first thing you need to do is determine just what "supports"
> means for you here.
>
> Personally, the biggest problems I've run into have been due to the
> mismatch between octet strings and character strings. There are a lot
> of places where I as a coder get octet strings but humans think of
> them as character strings, and the mismatch can be problematic. File
> names are an example: most Unixish filesystems actually name files
> with octet strings, not character strings; for example, a file name
> consisting of a single lowercase beta generated by a user using
> 8859-7 is indistinguishable from a file name consisting of a single
> lowercase a-circumflex generated by a user using 8859-1, and from a
> filename consisting of a 0xe2 octet generated by an application that
> uses filenames to store binary data that does not represent
> characters at all: the file name is not a character sequence but an
> octet sequence (which may or may not be an encoded character
> sequence).

I guess this can be a problem if the user has one locale setting,
UTF-8 for example, but different filenames are encoded in different
encodings. If you want to do something like regular expression string
matching, you would call mbsrtowcs() to convert multi-byte filename
string to a fixed wide character string.

What I'm trying to figure out is this: if filename encoding does not
match user's locale setting, mbsrtowcs() can stop on a character
sequence it does not think is legal, how do you skip it? It could be 2,
or 4-byte characters, but how do you know for sure? Do you just keep
calling mbsrtowcs() with 1 byte increments until it manages to decode
the next character sequence?

der Mouse

unread,

Jul 11, 2010, 7:03:43 AM7/11/10

to

>> For some purposes, for example, all input data is tagged with a
>> character set (peerhaps implicitly) and it's enough to just make
>> sure you preserve that marking through whatever processing you do.

> You probably refer to the ASCII tags which can enclose arbitrary
> encodings such as those from RFC 2047 here?

That was not what I had in mind, but it actually is a perfectly good
example of the paradigm I was talking about.

/~\ The ASCII Mouse
\ / Ribbon Campaign
X Against HTML mo...@rodents-montreal.org
/ \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B

--

der Mouse

unread,

Jul 11, 2010, 7:19:12 AM7/11/10

to

>> [...]: the file name is not a character sequence but an octet

>> sequence (which may or may not be an encoded character sequence).

> I guess this can be a problem if [...] different filenames are
> encoded in different encodings.

Yes, if you have to treat them as character sequences rather than octet
sequences. (If treating them as opaque octet sequences is good enough
for your purposes, then there's no problem.)

> If you want to do something like regular expression string matching,
> you would call mbsrtowcs() to convert multi-byte filename string to a
> fixed wide character string.

Maybe. If you want to do regular expression matching against
_character_ strings, yes. If _octet_ strings, no.

> What I'm trying to figure out is this: if filename encoding does not
> match user's locale setting, mbsrtowcs() can stop on a character
> sequence it does not think is legal, how do you skip it?

That's exactly the kind of problem I was talking about: you are given
some data (a file name) which is an octet sequence, which may or may
not be an encoded character sequence, and if it is it may or may not be
in your, or the user's, preferred encoding, and you want to turn it
into a character sequence.

What the right way to handle that is is application-specific.
Sometimes something like what you sketch is a right answer.

/~\ The ASCII Mouse
\ / Ribbon Campaign
X Against HTML mo...@rodents-montreal.org
/ \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B

--

Sad Clouds

unread,

Jul 11, 2010, 8:40:16 AM7/11/10

to

On Sun, 11 Jul 2010 07:19:12 -0400 (EDT)
der Mouse <mo...@Rodents-Montreal.ORG> wrote:

> > If you want to do something like regular expression string matching,
> > you would call mbsrtowcs() to convert multi-byte filename string to
> > a fixed wide character string.
>

> Maybe. If you want to do regular expression matching against
> _character_ strings, yes. If _octet_ strings, no.

I'm not sure if simply comparing 8-byte integer units is going to work.
Some encodings (e.g. JIS) may use escape sequences to indicate shifting
to two byte encoding.

If the escape sequence to shift to Kanji is '<ESC>$B' and you're
looking for ASCII '$' character, then part of the escape sequence will
match.

It seems to defeat the whole point of doing character comparison,
because you end up matching control data, which is not part of a
logical character sequence that represents the string.

Sad Clouds

unread,

Jul 13, 2010, 7:48:06 AM7/13/10

to

OK, I wrote a few wrapper functions around C library
mbsrtowcs(), mbrtowc(), wcsrtombs() and wcrtomb() functions. This
allows me to convert segments of non-null terminated strings from
multi-byte to wide strings in current locale and vice versa.

The wrapper functions use string conversion functions if strings are
long enough not to cause buffer overrun, and then fall back to
character conversion functions to convert the remaining data. The extra
steps are needed because C library functions expect source strings to
be null terminated, which may not be the case if you have a string
fragment in the buffer.

I did some quick benchmarks:

My wrapper functions are about 30% slower than simple function calls to
mbsrtowcs()/mbsrtowcs()

Time for converting 1032 bytes of utf-8 strings (mixed 1 and 2-byte
multi-byte characters) to utf-32 strings with iconv() in a loop 100000
times is: 19.04 seconds

Time for converting 1032 bytes of utf-8 strings (mixed 1 and 2-byte
multi-byte characters) to wchar_t strings with my wrapper functions in a
loop 100000 times is: 5.42 seconds

Using iconv() is about 3.5 times slower, which is a bit surprising.

Erik Fair

unread,

Jul 14, 2010, 10:38:42 PM7/14/10

to

On Jul 11, 2010, at 05:40, Sad Clouds wrote:

> On Sun, 11 Jul 2010 07:19:12 -0400 (EDT)
> der Mouse <mo...@Rodents-Montreal.ORG> wrote:
>
>>> If you want to do something like regular expression string matching,
>>> you would call mbsrtowcs() to convert multi-byte filename string to
>>> a fixed wide character string.
>>

>> Maybe. If you want to do regular expression matching against
>> _character_ strings, yes. If _octet_ strings, no.
>

> I'm not sure if simply comparing 8-byte integer units is going to work.
> Some encodings (e.g. JIS) may use escape sequences to indicate shifting
> to two byte encoding.
>
> If the escape sequence to shift to Kanji is '<ESC>$B' and you're
> looking for ASCII '$' character, then part of the escape sequence will
> match.
>
> It seems to defeat the whole point of doing character comparison,
> because you end up matching control data, which is not part of a
> logical character sequence that represents the string.

two comments:

1. there's a fundamentally nasty problem that UNIX itself has never dealt with in a general way: what's in that file? Text? MPEG streams? JPEG?

The answer has typically been, "well, most of our software handles text (meaning 7-bit ASCII), and if you need something different, you have software to obtain or write ..." Which is to say, UNIX mostly avoided the question, other than by implication of the formats that the installed base of software was prepared to handle.

Extending that to ISO-8859-1 ("ISO Latin 1") was easy ("make everything 8-bit clean! No, that's not a parity bit any more!") and that handled what was then the western "free" world that traded in computers and software. Those of us involved in the IETF MIME effort did our best to think beyond that limited view of the world and try to make it possible (if a bit messy) for everyone to exchange information in character sets which express their native languages.

However, the IETF is explicitly (with one glaring, embarrassing exception) agnostic about software - they care about "bits on the wire" (protocols) not whatever your OS may be doing (e.g. APIs). It's similar to declaring a language for a technical conference - so long as you can express your thoughts in that language to the other attendees, who cares what your native language is? Keep your notes in whatever script you like.

Theoretically, the POSIX locale stuff is supposed to handle things beyond that, but it's a more complicated and subtle problem than those POSIX committees really thought about. Just setting LANG environment variable (and its associates) to where ever you are or whatever you speak/read really tells the system and software exactly nothing about the content of the files you are manipulating - LANG speaks more of I/O to you, i.e. what you're prepared to read on a display, and what sorts of characters you'll be inputting from your ... input devices.

I commend this well written paper to your attention:

http://plan9.bell-labs.com/sys/doc/utf.html

which discusses what the Plan 9 people (Rob Pike, Ken Thompson, et. al) did about the software problem (and what they did about it), and explicitly what they decided to punt on. A precis: "we replaced the ASCII assumption with Unicode/UTF-8 because UTF-8 is a proper superset of ASCII (i.e. backward compatible) and also subsumes pretty much all other interesting character sets (with some warts) so we can translate into it without (much) semantic information loss."

A little history of how UTF-8 actually came about is here:

http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt

2. With regard to file contents, there are three approaches: guessing, assuming a default, or explicit meta-data (magic bytes/cookies, filename extensions, or ... a field in the inode (or whatever filesystem meta-data bundle you have).

Guessing has obvious disadvantages and probable violations of der Mouse's cherished "principle of least astonishment" (I cherish that principle, too). See file(1) for a rather heroic guessing program.

Assuming a default ... well, that depends on the default, and, as you point out, what happens if the file contents and the default don't match? That could lead to an astonishing result, just like guessing wrong. Not good to sort(1) a shift-JIS file if sort(1) is only expecting ASCII. We might do better if the assumption is UTF-8 and we both modify base system software to deal with that, and provide tools e.g. iconv(1), to convert into and out of UTF-8.

Which leaves explicit meta-data. Apple went this route in HFS from day one of MacOS with Type and Creator right in the "Finder info" (their version of an inode) though it took them a very long time to deal with the both the interchange issue and the notion that there could be a file format standards that multiple programs can view/manipulate (e.g. RTF, PDF, MPEG, JPEG).

UNIX and Microsoft DOS (and its successors) have been using both in-file magic cookies and filename "extensions" though in UNIX filename extensions were always a convention rather than anything required by the OS or the filesystem; period is just another valid filename character. Apple has been going in this direction with MacOS X with stated intent to abandon the explicit meta-data they already have in their filesystem; given their UI, I think that's a mistake. UNIX seems to work pretty reasonably with it's mishmash of conventions ... but then I don't use X11 (too many years of using the much more thoroughly integrated MacOS environment has spoiled me - every time I try to use X, I have to work hard to suppress the strong desire to do violence to the people responsible for it).

Even explicit meta-data leaves us with a nasty "M by N" problem: M programs/libraries to modify for N different code sets ...

A whole lot of software has already been written to deal with this problem (but not necessarily completely or well), and you would do well to research what's available before attempting to reinvent your own rounder wheel - someone might have already solved your particular problem ... just not in the base NetBSD distribution.

Erik <fa...@netbsd.org>

Sad Clouds

unread,

Jul 15, 2010, 6:26:51 AM7/15/10

to

On Wed, 14 Jul 2010 19:38:42 -0700
Erik Fair <fa...@netbsd.org> wrote:

> A whole lot of software has already been written to deal with this
> problem (but not necessarily completely or well), and you would do
> well to research what's available before attempting to reinvent your
> own rounder wheel - someone might have already solved your particular
> problem ... just not in the base NetBSD distribution.
>
> Erik <fa...@netbsd.org>

Well yes, I did some research and in regard to C language and
internationalization, it's quite difficult to find documentation that
provides developers with sound advice and information about how to
handle different character encodings.

For example the code I'm writing at the moment parses program
configuration files, which are simple text files. However to assume
that text file == ASCII file is a bit restrictive. For example:

log: "/path/to/file_name";

The path and the filename strings can be encoded in many different
ways - utf-8, utf-16, utf-32, jis, koi-8 and so on. I don't think
Unix filesystems care what encoding it is, they simply treat it as a
sequence of octets, as long as they don't contain NULL and / characters.

As a software developer you need to figure out the following two things:

1. What different encodings can your program accept and how to
determine/auto-detect them?

2. How do you represent this data internally in your program?

I think the answer to question 1 depends on the context. If it's your
local data, e.g. system configuration files, filename encodings, etc.
then the Unix locale is the most reliable way to tell the encoding.

If it's the data you get over the network, e.g. email, web pages, etc.
then the encoding is explicitly specified either in protocol headers,
or in the file.

The answer to question 2 is a bit more complex. Some environments use
utf-8 or utf-16, but since these are variable length encodings, you
can't have simple pointers to strings and you can't increment/decrement
pointers by N characters forward/backward.

I settled down on wchar_t and C library wide character functions. I
think I can use it with minimal fuss, and there is always iconv() or
similar if I need to convert from/to some weird encoding.

der Mouse

unread,

Jul 15, 2010, 9:28:50 AM7/15/10

to

> 1. there's a fundamentally nasty problem that UNIX itself has never
> dealt with in a general way: what's in that file? Text? MPEG
> streams? JPEG?

I would disagree that UNIX has never dealt with that problem. I would
say that UNIX _has_ dealt with it, by explicitly pushing it off to the
application layer, and that, indeed, that is where much of its power
and flexibility comes from. I regularly do useful things by treating
data as if it were of a type it originally wasn't intended to be.

This approach has problems, of course - perhaps most notably at the
moment, the conflict between octet strings and character strings - but,
well, try to find an approach to anything that doesn't have problems.

> The answer has typically been, "well, most of our software handles

> text (meaning 7-bit ASCII), [...]" [...]

> Extending that to ISO-8859-1 ("ISO Latin 1") was easy ("make
> everything 8-bit clean! No, that's not a parity bit any more!")

There's a critical point here: doing that did _not_ just extend that to
Latin-1: it extended it to whatever charset and encoding your input and
display devices felt like using. I can work (and have worked) with
8859-7 text simply by starting a new terminal emulator with an 8859-7
font. (Input is a little more awkward than 8859-1 input, but that's
because I've put more effort into 8859-1 input than 8859-7 input, not
because there's anything fundamentally more difficult about -7.) There
are lots of Linux users that use UTF-8 regularly, because they have
input and output setups that make UTF-8 easy compared to other
encodings and charsets. (Not that there's anything fundamentally
different about Linux in this regard; they've just put the time into
making their software support UTF-8. There may be others; Linux is
just the one I'm aware of because I've run into it personally.)

> Theoretically, the POSIX locale stuff is supposed to handle things
> beyond that, but it's a more complicated and subtle problem than
> those POSIX committees really thought about. Just setting LANG
> environment variable (and its associates) to where ever you are or
> whatever you speak/read really tells the system and software exactly
> nothing about the content of the files you are manipulating - LANG
> speaks more of I/O to you, i.e. what you're prepared to read on a
> display, and what sorts of characters you'll be inputting from your
> ... input devices.

Right. It does its job: it replaces the old "text is ASCII" assumption
with a variable "text is $LANG" assumption (I'm deliberately glossing
over many details here, but that's what it amounts to from the point of
view of this discussion).

> I commend this well written paper to your attention:

> http://plan9.bell-labs.com/sys/doc/utf.html

Well-written, perhaps. But I'm not convinced their choices are
particularly good ones.

In particular, if you want to use Unicode, I think you should stop
trying to use octets for character strings in any form: I think char
should be a 16-bit type, basically. (It's not quite that simple,
mostly because of all the octet streams that you'll want to handle, and
by definition char is the smallest integral type.) A bit like Plan9's
Rune, but without the UTF-8 form.

> 2. With regard to file contents, there are three approaches:

> guessing, assuming a default, or explicit meta-data [...].

Actually, assuming a default is just a special case of guessing.

For that matter, so is explicit meta-data; it amounts to guessing that
the labeling is accurate. I regularly see mislabeled data, perhaps
most commonly email labeled as ISO-8859-1 but containing octets in the
0x80-0x9f range, which are not 8859-1 text. This substantially impairs
my confidence that a metadata-based scheme will have accurate metadata.

I went through my larval phase under VMS, which uses a fairly elaborate
metadata scheme. For all the benefits it had, I still found myself
regularly using CONVERT/FDL to rewrite the metadata attached to file
contents so I could use tools that didn't understand what the original
metadata specified.

> [...] ... but then I don't use X11 (too many years of using the much

> more thoroughly integrated MacOS environment has spoiled me - every
> time I try to use X, I have to work hard to suppress the strong
> desire to do violence to the people responsible for it).

How odd. Every time I have occasion to subject myself to a Mac UI, I
find myself with related feelings. I don't know whether it's just a
question of what we're used to or whether there's something different
between us that makes us better matches to different UI styles.

Also, you may be confusing X with some common window system built on X.
It would be entirely possible to build a UI as thoroughly integrated as
the Mac one is atop X. (I don't know why it hasn't been done, or why
it hasn't gained wide popularity if it has.) X is not a window system,
despite being named as one; it's really a framework for building window
systems.

/~\ The ASCII Mouse
\ / Ribbon Campaign
X Against HTML mo...@rodents-montreal.org
/ \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B

--

Joerg Sonnenberger

unread,

Jul 15, 2010, 9:31:15 AM7/15/10

to

On Wed, Jul 14, 2010 at 07:38:42PM -0700, Erik Fair wrote:
> I commend this well written paper to your attention:
>
> http://plan9.bell-labs.com/sys/doc/utf.html

...which is also simplistic in the assumption and problems faced. If you
want to know about the issues with I18N and Unicode in specific, don't
ask Americans. Don't ask Europeans either, they only have slightly more
exposure to the problems.

Itojun mentioned some of the issues in
ftp://ftp.itojun.org/pub/paper/itojun-freenix2001-presen.ps.gz

Joerg

der Mouse

unread,

Jul 15, 2010, 10:19:08 AM7/15/10

to

>> http://plan9.bell-labs.com/sys/doc/utf.html
> ...which is also simplistic in the assumption and problems faced. If
> you want to know about the issues with I18N and Unicode in specific,
> don't ask Americans. Don't ask Europeans either, they only have
> slightly more exposure to the problems.

That actually was one thing I very much liked about that Plan 9
document. They recognized and specifically called out the places where
they knew they were making limiting assumptions - such as the
assumption of left-to-right top-to-bottom text.

I don't agree with everything they did, but that's hardly surprising,
and it doesn't make what they did any worse. (Indeed, I suspect there
are a few who would call that a recommendation.)

Trying to solve all the problems usually ends up failing and solving
none. Plan 9 doesn't solve all the world's UI issues. But it
addressed a specific, well-chosen subset of them and did a pretty good
job of solving them. That they didn't solve others doesn't invalidate
what they did do.

/~\ The ASCII Mouse
\ / Ribbon Campaign
X Against HTML mo...@rodents-montreal.org
/ \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B

--

Matthew Mondor

unread,

Jul 15, 2010, 4:37:06 PM7/15/10

to

On Thu, 15 Jul 2010 09:28:50 -0400 (EDT)
der Mouse <mo...@Rodents-Montreal.ORG> wrote:

> > 1. there's a fundamentally nasty problem that UNIX itself has never
> > dealt with in a general way: what's in that file? Text? MPEG
> > streams? JPEG?
>

> I would disagree that UNIX has never dealt with that problem. I would
> say that UNIX _has_ dealt with it, by explicitly pushing it off to the
> application layer, and that, indeed, that is where much of its power
> and flexibility comes from. I regularly do useful things by treating
> data as if it were of a type it originally wasn't intended to be.

I guess that with the advent of FUSE/PUFFS it would not not be too hard
to have an FS-level type layer equiped with a range of libraries for
various formats, such as was used on Amiga for a while (where the OS
itself didn't care, but application level could plug in custom
filesystem and device handlers which the microkernel model made easy).
I've never needed that kind of abstraction personally, but it's not
something that couldn't be done on some unix systems today, IMO.

In a way, I think that huge environments like Gnome have a virtual VFS
library layer to deal with such integration issues (where the file
manager can allow the user inside archives, for instance, and an
integrated viewer can deal with a number of "codec" modules). Of
course that's not identical to a system where a type metadata tag is
necessary for every file. But it can auto-detect a-la-file(1),
optionally then store a tag/property for later use as an optimization,
and deal with many file formats at the application/library/environment
level...

And indeed when you do have to deal yourself with the data, the system
won't be an obstacle, because it's not something dealt with by the OS
itself.

> > [...] ... but then I don't use X11 (too many years of using the much

> > more thoroughly integrated MacOS environment has spoiled me - every
> > time I try to use X, I have to work hard to suppress the strong
> > desire to do violence to the people responsible for it).

> [...]

> Also, you may be confusing X with some common window system built on X.
> It would be entirely possible to build a UI as thoroughly integrated as
> the Mac one is atop X. (I don't know why it hasn't been done, or why
> it hasn't gained wide popularity if it has.) X is not a window system,
> despite being named as one; it's really a framework for building window
> systems.

KDE and Gnome are getting pretty close, as long as you restrict
yourself to the set of official and recent enough applications for
them. If that matters to you, of course (I think it's no priority for
you or me). Similarily on OSX you have to restrict yourself to Cocoa
applications if integration is important...

An area where I think there is lacking integration between the OS and
open source desktop environments is when a removable device is removed
before the buffer cache could be flushed or volume still mounted.

We don't currently ask the user to insert the volume back easily and
resume gracefully. This is something which AmigaOS could do. What
helped there was that the hardware (including floppy drives) could
detect insertion/removal, and every file system was tagged with a
volume name and ID. Since USB/Firewire can detect such events, an
implementation could be possible in the future, though.

Another aspect to this was that on AmigaOS an application could attempt
to open a file and the environment would ask the user to insert the
needed volume. This also needs volume name support. Since open
source user GUI environments also support a degree of OS portability,
this is an additional obstacle to such highly OS/GUI coupled
integration (and I'm unsure a HAL-type layer can deal with all of those
differences for so many features, some of which totally lacking on a
system or another).

I honestly don't know if OSX does this, but it certainly is easier to
implement on an operating system which includes its own desktop
environment natively...
--
Matt

David Holland

unread,

Jul 15, 2010, 4:42:43 PM7/15/10

to

On Wed, Jul 14, 2010 at 07:38:42PM -0700, Erik Fair wrote:

> Theoretically, the POSIX locale stuff is supposed to handle things
> beyond that, but it's a more complicated and subtle problem than those
> POSIX committees really thought about.

Indeed.

> I commend this well written paper to your attention:
>
> http://plan9.bell-labs.com/sys/doc/utf.html
>
> which discusses what the Plan 9 people (Rob Pike, Ken Thompson,
> et. al) did about the software problem (and what they did about
> it), and explicitly what they decided to punt on. A precis: "we
> replaced the ASCII assumption with Unicode/UTF-8 because UTF-8 is a
> proper superset of ASCII (i.e. backward compatible) and also
> subsumes pretty much all other interesting character sets (with
> some warts) so we can translate into it without (much) semantic
> information loss."

The problem with UTF-8 in Unix is that it doesn't actually solve the
labeling problem: given comprehensive adotpion you no longer really
need to know what kind of text any given file or string is, but you
still need to know if the file contains text (UTF-8 encoded symbols)
or binary (octets), because not all octet sequences are valid UTF-8.

I don't see a viable way forward that doesn't involve labeling
everything.

--
David A. Holland
dhol...@netbsd.org

Erik Fair

unread,

Jul 15, 2010, 8:17:49 PM7/15/10

to

On Jul 15, 2010, at 13:42, David Holland wrote:

> The problem with UTF-8 in Unix is that it doesn't actually solve the
> labeling problem: given comprehensive adotpion you no longer really
> need to know what kind of text any given file or string is, but you
> still need to know if the file contains text (UTF-8 encoded symbols)
> or binary (octets), because not all octet sequences are valid UTF-8.
>
> I don't see a viable way forward that doesn't involve labeling
> everything.

If your goal is to be in deterministic file content nirvana, yes, that's the way to get there, but I'd argue it's an awful lot of work to deal with the M x N software problem I mentioned (and we'll have to add a type field to inodes which will trigger a very old debate about whether UNIX files should be just bags of bytes; the required changes for the full M x N is pretty pervasive and invasive), and the easy counter argument in an open source OS community is: "OK, who's going to write and test all that code?"

The Plan 9 people didn't shoot for a utopia - as is often their wont, they improved the situation a whole lot (Unicode/UTF-8 is a lot more expressive and encompassing of the possible space of human communications than ASCII or ISO-8859-1) with a relatively modest effort, and it's "good enough" for a much wider range of applications than the previous default of ASCII or ISO-8859-1 (does sort(1) even work right with ISO-8859-1? The man page in NetBSD 5.0 is silent on that question, but given where the diacritical characters are in the ISO-8859-1 codeset space, I bet it doesn't collate properly with a straight byte-numerical sort).

The more I ponder this, the more I think that:

1. the ASCII default status quo isn't good enough any more (and I'm sure our users in south & east Asia, not to mention eastern Europe, would agree),

2. Unicode/UTF-8 as a new default offers backward compatibility while expanding the character space quite broadly, and without anywhere near as much work (or as much paradigm shift, i.e. breaking "Unix files are a bag of bytes") on our software,

3. the "change the base software default" approach can allow us to examine and call out our software's implicit assumptions (e.g. "I'm operating on ASCII" or "I need to parse these bytes semantically") so that if/when we decide to make a run a the bigger "let's handle all character sets" M x N problem, we'll know much better what needs to be done.

4. we even have "later mover" advantage - the Plan 9 paper describes what they did, and there's standards work (hopefully sane) that we can use if we deem it correct.

Think of it as a stepwise refinement in the direction of character set processing nirvana. My concern is that if we scope the problem too large by trying to do everything, we'll never get it done, with lots of sturm und drang in the process.

Erik <fa...@netbsd.org>

Giles Lean

unread,

Jul 16, 2010, 4:33:42 AM7/16/10

to

Joerg Sonnenberger <jo...@britannica.bec.de> wrote:

> On Wed, Jul 14, 2010 at 07:38:42PM -0700, Erik Fair wrote:

> > I commend this well written paper to your attention:
> >
> > http://plan9.bell-labs.com/sys/doc/utf.html
>

> ...which is also simplistic in the assumption and problems faced. If you
> want to know about the issues with I18N and Unicode in specific, don't
> ask Americans. Don't ask Europeans either, they only have slightly more
> exposure to the problems.

I suppose you shouldn't ask Australians either, although I'm
Australian and have been mixed up with I18N issues on and off
including for Asian languages over the last 20 years or so,
and have got to see some of the problems first hand.

Since the Plan9 URL has been mentioned, I hope it's not too
off topic to say that I concur that that paper is too
simplistic about the advantages of Unicode and UTF-8, and that
the very same problems are present in Google's new Go
language, several of whose designers participated in the Plan9
work.

For anyone who's not interested in the gory details of this
sort of stuff, please stop reading now. It only gets uglier;
the world is a complex place, my Japanese friends have even
more objections to Unicode as "one size fits all" than I do
which I won't attempt to explain here, even if I were sure I
remembered them all.

For anyone who is interested in why s/ASCII/Unicode/ isn't
quite enough to write applications for worldwide use (even
worldwide use only in a single language, or even only for
worldwide use only in English!) here are a few points I find
left out of most discussions of Unicode.

The first two are points on which I disagree specifically with
the Plan 9 paper:

1. the decision not to address Unicode combining characters
2. the idea that the use of Unicode is sufficient excuse to
provide any of the functionality of locales

#1 means applications dealing with arbitrary Unicode data
(whether UTF-8 or not) must handle normalistion before even
being able to compare two strings for equality. (This is
progress?)

Even English has _some_ characters with accents, although they
are rare and English speakers have seemingly become very
tolerant of their loss in the computer age, so this isn't
"just" a problem for European languages. (Never mind the
rudeness of arbitrarily dropping accents from characters in
peoples' names.)

For #2, the glaring breakages in almost any application are
threefold:

a) how do you sort anything?

Even presuming English-only I'd like dictionary order
sometimes, and other times ASCII for consistency with
other applications or printed material, if it has used
ASCII order.

Non-English languages of course have their own rules
which should be respected, and given the number of
languages in the world and variations in local
preferences it is only practical to allow _users_ to
define collation order if no pre-existing order matches
their preference or has been created for their
language.

b) how can you (ever) localise error messages?

It would be a reasonable argument to say that an error
message catalogue can be implemented indepdently of
POSIX style locales, but localisation of an application
certainly requires translation of error messages and
indeed most of a typical application's user interface.

c) how do you handle varying date formats?

If I had a dollar (anyone's dollar -- Australian,
Canadian, Singaporean, USD, whatever) for each time
I've seen a date and had to stop and evaluate whether
it was more likely MM/DD/YY or DD/MM/YY I imagine I
could have retired long since.

3. An issue of current day importance (although not relevant
to Plan9, as it was an operating system) is how file
systems handle Unicode.

For #3 Unix -- in theory -- isn't too bad: most of its file
systems will take a series of bytes, disallowing only '/'
(which is represented as itself in UTF-8, so not typically a
problem) and '\0' (which UTF-8 avoids, so not a problem
either).

Where problems arise is where file systems (such as the
default file system on OS X) transform file names: the file
name you passed as valid UTF-8 to open() or creat() may not be
the same series of bytes you get back when you use readdir()
to examine the files in the directory. This makes for
"interesting times" for any software which wants to store a
list of file names and then access them.

> Itojun mentioned some of the issues in
> ftp://ftp.itojun.org/pub/paper/itojun-freenix2001-presen.ps.gz

Recommended.

My personal expectation is that -- like it or not -- Unicode
in the form of UTF-8 will be (if it isn't already) "the new
ASCII", but I _do_ wish that language (and operating system)
designers and vendors would:

i. specify the normal form of "their" UTF-8 strings
(and perhaps allow programmers to override the default)

ii. provide support for conversion to and from "foreign"
UTF-8 normalisation forms

iii. handle -- as gracefully as possible -- the existing file
system file name issues, and vendors should be encouraged
(severely, if that's what it takes) to allow file names
in _any_ Unicode encoding, and provide means to read
those file names "as written" (presumably: "as bytes,
trust me, I know what I'm doing") as well as "in my
preferred encoding" and with a choice of errors or "best
effort" conversion where file names are unrepresentable
(e.g. invalid UTF-8 sequence, code point doesn't fit into
UTF-16, etc).

Which still leaves open the problem of locales and issues of
multi-lingual documents and applications where a single
Unicode glyph really should be represented differently
depending upon what language it is being used for, but I did
say at the start of this too-lengthy message that the issues
get ugly.

The problems are hard; naïve (that's "naive" with a diaeresis
above the 'i', in case it was garbled en-route to you)
solutions will always be incomplete. Sweeping the
incompleteness under the carpet with the words "Well, it works
for me" is ... unimpressive.

Giles

Ken Hornstein

unread,

Jul 16, 2010, 8:17:32 AM7/16/10

to

>For anyone who's not interested in the gory details of this
>sort of stuff, please stop reading now. It only gets uglier;
>the world is a complex place, my Japanese friends have even
>more objections to Unicode as "one size fits all" than I do
>which I won't attempt to explain here, even if I were sure I
>remembered them all.

>[...]

You know, this sort of illustrates the problem I've always had with
I18N, which is: what the hell are you talking about?

I try to understand, I really do ... I've been trying to understand for
approximately 10 years now. But every time I try to read something written
by someone who understands what is going on, I get lost, and I have never
really seen anyone explain the answers to some basic questions:

- How, exactly, are UTF-8 and Unicode related?
- What exactly is a "code point"?
- What, exactly, do people mean by "normalization" in this context?
- How do locales interoperate with UTF-8/Unicode?
- And, most importantly: what do I, as a programmer, need to do to make
my application work with all of the above? I read the posted Plan 9 link,
and I guess that in some cases I need to deal with "Runes" (if I was
programming on Plan 9), but it's still not exactly clear.

I'm not saying anyone should feel obligated to answer these questions (but,
hey, if you have a good reference, I'd be glad to read it), but I'm trying
to illustrate the information gap that prevents some people from participating
in these discussions in a meaningful way.

I try to be a good international citizen, I really do ... but in a practical
sense it seems to be _so_ complicated that I basically just punt and end
up doing what I always do ... and it seems that as long as I'm 8-bit clean,
that makes me and most of the Europeans happy enough (although it tends
to piss off Japanese and Chinese users, and I'm sorry about that).

--Ken

der Mouse

unread,

Jul 16, 2010, 9:24:16 AM7/16/10

to

> [...] I have never really seen anyone explain the answers to some
> basic questions:

> - How, exactly, are UTF-8 and Unicode related?
> - What exactly is a "code point"?

Unicode is a character set: a mapping between small(ish) integers and
"character"s - which here means some kind of abstractions of the
marks-on-paper that non-computer writing uses so heavily. (I say
"abstractions" because there is a sense in which, for example, all
lowercase "j" characters are the same regardless of which font, size,
etc is used; it is that abstract common entity that I mean by
"character" here.)

The characters are things like "Latin lowercase j" or "Devanagri ta" or
"Greek uppercase xi". The integers are in the range 0 to 65535 (or at
least that's a workable approximation for purposes of this discussion;
the Unicode documentation I have does describe stuff above 65535, and
it might be better to go with 24 or 32 bits instead of 16, but most of
what I have to write here is independent of the exact range).

A code point is one of those integers.

UTF-8 is a way of encoding a character stream - well, really, a code
point stream, but the distinction between characters and the code
points representing them is often blurred - into an octet stream. A
stream of Unicode codepoints is, conceptually, a stream of these
small(ish) integers. Since there are more than 256 of them, they can't
be mapped to an octet stream as trivially as 8-bit sets like ISO-8859-1
or KOI-8 can (or <8-bit sets like ASCII). UTF-8 has a variety of
interesting and important properties, some nice and some less nice, but
those aren't terribly relevant to your question, so I'll leave them for
another time.

> - What, exactly, do people mean by "normalization" in this context?

Unicode has something called combining characters. These are a little
like dead accent keys on keyboards - the idea being that (to pick a
possibly fictitious example) you could represent an e-acute as the
two-character sequence <combining acute accent> <lowercase e>. (You
wouldn't usually do so in this particular case, because there is an
e-acute character already, but may have to if you want something like a
circumflex over a dollar sign.)

Normalization is the process of finding such cases where a character
may be represented more than one way and converting them to some
uniform representation (so that all e-acutes, for example, are
represented the same way). Which representation you pick is not
important - well, it's plenty important from many points of view, but
it's not important from the point of view of explaining the concept of
normalization.

> - How do locales interoperate with UTF-8/Unicode?

They are mostly orthogonal. Locales are things like "does money get
printed with $ or £ or ¥ or Rs or what" and "does the number 1234567
get textified as 1,234,567 or 1 234 567 or 123 4567 or what" (to pick
two of the simplest examples). Unicode and UTF-8 are relevant when you
try to represent those alternatives, but they aren't relevant to
picking which alternative to use (well, except that you presumably
aren't interested in supporting alternatives that call for characters
you don't have, an issue which mostly goes away with Unicode).

> - And, most importantly: what do I, as a programmer, need to do to
> make my application work with all of the above?

That varies drastically depending on what your application does and
what its target audience is. I can't outline it all here; much of this
thread has been about various aspects of that very question.

> I'm not saying anyone should feel obligated to answer these questions
> (but, hey, if you have a good reference, I'd be glad to read it), but
> I'm trying to illustrate the information gap that prevents some
> people from participating in these discussions in a meaningful way.

I don't have a good reference to point you at. Most of the above is
not stuff I got out of a reference; it's stuff picked up from many
assorted places over the years. It's possibly relevant that one of my
better friends actually cares about Unicode, has put a significant
amount of his own time into working with the various bodies involved,
and such. He taught me a nontrivial amount of the above.

> I try to be a good international citizen, I really do ... but in a
> practical sense it seems to be _so_ complicated

It is. The world is a complicated place. Trying to build software
that can deal with even a significant fraction of that complexity is
not a simple task.

> that I basically just punt and end up doing what I always do ... and
> it seems that as long as I'm 8-bit clean, that makes me and most of
> the Europeans happy enough (although it tends to piss off Japanese
> and Chinese users, and I'm sorry about that).

The "encoded characters are 1-to-1 with octets" assumption is a common
one, and, yes, it does tend to tick off those who use larger character
sets. I have a mailing-list acquaintance living in Japan who routinely
omits the f when writing about shift-JIS. :) This is part of the
reason that I wrote, upthread, that I think that if you want to use
Unicode more than trivially you should just bite the bullet and stop
working with octets except as an unfortunate I/O evil. UTF-8 is a
valiant attempt to deal with the impedance mismatch between Unicode
character strings and octet strings, but it can't cure it.

This is not to say that I am not guilty too. I write a lot of code
with English strings and 8-bit chars and the assumption that a char
represents exactly one character. I'm not happy about it either; when
I write my own OS (hah, right) I intend to do it righter.

/~\ The ASCII Mouse
\ / Ribbon Campaign
X Against HTML mo...@rodents-montreal.org
/ \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B

--

Valeriy E. Ushakov

unread,

Jul 16, 2010, 9:34:12 AM7/16/10

to

Ken Hornstein <ke...@pobox.com> wrote:

> I try to understand, I really do ... I've been trying to understand for
> approximately 10 years now. But every time I try to read something written
> by someone who understands what is going on, I get lost, and I have never
> really seen anyone explain the answers to some basic questions:
>
> - How, exactly, are UTF-8 and Unicode related?
> - What exactly is a "code point"?

http://www.unicode.org/reports/tr17/ addresses both.

> - What, exactly, do people mean by "normalization" in this context?

E.g. things like equivalence between single character "a with
diaeresis" vs. two characters "a" and "combining diaeresis".

http://unicode.org/reports/tr15/ has all the details.

> - How do locales interoperate with UTF-8/Unicode?

They are orthogonal.

Your word for Friday or the format used to print a date or your
culturally expected collation order exist independently of any coded
character set. So you have ru_RU.KOI8-R locale, and ru_RU.ISO8859-5
locale and ru_RU.UTF-8 locale - each with its coded charset specific
encoding for the word for Friday and appropriate numeric tables to
make two strings (as encoded in the locale's charset) collate
according to the expected order.

If (abstract) character set of your locale is covered by Unicode,
which is true for many locales, you have an *internal implementation
option* to write your locale definitions using unicode to referer to
your characters and then you can mass-produce all other locales by
converting from unicode to the locale's coded charset.

E.g. you can write in the template locale definition (expressed in
unicode) that the abbreviated name for Friday is \u043f\u0442 and then
derive actual values for koi8-r, iso8859-5 and koi8-r locales by doing
the equivalent of iconv -f utf-16 -t $(locale_charset). I think this
is what glibc does.

Alternatively you can just write all separate locale definitions in
their native charset.

-uwe

Alan Barrett

unread,

Jul 16, 2010, 9:51:17 AM7/16/10

to

On Fri, 16 Jul 2010, Ken Hornstein wrote:
> But every time I try to read something written
> by someone who understands what is going on, I get lost, and I have never
> really seen anyone explain the answers to some basic questions:

The "Terminology" section of the wikipedia article on "Character encoding"
is not great, but it may help.
<http://en.wikipedia.org/wiki/Character_encoding#Terminology>.

> - How, exactly, are UTF-8 and Unicode related?

Unicode is a lot of things, but for the purposes of contrasting Unicode
with UTF-8, think of Unicode as a mapping from 21-bit integers to
characters; UTF-8 is then a set of rules for representing those
21-bit integers using sequences of 8-bit bytes or octets.

> - What exactly is a "code point"?

A code point is an integer, which maps to a character in a coded
character set. For example, the code point for the letter "A" in the
ASCII coded character set is 65 or 0x41. For all characters that appear
in the ASCII repertoire, their code points in ASCII and in Unicode are
identical (modulo quibbles about <hyphen> versus <minus sign> versus
<hyphen-minus>, and <apostrophe> versus <left single quote>).

> - What, exactly, do people mean by "normalization" in this context?

Do you represent <capital letter FOO with accent BAR> as a single
character, or as the two-character sequence <capital letter
FOO><combining accent BAR>? What about <capital letter FOO with
accent BAR and accent BAZ>? Is <ligature "ffi"> equivalent to <letter
"f"><letter "f"><letter "i">? There are various types of normalisation
rules giving different answers to these and other questions.

--apb (Alan Barrett)

Sad Clouds

unread,

Jul 16, 2010, 9:53:44 AM7/16/10

to

On Fri, 16 Jul 2010 08:17:32 -0400
Ken Hornstein <ke...@pobox.com> wrote:

> - How, exactly, are UTF-8 and Unicode related?
> - What exactly is a "code point"?
> - What, exactly, do people mean by "normalization" in this context?
> - How do locales interoperate with UTF-8/Unicode?
> - And, most importantly: what do I, as a programmer, need to do to
> make my application work with all of the above? I read the posted
> Plan 9 link, and I guess that in some cases I need to deal with
> "Runes" (if I was programming on Plan 9), but it's still not exactly
> clear.

Have a look at the O'reilly book "Unicode explained" if you want to
know what Unicode is. You may need to read it a few times in order to
fully understand. I'm still in the process of reading it.

From what I understand:

utf-8 is an encoding, i.e. it's a particular way to represent unicode
character as a sequence of octets (bytes). You also have other
encodings, like utf-16 and utf-32, they all represent the same unicode
characters, but encode them in different ways, i.e. 16-bit integer and
32-bit integer.

Utf-8 is a variable length encoding, meaning that some characters are
represented as 1-byte, some as 2-bytes, and so on. The reason why many
people like utf-8 is because ascii characters are encoded in the same
way, which does not break older software and because utf-8 encoding is
independent of byte-order, i.e. it's just a sequence of bytes.

With utf-16 and utf-32 you need to know if data was encoded in big or
little-endian byte order, with utf-8 you don't.

I think a code point is a unique code assigned to each character (or
location) in unicode. In ascii you have code points from 0 to 127, in
unicode you have many more.

I think locales are independent of unicode, i.e. locales can support
other systems for representing characters, not just unicode. I think
locale stuff was developed before unicode became widespread.

If you program in C on Unix, then I think using wchar_t is the most
sensible way. Wide character routines in C library take care of
string/character comparison and multi-byte to wchar_t conversion.

Some people use utf-8 internally in their programs, but I'm not sure
how easy it is to handle utf-8 strings, because each character could be
1, 2, or 3 bytes in length. You can not simply extract Nth character
with 'character = utf8_string[N]' because of the variable length
encoding.

Ken Hornstein

unread,

Jul 16, 2010, 10:56:36 AM7/16/10

to

Thanks to everyone for the answers to my dumb questions; it does help fill
in the gaps a lot.

I have a few followup questions that if anyone knows the answer to, then
that would be appreciated.

Okay, I now know what Unicode is. A followup question ... it seems
that Unicode is developed in tandem as ISO-10646; do people mostly
consider those the same, or are there differences that affect
implementation details?

As people have explained, UTF-8 is a Unicode encoding that is a) a sequence
of bytes, b) doesn't use NUL ('\0'), c) is a superset of ASCII so ASCII
continues to work, and d) characters may take more than one byte.

Obviously the last one is the one that presents a number of challenges.
Okay, fine. For what I would say are my "normal" applications I don't
really do that much on a character-by-character basis; I generally deal
with whole strings. From what people are saying, if I treat characters
as arrays of char and as opaque objects, then I can simply say everything
is UTF-8 and most stuff should work fine, right? Obviously I'll have to
know somehow if I get something from a file or the network is UTF-8 and
do the right thing.

But this brings up some possibly dumb questions: say I have a UTF8 byte
sequence I want to display on standard out; do I simply use printf("%s")
like I have always been? Do I have to do something different? If so,
what?

Sad Clouds suggested using wchar_t (and I am assuming functions like
wprintf()) everywhere. I see the functions to translate character strings
into wchar_t ... but what do I use if I know that I have UTF-8? And
the reason I asked earlier about locale is that the locale affects the
way the multibyte character routines behave, which makes me think that
the locale setting affects the encoding all of those routines are using.

--Ken

Matthew Mondor

unread,

Jul 16, 2010, 11:30:04 AM7/16/10

to

On Fri, 16 Jul 2010 14:53:44 +0100
Sad Clouds <cryintot...@googlemail.com> wrote:

> Utf-8 is a variable length encoding, meaning that some characters are
> represented as 1-byte, some as 2-bytes, and so on. The reason why many
> people like utf-8 is because ascii characters are encoded in the same
> way, which does not break older software and because utf-8 encoding is
> independent of byte-order, i.e. it's just a sequence of bytes.

Other reasons why UTF-8 is convenient are that in C strings may still
be easily NUL ('\0', 0) terminated, and that the representation is
compact for languages which use no or few non-ASCII characters (albeit
the representation can also be considered bloated in some other
languages, unfortunately).

> Some people use utf-8 internally in their programs, but I'm not sure
> how easy it is to handle utf-8 strings, because each character could be
> 1, 2, or 3 bytes in length. You can not simply extract Nth character
> with 'character = utf8_string[N]' because of the variable length
> encoding.

I've seen in some code functions such as utf8_strlen() and the like;
but I also prefer working with a UCS-32/UTF-32 host-endian
representation internally, and to only use UTF-8 as a convenient
external representation.
--
Matt

der Mouse

unread,

Jul 16, 2010, 11:45:50 AM7/16/10

to

> Okay, I now know what Unicode is. A followup question ... it seems
> that Unicode is developed in tandem as ISO-10646; do people mostly
> consider those the same, or are there differences that affect
> implementation details?

Personally? I consider them the same.

But I know that is a simplification. For my purposes, so far, it's
been an ignorable simplification. If I were doing something
sufficiently serious, I would make sure I took the time to look into
whether it remained ignorable.

> As people have explained, UTF-8 is a Unicode encoding that is a) a
> sequence of bytes,

Right.

> b) doesn't use NUL ('\0'),

Wrong. It uses a 0x00 octet (which is what I assume you're talking
about) to represent U+0000. It does not use a 0x00 octet under any
other conditions, though.

> c) is a superset of ASCII so ASCII continues to work,

Mostly. It's more like "a sequence of Unicode characters all in the
ASCII range is represented in UTF-8 as the same string of octets as the
same string represented in the usual `just store ASCII in octets'
convention".

That is, given any string in ASCII stored one character per octet with
the high bits set to 0 (the usual convention for storing ASCII strings
in octet strings), the same sequence of octets is valid UTF-8 for the
Unicode codepoint string for the same characters.

> and d) characters may take more than one byte.

Right.

> Obviously the last one is the one that presents a number of
> challenges.

Well, it's one of them. "Characters do not all take the same number of
octets" is another property UTF-8 has which can cause trouble (though,
like your (d), it's implied by other properties put together).

> From what people are saying, if I treat characters as arrays of char
> and as opaque objects, then I can simply say everything is UTF-8 and
> most stuff should work fine, right?

If everything actually _is_ UTF-8, and you don't need to do any
particular processing, then yes, you can just treat strings as
content-opaque octet sequences. But that's true of pretty much any
encoding.

> But this brings up some possibly dumb questions: say I have a UTF8
> byte sequence I want to display on standard out; do I simply use
> printf("%s") like I have always been? Do I have to do something
> different? If so, what?

"That depends". It depends on whether printf tries to be smart (most
printfs I'm familiar with treat strings as opaque octet sequences for
things like %s, but I'd be surprised if there weren't some that went to
the trouble to process characters rather than octets). It depends on
how the octet sequence produced by your program is interpreted
(terminal or terminal emulator handling UTF-8 or 8859-1 or what). It
depends on what exactly you mean by "display on standard out", too.

/~\ The ASCII Mouse
\ / Ribbon Campaign
X Against HTML mo...@rodents-montreal.org
/ \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B

--

Sad Clouds

unread,

Jul 16, 2010, 11:50:12 AM7/16/10

to

On Fri, 16 Jul 2010 10:56:36 -0400
Ken Hornstein <ke...@pobox.com> wrote:

> But this brings up some possibly dumb questions: say I have a UTF8
> byte sequence I want to display on standard out; do I simply use
> printf("%s") like I have always been? Do I have to do something
> different? If so, what?
>

That's the good thing about utf-8, you can treat it as a sequence of
normal char objects. If your terminal supports utf-8, then any sequence
of non-ascii chars should be displayed correctly.

> Sad Clouds suggested using wchar_t (and I am assuming functions like
> wprintf()) everywhere. I see the functions to translate character
> strings into wchar_t ... but what do I use if I know that I have
> UTF-8? And the reason I asked earlier about locale is that the
> locale affects the way the multibyte character routines behave, which
> makes me think that the locale setting affects the encoding all of
> those routines are using.

I use wchar_t when I need to know that each character is represented by
a fixed size object. This way you can have a pointer to a string and
look at every character individually just by incrementing the pointer.
Sometimes I do it from left to right, but occasionally I may need to do
it from right to left. For example if you have a filename:

some_long_file_name.txt

To quickly extract the suffix '.txt' you just scan the string from
right to left, until you hit '.' char. I think with utf-8 this type of
string manipulation would be quite messy and you would have to use a
special library that understands utf-8 encodings, etc.

The multi-byte conversion functions are affected by the current locale.
Normally you would call

setlocale(LC_CTYPE, "");

at the start of your program and during your program run you don't
change locale. Setting empty locale will make multi-byte conversion
functions query users locale environment variable and perform
conversion based on that. So different users can use different locales,
which may result in different character encoding schemes, however C
library wide character functions should transparently handle that.

There are two problems with C wide characters:

1. Switching do different locales while the program is running is not
thread-safe and may result in weird errors. This means you can only use
one locale during program run time.

2. The interfaces for C library multi-byte to wide, and wide to
multi-byte conversion functions are so badly designed, it's not even
funny. The biggest problem with those functions is the fact they expect
NULL terminated strings. If you have a partial (not NULL terminated)
string in the buffer, you cant call string conversion function on it,
because it won't stop until it finds a NULL and you end up with buffer
overrun. You cannot "artificially" NULL terminate the string, because
after reading NULL char, the function will reset mbstate_t object to the
initial state. This will mess up the next sequence of multi-byte
characters if the encoding had state.

I spent two days, jumping through the hoops and trying to figure out
how to convert partial strings. I think I nailed it in the end with 30%
performance penalty, but still 3.5 times faster than iconv().

If anyone is interested, I can post the code for the wrapper
functions...

Peter Bex

unread,

Jul 16, 2010, 10:51:46 AM7/16/10

to

On Fri, Jul 16, 2010 at 08:17:32AM -0400, Ken Hornstein wrote:
> You know, this sort of illustrates the problem I've always had with
> I18N, which is: what the hell are you talking about?
>
> I try to understand, I really do ... I've been trying to understand for
> approximately 10 years now. But every time I try to read something written
> by someone who understands what is going on, I get lost, and I have never
> really seen anyone explain the answers to some basic questions:
>

[questions snipped]

>
> I'm not saying anyone should feel obligated to answer these questions (but,
> hey, if you have a good reference, I'd be glad to read it), but I'm trying
> to illustrate the information gap that prevents some people from participating
> in these discussions in a meaningful way.

I found this one day and it was very useful in explaining both to me and
my colleagues what this stuff is all about in a down-to earth manner:
http://www.joelonsoftware.com/articles/Unicode.html

You'll have to excuse the condescending tone this article sometimes assumes,
I guess the author was as frustrated as you are right now :)

After reading this you can decide whether you want to dive into the gory
details and read Unicode specifications :)

Cheers,
Peter
--
http://sjamaan.ath.cx
--
"The process of preparing programs for a digital computer
is especially attractive, not only because it can be economically
and scientifically rewarding, but also because it can be an aesthetic
experience much like composing poetry or music."
-- Donald Knuth

der Mouse

unread,

Jul 16, 2010, 12:06:59 PM7/16/10

to

> I use wchar_t when I need to know that each character is represented

> by a fixed size object. [...] For example if you have a filename:

> some_long_file_name.txt

> To quickly extract the suffix '.txt' you just scan the string from
> right to left, until you hit '.' char. I think with utf-8 this type
> of string manipulation would be quite messy and you would have to use
> a special library that understands utf-8 encodings, etc.

In the case of scanning for a '.', it's totally trivial, because that's
in the ASCII range and thus (a) it's represented as a single 0x2e octet
and (b) a 0x2e octet does not occur under other circumstances. But in
the more general case, where you are or might be scanning for a
non-ASCII character, it's not that easy. Even then, though, it's not
bad; one of UTF-8's nice properties is that it is trivial to identify
whether a given octet is the first octet of a character or not, thus
making it fairly easy to scan a string from right to left. With a
little extra work you can even accumulate the codepoint as you scan
backwards through the octets, so you don't have to scan backwards for
character beginnings and then forward to get the codepoints.

> There are two problems with C wide characters:

There are a lot more than two. :-)

> 1. Switching do different locales while the program is running is not
> thread-safe

This is an implementation issue.

> and may result in weird errors. This means you can only use one
> locale during program run time.

Thread (un)safety doesn't mean you can't switch locales; for example,
if your program is not threaded, thread safety is totally irrelevant.
If you can't switch locales at run time, that's a separate bug.

> 2. [C library wide-character support is badly designed]

Agreed. Entirely.

/~\ The ASCII Mouse
\ / Ribbon Campaign
X Against HTML mo...@rodents-montreal.org
/ \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B

--

Aleksej Saushev

unread,

Jul 16, 2010, 12:17:25 PM7/16/10

to

Erik Fair <fa...@netbsd.org> writes:

> 2. Unicode/UTF-8 as a new default offers backward compatibility while
> expanding the character space quite broadly, and without anywhere near
> as much work (or as much paradigm shift, i.e. breaking "Unix files are
> a bag of bytes") on our software,

Sorry, this is wrong. This assumes that you don't use anything ASCII
compatible (more or less). I do, and "UTF-8 by default" will cause major
pain to me and to many users here.

The main reason for it is that UTF-8 wastes half of bandwidth on wire,
and some of NetBSD tools don't tolerate long file names. E.g. pax.
I meet border cases already, and UTF-8 by default will double on-wire
length of file names in consideration.

--
HE CE3OH...

Matthew Mondor

unread,

Jul 16, 2010, 12:37:34 PM7/16/10

to

On Fri, 16 Jul 2010 16:50:12 +0100
Sad Clouds <cryintot...@googlemail.com> wrote:

> 2. The interfaces for C library multi-byte to wide, and wide to
> multi-byte conversion functions are so badly designed, it's not even
> funny. The biggest problem with those functions is the fact they expect
> NULL terminated strings. If you have a partial (not NULL terminated)
> string in the buffer, you cant call string conversion function on it,
> because it won't stop until it finds a NULL and you end up with buffer
> overrun. You cannot "artificially" NULL terminate the string, because
> after reading NULL char, the function will reset mbstate_t object to the
> initial state. This will mess up the next sequence of multi-byte
> characters if the encoding had state.
>
> I spent two days, jumping through the hoops and trying to figure out
> how to convert partial strings. I think I nailed it in the end with 30%
> performance penalty, but still 3.5 times faster than iconv().
>
> If anyone is interested, I can post the code for the wrapper
> functions...

In case it can serve, I also wrote an implementation of UTF-8 <->
UTF-32 and put it under BSD-like license:

http://cvs.pulsar-zone.net/cgi-bin/cvsweb.cgi/~checkout~/mmondor/mmsoftware/mmlib/utf8.c?rev=1.2;content-type=text%2Fplain
http://cvs.pulsar-zone.net/cgi-bin/cvsweb.cgi/~checkout~/mmondor/mmsoftware/mmlib/utf8.h?rev=1.1;content-type=text%2Fplain

I however have no benchmark comparing it against an other implementation.
--
Matt

Ty Sarna

unread,

Jul 16, 2010, 12:34:31 PM7/16/10

to

On Jul 16, 2010, at 11:50 AM, Sad Clouds wrote:

> Sometimes I do it from left to right, but occasionally I may need to do
> it from right to left. For example if you have a filename:
>
> some_long_file_name.txt
>
> To quickly extract the suffix '.txt' you just scan the string from
> right to left, until you hit '.' char. I think with utf-8 this type of
> string manipulation would be quite messy and you would have to use a
> special library that understands utf-8 encodings, etc.

Nope, because:

- ASCII characters are expressed in utf-8 identically ('.' is '.')
- No non-ASCII utf-8 character includes in its multibyte representation any byte which is also an ASCII character (all bytes of multibyte utf-8 characters have the high bit set). Thus, you can't accidentally mistake part of some other character as '.'

Thus, any kind or processing dealing with searching for ascii characters ('.', '/', newline, spaces, etc) can safely be ignorant of utf-8.

Of course some things need to care, like counting characters (vs bytes), truncation (to make sure it's not in the middle of a multibyte character), etc, etc, but there are many cases where it just doesn't matter. Good old strrchr() would do just fine for your example here.

Even if you do need to scan utf-8 backwards, it's not SO hard, because it's easy to tell when you got to the beginning of the character (high two bits of the first byte are 11, vs 10 for additional bytes)

Ken Hornstein

unread,

Jul 16, 2010, 12:52:15 PM7/16/10

to

See, I think I am understanding things, but then ....

>> b) doesn't use NUL ('\0'),
>
>Wrong. It uses a 0x00 octet (which is what I assume you're talking
>about) to represent U+0000. It does not use a 0x00 octet under any
>other conditions, though.

Okay ... I'm not up on nomenclature. U+0000 means .... a particular
Unicode codepoint? I guess that's a Unicode NULL, according to what
I've seen online.

I guess the real question is ... I'm used to C-style strings, where I
don't have to care about the length, but 0x00 is the terminator. Can
I still do that with Unicode? I mean, I see that U+0000 is a valid
Unicode code point, but it's not actually anything PRINTABLE, right?
Sure, I should be passing around lengths to everything, but I'm just
thinking of the amount of code that would need to be changed.

>> But this brings up some possibly dumb questions: say I have a UTF8
>> byte sequence I want to display on standard out; do I simply use
>> printf("%s") like I have always been? Do I have to do something
>> different? If so, what?
>
>"That depends". It depends on whether printf tries to be smart (most
>printfs I'm familiar with treat strings as opaque octet sequences for
>things like %s, but I'd be surprised if there weren't some that went to
>the trouble to process characters rather than octets). It depends on
>how the octet sequence produced by your program is interpreted
>(terminal or terminal emulator handling UTF-8 or 8859-1 or what). It
>depends on what exactly you mean by "display on standard out", too.

I'm just thinking of the basic example of, "I want my command-line program
to print out something to the defined Unix standard output", which is what
most of them do. From what people are saying ... there's not really a way
of telling, today, if your terminal supports UTF-8, or 8859-1, or anything
else (unless it's embedded in locale information, somehow).

Also, Aleksej says:

>Sorry, this is wrong. This assumes that you don't use anything ASCII
>compatible (more or less). I do, and "UTF-8 by default" will cause major
>pain to me and to many users here.
>
>The main reason for it is that UTF-8 wastes half of bandwidth on wire,
>and some of NetBSD tools don't tolerate long file names. E.g. pax.
>I meet border cases already, and UTF-8 by default will double on-wire
>length of file names in consideration.

This brings up a couple of questions:

- Isn't UTF-8 already ASCII compatible?
- How does UTF-8 waste half of the bandwidth?
- What would you prefer we do instead?

--Ken

der Mouse

unread,

Jul 16, 2010, 2:55:27 PM7/16/10

to

>>> b) [UTF-8] doesn't use NUL ('\0'),

>> Wrong. It uses a 0x00 octet (which is what I assume you're talking
>> about) to represent U+0000. It does not use a 0x00 octet under any
>> other conditions, though.
> Okay ... I'm not up on nomenclature. U+0000 means .... a particular
> Unicode codepoint?

Right. U+ followed by four (or more) hex digits refers to Unicode
codepoints (or, if you're in a context where blurring the distinction
between codepoints and their associated characters is appropriate,
sometimes for the character that goes with that codepoint).

> I guess that's a Unicode NULL, according to what I've seen online.

Right.

> I guess the real question is ... I'm used to C-style strings, where I
> don't have to care about the length, but 0x00 is the terminator. Can
> I still do that with Unicode?

You can with UTF-8; not with some other octetizations (to coin a word)
of Unicode character strings. In fact, you can to approximately the
same degree you can with ASCII: occasionally people want to process
ASCII text that can include NULs and have to avoid C library routines
as a result, and much the same is true here.

> I mean, I see that U+0000 is a valid Unicode code point, but it's not
> actually anything PRINTABLE, right?

Right.

> Sure, I should be passing around lengths to everything, but I'm just
> thinking of the amount of code that would need to be changed.

Well..."should"? Only in the sense that you "should" be doing the same
for ASCII text.

>>> [to print a UTF-8-encoded Unicode string] do I simply use

>>> printf("%s") like I have always been?

>> "That depends".

> I'm just thinking of the basic example of, "I want my command-line
> program to print out something to the defined Unix standard output",
> which is what most of them do.

The really really short answer is "yes, do that". It's probably the
closest available approximation to what you want, and will work in most
cases where anything will.

> From what people are saying ... there's not really a way of telling,
> today, if your terminal supports UTF-8, or 8859-1, or anything else
> (unless it's embedded in locale information, somehow).

Right. :( Some locale systems actually include charset info as well.
However, even there, making sure that the setting and the reality match
is usually pushed off to the human layer; I could, for example, set
environment variables to claim UTF-8 support in a terminal emulator
doing 8859-7, and software would be almost entirely unable to tell that
it's being lied to - but what I-the-human is seeing would not be what
the software expects based what it's been told.

>> The main reason for it is that UTF-8 wastes half of bandwidth on
>> wire,

This is true only if you normally use a non-ASCII set of characters
that have an 8-bit character set, and you're comparing to such a set.
(Examples might be 8859-7 and KOI-8. Not 8859-1, because most 8859-1
users draw on the ASCII low half heavily.)

> This brings up a couple of questions:

> - Isn't UTF-8 already ASCII compatible?

For suitable values of "compatible". It is ASCII compatible in that
taking a string encoded in ASCII using the usual "zero-pad each
character to one octet" convention, converting it (conceptually) to
ASCII characters, mapping them to their Unicode equivalents, and then
encoding the resulting Unicode codepoint string in UTF-8 results in the
same octet sequence you started with. (Most of those conversion steps
do not actually involve any data massaging, but rather just conceptual
reframes.)

> - How does UTF-8 waste half of the bandwidth?

See above. If you use, say, KOI-8 (Cyrillic, which strikes me as
likely what Aleksej is using), or 8859-7 (Greek), or 8859-8 (Hebrew)
and are using mostly the non-ASCII half, then UTF-8 encoding results in
two octets per character on the wire for most characters, as opposed to
using KOI-8 (or whatever), which uses one octet per character.

Sometimes this is important; sometimes it's not. Aleksej has a good
point in that FFS (which is probably what most NetBSD systems use) has
a limit of 255 on directory entry name length - but that's 255 octets,
not 255 characters. If you have a tendency to use file names in the
100-200 character range, this may well matter to you. There doubtless
are plenty of other relatively small limits which look smaller when
viewed through UTF-8 glasses....

/~\ The ASCII Mouse
\ / Ribbon Campaign
X Against HTML mo...@rodents-montreal.org
/ \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B

--

Sad Clouds

unread,

Jul 16, 2010, 4:19:13 PM7/16/10

to

On Fri, 16 Jul 2010 12:34:31 -0400
Ty Sarna <t...@sarna.org> wrote:

> On Jul 16, 2010, at 11:50 AM, Sad Clouds wrote:
>

> > Sometimes I do it from left to right, but occasionally I may need
> > to do it from right to left. For example if you have a filename:
> >
> > some_long_file_name.txt
> >
> > To quickly extract the suffix '.txt' you just scan the string from
> > right to left, until you hit '.' char. I think with utf-8 this type
> > of string manipulation would be quite messy and you would have to
> > use a special library that understands utf-8 encodings, etc.
>

> Nope, because:
>
> - ASCII characters are expressed in utf-8 identically ('.' is '.')
> - No non-ASCII utf-8 character includes in its multibyte
> representation any byte which is also an ASCII character (all bytes
> of multibyte utf-8 characters have the high bit set). Thus, you can't
> accidentally mistake part of some other character as '.'

Yeah you're right.

Aleksej Saushev

unread,

Jul 16, 2010, 6:39:12 PM7/16/10

to

der Mouse <mo...@Rodents-Montreal.ORG> writes:

>>> The main reason for it is that UTF-8 wastes half of bandwidth on
>>> wire,
>

> This is true only if you normally use a non-ASCII set of characters
> that have an 8-bit character set, and you're comparing to such a set.
> (Examples might be 8859-7 and KOI-8. Not 8859-1, because most 8859-1
> users draw on the ASCII low half heavily.)

Maybe it isn't obvious, but there's quite large part of the world that
writes non-Latin script and as such uses mostly non-Latin characters. ;)

>> - How does UTF-8 waste half of the bandwidth?
>

> See above. If you use, say, KOI-8 (Cyrillic, which strikes me as
> likely what Aleksej is using), or 8859-7 (Greek), or 8859-8 (Hebrew)
> and are using mostly the non-ASCII half, then UTF-8 encoding results in
> two octets per character on the wire for most characters, as opposed to
> using KOI-8 (or whatever), which uses one octet per character.
>
> Sometimes this is important; sometimes it's not. Aleksej has a good
> point in that FFS (which is probably what most NetBSD systems use) has
> a limit of 255 on directory entry name length - but that's 255 octets,
> not 255 characters. If you have a tendency to use file names in the
> 100-200 character range, this may well matter to you. There doubtless
> are plenty of other relatively small limits which look smaller when
> viewed through UTF-8 glasses....

You can use whatever short file names, it's at your wish... but only
before you have to communicate with the outer world, and out there it is
quite usual to have file names longer than 100 and even 200 characters.
And it's not you who can change this.

--
HE CE3OH...

der Mouse

unread,

Jul 16, 2010, 7:43:57 PM7/16/10

to

> Maybe it isn't obvious, but there's quite large part of the world
> that writes non-Latin script and as such uses mostly non-Latin
> characters. ;)

Certainly. But much of that large part of the world uses
non-alphabetic writing systems and your "wastes half the bandwidth"
does not really apply.

> You can use whatever short file names, it's at your wish... but only
> before you have to communicate with the outer world, and out there it
> is quite usual to have file names longer than 100 and even 200
> characters.

Sure. But nothing compels you to pick the same names fgor your files
as "the outer world" uses for theirs, not even if the file content is
the same. It's the same issue the 8.3 DOS world faced a long time ago,
just not quite as severe.

Not that that makes it a non-problem....

/~\ The ASCII Mouse
\ / Ribbon Campaign
X Against HTML mo...@rodents-montreal.org
/ \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B

--

Dave Huang

unread,

Jul 16, 2010, 8:51:20 PM7/16/10

to

On 7/16/2010 6:43 PM, der Mouse wrote:
> Certainly. But much of that large part of the world uses
> non-alphabetic writing systems and your "wastes half the bandwidth"
> does not really apply.

Although for the part of the world that uses non-alphabetic writing
systems, it still "wastes" a third of your bandwidth
(Chinese/Japanese/Korean characters take 3 octets in UTF-8, whereas they
take 2 in the various pre-Unicode encodings).

And then there are the unfortunate group of people where their
single-octet characters turn into three-octet characters (e.g., Thai and
the various Indian scripts, such as Devanagari).

mar...@gmail.com

unread,

Jul 16, 2010, 7:30:02 AM7/16/10

to

> If your goal is to be in deterministic file content nirvana, yes, that's
> the way to get there, but I'd argue it's an awful lot of work to deal with
> the M x N software problem I mentioned (and we'll have to add a type field
> to inodes which will trigger a very old debate about whether UNIX files
> should be just bags of bytes; the required changes for the full M x N is
> pretty pervasive and invasive)

How about (user settable) extended attributes? It's a nice way to handle
meta-data and is useful if you want ACLs. Besides, extended attributes are
natively supported by UFS2.

Regards

Giles Lean

unread,

Jul 18, 2010, 8:34:07 PM7/18/10

to

mar...@gmail.com wrote:

> How about (user settable) extended attributes? It's a nice way to handle
> meta-data and is useful if you want ACLs. Besides, extended attributes are
> natively supported by UFS2.

Introduces fascinating (er, frustrating?) portability issues, and the
old debate about whether meta data (and resource forks and the like)
should merely be replaced but the idea of using a directory and putting
all the extra stuff in it as extra files.

Note that it took Apple quite a long time to get all (assuming it's now
"all") of the file handling/copying/archiving utilities to support
HFS+'s extended attributes and ACLs, and they can't make up their
minds whether they really want to deprecate resource forks or not.

Then after adding support you have to get it right: I have a HFS+
file system which Apple's rsync will not ever say is synchronised,
despite all files having the same size, length, checksum(!), ACLs,
extended attributes, and ownership. Maybe that bug's fixed in
Snow Leopard (the current OS X release at the time of writing) and
maybe it's not too.

It's a hard problem. For binary format files I personally think
magic numbers remain a pretty good (workable, and robust if not
perfect) solution.

For "text" files (where this discussion began) where you want to know
the character set extended attributes _do_ start to make a claim: while
it might be arguable (as I did) that directories and multiple files are
theoretically equivalently powerful, we just are not accustomed to
thinking of a "directory" as a file, and similar softare changes are
needed in either case.

Giles

mar...@gmail.com

unread,

Jul 19, 2010, 5:34:26 AM7/19/10

to

> Introduces fascinating (er, frustrating?) portability issues, and the
> old debate about whether meta data (and resource forks and the like)
> should merely be replaced but the idea of using a directory and putting
> all the extra stuff in it as extra files.

Using $FOO.app directories has always struck me as odd but I guess it has
some advantages, like shipping all needed libraries and not depending on
a particular version being installed systemwide.

> Note that it took Apple quite a long time to get all (assuming it's now
> "all") of the file handling/copying/archiving utilities to support
> HFS+'s extended attributes and ACLs, and they can't make up their
> minds whether they really want to deprecate resource forks or not.

Having used MacOS only occasionally, I remain blissfully ignorant about its
quirks and resource forks are still a mystery to me.

> It's a hard problem. For binary format files I personally think
> magic numbers remain a pretty good (workable, and robust if not
> perfect) solution.
>
> For "text" files (where this discussion began) where you want to know
> the character set extended attributes _do_ start to make a claim: while
> it might be arguable (as I did) that directories and multiple files are
> theoretically equivalently powerful, we just are not accustomed to
> thinking of a "directory" as a file, and similar softare changes are
> needed in either case.

Sure, it would be great if there was something like magic numbers for text
files. I guess BOM was meant to serve as such but there're too many
UTF-16/-32 files without it. Piping poorly generated UTF-16 text to iconv on
x86 without explicitly specifying byte order is a great way to learn Chinese.
Ever since I first saw BeOS I'm in love with extended attributes (I'm not
quite sure if BeOS used xattr's or something else), especially because you
can Just Tag Your Files™ and thus avoid having to use ugly looking filenames,
lots of symlinks, strangely organized directories or any combination of the
above.

Kind regards

Giles Lean

unread,

Jul 19, 2010, 9:00:40 PM7/19/10

to

[ I changed the subject title; drifting, but in an interesting
(to me, anyway) direction. --giles ]

mar...@gmail.com wrote:

> Using $FOO.app directories has always struck me as odd but I guess it has
> some advantages, like shipping all needed libraries and not depending on
> a particular version being installed systemwide.

It's a practical solution to shared library versioning hell,
of which Unix still has some (e.g. an application that depends
on a bug in a particular version of a library).

Of course, any library thus shipped with the application then
loses the benefit of being "shared", so one wonders why _that_
library shouldn't be linked statically, removing one
justification for the whole idea.

On the plus side, keeping all the image files, help files,
libexec type binaries etc with the primary executable helps
with installation and uinstallation. (Except when you have
several applications which all use the same helper, and that
helper needs fixing ... been there, done that too, but I'd
put up with that case.)

On the gripping hand, every darn application seems to want to
maintain a cache and some state, and while there are
"standard" places to store those too, most ("most" by my
personal observation, not any scientific measurement) OS X
applications fail to include an uninstall script/program so
these are easily "left behind" when the application's .app
directory is deleted.

I still do OS upgrades via "Backup, check the backup, clean
install, reinstall applications" process. :-) :-(

> Having used MacOS only occasionally, I remain blissfully
> ignorant about its quirks and resource forks are still a
> mystery to me.

They're "alternate" data streams. Originally for program icons
and such I believe. Officially they're deprecated.
Unofficially OS X applications and programmers can't quite
give them up, seemingly.

Windows (starting with NT?) has something similar, but that
I'm not familiar with.

For the record (and IMHO blah blah) resource forks are an
outright bad idea: directories are _designed_ for keeping
related data together.

Extended attributes (or tags of some sort) make a lot more
sense (or would, IMHO, yada yada) if they provided some
sort of MIME type file metadata about the file, and were
flexible enough to handle additional concepts like *BSD
flags (refer chflags(2), chflags(1)) etc.

ACLs (IMHO -- this whole post is IMHO :-) should not be
an add-on to traditional Unix groups: pick one or the
other, and stick to it.

In practice I have never had a situation where Unix group
membership (if necessary, with a setgid and/or setuid
accessor program) did not suffice. About twice ACLs would
truly have been more convenient, but the lack (at least at
the time, but probably still now) of consistent semantics
(nevermind user unfamiliarity) was a showstopper.

> Sure, it would be great if there was something like magic numbers for text
> files. I guess BOM was meant to serve as such but there're too many
> UTF-16/-32 files without it.

Plus, every program has to know to expect a BOM, and as a form
of in-band signalling it has all the problems such schemes
always have.

A magic number in a binary file is different: only programs
already familiar with the file format care about the magic
number, and so can be taught to check for it at the same time
as they're taught the entire file format.

> Piping poorly generated UTF-16 text to iconv on x86 without
> explicitly specifying byte order is a great way to learn
> Chinese.

I'll take your word for it. I once had to teach a week long
class (in English) where the only PC that could drive the
projector was cofigured for (I think) simplified Chinese. I
needed help from my students to manage things like opening
files. :-(

> Ever since I first saw BeOS I'm in love with extended
> attributes (I'm not quite sure if BeOS used xattr's or
> something else), especially because you can Just Tag Your
> Files™ and thus avoid having to use ugly looking filenames,
> lots of symlinks, strangely organized directories or any
> combination of the above.

But it's close to all or nothing: open(), read(), write(),
close() isn't enough to replicate the file anymore. (I
suppose, to be fair, it isn't on Unix either: you need stat(),
chown(), chmod(), etc ... but still, _every_ application that
copies files has to know about any new functionality.)

Trouble is here with NetBSD, we don't have a clean slate. We
have years of history and lots of valuable code: any
enhancements along these lines are going to be tradeoffs
between maintaining compatibility with the past (and thus
existing code) and providing new capabilities.

Change isn't impossible: we have 64bit off_t and the
commercial vendors even managed to work together for that
migration. (Their solution wasn't ideal, perhaps, but it
_was_ usable across all the vendors' platforms who used it,
and didn't require the "flag day" that 4.4BSD had the luxury
of opting for.)

Giles

der Mouse

unread,

Jul 19, 2010, 9:59:24 PM7/19/10

to

>> Having used MacOS only occasionally, I remain blissfully ignorant
>> about its quirks and resource forks are still a mystery to me.

> They're "alternate" data streams.

Well...the things in the resource fork are alternative data streams.
The resource fork as a whole is more like an alternative data stream
directory.

AIUI, that is; I have never had MacOS (of any vintage) inflicted on me,
but I've picked up bits here and there.

> For the record (and IMHO blah blah) resource forks are an outright
> bad idea: directories are _designed_ for keeping related data
> together.

> Extended attributes (or tags of some sort) make a lot more sense (or
> would, IMHO, yada yada) if they provided some sort of MIME type file
> metadata about the file, and were flexible enough to handle
> additional concepts like *BSD flags (refer chflags(2), chflags(1))
> etc.

So, tell me: besides the name, what is the difference between an
extended attribute and an entry in the resource fork? In particular,
what makes the former good and the latter bad?

Because I'm having trouble seeing the difference.

/~\ The ASCII Mouse
\ / Ribbon Campaign
X Against HTML mo...@rodents-montreal.org
/ \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B

--

Matthew Mondor

unread,

Jul 20, 2010, 1:12:45 AM7/20/10

to

On 20 Jul 2010 11:00:40 +1000
Giles Lean <giles...@pobox.com> wrote:

> They're "alternate" data streams. Originally for program icons
> and such I believe. Officially they're deprecated.
> Unofficially OS X applications and programmers can't quite
> give them up, seemingly.
>
> Windows (starting with NT?) has something similar, but that
> I'm not familiar with.
>
> For the record (and IMHO blah blah) resource forks are an
> outright bad idea: directories are _designed_ for keeping
> related data together.

Absolutely, especially for the security conscious when the default OS
distribution tools won't easily show the total size, all streams and
their respective size, with access to their contents (as is (or at
least was) the case on Windows). I remember having to look at the
win32 API to write a tool for doing that, when I had the misfortune
of administering Windows systems...
--
Matt

Matthew Mondor

unread,

Jul 20, 2010, 1:20:26 AM7/20/10

to

On Mon, 19 Jul 2010 21:59:24 -0400 (EDT)
der Mouse <mo...@Rodents-Montreal.ORG> wrote:

> So, tell me: besides the name, what is the difference between an
> extended attribute and an entry in the resource fork? In particular,
> what makes the former good and the latter bad?
>
> Because I'm having trouble seeing the difference.

I could be wrong, but my impression is that extended attributes should
limit functionality to what is strictly necessary to tag custom
metadata to files, and possibly even enforcing sanity on the data (i.e.
a typed property system), as opposed to forks which may allow
arbitrary, structured or unstructured data (that is, hidden files
within a file)...
--
Matt

mar...@gmail.com

unread,

Jul 20, 2010, 9:51:41 AM7/20/10

to

> I still do OS upgrades via "Backup, check the backup, clean
> install, reinstall applications" process. :-) :-(

Good and tried method, regularly used here too.

>> Having used MacOS only occasionally, I remain blissfully
>> ignorant about its quirks and resource forks are still a
>> mystery to me.

> They're "alternate" data streams. Originally for program icons
> and such I believe. Officially they're deprecated.
> Unofficially OS X applications and programmers can't quite
> give them up, seemingly.
>
> Windows (starting with NT?) has something similar, but that
> I'm not familiar with.

I don't understand what "alternate" means in this case. Is this "data stream"
kept in the executable itself (that's what Windows does with icons, IMO since
3.x) or in a separate file but still being treated as a part of the
executable (like DOS overlay files)?

> ACLs (IMHO -- this whole post is IMHO :-) should not be
> an add-on to traditional Unix groups: pick one or the
> other, and stick to it.

IMO you're right :-) but some people want to have ACLs, MAC and whatever else
it takes to become B1 compliant "Trusted $OS" and hopefully make big buck$.

> In practice I have never had a situation where Unix group
> membership (if necessary, with a setgid and/or setuid
> accessor program) did not suffice.

I don't particularly like set[ug]id wrappers and creating new groups every so
often for usually trivial/mundane tasks. Still, Unix groups are easier to
manage.

> A magic number in a binary file is different: only programs
> already familiar with the file format care about the magic
> number, and so can be taught to check for it at the same time
> as they're taught the entire file format.

This again brings up the question if files are just bags of bytes or not. I'd
say text files are just another file format. There're just too many programs
which deal with it. Here Plan 9 got it right - assume all text is UTF-8 (I
leave aside considerations about file size or bandwidth wasted) and add some
functions to libc if needed. "Assume all text is ASCII" is just as valid
though.

> But it's close to all or nothing: open(), read(), write(),
> close() isn't enough to replicate the file anymore. (I
> suppose, to be fair, it isn't on Unix either: you need stat(),
> chown(), chmod(), etc ... but still, _every_ application that
> copies files has to know about any new functionality.)

I believe XFS has xattrs by design so I wonder if all Irix programs were
written to be xattr-aware or if it was pushed down to the filesystem level.
Maybe that's the right approach?

Best regards

Giles Lean

unread,

Jul 20, 2010, 8:54:35 PM7/20/10

to

mar...@gmail.com wrote:

> I don't understand what "alternate" means in this case. Is this "data stream"
> kept in the executable itself (that's what Windows does with icons, IMO since
> 3.x) or in a separate file but still being treated as a part of the
> executable (like DOS overlay files)?

(With the proviso that like much else in this thread, this is
nearly all "IMHO" -- disagreement is both expected and welcome
if it adds clarity or better ideas than I have.)

When you call open() you get the main data stream; there are other
calls to find out what resource forks exist and to open one of them
instead. Sounds more like the Windows example you gave.

> IMO you're right :-) but some people want to have ACLs, MAC and
> whatever else it takes to become B1 compliant "Trusted $OS" and
> hopefully make big buck$.

True. But that market is so small (in number) that doing anything
to inconvenience the non-trusted users is inappropriate, I think.

Were I (perish the thought) trying to turn a Unix system into
something that would pass B1 certification (if that still
means anything) I would be deciding things like "if I have
ACLs, I'm not going to have Unix groups".

The implementations I have seen which have tried to maintain
"optional" ACLs on top of Unix groups always seemed awkward:
naive programs and users couldn't figure out what was going
on as soon as an ACL appeared.

> This again brings up the question if files are just bags of bytes or
> not. I'd say text files are just another file format. There're just
> too many programs which deal with it. Here Plan 9 got it right -
> assume all text is UTF-8 (I leave aside considerations about file size
> or bandwidth wasted) and add some functions to libc if needed. "Assume
> all text is ASCII" is just as valid though.

Historically, the "bag of bytes" model split into:

o text, which lots of things understood, and
o binary, which very likely had a specific format and only the
appropriate program could actually work with was fine

_Plus_, by and large, there was no rule that you couldn't
apply a text handling utility to a file that happened to have
binary contents etc. A pretty revolutionary idea when it was
new.

That goal was not and is not always achieved as well as it
might be: early versions of sed truncated(!) data lines at 512
bytes; a few utilities do object to "binary files" and modify
their behaviour when they see one:

$ diff /mach_kernel /dev/null
Binary files /mach_kernel and /dev/null differ

Here true, and I probably don't want to see the output of that
diff. But it's annoying if there are only a few binary
chracters and they're not on the lines required for the diff.
(Oddly, I lack a running NetBSD system right now; that example
is from OS X diff, which is GNU diff. NetBSD's may be
different.)

Our present day problem is all the world's _not_ ASCII, so our
idea of what a "text file" is is under pressure: I (English
speaker in a country where English is the major language
spoken) have a mixture of ASCII and UTF-8 files.

A Japanese user would be likely to have some Shift-JIS files
as well, I suspect.

Who uses UTF-16 these days I'm not sure (Windows for at least
some things?), but there are definitely files (and file
systems) which use them and byte order raises its ugly head,
so a mixture of ASCII, UTF-8, and UTF-16 (maybe little endian,
maybe big-endian, maybe both) is also reasonable to find on
an individual system.

Thus, it looks more reasonable than it did in 1970 or so to
place type (meta data) on text files; personally I'd make this
advisory: only applications that care (e.g. text editors) need
enquire, and applications that don't (e.g. cp(1)) would not
enquire except to propagate the metadata along with the file.

Plan 9 had two luxuries the world in general doesn't have:

a) they were a new OS, largely unconstrained by backward
compatibility

b) they punted on some serious issues, such as ignoring
combining characters in UTF-8, thus "UTF-8" in Plan 9
was effectively enforced (so far as I understand from
the paper quoted earlier in the thread) to be UTF-8 in
NFC (Normalization FOrm Canonical Composition).

A translation utility was provided, but what input
character sets it supported I don't know.

OS X increasingly places extended attributes on files;
browsers tag downloaded files and such. I'm not aware
of charset tags (or conventions for them) -- but I think
would be useful.

As people probably get tired of me saying, UTF-8 can be
unnormalised (a mixture of composed and decomposed characters)
or normalised to one of serveral forms. Simply saying "UTF-8"
doesn't mean I can diff two files to see if the list of names
in each is the same: I have to normalise each line (name)
first unless I _know_ the file contents are normalised.

Sorting those names gets into issues of locales, and _that_
is a whole 'nuther problem (and one both Plan 9 and Google's
Go language have so far punted on, too).

I'm not a huge fan of POSIX locales -- they have several
issues -- but at the moment they're the only game in town
I've even heard about. (More information welcome.)

> I believe XFS has xattrs by design so I wonder if all Irix
> programs were written to be xattr-aware or if it was pushed
> down to the filesystem level. Maybe that's the right
> approach?

If you can figure out how to push it down to the file system
level, I think you'll be doing well; I don't think you can,
realistically.

Here's a case where copying extended attributes (and even
ACLs) is pretty clearly "The Right Thing" to do:

$ cp -p original copy

On the other hand, what about when I fire up vi on a file I
don't have write permission to, and then save a copy after
I've edited it? At minimum, I must maintain the right to edit
it (so ACLs become troublesome) and how does the file system
possibly work out what I want for extended attributes when I
use _multiple_ source files to create an entirely new file?

# vi original1
...
:i original2
...
:w copy

If both files were UTF-8 normalised to NFD, then it migh be
reasonable for vi to note that in metadata, but I don't think
the file system can do it for me automatically unless it
examines every byte in the file as they're written. (And
then, how do I create an intentionally malformed file for
testing?)

Cheers,

Giles

Giles Lean

unread,

Jul 20, 2010, 9:48:39 PM7/20/10

to

der Mouse <mo...@Rodents-Montreal.ORG> wrote:

> So, tell me: besides the name, what is the difference between an
> extended attribute and an entry in the resource fork? In particular,
> what makes the former good and the latter bad?

Bloody little. (And bloody history. :-)

Resource forks have been decreed to be a Bad Thing: they
stored data, and developers are now strongly encouraged by
Apple to stop using them mainly due to difficulties
transferring files with foreign systems.

The preferred solution now is to put the "resource" data
(whatever it is -- icons, fonts, ...) into their own files.

Extended attributes (aka xattrs) are mainly intended for
metadata (although I can't find a clear statement about this
from Apple) but as the OS X manual page for setxattr(2) makes
clear extended attributes *subsume* resource forks, so perhaps
that "bloody little" difference is really no difference at
all, technically, but merely one of focus and intended use:

Extract of setxattr(2) manual page from OS X 10.6.4 (Snow Leopard):

"Extended attributes extend the basic attributes associated
with files and directories in the file system. They are
stored as name:data pairs associated with file system objects
(files, directories, symlinks, etc).

setxattr() associates name and data together as an attribute of path.

An extended attribute's name is a simple NULL-terminated
UTF-8 string. Value is a pointer to a data buffer of size
bytes containing textual or binary data to be associated with
the extended attribute. Position specifies the offset within
the extended attribute. In the current implementation, only
the resource fork extended attribute makes use of this
argument. For all others, position is reserved and should be
set to zero."

And no, I can't find advice from Apple on when to use them or
what for. :-( I notice browsers record which files have been
downloaded from the Internet and pop up a warning when I open
them the first time; that's certainly the most common usage I
see.

All that said, I do see an increasing need to add metadata to
files, however unfortunate the damage to the ideal of files as
"a bag of bytes" is:

o text files: character set encoding

o non-text files: preferred application (given a point-'n'-click
GUI that tries to do the Right Thing when a data file is double
clicked: without meta-data we're back to file extensions or
magic numbers and a system wide (or at least user wide)
default.

As always seems to happen, there are odd files which will
open in one application but not another even when they
_should_ work in both; being able to tag "open this one in
application XYZ instead of my normal ABC" is useful. But
again, only when double clicking on a data file: from within
an application, File->Open overrides, naturally.

As a Unix user from long enough ago that my beard (if I had
one) would be gray (not that I have the moral authority of an
old "graybeard" ;-), I'd prefer to see imaginative use made of
existing APIs rather than new ones introduced.

If you have to change applications for them to understand meta
data anyway, is it any harder to teach an editor when it's
offered a directory to see if that directory contains one text
file and aother file giving the text file's encoding (assuming
reasonable conventions about naming) than it is to make it use
a totally new API to recover the meta data?

I think xattrs on OS X are here to stay (whether I like 'em or
not). I can /hope/ that some consistency will develop across
platforms about what meta data means (particularly with respect
to character sets) as that would help translation from one
file system to another.

As to what NetBSD should do, having put in my 2c, I offer no
advice. (Frankly, it's not my place: my contributions to
NetBSD have been very minor, in part because I worked for many
years with a Unix vendor on their source code while the silly
SCO litigation raged; even the appearance of conflict of
interest was well worth avoiding in both my employer's and
NetBSD's iterests. Oh, and mine!)

Giles

mar...@gmail.com

unread,

Jul 21, 2010, 4:19:41 AM7/21/10

to

> Thus, it looks more reasonable than it did in 1970 or so to
> place type (meta data) on text files; personally I'd make this
> advisory: only applications that care (e.g. text editors) need
> enquire, and applications that don't (e.g. cp(1)) would not
> enquire except to propagate the metadata along with the file.

Well, what *I* use xattr's for is:

o providing MIME type and, in the case of plain text files (whatever "text
file" means today), charset information;
o user tags to make indexing and searching easier.

Maybe some applications should place MIME type information in newly created
files. I imagine it's easy to make GIMP add "mime-type:image/png" to PNG files
or Open Office put "mime-type:application/vnd.oasis.opendocument.text" (I
think charset info is kept in the file itself) on ODT files.

> If you can figure out how to push it down to the file system
> level, I think you'll be doing well; I don't think you can,
> realistically.

That doesn't mean I can't dream about it happening automagically the way I
want it, right?

> On the other hand, what about when I fire up vi on a file I
> don't have write permission to, and then save a copy after
> I've edited it? At minimum, I must maintain the right to edit
> it (so ACLs become troublesome) and how does the file system
> possibly work out what I want for extended attributes when I
> use _multiple_ source files to create an entirely new file?

I don't really need (or use, for that matter) ACLs. I believe the concept
itself is fine, but implementing it correctly is clearly non-obvious. VMS
uses some weird combination of file ownership and ACLs, so it may be well
worth borrowing the good ideas and improving the rest where it's needed. I
intend on investigating how VMS and various Unices handle ACLs. You have here
a good argument in favour of making apps xattr-savvy.

Kind regards

Giles Lean

unread,

Jul 21, 2010, 10:23:57 PM7/21/10

to

mar...@gmail.com wrote:

> Well, what *I* use xattr's for is:
>
> o providing MIME type and, in the case of plain text files (whatever "text
> file" means today), charset information;
> o user tags to make indexing and searching easier.

Makes sense to me.

> > If you can figure out how to push it down to the file system
> > level, I think you'll be doing well; I don't think you can,
> > realistically.
> That doesn't mean I can't dream about it happening automagically the way I
> want it, right?

On the contrary, if you can figure it out, I'm all for it! I
just don't think it's doable, but I've been wrong before, and
I'd like to be this time.

> You have here a good argument in favour of making apps xattr-savvy.

Well, character sets make for a reasonable argument for
associating meta-data with files. But files + directories have
the same theoretical power as files + extended attributes, so
there's some unnecessary complexity introduced.

If I were to bet, I'd bet that extended attributes are here to
stay. But without standardisation (de-facto or de-jure)
across file systems and OSes, their use is going to remain
limited, often application specific, and sometimes OS
specific. (Work creation schemes for programmers?)

The freedesktop.org folk are creating recommendations for the
use of extended attributes:

http://www.freedesktop.org/wiki/CommonExtendedAttributes

Maybe a miracle will happen and Microsoft and Apple will join
them. Or maybe history will repeat itself and Apple and MS
will product conflicting, multiply versioned recommendations
that are semantically different enough that they can't be
mapped 100% back and forth.

I think I've exhausted everything I've got to say on this
topic now, other than mentioning again that anyone thinking
about MIME types and UTF-8 might want to consider including
normalisation information if their file is in a normalised
form. :-)

Cheerio,

Giles

Greg A. Woods

unread,

Jul 9, 2011, 10:23:48 PM7/9/11

to

At 16 Jul 2010 18:33:42 +1000, Giles Lean <giles...@pobox.com> wrote:
Subject: Re: wide characters and i18n
>
> 2. the idea that the use of Unicode is sufficient excuse to
> provide any of the functionality of locales

This is why the problem is usually broken down into two different
sections, with only occasional overlap in ideal scenarios! :-)

Internationalisation (I18N) and Localisation (L10N)

The first time I learned these two were better separate than together I
learned a whole lot of new things and many light bulbs came on bright
and bells rang clear for me.

There's also multilingualisation (M17N), which in a sense is ideally a
better term than localisation, since it implies implicitly performing
localisation for every target locale all at once, but it seems that term
only gets used in some domains, so perhaps it's best to stick to L10N.

Indeed Plan 9 did not address localisation at all (and sadly the paper
doesn't use that more formal term either) -- it was, after all,
initially built in America for Americans, by Americans. ;-) Indeed the
paper actually sates in many places that Plan 9 (at the time) did not
even begin to address the issue of localisation.

One might say they even punted on I18N, but as others have pointed out
the paper already mentions these caveats

As the paper concludes, it "at least [has] the capacity to be
international."

BTW, I think Plan 9's insistence that everything "textual" inside the
system always be in Unicode in UTF-8 all the time is one of its key
features. That means _everything_ coming into the system has to be
converted before it can be used usefully by any application, or indeed
to have any meaning whatsoever. This solves some of the niggles you
worried about.

The combination of Plan 9's universal use of Unicode in UTF-8, and its
policy of requiring everything to be converted to Unicode in UTF-8
either on input, import, or at least before it can be used, makes for
the firm foundations of a system upon which one can _begin_ the next
task of localisation.

This is where IEEE POSIX / UNIX(tm) _should_ go, IMNSHO. Get rid of all
the old non-UTF-8 crap for different character sets. Ideally get rid of
ANSI/ISO "wide char" crap too -- for the reasons given in the Plan 9
paper (though maybe choose 32-bits for Runes?). Then, and only then,
begin thinking about how to do locales better.

(Yes, I know where to find Plan 9 and how to run it! :-))

(BTW, it would be good to have a recording or transcript of Pike's when
he presented the "Hello World" paper at Usenix '93. It really helped
set the context and I think give more advice than the paper alone,
though the paper really stands up well, and indeed tries to teach us
many lessons which we still have not even come close to learning yet.)

> Which still leaves open the problem of locales and issues of
> multi-lingual documents and applications where a single
> Unicode glyph really should be represented differently
> depending upon what language it is being used for, but I did
> say at the start of this too-lengthy message that the issues
> get ugly.

:-)

--
Greg A. Woods
Planix, Inc.

<wo...@planix.com> +1 250 762-7675 http://www.planix.com/

Mouse

unread,

Jul 10, 2011, 2:48:43 AM7/10/11

to

> BTW, I think Plan 9's insistence that everything "textual" inside the
> system always be in Unicode in UTF-8 all the time is one of its key
> features.

Maybe, but I think it's a horrible, horrible mistake. The only excuse
for UTF-8 is to shoehorn Unicode into a system with the "char = 8 bits"
assumption already wired deeply into it. A system built from the start
to use Unicode really should be using 16-bit chars - or 24-bit (or
maybe 32-bit, depending) if you want to support more than the BMP.

>> Which still leaves open the problem of locales and issues of
>> multi-lingual documents and applications where a single Unicode
>> glyph really should be represented differently depending upon what
>> language it is being used for,

Or cases where what, in isolation, could reasonably be said to be the
same character needs to be encoded differently depending on what
language it's for. For example, Latin, Cyrllic, and Greek each have a
character formed by two parallel vertical strokes joined by a
horizontal stroke approximately midway up, but it means something
fairly drastically different to each of them. Yet, written on a piece
of paper in isolation, there is no difference; any of them could
equally well be taken to be any of the others.

/~\ The ASCII Mouse
\ / Ribbon Campaign
X Against HTML mo...@rodents-montreal.org
/ \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B

--

Aleksej Saushev

unread,

Jul 11, 2011, 5:39:37 PM7/11/11

to

Hello!

"Greg A. Woods" <wo...@planix.ca> writes:

> This is where IEEE POSIX / UNIX(tm) _should_ go, IMNSHO. Get rid of all
> the old non-UTF-8 crap for different character sets.

Not before Latin becomes no less than 4 octets per letter.

> Ideally get rid of
> ANSI/ISO "wide char" crap too -- for the reasons given in the Plan 9
> paper (though maybe choose 32-bits for Runes?). Then, and only then,
> begin thinking about how to do locales better.

It ought to be exactly the other way around. Text encoding is nothing
but internal affairs of program, it doesn't matter at all how you toggle
bits inside your black box. It is converting weird inscriptions like "7/12"
into date what does cause the headache. (Is it July 1, 2012?)

--
HE CE3OH...