Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

multibyte in C locale

155 views
Skip to first unread message

Szabolcs Nagy

unread,
Mar 26, 2013, 8:57:19 PM3/26/13
to
there is an ongoing discussion on austingroup mailing list
if the "C"/"POSIX" locale should be restricted to allow
single byte character encodings in POSIX

to me that sounds rather restrictive (in particular it
forbids an implementation to assume UTF-8 in mbstowcs
in the default locale)

so i wonder about the intentions of the ISO C committee

the C99 rational mentions at some point that the C locale
is the minimum required for translation of C code

but in the multibyte support extension part (MSE.1 p20)
it says that both the source and execution character sets
may contain multibyte characters even in the "C" locale

which seems to imply that UTF-8 execution character set
encoding is fine

so is the "C" locale intended to possibly support
non-english languages or other locales should be
used for that?

Tim Rentsch

unread,
Mar 29, 2013, 3:01:00 PM3/29/13
to
It seems clear that ISO C allows (or at least is meant to allow)
implementations that use multi-byte encodings in the "C" locale.
Personally I think it would be a bad choice for an implementation
to make. Among other things, the "C" locale carries implications
regarding the behavior of some <ctype.h> functions, eg, isspace,
islower, isupper. Trying to support a non-English language in
the "C" locale is probably more trouble than it's worth. Do any
existing implementations actually use multi-byte encodings in
the "C" locale? That would be worth knowing.

As far is POSIX is concerned, the question is really whether
they want to exclude such implementations from being POSIX
compliant. Three obvious possibilities suggest themselves:

A. Have POSIX stipulate that the "C" locale not use
multi-byte encodings, which would exclude any
implmentations that do (assuming any exist or
that some might exist in the future);

B. Have POSIX accommodate implementations with "C"
locales that use multi-byte encodings;

C. Like B, except also require that the implementation
define a "POSIX" locale, that locale being limited
to single-byte encodings (and otherwise the same as
the "C" locale).

If there are no existing implementations that use multi-byte
encodings in the "C" locale I would vote for A (and allowing the
possibility that a future POSIX standard might change that).
Otherwise I think there are both plusses and minuses for each of
these alternatives, and the decision would hinge on which factors
deserve more weight.

Szabolcs Nagy

unread,
Apr 6, 2013, 7:07:50 AM4/6/13
to
Tim Rentsch <t...@alumni.caltech.edu> wrote:
> Szabolcs Nagy <n...@port70.net> writes:
>> so is the "C" locale intended to possibly support
>> non-english languages or other locales should be
>> used for that?
>
> It seems clear that ISO C allows (or at least is meant to allow)
> implementations that use multi-byte encodings in the "C" locale.
> Personally I think it would be a bad choice for an implementation
> to make. Among other things, the "C" locale carries implications
> regarding the behavior of some <ctype.h> functions, eg, isspace,
> islower, isupper. Trying to support a non-English language in
> the "C" locale is probably more trouble than it's worth. Do any
> existing implementations actually use multi-byte encodings in
> the "C" locale? That would be worth knowing.

there was at least one trying for a short while: cygwin, but
they had all sorts of windows specific and legacy code issues

plan9 (and the c compilers in the reference go implementation)
are utf8 only, but those are non-conformant implementations

otherwise i think this is a modern requirement, historical
implementations do not support multi-byte encoding properly

unfortunately there are various implementation issues if a system
has to support multiple character encodings (eg if encoding is
locale-specific then one cannot use non-portable chracters in
string literals in c code: the runtime setlocale may set a
different character set than the execution character set and the
representation can be different than at translation time as well,
c11 has utf-8 string literals but even that has limited usefulness
if they cannot be processed with the available mb* functions)

so it would be useful to allow supporting a single encoding
in all locales including the c one, then on such platforms
one can use extended characters in c string literals safely

(and on a modern system if a single encoding is supported then
that is better to be a multi-byte encoding for the universal
character set ie. utf-8 encoding for iso 10646)

there are performance issues as well: on posix, with uselocale
etc apis, locale is thread specific which means multi-byte apis
should branch on (or dereference) thread local storage to dispatch
between the different decoders and that is very slow on some
architectures (so if the c locale is required to be single byte,
but the implementation wants to support non-english languages as
well then it has to support at least two encodings which means
mbrtowc etc will be slow and thus anything that processes the
input character-by-character)

the main counter argument i see is historical practices and
expectations in various cornercases of string handling in the
c locale (strcoll etc)

Tim Rentsch

unread,
Apr 11, 2013, 4:12:16 PM4/11/13
to
Szabolcs Nagy <n...@port70.net> writes:

> Tim Rentsch <t...@alumni.caltech.edu> wrote:
>> Szabolcs Nagy <n...@port70.net> writes:
>>> so is the "C" locale intended to possibly support
>>> non-english languages or other locales should be
>>> used for that?
>>
>> It seems clear that ISO C allows (or at least is meant to allow)
>> implementations that use multi-byte encodings in the "C" locale.
>> Personally I think it would be a bad choice for an implementation
>> to make. Among other things, the "C" locale carries implications
>> regarding the behavior of some <ctype.h> functions, eg, isspace,
>> islower, isupper. Trying to support a non-English language in
>> the "C" locale is probably more trouble than it's worth. Do any
>> existing implementations actually use multi-byte encodings in
>> the "C" locale? That would be worth knowing.
>
> there was at least one trying for a short while: cygwin, but
> they had all sorts of windows specific and legacy code issues
>
> plan9 (and the c compilers in the reference go implementation)
> are utf8 only, but those are non-conformant implementations
>
> otherwise i think this is a modern requirement, historical
> implementations do not support multi-byte encoding properly

Multi-byte characters have been included in C since (at least)
the first ISO standard in 1990. Are you suggesting that C90
implementations deserve no consideration in discussing this
topic? If so I disagree.

> unfortunately there are various implementation issues if a system
> has to support multiple character encodings (eg if encoding is
> locale-specific then one cannot use non-portable chracters in
> string literals in c code: the runtime setlocale may set a
> different character set than the execution character set and the
> representation can be different than at translation time as well,
> c11 has utf-8 string literals but even that has limited usefulness
> if they cannot be processed with the available mb* functions)

This argument seems backwards. One should never use non-basic
characters in an ordinary string literal, because they aren't
portable. That's why there are wide-character constants and
string literals, which have been in C since C90.

> so it would be useful to allow supporting a single encoding
> in all locales including the c one, then on such platforms
> one can use extended characters in c string literals safely

Exactly the opposite: having a single, multi-byte, encoding in
all locales encourages a bad programming practice. We don't want
to make it easier to employ bad programming practices, we want to
make it harder. This argues in favor of discouraging multi-byte
encodings in the C locale, not in favor of allowing them.

> (and on a modern system if a single encoding is supported then
> that is better to be a multi-byte encoding for the universal
> character set ie. utf-8 encoding for iso 10646)

This comment is a circular restatement of the previous paragraph.
If you are going to have multi-byte extended characters in a
single encoding (and we may assume they are multi-byte, since
otherwise the whole question is moot), then obviously the single
encoding needs to be a multi-byte encoding. Using a multi-byte
encoding for the "C" locale is still a bad idea; choosing the
(multi-byte) "C" locale encoding as the single encoding for all
the implementation's locales only compounds the problem.

> there are performance issues as well: on posix, with uselocale
> etc apis, locale is thread specific which means multi-byte apis
> should branch on (or dereference) thread local storage to dispatch
> between the different decoders and that is very slow on some
> architectures (so if the c locale is required to be single byte,
> but the implementation wants to support non-english languages as
> well then it has to support at least two encodings which means
> mbrtowc etc will be slow and thus anything that processes the
> input character-by-character)

This argument is not compelling, for a couple of reasons.

One is that many of the key functions (eg, isalpha()) will need
to be different in different locales even if all the encodings
are the same. So having a single encoding doesn't buy you
anything in those cases, and they are common cases.

Two, a little thought will show that per-thread vectoring can be
avoided in many or most cases, with just one or two conditional
branches, using only shared (ie, non-thread-specific) memory
access. The actual performance cost won't be anywhere near as
bad as a naive implementation might suggest.

> the main counter argument i see is historical practices and
> expectations in various cornercases of string handling in the
> c locale (strcoll etc)

Really? Then may I suggest you look again at the comments above,
and look up those passages in the Standard regarding such cases?

Also, to add to that, it is ALWAYS useful to have available a
locale that uses straight 8-bit encoding, and no multi-byte
character processing. That by itself is a strong argument
against an implementation using a single encoding for all
locales. And, unless the C Standard is going to be modified to
include another required locale, the "C" locale is the most
natural place for POSIX to put it.

William Ahern

unread,
Apr 11, 2013, 11:40:15 PM4/11/13
to
Tim Rentsch <t...@alumni.caltech.edu> wrote:
> Szabolcs Nagy <n...@port70.net> writes:
<snip>
> > there are performance issues as well: on posix, with uselocale
> > etc apis, locale is thread specific which means multi-byte apis
> > should branch on (or dereference) thread local storage to dispatch
> > between the different decoders and that is very slow on some
> > architectures (so if the c locale is required to be single byte,
> > but the implementation wants to support non-english languages as
> > well then it has to support at least two encodings which means
> > mbrtowc etc will be slow and thus anything that processes the
> > input character-by-character)

> This argument is not compelling, for a couple of reasons.

> One is that many of the key functions (eg, isalpha()) will need
> to be different in different locales even if all the encodings
> are the same. So having a single encoding doesn't buy you
> anything in those cases, and they are common cases.

It's also worth mentioning that isalpha() is broken anyhow for UTF. There
are multibyte "alpha" glyphs which still won't fit a single data type of any
size. (You would need to use dynamic compositioning a la Perl 6's
normalization form to make it work within the contraints of the current
APIs.)

C's multibyte support is more about backward compatability, IMNSHO, for
older national encodings. It's fundamentally broken for Unicode (or at least
UTF). I don't understand why people invest so much time beating the dead
horse of current standardized locale interfaces.

Szabolcs Nagy

unread,
Apr 15, 2013, 10:41:59 AM4/15/13
to
Tim Rentsch <t...@alumni.caltech.edu> wrote:
> Szabolcs Nagy <n...@port70.net> writes:
>>
>> otherwise i think this is a modern requirement, historical
>> implementations do not support multi-byte encoding properly
>
> Multi-byte characters have been included in C since (at least)
> the first ISO standard in 1990. Are you suggesting that C90
> implementations deserve no consideration in discussing this
> topic? If so I disagree.

i am aware of that..

my wording was wrong, what i meant was that historically there
was no need for a system to use multi-byte in the c locale,
because they did not support a universal character set, but
a mix of character sets with various encodings

> This argument seems backwards. One should never use non-basic
> characters in an ordinary string literal, because they aren't
> portable. That's why there are wide-character constants and
> string literals, which have been in C since C90.

wide character string literals make things worse (they are
non-portable and hard to use correctly and not what is needed
usually)

i agree if portability matters then only basic characters should
be used in string literals

an implementation may still want to support multi-byte in string
literals correctly and since the c locale gives "a minimal environment
for C translation" it requires multi-byte then on such a platform

>> there are performance issues as well: on posix, with uselocale
>> etc apis, locale is thread specific which means multi-byte apis
>> should branch on (or dereference) thread local storage to dispatch
>> between the different decoders and that is very slow on some
>> architectures (so if the c locale is required to be single byte,
>> but the implementation wants to support non-english languages as
>> well then it has to support at least two encodings which means
>> mbrtowc etc will be slow and thus anything that processes the
>> input character-by-character)
>
> This argument is not compelling, for a couple of reasons.
>
> One is that many of the key functions (eg, isalpha()) will need
> to be different in different locales even if all the encodings
> are the same. So having a single encoding doesn't buy you
> anything in those cases, and they are common cases.

you assume that the implementation provides different locales

currently it is only required to provide the c locale

a minimal implementation may want to support utf8 text but no
other locales (this is currently a valid implementation as far
as i can see and does not have any performance problems)

> Also, to add to that, it is ALWAYS useful to have available a
> locale that uses straight 8-bit encoding, and no multi-byte
> character processing. That by itself is a strong argument
> against an implementation using a single encoding for all
> locales. And, unless the C Standard is going to be modified to
> include another required locale, the "C" locale is the most
> natural place for POSIX to put it.

allowing a single multi-byte encoding is useful as well

the usefulness of an 8-bit encoding is a different matter

(and i'm not yet convinced about that:

usually when the encoding of some input is not known or it is
binary data then apropriate data processing tools should be
used and not text processing tools in an 8-bit locale, when
the encoding is known then there are conversion tools..

plan9 is a nice demonstration how multi-byte text can work
without an 8-bit locale)

Szabolcs Nagy

unread,
Apr 15, 2013, 10:44:01 AM4/15/13
to
William Ahern <wil...@wilbur.25thandclement.com> wrote:
> It's also worth mentioning that isalpha() is broken anyhow for UTF. There
> are multibyte "alpha" glyphs which still won't fit a single data type of any
..

isalpha is well defined in the c locale

encoding and character set does not matter

William Ahern

unread,
Apr 15, 2013, 4:12:47 PM4/15/13
to
Szabolcs Nagy <n...@port70.net> wrote:
> William Ahern <wil...@wilbur.25thandclement.com> wrote:
> > It's also worth mentioning that isalpha() is broken anyhow for UTF.
> > There are multibyte "alpha" glyphs which still won't fit a single data
> > type of any
> ..

> isalpha is well defined in the c locale

It's well defined but useless for UTF, because UTF violates the model
implicit in the interface. The C standard uses a concept of "character"
which is incompatible with both code points and characters as described by
Unicode.

The standard seems to equate "character" with "code point", but that's
insufficient.

> encoding and character set does not matter

They matter if you're actually trying to accomplish something useful, such
as parsing text in a language-agnostic manner.

But most people's experience with UTF isn't with parsing text qua text.
They're usually parsing some other kind of structured data with semantics
similar to ASCII, and where interstitial text can be safely ignored.
Everything _seems_ to work, and people go on their merry way using crippled
interfaces, content to have put lipstick on a pig. This is how the C
standard approaches things; from the perspective of defining the semantics
sufficiently for a compiler writer implementing the standard.

Notice that C11 Annex D disallows combining marks. I don't see that
restriction stated wrt iswalpha....

Tim Rentsch

unread,
Apr 20, 2013, 3:47:30 PM4/20/13
to
Szabolcs Nagy <n...@port70.net> writes:

> Tim Rentsch <t...@alumni.caltech.edu> wrote:
>> Szabolcs Nagy <n...@port70.net> writes:
>>>
>>> otherwise i think this is a modern requirement, historical
>>> implementations do not support multi-byte encoding properly
>>
>> Multi-byte characters have been included in C since (at least)
>> the first ISO standard in 1990. Are you suggesting that C90
>> implementations deserve no consideration in discussing this
>> topic? If so I disagree.
>
> i am aware of that..
>
> my wording was wrong, what i meant was that historically there
> was no need for a system to use multi-byte in the c locale,
> because they did not support a universal character set, but
> a mix of character sets with various encodings

Check your facts. ISO C has had universal character names for
fourteen years. It wasn't a compelling argument to change the
"C" locale in the past, and it isn't now.

>> This argument seems backwards. One should never use non-basic
>> characters in an ordinary string literal, because they aren't
>> portable. That's why there are wide-character constants and
>> string literals, which have been in C since C90.
>
> wide character string literals make things worse (they are
> non-portable and hard to use correctly and not what is needed
> usually)

But wide-character string constants using non-basic characters
are more portable than normal string constants, because wide
characters are required to be large enough to represent all
characters in the extended character set. If wide-character
string constants shouldn't be used because they aren't portable,
that applies in spades to regular string constants.

> i agree if portability matters then only basic characters should
> be used in string literals
>
> an implementation may still want to support multi-byte in string
> literals correctly and since the c locale gives "a minimal environment
> for C translation" it requires multi-byte then on such a platform

This is a circular argument. An implementation that chooses to
have multi-byte characters in the "C" locale obviously needs to
have a "C" locale that uses multi-byte characters.

>>> there are performance issues as well: on posix, with uselocale
>>> etc apis, locale is thread specific which means multi-byte apis
>>> should branch on (or dereference) thread local storage to dispatch
>>> between the different decoders and that is very slow on some
>>> architectures (so if the c locale is required to be single byte,
>>> but the implementation wants to support non-english languages as
>>> well then it has to support at least two encodings which means
>>> mbrtowc etc will be slow and thus anything that processes the
>>> input character-by-character)
>>
>> This argument is not compelling, for a couple of reasons.
>>
>> One is that many of the key functions (eg, isalpha()) will need
>> to be different in different locales even if all the encodings
>> are the same. So having a single encoding doesn't buy you
>> anything in those cases, and they are common cases.
>
> you assume that the implementation provides different locales
>
> currently it is only required to provide the c locale

Check your facts. ISO C has required a minimum of two supported
locales since 1990.

> a minimal implementation may want to support utf8 text but no
> other locales (this is currently a valid implementation as far
> as i can see and does not have any performance problems)

You are confusing encodings and locales. An implementation could
choose to use the same encoding in all locales, but that doesn't
make them the same locale. The requirement that an implementation
provide at least two locales is an indication that these locales
are likely to use different encodings. And sensibly so, because
they are used for different purposes.

>> Also, to add to that, it is ALWAYS useful to have available a
>> locale that uses straight 8-bit encoding, and no multi-byte
>> character processing. That by itself is a strong argument
>> against an implementation using a single encoding for all
>> locales. And, unless the C Standard is going to be modified to
>> include another required locale, the "C" locale is the most
>> natural place for POSIX to put it.
>
> allowing a single multi-byte encoding is useful as well

There are some benefits to using a single multi-byte encoding
in all locales. My point all along has been that the costs
outweigh the benefits.

> the usefulness of an 8-bit encoding is a different matter
>
> (and i'm not yet convinced about that:
>
> usually when the encoding of some input is not known or it is
> binary data then apropriate data processing tools should be
> used and not text processing tools in an 8-bit locale, when
> the encoding is known then there are conversion tools..

When what is being processed is text, it should be processed in a
textual mode, not a binary mode. What you want is to emasculate
the C language so that conversion tools are necessary. Surely
it is better for C implementations to provide enough flexibility
so such conversion tools are not needed.

> plan9 is a nice demonstration how multi-byte text can work
> without an 8-bit locale)

Obviously an implementation can be written that uses a single
multi-byte encoding in all locales. But it takes away a
flexibility that certainly is useful at times. Furthermore any
code written with that specific implementation in mind is likely
to be less portable to other implementations, because most of them
don't use multi-byte encodings in the "C" locale, let alone all
locales. What you're suggesting, in effect, is to "standardize"
something that essentially no implementations do. Saying "this
would be nice" or "this can be made to work" does not provide any
kind of compelling argument. On the contrary, noting that most
implementations use a single-byte encoding in the "C" locale
provides a strong argument that, if anything, a proposed standard
mandate this common choice (of a single-byte "C" locale), not some
other unusual one.

Antoine Leca

unread,
Apr 26, 2013, 7:01:28 AM4/26/13
to
William Ahern wrote:
> Notice that C11 Annex D disallows combining marks.

Can you elaborate?
It seems to me it disallows combining marks to start an identifier,
which seems to me a rather smaller cornercase (and does not seem
unreasonable.)


> I don't see that restriction stated wrt iswalpha....

And contrary to what some people might expect, isalpha/isalnum cannot be
used to parse identifiers in C since C99, because of those UCNs.


Antoine

William Ahern

unread,
May 1, 2013, 12:21:10 AM5/1/13
to
Antoine Leca <ro...@localhost.invalid> wrote:
> William Ahern wrote:
> > Notice that C11 Annex D disallows combining marks.

> Can you elaborate?
> It seems to me it disallows combining marks to start an identifier,
> which seems to me a rather smaller cornercase (and does not seem
> unreasonable.)

I was just at the allowable codepoints in C99. I see now that in C11
combining codepoints are only disallowed as the first codepoint.

In any event, one still cannot use a simple loop with iswalnum to consume an
identifier. Unicode identifiers must necessarily be treated as a special
case in order to consume multibyte graphemes that don't compose to a single
codepoint. C's interface is insufficiently abstract--or at least
underspecified--to encompass Unicode.

Though, IMO it shouldn't bother trying. Inevitably a more sophisticated API
would become in part obsolete or irrelevant as best practice evolves. It's
too early.

Szabolcs Nagy

unread,
May 2, 2013, 6:55:59 PM5/2/13
to
Tim Rentsch <t...@alumni.caltech.edu> wrote:
> Check your facts. ISO C has had universal character names for
> fourteen years. It wasn't a compelling argument to change the
> "C" locale in the past, and it isn't now.

the *implementations* lacked ucs support (not the language)
so obviously there was no requirement on the c locale..

> But wide-character string constants using non-basic characters
> are more portable than normal string constants, because wide
> characters are required to be large enough to represent all
> characters in the extended character set. If wide-character

"extended character set" is not a portable term
"wide character" ditto

> This is a circular argument. An implementation that chooses to
> have multi-byte characters in the "C" locale obviously needs to
> have a "C" locale that uses multi-byte characters.

why would you want to disallow such an implementation?

> Check your facts. ISO C has required a minimum of two supported
> locales since 1990.

ok i didn't know this, can you show supporting text?

i see the "C" locale and the locale of the native environment
specified, but i dont see why those cannot be the same

>> a minimal implementation may want to support utf8 text but no
>> other locales (this is currently a valid implementation as far
>> as i can see and does not have any performance problems)
>
> You are confusing encodings and locales. An implementation could
> choose to use the same encoding in all locales, but that doesn't

the constraints are:
1) supporting non-english text
2) minimal implementation

and this can be done with a single locale and single
(multi-byte) encoding

> When what is being processed is text, it should be processed in a
> textual mode, not a binary mode. What you want is to emasculate
> the C language so that conversion tools are necessary. Surely

you can use text processing tools once you converted
your data into text, before that you cant

> Obviously an implementation can be written that uses a single
> multi-byte encoding in all locales. But it takes away a
> flexibility that certainly is useful at times. Furthermore any

what are the use-cases for a many encoding system that cannot be
easily solved by other means in a single encoding one

> code written with that specific implementation in mind is likely
> to be less portable to other implementations, because most of them
> don't use multi-byte encodings in the "C" locale, let alone all
> locales. What you're suggesting, in effect, is to "standardize"
> something that essentially no implementations do. Saying "this

i'm suggesting *not* to standardize something that might prevent
useful implementations

note that the encoding of the c locale is historically non-portable,
sorting strings in ebcdic gives different result than in ascii

Tim Rentsch

unread,
May 6, 2013, 10:22:50 PM5/6/13
to
Szabolcs Nagy <n...@port70.net> writes:

> [snip]
> i'm suggesting *not* to standardize something that might prevent
> useful implementations

Apparently you don't understand that this statement by itself
is just not a compelling argument.
0 new messages