Szabolcs Nagy <
n...@port70.net> writes:
> Tim Rentsch <
t...@alumni.caltech.edu> wrote:
>> Szabolcs Nagy <
n...@port70.net> writes:
>>>
>>> otherwise i think this is a modern requirement, historical
>>> implementations do not support multi-byte encoding properly
>>
>> Multi-byte characters have been included in C since (at least)
>> the first ISO standard in 1990. Are you suggesting that C90
>> implementations deserve no consideration in discussing this
>> topic? If so I disagree.
>
> i am aware of that..
>
> my wording was wrong, what i meant was that historically there
> was no need for a system to use multi-byte in the c locale,
> because they did not support a universal character set, but
> a mix of character sets with various encodings
Check your facts. ISO C has had universal character names for
fourteen years. It wasn't a compelling argument to change the
"C" locale in the past, and it isn't now.
>> This argument seems backwards. One should never use non-basic
>> characters in an ordinary string literal, because they aren't
>> portable. That's why there are wide-character constants and
>> string literals, which have been in C since C90.
>
> wide character string literals make things worse (they are
> non-portable and hard to use correctly and not what is needed
> usually)
But wide-character string constants using non-basic characters
are more portable than normal string constants, because wide
characters are required to be large enough to represent all
characters in the extended character set. If wide-character
string constants shouldn't be used because they aren't portable,
that applies in spades to regular string constants.
> i agree if portability matters then only basic characters should
> be used in string literals
>
> an implementation may still want to support multi-byte in string
> literals correctly and since the c locale gives "a minimal environment
> for C translation" it requires multi-byte then on such a platform
This is a circular argument. An implementation that chooses to
have multi-byte characters in the "C" locale obviously needs to
have a "C" locale that uses multi-byte characters.
>>> there are performance issues as well: on posix, with uselocale
>>> etc apis, locale is thread specific which means multi-byte apis
>>> should branch on (or dereference) thread local storage to dispatch
>>> between the different decoders and that is very slow on some
>>> architectures (so if the c locale is required to be single byte,
>>> but the implementation wants to support non-english languages as
>>> well then it has to support at least two encodings which means
>>> mbrtowc etc will be slow and thus anything that processes the
>>> input character-by-character)
>>
>> This argument is not compelling, for a couple of reasons.
>>
>> One is that many of the key functions (eg, isalpha()) will need
>> to be different in different locales even if all the encodings
>> are the same. So having a single encoding doesn't buy you
>> anything in those cases, and they are common cases.
>
> you assume that the implementation provides different locales
>
> currently it is only required to provide the c locale
Check your facts. ISO C has required a minimum of two supported
locales since 1990.
> a minimal implementation may want to support utf8 text but no
> other locales (this is currently a valid implementation as far
> as i can see and does not have any performance problems)
You are confusing encodings and locales. An implementation could
choose to use the same encoding in all locales, but that doesn't
make them the same locale. The requirement that an implementation
provide at least two locales is an indication that these locales
are likely to use different encodings. And sensibly so, because
they are used for different purposes.
>> Also, to add to that, it is ALWAYS useful to have available a
>> locale that uses straight 8-bit encoding, and no multi-byte
>> character processing. That by itself is a strong argument
>> against an implementation using a single encoding for all
>> locales. And, unless the C Standard is going to be modified to
>> include another required locale, the "C" locale is the most
>> natural place for POSIX to put it.
>
> allowing a single multi-byte encoding is useful as well
There are some benefits to using a single multi-byte encoding
in all locales. My point all along has been that the costs
outweigh the benefits.
> the usefulness of an 8-bit encoding is a different matter
>
> (and i'm not yet convinced about that:
>
> usually when the encoding of some input is not known or it is
> binary data then apropriate data processing tools should be
> used and not text processing tools in an 8-bit locale, when
> the encoding is known then there are conversion tools..
When what is being processed is text, it should be processed in a
textual mode, not a binary mode. What you want is to emasculate
the C language so that conversion tools are necessary. Surely
it is better for C implementations to provide enough flexibility
so such conversion tools are not needed.
> plan9 is a nice demonstration how multi-byte text can work
> without an 8-bit locale)
Obviously an implementation can be written that uses a single
multi-byte encoding in all locales. But it takes away a
flexibility that certainly is useful at times. Furthermore any
code written with that specific implementation in mind is likely
to be less portable to other implementations, because most of them
don't use multi-byte encodings in the "C" locale, let alone all
locales. What you're suggesting, in effect, is to "standardize"
something that essentially no implementations do. Saying "this
would be nice" or "this can be made to work" does not provide any
kind of compelling argument. On the contrary, noting that most
implementations use a single-byte encoding in the "C" locale
provides a strong argument that, if anything, a proposed standard
mandate this common choice (of a single-byte "C" locale), not some
other unusual one.