Multibyte characters questions

Joseph S. Myers

unread,

Jun 6, 2001, 6:30:09 PM6/6/01

to

Some questions on multibyte characters (relating to the execution
character set only):

(a) The response to DR#091 states that the execution multibyte
character set need not be prefix-free, and in particular that the
characters of the basic execution character set may be prefixes of
other characters. Does this response also apply to C99?

(b) If so, may the byte with all bits zero be a prefix of a multibyte
character (as well as being a null character in its own right)? I'd
suppose not, but the difference between "shall be interpreted as a
null character" (of that byte) and "retain their usual interpretation"
(of the basic characters, in the initial shift state) (5.2.1.2#1)
seems rather obscure.

(c) If the response still applies to C99, can a s.c. program do

/* ... */
char s[1] = "a";
wprintf(L"%.1s\n", s);

? Here, the array s contains a valid multibyte character - but
wprintf can't determine whether it is complete or a prefix without
going beyond the end of the array. (The specification of the return
values of mbrtowc doesn't really seem to envisage non-prefix-free
multibyte character sets, either.)

(d) May the extended execution character set define multibyte
characters that are alternative representations for characters in the
basic character set (i.e., that convert to the same wide character)?
If so, may they include alternative representations for the null
character? Must such alternative representations be accepted as
equivalent by the C library; for example, must printf accept
alternative multibyte representations of % and other characters used
in conversions specifiers, and must alternative null characters (if
the standard permits them) be accepted? (DR#090 addresses the
converse situation: where % in another shift state represents
something other than the % of the basic character set, printf does not
recognise it.)

--
Joseph S. Myers
js...@cam.ac.uk

Douglas A. Gwyn

unread,

Jun 8, 2001, 1:09:40 PM6/8/01

to

"Joseph S. Myers" wrote:
> ... may the byte with all bits zero be a prefix of a multibyte

> character (as well as being a null character in its own right)?

No byte of a multibyte character encoding is allowed to have a 0 value;
0 is reserved for use as a string terminator, so that the str*()
functions can be safely used to operate with multibyte arrays.

> ... can a s.c. program do

> char s[1] = "a";
> wprintf(L"%.1s\n", s);
> ? Here, the array s contains a valid multibyte character - but
> wprintf can't determine whether it is complete or a prefix without
> going beyond the end of the array.

That would be an invalid encoding, which produces undefined behavior.

> (d) May the extended execution character set define multibyte
> characters that are alternative representations for characters in the
> basic character set (i.e., that convert to the same wide character)?

I don't see why not. Of course, converting the multibyte encoding
back to wide character won't necessarily match the original.

> If so, may they include alternative representations for the null
> character?

"Null character" is a C construct only; it always has value 0.

Clive D. W. Feather

unread,

Jun 10, 2001, 9:05:53 AM6/10/01

to

In article <3B2106D4...@null.net>, Douglas A. Gwyn
<DAG...@null.net> writes

>No byte of a multibyte character encoding is allowed to have a 0 value;
>0 is reserved for use as a string terminator, so that the str*()
>functions can be safely used to operate with multibyte arrays.

Um, that's not quite what the Standard says, even if it's what we meant.
A single-byte character 0 is the terminator, yes. A multibyte character
cannot have a zero byte as the second or subsequent byte. But, given
that single-byte characters can be prefixes of multibyte ones, where
does it say that the first byte of a multibyte character can't be 0 ?

>> ... can a s.c. program do
>> char s[1] = "a";
>> wprintf(L"%.1s\n", s);
>> ? Here, the array s contains a valid multibyte character - but
>> wprintf can't determine whether it is complete or a prefix without
>> going beyond the end of the array.
>
>That would be an invalid encoding, which produces undefined behavior.

What makes it invalid ? We thought it was invalid when we answered DR
091.

====

Does a locale with the following encoding of multibyte characters
conform to the C Standard?

* The 99 characters of the basic execution character set have codes 1 to
99, in the order mentioned in subclause 5.2.1.1 (so 'A' == 1, 'a' == 27,
'0' == 53, '!' == 63, '\n' == 99).
* The extended execution character set consists of 16,256 (127 x 128)
two-byte characters. For each two-byte character, the first byte is
between 1 and 127 inclusive, and the second byte is between 128 and 255
inclusive.

Note that any sequence of bytes can unambiguously be broken into
multibyte characters, but the basic characters are prefixes of other
characters.

Response

The hypothetical locale described does conform to the C Standard
because the specified encoding does not violate the requirements
imposed on multibyte characters by subclause 5.2.1.2. No additional
requirements are needed.

====

--
Clive D.W. Feather, writing for himself | Home: <cl...@davros.org>
Tel: +44 20 8371 1138 (work) | Web: <http://www.davros.org>
Fax: +44 20 8371 4037 (D-fax) | Work: <cl...@demon.net>
Written on my laptop; please observe the Reply-To address

Clive D. W. Feather

unread,

Jun 10, 2001, 8:58:46 AM6/10/01

to

In article <9fmath$5c8$1...@pegasus.csx.cam.ac.uk>, Joseph S. Myers
<js...@cam.ac.uk> writes

>Some questions on multibyte characters (relating to the execution
>character set only):
>
>(a) The response to DR#091 states that the execution multibyte
>character set need not be prefix-free, and in particular that the
>characters of the basic execution character set may be prefixes of
>other characters. Does this response also apply to C99?

I don't see why not. Note that Unicode is arguable not prefix-free (you
write accents after the base letter to produce an accented letter).

Documenting the application is C89 DRs to C99 is something that's being
done in my Copious Free Time.

>(b) If so, may the byte with all bits zero be a prefix of a multibyte
>character (as well as being a null character in its own right)? I'd
>suppose not, but the difference between "shall be interpreted as a
>null character" (of that byte) and "retain their usual interpretation"
>(of the basic characters, in the initial shift state) (5.2.1.2#1)
>seems rather obscure.

Indeed.

>(c) If the response still applies to C99, can a s.c. program do
>
> /* ... */
> char s[1] = "a";
> wprintf(L"%.1s\n", s);
>
>? Here, the array s contains a valid multibyte character - but
>wprintf can't determine whether it is complete or a prefix without
>going beyond the end of the array. (The specification of the return
>values of mbrtowc doesn't really seem to envisage non-prefix-free
>multibyte character sets, either.)

Cringe. I'm not sure I have a particularly good answer off the top of my
head.

>(d) May the extended execution character set define multibyte
>characters that are alternative representations for characters in the
>basic character set (i.e., that convert to the same wide character)?

Yes.

>If so, may they include alternative representations for the null
>character?

I think so; at least, I can't see why not.

>Must such alternative representations be accepted as
>equivalent by the C library; for example, must printf accept
>alternative multibyte representations of % and other characters used
>in conversions specifiers, and must alternative null characters (if
>the standard permits them) be accepted?

Yes. That is, except where the wording says otherwise, all
representations of % are equally valid. Note, however, that %[
introduces an interesting booby trap.

Joseph S. Myers

unread,

Jun 10, 2001, 12:39:00 PM6/10/01

to

In article <liabrdzG...@romana.davros.org>,

Clive D. W. Feather <cl...@davros.org> wrote:
>I don't see why not. Note that Unicode is arguable not prefix-free (you
>write accents after the base letter to produce an accented letter).

I rather doubt that Unicode is likely to be implemented with C that
way, since you'd need to invent a new encoding for wchar_t that stores
both combining and base characters in one wchar_t.

>Yes. That is, except where the wording says otherwise, all
>representations of % are equally valid. Note, however, that %[
>introduces an interesting booby trap.

%[ is my next question. As I (and the Austin Group draft) interpret
it, %[ contains a sequence of bytes, not multibyte characters, even
though the format string is a multibyte character sequence. So:

(a) Is the initial [ matched as a byte, or as any multibyte sequence
representing [?

(b) Is a ^, ] or ^] following the [ matched as a byte or pair of
bytes, or as any multibyte sequence representing those characters?

(c) Are subsequent characters of the string read as bytes or as
multibyte characters?

(d) Is the implementation-defined interpretation of - in the string
associated with a - byte, or a - multibyte character?

(e) Is the end determined as a ] byte or as a multibyte character
representing ]?

(f) If at any point bytes rather than multibyte characters are read,
and so after the ] is not the end of a multibyte character in the
original source string, must the implementation then continue to read
multibyte characters from there, out of sync with the normal multibyte
character sequence of the string? Is the behavior defined as long as
the string so interpreted is a valid format string (albeit not
following the normal multibyte character sequence)? What about if the
string so interpreted does not end in the initial shift state?

Clive D. W. Feather

unread,

Jun 10, 2001, 3:16:35 PM6/10/01

to

In article <9g07r4$id7$1...@pegasus.csx.cam.ac.uk>, Joseph S. Myers
<js...@cam.ac.uk> writes

>%[ is my next question. As I (and the Austin Group draft) interpret
>it, %[ contains a sequence of bytes, not multibyte characters, even
>though the format string is a multibyte character sequence.

That's right.

>(a) Is the initial [ matched as a byte, or as any multibyte sequence
>representing [?

Multibyte.

>(b) Is a ^, ] or ^] following the [ matched as a byte or pair of
>bytes, or as any multibyte sequence representing those characters?

Bytes.

>(c) Are subsequent characters of the string read as bytes or as
>multibyte characters?

Bytes up to the ]. After the ], you go back to multibyte.

>(d) Is the implementation-defined interpretation of - in the string
>associated with a - byte, or a - multibyte character?

Byte.

>(e) Is the end determined as a ] byte or as a multibyte character
>representing ]?

Byte.

>(f) If at any point bytes rather than multibyte characters are read,
>and so after the ] is not the end of a multibyte character in the
>original source string, must the implementation then continue to read
>multibyte characters from there, out of sync with the normal multibyte
>character sequence of the string?

Yes.

>Is the behavior defined as long as
>the string so interpreted is a valid format string (albeit not
>following the normal multibyte character sequence)?

Yes.

>What about if the
>string so interpreted does not end in the initial shift state?

Undefined behaviour. [I don't know why, but it is.]

All this can be deduced quite easily once you remember that "character"
means "byte", not "multibyte character".

Douglas A. Gwyn

unread,

Jun 11, 2001, 11:28:18 AM6/11/01

to

"Clive D. W. Feather" wrote:
> In article <3B2106D4...@null.net>, Douglas A. Gwyn
> <DAG...@null.net> writes
> >No byte of a multibyte character encoding is allowed to have a 0 value;
> >0 is reserved for use as a string terminator, so that the str*()
> >functions can be safely used to operate with multibyte arrays.
> Um, that's not quite what the Standard says, even if it's what we meant.
> A single-byte character 0 is the terminator, yes. A multibyte character
> cannot have a zero byte as the second or subsequent byte. But, given
> that single-byte characters can be prefixes of multibyte ones, where
> does it say that the first byte of a multibyte character can't be 0 ?

5.2.1.2: A byte with all bits zero shall be interpreted as a null
character independent of shift state. (This is not a sub-bullet of
the item about shift state, but applies to all multibyte character
sets, in both the source and execution environments.)

> >> ... can a s.c. program do
> >> char s[1] = "a";
> >> wprintf(L"%.1s\n", s);
> >> ? Here, the array s contains a valid multibyte character - but
> >> wprintf can't determine whether it is complete or a prefix without
> >> going beyond the end of the array.
> >That would be an invalid encoding, which produces undefined behavior.
> What makes it invalid ? We thought it was invalid when we answered DR
> 091.

I was not talking about the encoding *scheme*, but rather the
multibyte encoding stored in the array s.

In an encoding scheme that requires presence of, say, 2 well-defined
bytes to determine the meaning (character assignment), as is
hypothesized here, providing only a single byte is not meeting a
basic requirement of the encoding scheme, which makes this example
an invalid encoding if ever I saw one.

Another way of looking at it is, do you *really* want this to be
a valid encoding? We *know* it can cause trouble due to the
nonexistence of data that wprintf is going to have to access.

> Does a locale with the following encoding of multibyte characters
> conform to the C Standard?

The DR response doesn't contradict anything I said, and conversely.
*Some* prefix schemes are possible, but they must be used with care,
as the above example shows.

Bjorn Reese

unread,

Jun 11, 2001, 2:56:56 PM6/11/01

to

"Clive D. W. Feather" wrote:
>

> In article <9g07r4$id7$1...@pegasus.csx.cam.ac.uk>, Joseph S. Myers
> <js...@cam.ac.uk> writes
> >%[ is my next question. As I (and the Austin Group draft) interpret
> >it, %[ contains a sequence of bytes, not multibyte characters, even
> >though the format string is a multibyte character sequence.
>
> That's right.

How do I go about matching alternative multibyte characters then?

Could a solution be to implement collating symbols as in POSIX
regular expression (disregarding the fact that this will be a
non-standard extension). For example

%[[.put_the_multibyte_character_here.]]

Clive D. W. Feather

unread,

Jun 12, 2001, 9:22:27 AM6/12/01

to

In article <3B24E392...@null.net>, Douglas A. Gwyn
<DAG...@null.net> writes

>5.2.1.2: A byte with all bits zero shall be interpreted as a null
>character independent of shift state. (This is not a sub-bullet of
>the item about shift state, but applies to all multibyte character
>sets, in both the source and execution environments.)

Yes, but that doesn't forbid multibyte characters with the first byte
zero. Just as the fact that 0x41 is a character doesn't forbid 0x41,0x9A
from being a different character.

>> >> ... can a s.c. program do
>> >> char s[1] = "a";
>> >> wprintf(L"%.1s\n", s);
>> >> ? Here, the array s contains a valid multibyte character - but
>> >> wprintf can't determine whether it is complete or a prefix without
>> >> going beyond the end of the array.
>> >That would be an invalid encoding, which produces undefined behavior.
>> What makes it invalid ? We thought it was invalid when we answered DR
>> 091.

[That should have said "We thought it was valid"]

>I was not talking about the encoding *scheme*, but rather the
>multibyte encoding stored in the array s.
>In an encoding scheme that requires presence of, say, 2 well-defined
>bytes to determine the meaning (character assignment), as is
>hypothesized here, providing only a single byte is not meeting a
>basic requirement of the encoding scheme, which makes this example
>an invalid encoding if ever I saw one.

Okay, that's reasonable - it's undefined because the precision (1) is
greater than the converted sequence (length 0, because there isn't
enough data to deduce the first character).

>Another way of looking at it is, do you *really* want this to be
>a valid encoding?

I'm not sure.

I've just spotted a problem with such encodings and mbrtowc().

char s [] = "A@BC"; // "A@" is one character, "B" is another
wchar_t wc;
char *ss = s;
size_t r;

do
{
r = mbrtowc (&wc, ss, 1, NULL);
switch (r)
{
case (size_t) -1: printf ("%2x Error\n", *ss); break;
case (size_t) -2: printf ("%2x Incomplete\n", *ss); break;
case (size_t) 0: printf ("%2x End\n", *ss); break;
default: printf ("%2x => %lc (%zu)\n", *ss, wc, r); break;
}
s ++;
}
while (r != 0 && r != (size_t) -1);

This should, I think, print:

41 Incomplete
9A => (~) 1
42 Incomplete

but then what ?

Clive D. W. Feather

unread,

Jun 12, 2001, 9:29:31 AM6/12/01

to

In article <3B251478...@mail1.stofanet.dk>, Bjorn Reese
<bre...@mail1.stofanet.dk> writes

>How do I go about matching alternative multibyte characters then?

wscanf

> %[[.put_the_multibyte_character_here.]]

No: [ is not treated specially within the scanset.

Douglas A. Gwyn

unread,

Jun 12, 2001, 12:35:11 PM6/12/01

to

"Clive D. W. Feather" wrote:

> In article <3B24E392...@null.net>, Douglas A. Gwyn
> <DAG...@null.net> writes
> >5.2.1.2: A byte with all bits zero shall be interpreted as a null
> >character independent of shift state. (This is not a sub-bullet of
> >the item about shift state, but applies to all multibyte character
> >sets, in both the source and execution environments.)
> Yes, but that doesn't forbid multibyte characters with the first byte
> zero.

Sure it does. The (entire null) character is the 0-valued byte.
Unless you think that an encoding scheme could be ambiguous,
which I am certain is not our intent.

Note that the example in the DR was careful *not* to have a 0
value as a prefix, presumably in order to avoid this constraint.

> Just as the fact that 0x41 is a character doesn't forbid 0x41,0x9A
> from being a different character.

Not relevant, because the byte value 0x41 is not singled out as
having a special universal meaning.

Now, if we were trying to design an altogether new character
facility, perhaps for a new programming language, I would argue
against the use of "in-band" special-function (delimiter) values.
However, the C embedded-0 byte issue was thoroughly debated, and
what is specified for this was the outcome. No distinction was
made based on where within a multibyte encoding the 0 byte might
occur; we definitely realized that this ruled out some encodings,
and when we were asked about 16-bit Unicode (UCS-2) we replied
that that was not suitable for a Standard C multibyte encoding,
due to running afoul of the embedded-0 constraint.

> I've just spotted a problem with such encodings and mbrtowc().

I would say that when a positive return is possible, it is
required. The real error in the example is in not *allowing*
mbrtowc to look far enough ahead to properly decode. That's
a programming error, due perhaps to not thinking about prefix-
encodings, not an interface error.

Bjorn Reese

unread,

Jun 13, 2001, 12:19:15 PM6/13/01

to

"Clive D. W. Feather" wrote:

> >How do I go about matching alternative multibyte characters then?
>
> wscanf

Well, I was thinking of scanf explictly.

> > %[[.put_the_multibyte_character_here.]]
>
> No: [ is not treated specially within the scanset.

Yes, I am aware of this, and maybe my suggestion is inappropriate
for this newsgroup. I was merely trying to think of alternative
solutions, as it apparently isn't possible to scan fo multibyte
characters with scanf.

My idea was that if one were to extend the [ specifier of scanf
with collating symbols, as illustrated above, this could be used
to scan for multibyte characters.

Actually, it would also make sense to extend scanf with the
equivalence class and character class expressions from the POSIX
regular expressions. For example

/* Collating symbols */
scanf("%[[.ch.]]", ch_letter);

/* Character class expression */
scanf("%[[:alpha:]]", letters_only);

/* Equivalence class expression */
scanf("%[[=e=]]", accentuated_e);

Dave Prosser

unread,

Jun 13, 2001, 2:55:21 PM6/13/01

to

Bjorn Reese wrote:
> My idea was that if one were to extend the [ specifier of scanf
> with collating symbols, as illustrated above, this could be used
> to scan for multibyte characters.
>
> Actually, it would also make sense to extend scanf with the
> equivalence class and character class expressions from the POSIX
> regular expressions. For example
>
> /* Collating symbols */
> scanf("%[[.ch.]]", ch_letter);
>
> /* Character class expression */
> scanf("%[[:alpha:]]", letters_only);
>
> /* Equivalence class expression */
> scanf("%[[=e=]]", accentuated_e);

Why? Is there some gain to using the "ugly" *scanf() pushback
behavior on top of the existing POSIX regular expression APIs?

And back to the original issue, why can't one use %l[...]? Is
the problem that the result is a wide string instead of multibyte,
so that you'd have to convert back to a multibyte string?

Most C programmers that I know tend to shy away from using any
of the *scanf() family anyway, except possibly for the s*scanf()s.
So, why put even more burden on these functions, unless we want
to encourage programmers to use them?

Personally, I didn't even like the %l[...] addition to C99, and
if I'd have still been active in the technical committee when
it was doing C99, I'd have argued against it. (Not that I'd
have expected to win the argument. :-) This was one of the few
wide/multibyte features added beyond the 1st Amendment's.

--
Dave Prosser dfp at sco dot com Caldera, Murray Hill, NJ

Clive D. W. Feather

unread,

Jun 13, 2001, 11:36:26 AM6/13/01

to

In article <3B2644BF...@null.net>, Douglas A. Gwyn
<DAG...@null.net> writes

>> Yes, but that doesn't forbid multibyte characters with the first byte
>> zero.
>Sure it does. The (entire null) character is the 0-valued byte.

But that doesn't stop it being the prefix of another character.

>Unless you think that an encoding scheme could be ambiguous,
>which I am certain is not our intent.

It's not ambiguous, and it's a dangerous scheme, and we don't want to
support it, but I'm not convinced that the wording forbids it.

>Note that the example in the DR was careful *not* to have a 0
>value as a prefix, presumably in order to avoid this constraint.

Um, no, I didn't want to go that way because it would have confused the
matter.

>> Just as the fact that 0x41 is a character doesn't forbid 0x41,0x9A
>> from being a different character.
>Not relevant, because the byte value 0x41 is not singled out as
>having a special universal meaning.

[...]

All we're saying is that we're not convinced that the wording supports
you.

>> I've just spotted a problem with such encodings and mbrtowc().
>I would say that when a positive return is possible, it is
>required.

I'm not sure what you mean by that.

>The real error in the example is in not *allowing*
>mbrtowc to look far enough ahead to properly decode. That's
>a programming error, due perhaps to not thinking about prefix-
>encodings, not an interface error.

I'm not convinced. The design of mbrtowc, and in particular the -2
return value, was explicitly aimed at allowing it to be fed partial
characters as they become available, rather than having to read the
whole character in advance (as mbtowc does).

Bjorn Reese

unread,

Jun 15, 2001, 4:55:26 AM6/15/01

to

Dave Prosser wrote:

> Why? Is there some gain to using the "ugly" *scanf() pushback
> behavior on top of the existing POSIX regular expression APIs?

How about internationalization? The very same reason why those
expressions were added to the POSIX regular expressions.

Btw, I make no assumptions about the relation to the POSIX
regular expressions -- I just borrowed its syntax, which makes
the subject easier to discuss.

> And back to the original issue, why can't one use %l[...]? Is
> the problem that the result is a wide string instead of multibyte,
> so that you'd have to convert back to a multibyte string?

That is a suitable workaround for multibyte characters. I have
no problem with that.

However, my point was not that collating symbols should be
added to group scanning to handle multibyte characters, but
rather that if collating symbols were added for other reasons
then it could handle multibyte characters as well.

So the question is, why add collating symbols? Neither multibyte
nor wide characters handles the Spanish 'ch' letter (or the
German 'ss' letter or the Danish 'ae', 'oe', and 'aa' letters
or the vast number of similar letters in other languages). For
example, how do I use scanf to scan fo the letters 'a', 'b', or
'ch'?

> Most C programmers that I know tend to shy away from using any
> of the *scanf() family anyway, except possibly for the s*scanf()s.

This is also my experience, but their reason is nearly always
that scanf is not capable of handling their needs. Using a
full-fledged regular expression is often overkill, so they end
up with their own hardcoded scanners (which usually makes
maintenance more difficult).

> So, why put even more burden on these functions, unless we want
> to encourage programmers to use them?

Don't we want to encourage their use?

Dave Prosser

unread,

Jun 15, 2001, 2:37:46 PM6/15/01

to

Bjorn Reese wrote:
> Dave Prosser wrote:
> > Why? Is there some gain to using the "ugly" *scanf() pushback
> > behavior on top of the existing POSIX regular expression APIs?
>
> How about internationalization? The very same reason why those
> expressions were added to the POSIX regular expressions.
>
> Btw, I make no assumptions about the relation to the POSIX
> regular expressions -- I just borrowed its syntax, which makes
> the subject easier to discuss.

In my opinion, since this capability is covered by the existing
POSIX APIs, it isn't reasonable to add this (or something similar)
to the *scanf() functions. I see no real benefit in practice,
and adding this *large* extra overhead to the %[...] handling in
*scanf() -- which was already thrown for a loop with the addition
of %l[...] -- is not possibly worth it.

> So the question is, why add collating symbols? Neither multibyte
> nor wide characters handles the Spanish 'ch' letter (or the
> German 'ss' letter or the Danish 'ae', 'oe', and 'aa' letters
> or the vast number of similar letters in other languages). For
> example, how do I use scanf to scan fo the letters 'a', 'b', or
> 'ch'?

You're right, you don't. So read in a line and apply some POSIX
Regular Expression APIs to it as you need. A good implementation
will not have very much overhead and doesn't have all the required
limitations of the *scanf() functions.

> > So, why put even more burden on these functions, unless we want
> > to encourage programmers to use them?
>
> Don't we want to encourage their use?

Again in my opinion, no. Do we want to encourage the use of gets()?
It's in the old and new C standard, too? The *scanf() functions are
not a favorite of programmers or implementors.