meaning of "character" in translation limits

Jun Woong

unread,

Nov 13, 2011, 6:24:03 AM11/13/11

to

When the standard says "character" in the translation limits, what
does that term refer to? Is it a byte or a "real" character?

C99 5.2.4.1:
- 4095 characters in a logical source line
- 4095 characters in a character string literal or wide string
literal (after concatenation)

I'm modifying my compiler to issue warnings for code that contains a
string or a line whose length exceeds those limits, but the term
"character" bothers me.

My compiler works as follows:

- every code (whose character encoding is not ASCII) is converted
to be encoded in UTF-8 in TP1; and
- characters in character constants or (wide) string literals are
converted in TP6 to the encoding the user specified.

Consider a program that contains a non-wide string literal that is of
M bytes long and encoded in EUC-KR, where M is the limit specified by
the standard. Note that a Korean character is encoded by two bytes in
EUC-KR and by three in UTF-8, thus there are M/2 "characters" in the
literal.

Now, what should my compiler do for it?

a) a warning should not be emitted because the string literal has
less than M+1 bytes in EUC-KR;
b) a warning should be emitted because the string literal has more
than M bytes in UTF-8;
c) a warning should not be emitted because the string literal has
less than M+1 Korean "characters" (it has M/2); or
d) any of the above ones; it depends on how an implementer maps
characters in TP1 and defines the "character" after TP1.

And what about wide strings? For a wide string literal whose
multibyte encoding is EUC-KR (that is converted to UTF-8 in TP1) and
wide character encoding is UCS-2, what should my compiler do?

a) emit a warning when the number of bytes in the EUC-KR-encoded
string exceeds the limit;
b) emit a warning when the number of bytes in the UTF-8-encoded
string exceeds the limit;
c) emit a warning when the number of "elements" in the wchar_t
array exceeds the limit; or
d) any of the above ones.

FYI, gcc chose b) and c) respectively for non-wide and wide string
literals and my implementation follows it for now.

Does the same interpretation go for the limit on the number of
characters in a logical line?

Thanks in advance.

James Kuyper

unread,

Nov 13, 2011, 7:02:28 AM11/13/11

to

On 11/13/2011 06:24 AM, Jun Woong wrote:
> When the standard says "character" in the translation limits, what
> does that term refer to? Is it a byte or a "real" character?
>
> C99 5.2.4.1:
> - 4095 characters in a logical source line

It says "characters"; given that the standard explicitly allows for
multi-byte characters, I don't see any way to justify counting bytes,
rather than characters, when there's a difference.

> - 4095 characters in a character string literal or wide string
> literal (after concatenation)

The same comment applies here.

> I'm modifying my compiler to issue warnings for code that contains a
> string or a line whose length exceeds those limits, but the term
> "character" bothers me.
>
> My compiler works as follows:
>
> - every code (whose character encoding is not ASCII) is converted
> to be encoded in UTF-8 in TP1; and
> - characters in character constants or (wide) string literals are
> converted in TP6 to the encoding the user specified.
>
> Consider a program that contains a non-wide string literal that is of
> M bytes long and encoded in EUC-KR, where M is the limit specified by
> the standard. Note that a Korean character is encoded by two bytes in
> EUC-KR and by three in UTF-8, thus there are M/2 "characters" in the
> literal.
>
> Now, what should my compiler do for it?
>
> a) a warning should not be emitted because the string literal has
> less than M+1 bytes in EUC-KR;
> b) a warning should be emitted because the string literal has more
> than M bytes in UTF-8;
> c) a warning should not be emitted because the string literal has
> less than M+1 Korean "characters" (it has M/2); or
> d) any of the above ones; it depends on how an implementer maps
> characters in TP1 and defines the "character" after TP1.

The correct answer is c. The implementor's freedom to define things does
not extend to taking a 3-byte representation of a single character, and
counting it as three characters.

> And what about wide strings? For a wide string literal whose
> multibyte encoding is EUC-KR (that is converted to UTF-8 in TP1) and
> wide character encoding is UCS-2, what should my compiler do?
>
> a) emit a warning when the number of bytes in the EUC-KR-encoded
> string exceeds the limit;
> b) emit a warning when the number of bytes in the UTF-8-encoded
> string exceeds the limit;
> c) emit a warning when the number of "elements" in the wchar_t
> array exceeds the limit; or
> d) any of the above ones.

Again, c seems the correct answer.

> FYI, gcc chose b) and c) respectively for non-wide and wide string
> literals and my implementation follows it for now.
>
> Does the same interpretation go for the limit on the number of
> characters in a logical line?

If they had wanted you to count bytes rather than characters, the limit
should have been expressed in bytes. Keep in mind that the number of
characters can be determined by examination of the source code; knowing
the number of bytes requires knowledge of the implementation-specific
multi-byte representation.
--
James Kuyper

Keith Thompson

unread,

Nov 13, 2011, 5:03:02 PM11/13/11

to

Jun Woong <wo...@icu.ac.kr> writes:
> When the standard says "character" in the translation limits, what
> does that term refer to? Is it a byte or a "real" character?
>
> C99 5.2.4.1:
> - 4095 characters in a logical source line
> - 4095 characters in a character string literal or wide string
> literal (after concatenation)
>
> I'm modifying my compiler to issue warnings for code that contains a
> string or a line whose length exceeds those limits, but the term
> "character" bothers me.

Translation Phase 1 maps "[p]hysical source ﬁle multibyte characters"
to "the source character set". (This includes replacing trigraphs.)

TP2 splices physical lines ending in \ to "logical source lines".

So "characters in a logical source line" must refer to characters in the
source character set. Any multibyte character in the physical source
file counts as one character towards the 4095-character limit.

[...]

--
Keith Thompson (The_Other_Keith) ks...@mib.org <http://www.ghoti.net/~kst>
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"

Woong Jun

unread,

Nov 14, 2011, 9:33:53 PM11/14/11

to

James Kuyper <jameskuy...@verizon.net> wrote:
> On 11/13/2011 06:24 AM, Jun Woong wrote:

[...]

>
> If they had wanted you to count bytes rather than characters, the limit
> should have been expressed in bytes.

Seeing "byte" used in other items of TLs, I agree with you.

> Keep in mind that the number of
> characters can be determined by examination of the source code; knowing
> the number of bytes requires knowledge of the implementation-specific
> multi-byte representation.
>

The situation differs if iconv is hired to convert character
encodings. Counting the number of bytes in a UTF-8 string converted
from a physical source line is easier than counting the number of
characters in the same string because I already have pointers to the
start and the end of the UTF-8 string and iconv() gives no
information on how many characters it converted. I suspect that gcc
uses the number of bytes for the same reason. In the current state,
I have no way other than counting the number of the first byte of
each character in the UTF-8 string, which degrades performance.

--
Jun, Woong (woong.jun at gmail.com)

James Kuyper

unread,

Nov 14, 2011, 10:01:35 PM11/14/11

to

On 11/14/2011 09:33 PM, Woong Jun wrote:
> James Kuyper <jameskuy...@verizon.net> wrote:
...

>> Keep in mind that the number of
>> characters can be determined by examination of the source code; knowing
>> the number of bytes requires knowledge of the implementation-specific
>> multi-byte representation.
>>
>
> The situation differs if iconv is hired to convert character
> encodings. Counting the number of bytes in a UTF-8 string converted
> from a physical source line is easier than counting the number of
> characters in the same string because I already have pointers to the
> start and the end of the UTF-8 string and iconv() gives no
> information on how many characters it converted. I suspect that gcc
> uses the number of bytes for the same reason. In the current state,
> I have no way other than counting the number of the first byte of
> each character in the UTF-8 string, which degrades performance.

This is a limit your implementation is supposed to satisfy, not one you
have an obligation to diagnose. If it is bytes, not characters, that
determine when your implementation actually starts having problems (and
I would expect that's normally the case), it's is your obligation to
make sure it has sufficient capacity to store the bytes needed to
represent that number of characters.
While it is nice of you to diagnose code that exceeds these limits,
making it easy to do so wasn't one of the goals the committee had in
mind when writing that section.
--
James Kuyper

Jun Woong

unread,

Nov 16, 2011, 12:12:57 AM11/16/11

to

James Kuyper <jameskuy...@verizon.net> wrote:
> On 11/14/2011 09:33 PM, Woong Jun wrote:
> > James Kuyper <jameskuy...@verizon.net> wrote:
> ...
> >> Keep in mind that the number of
> >> characters can be determined by examination of the source code; knowing
> >> the number of bytes requires knowledge of the implementation-specific
> >> multi-byte representation.
>
> > The situation differs if iconv is hired to convert character
> > encodings. Counting the number of bytes in a UTF-8 string converted
> > from a physical source line is easier than counting the number of
> > characters in the same string because I already have pointers to the
> > start and the end of the UTF-8 string and iconv() gives no
> > information on how many characters it converted. I suspect that gcc
> > uses the number of bytes for the same reason. In the current state,
> > I have no way other than counting the number of the first byte of
> > each character in the UTF-8 string, which degrades performance.
>
> This is a limit your implementation is supposed to satisfy, not one you
> have an obligation to diagnose. If it is bytes, not characters, that
> determine when your implementation actually starts having problems (and
> I would expect that's normally the case),

You're right.

> it's is your obligation to
> make sure it has sufficient capacity to store the bytes needed to
> represent that number of characters.

My implementation has two modes to handle input lines, in one of
which it enlarges dynamically the storage to accommodate input lines
while uses a fixed-sized buffer that acts like a sliding window over
the input stream in the other mode. When a switch for encoding
conversion is on, it always runs in the former mode, so no problem
unless the enlargement fails. BTW, (even if the TLs are said to be
rubber teeth) it would not be easy for an implementation using a
fixed-sized buffer to always satisfy the limits because a program may
contain an arbitrarily long shift sequence, say, in ISO-2022-JP.

> While it is nice of you to diagnose code that exceeds these limits,
> making it easy to do so wasn't one of the goals the committee had in
> mind when writing that section.

Agreed.