Meaning of being representable in the execution character set

27 views
Skip to first unread message

Daniel Fishman

unread,
Jun 8, 2017, 5:16:14 PM6/8/17
to std-dis...@isocpp.org
The C++14 Standard says in [lex.ccon]/1:

"An ordinary character literal that contains a single c-char representable in the
execution character set has type char, with value equal to the numerical value
of the encoding of the c-char in the execution character set
"

There seem to be a bit of a problem here: as far as I see, the term "
representable
in the execution character set" is not explicitly defined anywhere in the Standard.
If the implementation uses utf8 as it's execution character set then it seems clear
that 'CYRILLIC CAPITAL LETTER A', for example, is
representable in the execution character
set
, since the letter is part of a utf8. But since the
literal's numerical value of the encoding
is usually larger than the maximum value of a char, it's type cannot be char.

Wouldn't it be more correct to say something like: "
...representable in the
execution character set and having
numerical value of the encoding of the c-char in
the execution character set representable in a char,
has type char..."?

Thiago Macieira

unread,
Jun 8, 2017, 7:04:58 PM6/8/17
to std-dis...@isocpp.org
On Thursday, 8 June 2017 14:16:08 PDT Daniel Fishman wrote:
> The C++14 Standard says in [lex.ccon]/1:
>
> "An ordinary character literal that contains a single c-char representable
> in the execution character set has type char, with value equal to the
> numerical value of the encoding of the c-char in the execution character
> set"
>
> There seem to be a bit of a problem here: as far as I see, the term
> "representable in the execution character set" is not explicitly defined
> anywhere in the Standard. If the implementation uses utf8 as it's execution
> character set then it seems clear that 'CYRILLIC CAPITAL LETTER A', for
> example, is representable in the execution character set, since the letter
> is part of a utf8. But since the literal's numerical value of the encoding
> is usually larger than the maximum value of a char, it's type cannot be
> char.

This is mostly because of old, non-ASCII encodings. You could write the source
code in ASCII and have EBCDIC (for example) as the execution character set.
That meant you could have characters in the source that are not representable
in the execution one.

The opposite is impossible: if you can't write it in the source, then there is
no source code that has that construct. And note that multibyte sequences are
not taken into account: you can't have multibyte character literals in the
source encoding nor in the execution charset. You need strings for that.

> Wouldn't it be more correct to say something like: "...representable in the
> execution character set and having numerical value of the encoding of the
> c-char in the execution character set representable in a char, has type
> char..."?

I don't think so, because I think it's redundant. I disagree with you that any
Cyrillic letter is a valid char in UTF-8 because it requires multibyte
sequences. Therefore, the extra qualification you added is unnecessary.

--
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
Software Architect - Intel Open Source Technology Center

Daniel Fishman

unread,
Jun 9, 2017, 2:19:31 AM6/9/17
to std-dis...@isocpp.org

> ...no source code that has that construct. And note that multibyte sequences are
> not taken into account: you can't have multibyte character literals in the
> source encoding nor in the execution charset. You need strings for that.

Where does the Standard says that? I would say that the reverse is true:
character literal is a c-char, which is any member of the source character set,
and Cyrillic letter is definitely a member of the source character set in spite
of using more than one byte for representation. You need strings for
multicharacter literals, not for multibyte characters.


> I don't think so, because I think it's redundant. I disagree with you that any
> Cyrillic letter is a valid char in UTF-8 because it requires multibyte
> sequences. Therefore, the extra qualification you added is unnecessary.
>

According to what I wrote above, while not being a valid char it is a valid
character literal representable in UTF-8.

Thiago Macieira

unread,
Jun 9, 2017, 2:36:22 AM6/9/17
to std-dis...@isocpp.org
Because it's not a character literal if it requires more than one char.

Daniel Fishman

unread,
Jun 9, 2017, 3:45:02 AM6/9/17
to std-dis...@isocpp.org

> Because it's not a character literal if it requires more than one char.
>

Where does the Standards says so?

Tom Honermann

unread,
Jun 9, 2017, 11:26:09 AM6/9/17
to std-dis...@isocpp.org
On 06/09/2017 03:44 AM, Daniel Fishman wrote:
>
>> Because it's not a character literal if it requires more than one char.
>>
>
> Where does the Standards says so?
>
I think the relevant text is [lex.ccon] 2.13.3p8:

"... The value of a character literal is implementation-defined if it
falls outside of the implementation-defined range defined for char (for
literals with no prefix) or wchar_t (for literals prefixed by L). [
Note: If the value of a character literal
prefixed by u, u8, or U is outside the range defined for its type, the
program is ill-formed. — end note ]"


Tom.

Reply all
Reply to author
Forward
0 new messages