Minimum size of a char for implementations that use utf32 as execution character set

83 views
Skip to first unread message

Daniel Fishman

unread,
Jun 8, 2017, 5:33:04 PM6/8/17
to std-dis...@isocpp.org
The C++14 Standard says in [basic.fundamental]/1:

"Objects declared as characters (char) shall be large enough to store any member
of the implementation’s basic character set"

Does it means that the implementation cannot use encodings in which members
of the basic character set cannot be stored in a char? In other words, does it
means that if the implementation uses utf32, then the size of a char must be at
least 4 bytes?

Nicol Bolas

unread,
Jun 8, 2017, 5:55:42 PM6/8/17
to ISO C++ Standard - Discussion

There is a difference between "basic character set" and "execution character set". In the former, the characters in that set are defined by the standard in [lex.charset]/1. The latter is an implementation-defined set which must include all of the characters in the basic execution character set, but may include others.

The requirement is that `char` must be large enough to store the characters in the "basic character set". Of course, if the implementation so desired, it could store all characters as Unicode UTF-32 code units. In which case, `char` would be 4. But this would be due to the implementation's decision to store its strings that way, not because of the above restriction.

Daniel Fishman

unread,
Jun 8, 2017, 6:29:51 PM6/8/17
to std-dis...@isocpp.org

> There is a difference between "basic character set" and "execution character
> set". In the former, the characters in that set are defined by the standard in
> [lex.charset]/1. The latter is an implementation-defined set which must include
> all of the characters in the basic execution character set, but may include others
> The requirement is that `char` must be large enough to store the characters in
> the "basic character set". Of course, if the implementation so desired, it could
> store all characters as Unicode UTF-32 code units. In which case, `char` would
> be 4. But this would be due to the implementation's decision to store its
> strings that way, not because of the above restriction.

Implementation's decision to use UTF-32 does not automatically means that 'char'
would be 4 bytes - for example, if you ask g++ to use UTF-32 as it's execution
character set (using an option '-fexec-charset=UFT-32' during compilation),
size of a char is still 1 byte. It is only due to the [basic.fundamental]/1
that the char must be 4 bytes.

What I wanted to understand is whether g++'s behaviour (using 1 byte char when
UTF-32 is used) is compliant. For the time being it seems that it's behaviour
is not compatible with the Standard.


Thiago Macieira

unread,
Jun 8, 2017, 7:06:03 PM6/8/17
to std-dis...@isocpp.org
The size of char is always 1 byte. By definition, because byte is defined as the
size of char.

For a UTF-32 execution charset, you need a 32-bit byte. That's all.

--
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
Software Architect - Intel Open Source Technology Center

Nicol Bolas

unread,
Jun 9, 2017, 10:39:46 AM6/9/17
to ISO C++ Standard - Discussion


On Thursday, June 8, 2017 at 6:29:51 PM UTC-4, Daniel Fishman wrote:

> There is a difference between "basic character set" and "execution character
> set". In the former, the characters in that set are defined by the standard in
> [lex.charset]/1. The latter is an implementation-defined set which must include
> all of the characters in the basic execution character set, but may include others
> The requirement is that `char` must be large enough to store the characters in
> the "basic character set". Of course, if the implementation so desired, it could
> store all characters as Unicode UTF-32 code units. In which case, `char` would
> be 4. But this would be due to the implementation's decision to store its
> strings that way, not because of the above restriction.

Implementation's decision to use UTF-32 does not automatically means that 'char'
would be 4 bytes - for example, if you ask g++ to use UTF-32 as it's execution
character set (using an option '-fexec-charset=UFT-32' during compilation),
size of a char is still 1 byte. It is only due to the [basic.fundamental]/1
that the char must be 4 bytes.

`sizeof(char)` will always be 1, no matter what. C++ defines sizes and alignment of objects relative to the size of `char`.

The question you're asking is whether a `char` will take up multiple machine bytes. Or more specifically, that `sizeof(char) == sizeof(int)` or somesuch. That's up to the compiler.

Basically, the meaning of `-fexec-charset=UTF-32` is whatever the compiler would like it to mean.

Then, there's the question of the distinction between "character set" and "encoding". UTF-32 is an encoding; it is not a character set. Unicode is a character set (technically, the Unicode character set is a character set, but nevermind that now).

So it's not really clear what "-fexec-charset=UTF-32" means, with respect to the standard.

Thiago Macieira

unread,
Jun 9, 2017, 12:07:38 PM6/9/17
to std-dis...@isocpp.org
On Friday, 9 June 2017 07:39:46 PDT Nicol Bolas wrote:
> The question you're asking is whether a `char` will take up multiple
> machine bytes. Or more specifically, that `sizeof(char) == sizeof(int)` or
> somesuch. That's up to the compiler.
>
> Basically, the meaning of `-fexec-charset=UTF-32` is whatever the compiler
> would like it to mean.
>
[cut]
> So it's not really clear what "-fexec-charset=UTF-32" means, with respect
> to the standard.

That's exactly it: this is not a Standard problem, but first a compiler
problem. You have to ask GCC devs what using
-fexec-charset=something that requires more than 8 bits

will do on a regular, modern processor. It's very likely that it's simply not
supported, should be rejected and that the only reason it works is that the
flag accepts any value that iconv/gconv does, including nonsensical ones.

Other compilers may support this but cause CHAR_BITS to increase by using a
byte that is larger than the machine's byte, like Nicol suggested.

Jens Maurer

unread,
Jun 9, 2017, 12:25:18 PM6/9/17
to std-dis...@isocpp.org
On 06/08/2017 11:33 PM, Daniel Fishman wrote:
> The C++14 Standard says in [basic.fundamental]/1:
>
> "Objects declared as characters (char) shall be large enough to store any member
> of the implementation’s basic character set"

This sentence is slightly unclear in whether it means
"basic execution character set" or "basic source character set".
Since the latter does not depend on the implementation
(cf. [lex.charset] p1), only the former makes sense.
But we should make the text explicit.

> Does it means that the implementation cannot use encodings in which members
> of the basic character set cannot be stored in a char? In other words, does it
> means that if the implementation uses utf32, then the size of a char must be at
> least 4 bytes?

According to [lex.charset] p3, there's a basic execution character set
and an execution character set.

I think an implementation could use ASCII for the basic execution character
set (whose encoding fits into a 1-octet char) and Unicode (note: UTF32 is an
encoding, not a set) for the execution character set.

The net result from all the text is, in my view, that a "char" needs
to be able to store the value of a member of the basic source character
set, which is explicitly defined in [lex.charset] p1.

Jens


Daniel Fishman

unread,
Jun 9, 2017, 12:32:55 PM6/9/17
to std-dis...@isocpp.org

> That's exactly it: this is not a Standard problem, but first a compiler
> problem. You have to ask GCC devs what using
> -fexec-charset=something that requires more than 8 bits
>
> will do on a regular, modern processor. It's very likely that it's simply not
> supported, should be rejected and that the only reason it works is that the
> flag accepts any value that iconv/gconv does, including nonsensical ones.
>
> Other compilers may support this but cause CHAR_BITS to increase by using a
> byte that is larger than the machine's byte, like Nicol suggested.

It indeed seems that the option works only by chance and doesn't do things
properly - when used, char still really uses only one machine byte and
overflows when assigned something >256. And sizeof('a') is 4.


Thiago Macieira

unread,
Jun 9, 2017, 3:43:41 PM6/9/17
to std-dis...@isocpp.org
Yup:

long x = sizeof('a');

gcc (regular):
x is 1
note: C++; in C, that would be 4 since 'a' is int, not char.

gcc -fexec-charset=utf-16
warning: multi-character character constant [-Wmultichar]
x is 4

gcc -fexec-charset=utf-32
warning: character constant too long for its type
x is 4

Yubin Ruan

unread,
Jun 10, 2017, 8:04:19 AM6/10/17
to std-dis...@isocpp.org
> Basically, the meaning of `-fexec-charset=UTF-32` is whatever the compiler
> would like it to mean.

So are there any GCC docs for this? The current GCC docs for this[1] seems to
say nothing about this...

[1]: https://gcc.gnu.org/onlinedocs/cpp/Invocation.html

--
Yubin

Thiago Macieira

unread,
Jun 10, 2017, 1:13:45 PM6/10/17
to std-dis...@isocpp.org
The lack of documentation is probably all the information you need to conclude
that the option is nonsense and you shouldn't use it.

Aarón Bueno Villares

unread,
Sep 17, 2017, 12:01:46 AM9/17/17
to ISO C++ Standard - Discussion
[intro.memory]1 The fundamental storage unit in the C ++ memory model is the byte. A byte is at least large enough to contain any member of the basic execution character set (2.3) and the eight-bit code units of the Unicode UTF-8 encoding form and is composed of a contiguous sequence of bits, the number of which is implementation-defined. [...]

[lex.charset]1 The basic source character set consists of 96 characters [...]

[lex.charset]3 The basic execution character set and the basic execution wide-character set shall each contain all the members of the basic source character set, plus control characters representing alert, backspace, and carriage return, plus a null character (respectively, null wide character), whose representation has all zero bits. [...]

[basic.fundamentals]1 Objects declared as characters (char) shall be large enough to store any member of the implementation’s basic character set. [...] 

[expr.sizeof]1 [...]. sizeof(char), sizeof(signed char) and sizeof(unsigned char) are 1. [...]

[lex.phases]1.1 Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set (introducing new-line characters for end-of-line indicators) if necessary. The set of physical source file characters accepted is implementation-defined. [...] Any source file character not in the basic source character set (2.3) is replaced by the universal-character-name that designates that character. [...]

[lex.phases]1.5 Each source character set member in a character literal or a string literal, as well as each escape sequence and universal-character-name in a character literal or a non-raw string literal, is converted to the corresponding member of the execution character set (2.14.3, 2.14.5); if there is no corresponding member, it is converted to an implementation-defined member other than the null (wide) character.
 
The idea behind all of these bullets is:

The compiler must recognize each character written in the source file. For that, the compiler of course need to know the encoding of the source file, but how the compiler treats internally each character its implementation-defined. The internal encoding of the compiler has theoretically nothing to do with the input or execution encoding. Each appearance of non-basic characters must be treated as-if written using the universal-character-name syntax (\u + Unicode code point in hexadecimal). But the compiler could for example treat internally the found characters in source files as wish (for example, translating from input encoding to UTF-8 encoding without the universal-character-name syntax, or translating them directly to the execution character set). What matter is the character recognition from the input to parse the syntax and to translate char and strings to do a proper translation to the output encoding.

Non-basic character cannot appear outside identifiers or char/string literals. Support for non-basic characters inside identifiers is implementation-defined, which usually aren't, because it can cause problems when linking (different linkers can have different way of treating the encoding of non-ASCII symbols), because the execution encoding has also nothing to do with identifiers encoding in the executable.

So, what matters is the characters written inside char and strings literals. Objects of type `char` must support at least 255 different values in execution, because there's 255 different UTF-8 code-units (which includes the 96 characters of the basic source character set). But char's can hold more values if the compiler wants to support them. For that, it will require more physical storage for a bigger range of values that those minimally required. Despite the physical size, `sizeof(char) = 1`.

The physical storage length of chars it what defines physically the memory unit of C++ (the C++ bytes), because `sizeof(unsigned char) = sizeof(signed char) = sizeof(char)`, and the object representation is the set of `unsigned chars` required to store an object. So, `sizeof(T)` is the amount of `unsigned char`s that an object of type T requires for being stored, which of course, can be different from the amount of physical bytes needed to store that object if the physical size of a char is bigger than 1 physical byte.

So, the size of chars is not about characters, but also about memory composition. And then, it comes the execution character set (the execution encoding). 

If the user specifies an execution character set of 2^32 different values (UTF-32, for instance), each char will be able to store at least 2^32 different values, but `sizeof(char)` will still be 1, and `sizeof(int) = 1`, if `int` are physically stored as 4 bytes. And remember that objects of type char must be "large enough", so the physical size of chars can be even bigger (which of course never is).

Is it a good idea to specifiy UTF-32 as execution encoding? Never, because the storage required for ints is recommended to be the fastest size of the host machine. If the fastest size of the host machine is 2-bytes (unlikely today in the most majority of situations), then you are forcing the compiler to work with ints of 4 physical bytes, because each int is composed of a finite set of unsigned chars (in this case, 1).
Reply all
Reply to author
Forward
0 new messages