std::byte thoughts

1,423 views
Skip to first unread message

pec...@gmail.com

unread,
Oct 31, 2016, 8:28:20 AM10/31/16
to ISO C++ Standard - Future Proposals
I noticed "a byte type definition" proposal (P0298R1) in last wg21 mailings and I want to share some concerns. While I agree fundamental byte type is indeed missing in C++ and should be added I don't think the presented way is the right one. The proposal suggests to add it as a library type in namespace std. It argues C++17 is expressive enough for a simple library definition, as opposed to a keyword. I am not saying it's not possible but the question is if the new type is added this way will it fit with other fundamental types defined as keywords? Will it bring confusion to the users? And this is a weak point of this proposal imho.

The fundamental type set can be already seen as confusing. First we have shortcuts like short int becomes a short, signed int becomes int etc. This is obviously shared with C and cannot be changed. We are all used to live with that. But then there is a char as a distinct type not a shortcut for signed char. OK confusion number one but it happened long time ago so nothing to do here too. Then the user has to learn some newly added types contain _t suffix like wchar_t, char16_t and char32_t. Most inexperienced programmers tend to think _t means a typedefed type because that is a convention used by some libraries. wchar_t was really implemented as a typedef in earlier MS compilers. But that's no longer true. All these types are distinct opaque types so typedef cannot be used for that. I assume _t was used as an uglifier which helps preserving source code compatibility. So confusion number two. And now we want to add std::byte. I guess using namespace std everywhere is not a good practice and using it in headers is widely considered wrong. So users will have to learn another rule - new generation types have to be prefixed with namespace qualifier. I cannot imagine any other way how to bring the confusion to a higher level than this.

We are still talking about fundamental type set not anything complicated. Simplicity and clarity matters. So why so many rules, exceptions and experiments? Why don't we use byte_t keyword so it will at least fit with 2nd generation types? If that is not possible I consider the new proposal so confusing to the users that it shouldn't be accepted. We live without it since the beginning anyway. What do the others here think?

mihailn...@gmail.com

unread,
Oct 31, 2016, 9:58:14 AM10/31/16
to ISO C++ Standard - Future Proposals, pec...@gmail.com
Off the top of my head.
 - Native byte is out of question, because it will break existing code which defined something else behind it.
Also it will be very confusing to have two basic types with exactly the same properties and not be "shortcuts" for one another.
 - byte_t will break code as well.

Also, you can always do a 'using std:byte' in your code and you are all set.

That said, it is probably more correct to be called std::byte_t like std::size_t, std::nullptr_t, std::nullopt_t, std::align_val_t

The last one is particularly relevant because it is already just enum class align_val_t : size_t {}; so the overloading can pick it up

byte is much more akin to these basic types, then to string, vector or map, which don't have a _t




Dejan Milosavljevic

unread,
Nov 1, 2016, 4:28:23 PM11/1/16
to std-pr...@isocpp.org
In <cstdint> there is uint8_t.
So far it is optional.
In GCC and MSVS this type exists.


--
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+unsubscribe@isocpp.org.
To post to this group, send email to std-pr...@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp.org/d/msgid/std-proposals/004be1ec-6455-4844-b704-7961ea745773%40isocpp.org.

D. B.

unread,
Nov 1, 2016, 5:23:49 PM11/1/16
to std-pr...@isocpp.org
It must always be optional, because C/C++ do not like to set (even 99% coverage) rules for how an implementation must represent numbers.

But anyway, the proposal linked explicitly states that "std::byte is not an integer and not a character". std::uint8_t is always the former, and on I bet all of our machines here, it's also a typedef to the latter. So how is it relevant?

Mind you, (A) std::byte will ultimately contain an integer unsigned char, just with some opaque wrapping around it to save the programmer from themselves in some ill-defined way. I don't really see much need for it, but as another tool in the stdlib, sure, the more the merrier.

I will say that I  think that proposal oversteps its do main when it expects to get allowances for a library wrapper type to participate in the special cases of object lifetime, aliasing, etc. Come on! They need to decide whether they're proposing a piece of convenience or a new fundamental type. I was iffy enough about std::initializer_list straddling the boundaries, and I'm really not sure this justifies another case of that, and on a far flimsier pretext AFAICT.

Thiago Macieira

unread,
Nov 1, 2016, 9:21:27 PM11/1/16
to std-pr...@isocpp.org
Em terça-feira, 1 de novembro de 2016, às 21:28:21 PDT, Dejan Milosavljevic
escreveu:
> In <cstdint> there is uint8_t.
> So far it is optional.
> In GCC and MSVS this type exists.

Bytes don't have to be 8 bits.

Note that we don't strictly need the typedef. It's just a convenience so that
you don't see "char" and get confused.

--
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
Software Architect - Intel Open Source Technology Center

Nicol Bolas

unread,
Nov 1, 2016, 11:10:28 PM11/1/16
to ISO C++ Standard - Future Proposals


On Tuesday, November 1, 2016 at 9:21:27 PM UTC-4, Thiago Macieira wrote:
Em terça-feira, 1 de novembro de 2016, às 21:28:21 PDT, Dejan Milosavljevic
escreveu:
> In <cstdint> there is uint8_t.
> So far it is optional.
> In GCC and MSVS this type exists.

Bytes don't have to be 8 bits.

Note that we don't strictly need the typedef. It's just a convenience so that
you don't see "char" and get confused.

Actually, no. `std::byte` as proposed by P0298 is explicitly not a typedef. It is a scoped enumeration whose underlying type is `unsigned char`, but it is not a typedef.

Granted, I can't agree with that design decision. I would much rather it be a genuine type, rather than using C++ enum chicanery to get strong aliases.

Thiago Macieira

unread,
Nov 2, 2016, 12:55:46 AM11/2/16
to std-pr...@isocpp.org
Em terça-feira, 1 de novembro de 2016, às 20:10:28 PDT, Nicol Bolas escreveu:
> Actually, no. `std::byte` as proposed by P0298 is explicitly *not* a
> typedef. It is a scoped enumeration whose underlying type is `unsigned
> char`, but it is not a typedef.
>
> Granted, I can't agree with that design decision. I would much rather it be
> a genuine type, rather than using C++ enum chicanery to get strong aliases.

That would also not be a good idea because, unsigned char is, by defintion, a
byte. Why should we have (more) types that mean exactly the same thing, and
this time in all platforms, by definition?

Nicol Bolas

unread,
Nov 2, 2016, 1:16:32 AM11/2/16
to ISO C++ Standard - Future Proposals
On Wednesday, November 2, 2016 at 12:55:46 AM UTC-4, Thiago Macieira wrote:
Em terça-feira, 1 de novembro de 2016, às 20:10:28 PDT, Nicol Bolas escreveu:
> Actually, no. `std::byte` as proposed by P0298 is explicitly *not* a
> typedef. It is a scoped enumeration whose underlying type is `unsigned
> char`, but it is not a typedef.
>
> Granted, I can't agree with that design decision. I would much rather it be
> a genuine type, rather than using C++ enum chicanery to get strong aliases.

That would also not be a good idea because, unsigned char is, by defintion, a
byte. Why should we have (more) types that mean exactly the same thing, and
this time in all platforms, by definition?

Because "unsigned char" also means "unsigned character". With just `unsigned char`, there is no way to distinguish between manipulating bytes and manipulating unsigned characters.

That's what `byte` is for, as a type: a way to semantically differentiate between operations on bytes and operations on characters. The types can be inter-convertible, numerically speaking, but they don't mean the same thing.

My problem with using C++ enum chicanery is that, if you use the type traits mechanisms to ask what `std::byte` is, it will say that it's an enum, not that it's an integral type. There's no reason why `byte` should not be an integral type.

Think about it. The C++ standard allows a break in strict aliasing rules for `unsigned char` and `char`. Why? Not because the standard thinks it's reasonable for people to alias UTF-8 strings. But because that's the only way to pass/manipulate a byte array. And byte arrays need to be able to alias.

It's the same reason why we have `char16_t` as a distinct type from `uint_least16_t`. Because there is a fundamental semantic difference between an array of unsigned integers that are at least 16-bits in size and an array of UTF-16 code units. One of this is a string; the other is not.

It's time we had such a distinction for bytes. And UTF-8 code units, for that matter.

Jared Grubb

unread,
Nov 2, 2016, 2:40:42 AM11/2/16
to ISO C++ Standard - Future Proposals

100% agree. I know I've had a few cases where I was annoyed by the lack of an 8-bit number type. Although this proposal does not fix the inconsistency in the following example, it at least provides an 8-bit option.

int main()
{
    std::cout << (uint32_t)65 << '\n';
    std::cout << (uint16_t)65 << '\n';
    std::cout << (uint8_t)65 << '\n'; // Surprise!
}
 

Thiago Macieira

unread,
Nov 2, 2016, 2:43:35 AM11/2/16
to std-pr...@isocpp.org
Em terça-feira, 1 de novembro de 2016, às 22:16:32 PDT, Nicol Bolas escreveu:
> On Wednesday, November 2, 2016 at 12:55:46 AM UTC-4, Thiago Macieira wrote:
> > That would also not be a good idea because, unsigned char is, by
> > defintion, a
> > byte. Why should we have (more) types that mean exactly the same thing,
> > and
> > this time in all platforms, by definition?
>
> Because "unsigned char" *also* means "unsigned character". With just
> `unsigned char`, there is no way to distinguish between manipulating bytes
> and manipulating unsigned characters.
>
> That's what `byte` is for, as a type: a way to semantically differentiate
> between operations on bytes and operations on characters. The types can be
> inter-convertible, numerically speaking, but they don't mean the same thing.

I'm sorry, I don't agree that there's a distinction in the first place. Bytes
are used more often than just copying around. If you add, subtract, shift left
or right, perform bitwise operations, etc, you need the value. If I need the
value, then a zero is a zero is a zero, a 0x40 is still a 0x40.

Also, I can assign 'a' to any integer type. Maybe this was the main issue:
that single-quote character literals automatically convert to integral,
instead of staying a character. We're 40 years too late to change this, though
(since B).

> My problem with using C++ enum chicanery is that, if you use the type
> traits mechanisms to ask what `std::byte` is, it will say that it's an
> enum, not that it's an integral type. There's no reason why `byte` should
> not be an integral type.

Agreed, but I'm going to go further and say that it's pretty useless for a lot
of use-cases. I need a value of a lot of operations and an enum won't give it
to me unless I cast it to a suitable integral in the first place -- unsigned
char (that is, a *real* byte).

This week I've been spending time working on hashing algorithms, notably
SipHash (btw, implementations should reconsider their std::hash algorithms).
In order to implement it, I needed to access the byte array and do byte-level
operations like rotating left, XOR, and additions. Not only would std::byte
not work for me, I fail to see how the operations I'm doing are any different
than the operations on an unsigned char.

> It's the same reason why we have `char16_t` as a distinct type from
> `uint_least16_t`. Because there is a fundamental semantic difference
> between an array of unsigned integers that are at least 16-bits in size and
> an array of UTF-16 code units. One of this is a string; the other is not.

The only benefit I see there is allowing overloading.

But while that may be true, what's the point of an *unsigned* char? If you
want to do character operations, you use char. If you're using unsigned char,
that's because you want a byte, plain and simple. By this argument, we already
have the distinction between character operations and byte operations.

Andrey Semashev

unread,
Nov 2, 2016, 5:02:08 AM11/2/16
to std-pr...@isocpp.org
On 11/02/16 08:16, Nicol Bolas wrote:
> On Wednesday, November 2, 2016 at 12:55:46 AM UTC-4, Thiago Macieira wrote:
>
> Em terça-feira, 1 de novembro de 2016, às 20:10:28 PDT, Nicol Bolas
> escreveu:
> > Actually, no. `std::byte` as proposed by P0298 is explicitly *not* a
> > typedef. It is a scoped enumeration whose underlying type is
> `unsigned
> > char`, but it is not a typedef.
> >
> > Granted, I can't agree with that design decision. I would much
> rather it be
> > a genuine type, rather than using C++ enum chicanery to get strong
> aliases.
>
> That would also not be a good idea because, unsigned char is, by
> defintion, a
> byte. Why should we have (more) types that mean exactly the same
> thing, and
> this time in all platforms, by definition?
>
>
> Because "unsigned char" /also/ means "unsigned character". With just
> `unsigned char`, there is no way to distinguish between manipulating
> bytes and manipulating unsigned characters.

And what is "unsigned character", exactly? The standard defines a
character set, but leaves character encoding implementation-defined,
except that it says that code units are representable by char. Assuming
that code units are always positive (which I don't think is mandated
anywhere, but let's keep things sane), you could also store code units
as unsigned chars. But neither char nor unsigned char represents a
character, unless a code point is equivalent to a code unit.

I think when you say "unsigned character" you should actually be saying
"code units", and at this point it's not that much different from
"bytes". I think, unsigned char should be considered as byte in every
respect; if such a type is added to C++, it should either be a regular
typedef (std::byte_t) or an intrinsic integral type that is equivalent
to unsigned char.

> My problem with using C++ enum chicanery is that, if you use the type
> traits mechanisms to ask what `std::byte` is, it will say that it's an
> enum, not that it's an integral type. There's no reason why `byte`
> should not be an integral type.

Agreed.

> Think about it. The C++ standard allows a break in strict aliasing rules
> for `unsigned char` and `char`. Why? Not because the standard thinks
> it's reasonable for people to alias UTF-8 strings. But because that's
> the only way to pass/manipulate a byte array. And byte arrays need to be
> able to alias.
>
> It's the same reason why we have `char16_t` as a distinct type from
> `uint_least16_t`. Because there is a fundamental semantic difference
> between an array of unsigned integers that are at least 16-bits in size
> and an array of UTF-16 code units. One of this is a string; the other is
> not.
>
> It's time we had such a distinction for bytes. And UTF-8 code units, for
> that matter.

Thing is, unsigned char is already allowed to alias, and if we add the
intrinsic byte type that is also allowed to alias, we don't make that
distinction you talk about. And if we prohibit unsigned char to alias
other types, we will render lots of existing code invalid. We could add
an intrinsic char8_t instead and say it will only represent narrow
character code units and not alias other types. This way unsigned char
is left as the "byte" type and we have the other type for string processing.

mihailn...@gmail.com

unread,
Nov 2, 2016, 5:27:17 AM11/2/16
to ISO C++ Standard - Future Proposals


On Wednesday, November 2, 2016 at 8:43:35 AM UTC+2, Thiago Macieira wrote:
Em terça-feira, 1 de novembro de 2016, às 22:16:32 PDT, Nicol Bolas escreveu:
> On Wednesday, November 2, 2016 at 12:55:46 AM UTC-4, Thiago Macieira wrote:
> > That would also not be a good idea because, unsigned char is, by
> > defintion, a
> > byte. Why should we have (more) types that mean exactly the same thing,
> > and
> > this time in all platforms, by definition?
>
> Because "unsigned char" *also* means "unsigned character". With just
> `unsigned char`, there is no way to distinguish between manipulating bytes
> and manipulating unsigned characters.
>
> That's what `byte` is for, as a type: a way to semantically differentiate
> between operations on bytes and operations on characters. The types can be
> inter-convertible, numerically speaking, but they don't mean the same thing.

I'm sorry, I don't agree that there's a distinction in the first place. Bytes
are used more often than just copying around. If you add, subtract, shift left
or right, perform bitwise operations, etc, you need the value. If I need the
value, then a zero is a zero is a zero, a 0x40 is still a 0x40.

...

But while that may be true, what's the point of an *unsigned* char? If you
want to do character operations, you use char. If you're using unsigned char,
that's because you want a byte, plain and simple. By this argument, we already
have the distinction between character operations and byte operations.

--
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
   Software Architect - Intel Open Source Technology Center

All byte is trying to do is to make "unsigned char" officially different then char. Better name, better name, no conversions and limited operations.
Also it does define bitwise ops.
As for arithmetics, well It seems a bit odd indeed, but lets not forget that char is too small on the one hand, and, on the other, unsigned is not considered a good type for math (as per '16 cppcon talk).

The idea is to be pure storage format not representing values (strings or math) without a cast.

Miro Knejp

unread,
Nov 2, 2016, 8:14:54 AM11/2/16
to std-pr...@isocpp.org
I think this is the real issue here: a type that is dedicated to represent only characters and not allowed to alias anything else. Having a char* can severely limit the compiler’s ability to optimize your function because of all the aliasing implications, even if you as the author know it never aliases anything because it actually *is* a string. Having a type that clearly conveys this semantic to the compiler would be useful.

Tom Honermann

unread,
Nov 2, 2016, 9:46:18 AM11/2/16
to std-pr...@isocpp.org
On 11/02/2016 01:16 AM, Nicol Bolas wrote:
> It's time we had such a distinction for bytes. And UTF-8 code units,
> for that matter.
>
In case you haven't seen the latest proposal yet:
- P0482R0: char8_t: A type for UTF-8 characters and strings
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0482r0.html

Any feedback you might have would be appreciated.

Tom.

Nicol Bolas

unread,
Nov 2, 2016, 9:55:30 AM11/2/16
to ISO C++ Standard - Future Proposals

P0298's std::byte permits all of those operations via operator overloading.

> It's the same reason why we have `char16_t` as a distinct type from
> `uint_least16_t`. Because there is a fundamental semantic difference
> between an array of unsigned integers that are at least 16-bits in size and
> an array of UTF-16 code units. One of this is a string; the other is not.

The only benefit I see there is allowing overloading.

But while that may be true, what's the point of an *unsigned* char? If you
want to do character operations, you use char. If you're using unsigned char,
that's because you want a byte, plain and simple. By this argument, we already
have the distinction between character operations and byte operations.

You must not do much UTF-8 work. Because `char` can be signed or unsigned, the only reasonable way to manipulate UTF-8 code units is to use `unsigned char`. While a signed `char` is required to be able to store UTF-8 code units, you don't want to invoke implementation-defined behavior when bitshifting signed types. So you have to use `unsigned char`.

So it's hardly unreasonable to pass UTF-8 strings around via `unsigned char*`, rather than `char*`.

Tom Honermann

unread,
Nov 2, 2016, 10:02:52 AM11/2/16
to std-pr...@isocpp.org
On 11/02/2016 05:02 AM, Andrey Semashev wrote:
Assuming that code units are always positive (which I don't think is mandated anywhere, but let's keep things sane),

I don't know of any encodings that specify negative values for code units or code points.  However, the mapping of UTF-8 code units to char results in negative code unit values in UTF-8 strings for implementations that use a signed 8-bit char.  So, in practice, sanity does not always prevail.  This is one of the motivations for P0482R0.

Tom.

Nicol Bolas

unread,
Nov 2, 2016, 10:10:11 AM11/2/16
to ISO C++ Standard - Future Proposals

A proposal which was apparently soundly rejected by EWG at the previous meeting.

Thiago Macieira

unread,
Nov 2, 2016, 10:10:55 AM11/2/16
to std-pr...@isocpp.org
Em quarta-feira, 2 de novembro de 2016, às 06:55:29 PDT, Nicol Bolas escreveu:
> You must not do much UTF-8 work. Because `char` can be signed or unsigned,
> the only reasonable way to manipulate UTF-8 code units is to use `unsigned
> char`. While a signed `char` is required to be able to store UTF-8 code
> units, you don't want to invoke implementation-defined behavior when
> bitshifting signed types. So you have to use `unsigned char`.
>
> So it's hardly unreasonable to pass UTF-8 strings around via `unsigned
> char*`, rather than `char*`.

I didn't respond to that part of the email, about UTF-8, because I thought we
were getting char8_t (see Tom Honermann's email).

Why do we need:
- char
- unsigned char
- char8_t
- byte

It seems to me we have one too many. Why do we need four? What are the four
distinction types?

Tom Honermann

unread,
Nov 2, 2016, 10:14:36 AM11/2/16
to std-pr...@isocpp.org
On 11/02/2016 08:14 AM, Miro Knejp wrote:
I think this is the real issue here: a type that is dedicated to represent only characters and not allowed to alias anything else. Having a char* can severely limit the compiler’s ability to optimize your function because of all the aliasing implications, even if you as the author know it never aliases anything because it actually *is* a string. Having a type that clearly conveys this semantic to the compiler would be useful.

I recall this being discussed during the presentation of P0257R0 in Jacksonville; thoughts were expressed that, if an alternative type to accommodate aliasing needs can be popularized, then perhaps the aliasing behaviors of char can be removed in some future standard.  I think that is a more realistic perspective than popularizing a new type to replace char.  I'll note that the char8_t type proposed in P0482R0 for UTF-8 characters and strings would not include char's aliasing behavior, thus at least allowing for more optimization possibilities for UTF-8 characters/strings.

Tom.

Tom Honermann

unread,
Nov 2, 2016, 10:18:49 AM11/2/16
to std-pr...@isocpp.org
On 11/02/2016 10:10 AM, Thiago Macieira wrote:
Em quarta-feira, 2 de novembro de 2016, às 06:55:29 PDT, Nicol Bolas escreveu:
You must not do much UTF-8 work. Because `char` can be signed or unsigned,
the only reasonable way to manipulate UTF-8 code units is to use `unsigned
char`. While a signed `char` is required to be able to store UTF-8 code
units, you don't want to invoke implementation-defined behavior when
bitshifting signed types. So you have to use `unsigned char`.

So it's hardly unreasonable to pass UTF-8 strings around via `unsigned
char*`, rather than `char*`.
I didn't respond to that part of the email, about UTF-8, because I thought we 
were getting char8_t (see Tom Honermann's email).
I'd like for that to happen, but P0482R0 has not yet been presented nor received any discussion.  I hope to present/discuss in Issaquah, time permitting.

Tom.

Tom Honermann

unread,
Nov 2, 2016, 10:26:00 AM11/2/16
to std-pr...@isocpp.org
That isn't my understanding.  P0372R0 was presented to EWG in Oulu.  The wiki notes state that further discussion has been delegated to LEWG.  The link that you provided states similarly and does not state that it was rejected.  I've had some correspondence with the authors and with a LEWG member that raised concerns.  Some of those concerns are addressed in P0482R0, but I ran out of time before the Issaquah mailing deadline to address them all.  The paper notes some incomplete items at the end of the introduction.

Tom.

Andrey Semashev

unread,
Nov 2, 2016, 10:51:01 AM11/2/16
to std-pr...@isocpp.org
> <http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0482r0.html>.

AFAIR, Unicode does not define negative code units. The fact that code
units greater than 0x7f cannot be represented in 8-bit signed `char`
simply means overflow and implementation-defined behavior. In other
words, UTF-8 code units cannot be represented by `char` on such platforms.

Tom Honermann

unread,
Nov 2, 2016, 11:14:29 AM11/2/16
to std-pr...@isocpp.org
The standard specifies behavior a little stronger than that:

C++14 [basic.fundamental] 3.9.1p1:

"For each value i of type unsigned char in the range 0 to 255 inclusive,
there exists a value j of type char such that the result
of an integral conversion (4.7) from i to char is j, and the result of
an integral conversion from j to unsigned
char is i."

So, arguably, UTF-8 code units with value greater than 0x7f must be
representable in char, though their values are mapped in an
implementation defined way. Obtaining the UTF-8 code unit value
requires conversion to unsigned char. This means that it is possible to
fully work with UTF-8 data, including UTF-8 character and string
literals, though it is quite frustrating and error prone.

Tom.
Reply all
Reply to author
Forward
0 new messages