Some bytes, perchance to view

Daniel

unread,

Oct 4, 2017, 11:34:34 PM10/4/17

to

I'd like to define

using bytes_view = std::basic_string_view<uint8_t>;

and have it work cross platform.

It compiles with vs2015. But do I need to worry if the specialization
std::char_traits<uin8_t> always exists? Or would it be safer to
define my own character traits?

Thanks,
Daniel

Öö Tiib

unread,

Oct 5, 2017, 1:59:18 AM10/5/17

to

Theoretically there can be a platform where uin8_t does not exist but
in practice I don't think that any such platform has C++ compiler.
When there is uint8_t then it is either unsigned char or some "extended
unsigned integer type".

IIRC standard requires char_traits only for char, wchar_t, char16_t
and char32_t. So classes that need char_traits
(like std::basic_fstream<uin8_t> or std::basic_string_view<uint8_t>
or std::basic_string<uin8_t>) are not required to work.

Why you want to use uint8_t for text?

Daniel

unread,

Oct 5, 2017, 9:35:32 AM10/5/17

to

On Thursday, October 5, 2017 at 1:59:18 AM UTC-4, Öö Tiib wrote:
>
> Theoretically there can be a platform where uin8_t does not exist but
> in practice I don't think that any such platform has C++ compiler.
> When there is uint8_t then it is either unsigned char or some "extended
> unsigned integer type".
>
> IIRC standard requires char_traits only for char, wchar_t, char16_t
> and char32_t.

I guess that means it's not required for "unsigned char" or "signed char"
either.

> So classes that need char_traits
> (like std::basic_fstream<uin8_t> or std::basic_string_view<uint8_t>
> or std::basic_string<uin8_t>) are not required to work.
>
> Why you want to use uint8_t for text?

What is text :-)

The use is for binary strings, which would be written as base64 for
JSON, or the bytes themselves for CBOR.

Daniel

Alf P. Steinbach

unread,

Oct 5, 2017, 10:25:03 AM10/5/17

to

On 10/5/2017 7:59 AM, Öö Tiib wrote:
> On Thursday, 5 October 2017 06:34:34 UTC+3, Daniel wrote:
>> I'd like to define
>>
>> using bytes_view = std::basic_string_view<uint8_t>;
>>
>> and have it work cross platform.
>>
>> It compiles with vs2015. But do I need to worry if the specialization
>> std::char_traits<uin8_t> always exists? Or would it be safer to
>> define my own character traits?
>
> Theoretically there can be a platform where uin8_t does not exist but
> in practice I don't think that any such platform has C++ compiler.

The usual example of CHAR_BIT > 8 has been Texas Instruments digital
signal processors, with CHAR_BIT = 16, and C++ compilers.

I guess one could check for existence via UINT8_MAX macro.

Cheers!,

- Alf

James R. Kuyper

unread,

Oct 5, 2017, 12:11:31 PM10/5/17

to

On 2017-10-05 09:35, Daniel wrote:
> On Thursday, October 5, 2017 at 1:59:18 AM UTC-4, Öö Tiib wrote:

...

>> IIRC standard requires char_traits only for char, wchar_t, char16_t
>> and char32_t.
>
> I guess that means it's not required for "unsigned char" or "signed char"
> either.

Correct. In particular, std::char_traits<uint8_t> will exist only if
uint8_t is a typedef for char; if it's a typedef for unsigned char, that
specialization will not exist. That's true even if char is an unsigned
type: char, unsigned char, and signed char are always three distinct
types, even though char is required to represent the same rante of
values as one of the other two types.

>> So classes that need char_traits
>> (like std::basic_fstream<uin8_t> or std::basic_string_view<uint8_t>
>> or std::basic_string<uin8_t>) are not required to work.
>>
>> Why you want to use uint8_t for text?
>
> What is text :-)

Text is the purpose for which std::basic_string<> was created. If you're
just looking for an array of uint8_t, then use one of the other standard
containers, such as std::vector<uint8_t>. If there's any feature that
std::basic_string<> has, which isn't shared by any of the other
standard containers, and you need to make use of that feature, then what
you're working with probably is text, in some sense.

Daniel

unread,

Oct 5, 2017, 1:23:21 PM10/5/17

to

The question was about the need for a bytes_view, for which there's nothing
in the standard library, and whether it was sensible or stupid to base it on
std::basic_string_view<uint8_t,?>. I'm leaning towards stupid :-)

Regarding text, as far as I can tell, std::basic_string<> doesn't really
offer much more than a sequence container of 8, 16 or 32 bit items, with the
additional favour of appending a zero with c_str(), with no text semantics
except when they coincide with the usual operations on fixed sized items in
a sequence container, at least for the default definitions of
std::char_traits<>. In practice people seem to either (1) use std::string to
hold utf-8 octets, using the member functions when they make sense, and for
the rest, using extra functions for determining length in characters
(codepoints), iterating over characters (codepoints), etc. Or
(2), what you see sometimes on Windows platforms, using std::wstring to hold
utf-16 units and using extra functions.

Daniel

James R. Kuyper

unread,

Oct 5, 2017, 2:18:34 PM10/5/17

to

On 2017-10-05 13:23, Daniel wrote:
> On Thursday, October 5, 2017 at 12:11:31 PM UTC-4, James R. Kuyper wrote:
>> On 2017-10-05 09:35, Daniel wrote:
>>>
>>> What is text :-)
>>
>> Text is the purpose for which std::basic_string<> was created. If you're
>> just looking for an array of uint8_t, then use one of the other standard
>> containers, such as std::vector<uint8_t>. If there's any feature that
>> std::basic_string<> has, which isn't shared by any of the other
>> standard containers, and you need to make use of that feature, then what
>> you're working with probably is text, in some sense.
>
> The question was about the need for a bytes_view, for which there's nothing
> in the standard library, and whether it was sensible or stupid to base it on
> std::basic_string_view<uint8_t,?>. I'm leaning towards stupid :-)
>
> Regarding text, as far as I can tell, std::basic_string<> doesn't really
> offer much more than a sequence container of 8, 16 or 32 bit items, with the

basic_string has no such restriction on the sizes of the things it can
contain. It has implementation-provided specializations for char,
wchar_t, char16_t and char32_t, but there's no requirement that either
of those first two types have a size that matches any of the three sizes
you've listed. And you can specialize for any user-defined non-array POD
type, as long as you provide a specialization of char_traits<> for that
same type which meets the requirements specified in 21.2.1.

> additional favour of appending a zero with c_str(),

In the general case, that's charT() (21.4.5p2) which is not necessarily
zero.

> ... with no text semantics

> except when they coincide with the usual operations on fixed sized items in
> a sequence container, at least for the default definitions of
> std::char_traits<>.

Most of the features that distinguish basic_string from other container
types are those listed in 21.4.7, which is, unsurprisingly, titled
"String operations". If you don't intend to use any of those operations,
you probably should use an ordinary container type.

James R. Kuyper

unread,

Oct 5, 2017, 2:32:03 PM10/5/17

to

On 2017-10-05 14:18, James R. Kuyper wrote:
> On 2017-10-05 13:23, Daniel wrote:
...

>> Regarding text, as far as I can tell, std::basic_string<> doesn't really
>> offer much more than a sequence container of 8, 16 or 32 bit items, with the
>
> basic_string has no such restriction on the sizes of the things it can
> contain. It has implementation-provided specializations for char,
> wchar_t, char16_t and char32_t, but there's no requirement that either
> of those first two types have a size that matches any of the three sizes
> you've listed.

Actually, there's no such requirement for any of those types, char16_t
and char32_t are required to be typedefs for uint_least16_t and
uint_least32_t, respectively, which need not have a size of exactly 16
or 32 bits, respectively. It's extremely likely that each of those four
types will have one of those three sizes, but it's not a requirement.

Daniel

unread,

Oct 5, 2017, 3:22:55 PM10/5/17

to

On Thursday, October 5, 2017 at 2:18:34 PM UTC-4, James R. Kuyper wrote:
>
> Most of the features that distinguish basic_string from other container
> types are those listed in 21.4.7, which is, unsurprisingly, titled
> "String operations".

Except for c_str(), and variants of the "string operations" that apply
to "null terminated" strings, it seems to me that all of those operations
would apply equally to CBOR binary strings. There are no text semantics.
std::string, for example, doesn't know about utf8, about continuation bytes,
even though that's often what it holds these days.

> If you don't intend to use any of those operations,
> you probably should use an ordinary container type.

Rather, my question was about the need for a bytes_view, for which there's
nothing in the standard library, and about the advisability or lack thereof
of basing one on std::basic_string_view<uint8_t,?>, I'm leaning towards no.

Daniel

James R. Kuyper

unread,

Oct 5, 2017, 3:56:56 PM10/5/17

to

On 2017-10-05 15:22, Daniel wrote:
> On Thursday, October 5, 2017 at 2:18:34 PM UTC-4, James R. Kuyper wrote:
>>
>> Most of the features that distinguish basic_string from other container
>> types are those listed in 21.4.7, which is, unsurprisingly, titled
>> "String operations".
>
> Except for c_str(), and variants of the "string operations" that apply
> to "null terminated" strings, it seems to me that all of those operations
> would apply equally to CBOR binary strings.

Everything I know about CBOR is from what I just read at
<https://en.wikipedia.org/wiki/CBOR>. Is that what you're referring to?

How and why would you want to apply any of the basic_string<>::find*()
member functions to CBOR binary strings? It would look at bytes that
contain the header, the payload, or the data, without discriminating
between them. I can't imagine why you'd want to use any facility on a
CBOR string that wasn't aware of the distinction between those parts of
the data format. Similarly, how and why would you want to use substr()?

I can imagine a use for a container that was aware of the CBOR format,
and which parsed the items in a CBOR string into actual data items. I
imagine that this container might internally use an array or a standard
container of uint8_t to work on the string. But why would any of
basic_string<>'s special capabilities be of any particular use for that
purpose? As I said before, one of the non-string oriented standard
containers would seem to be a better choice.

>> If you don't intend to use any of those operations,
>> you probably should use an ordinary container type.
>
> Rather, my question was about the need for a bytes_view, for which there's
> nothing in the standard library, and about the advisability or lack thereof
> of basing one on std::basic_string_view<uint8_t,?>, I'm leaning towards no.

You haven't really explained anything about what bytes_view is supposed
to do, which makes it hard to answer that question. You indicated that
it has something to do with JSON and CBOR, which would incline me to
agree with your "no".

Öö Tiib

unread,

Oct 5, 2017, 4:36:17 PM10/5/17

to

On Thursday, 5 October 2017 22:22:55 UTC+3, Daniel wrote:
> On Thursday, October 5, 2017 at 2:18:34 PM UTC-4, James R. Kuyper wrote:
> >
> > Most of the features that distinguish basic_string from other container
> > types are those listed in 21.4.7, which is, unsurprisingly, titled
> > "String operations".
>
> Except for c_str(), and variants of the "string operations" that apply
> to "null terminated" strings, it seems to me that all of those operations
> would apply equally to CBOR binary strings. There are no text semantics.
> std::string, for example, doesn't know about utf8, about continuation bytes,
> even though that's often what it holds these days.

Currently one likely uses std::string to represent UTF8 in C++. The
literal u8"text" is of type const char[] and so there are no additional
conversions needed.

>
> > If you don't intend to use any of those operations,
> > you probably should use an ordinary container type.
>
> Rather, my question was about the need for a bytes_view, for which there's
> nothing in the standard library, and about the advisability or lack thereof
> of basing one on std::basic_string_view<uint8_t,?>, I'm leaning towards no.

If there is a need for class referring to a contiguous sequence of values
of type T (that are not characters) somewhere in memory then may be use
some non-standard library class like gsl::span<T>?

Specializing 'std::char_traits' for uint8_t that are not really meant to
be characters just to get 'std::basic_string' to work just to get 'std::basic_string_view' to work (I feel) it can confuse more. From
where you get these "bytes strings" of whose "bytes views" you need?
Don't you need also 'std::codecvt<uint8_t>' for that?

On the other hand Microsoft apparently did it. On third hand Microsoft
has questionable practices. And on fourth hand I don't really know your
use cases and rest of software and plans. ;)

Daniel

unread,

Oct 5, 2017, 4:46:17 PM10/5/17

to

On Thursday, October 5, 2017 at 3:56:56 PM UTC-4, James R. Kuyper wrote:

> Everything I know about CBOR is from what I just read at
> <https://en.wikipedia.org/wiki/CBOR>. Is that what you're referring to?

https://tools.ietf.org/html/rfc7049 is a better reference. CBOR supports
two types of strings: utf8 encoded, and binary. A binary string is just a
contiguous sequence of arbitrary bytes. If formatted to text, it would
typically be output as base64.

> How and why would you want to apply any of the basic_string<>::find*()
> member functions to CBOR binary strings? It would look at bytes that
> contain the header, the payload, or the data, without discriminating
> between them. I can't imagine why you'd want to use any facility on a
> CBOR string that wasn't aware of the distinction between those parts of
> the data format. Similarly, how and why would you want to use substr()?
>

Point taken :-) On the other hand, you can't sensibly use substr on a utf8
encoded string either, at least for arbitrary indices. find can work, but only because of UTF-8's self-synchronizing features.

> I can imagine a use for a container that was aware of the CBOR format,
> and which parsed the items in a CBOR string into actual data items. I
> imagine that this container might internally use an array or a standard
> container of uint8_t to work on the string.

Yes, I have one, to encode/decode between CBOR and an unpacked
JSON variant. https://github.com/danielaparker/jsoncons/blob/master/doc/ref/cbor/encode_cbor.md

>
> You haven't really explained anything about what bytes_view is supposed
> to do

Analogous to string_view, a non-mutable non owning holder of a contiguous sequence of bytes, supporting member functions const uint8_t* data() const, length(), operator==, operator[], begin(), end(), perhaps a couple of others.

I was going to write one, but I noticed that somebody else's project in this
space was using

using bytes_view = std::experimental::basic_string_view<char>;

so I thought I'd run that by here, to see what people here thought. All
other things equal, I'd prefer to leverage existing things than to
introduce new things. That's all.

Daniel

Daniel

unread,

Oct 5, 2017, 5:43:20 PM10/5/17

to

On Thursday, October 5, 2017 at 2:32:03 PM UTC-4, James R. Kuyper wrote:
>
> Actually, there's no such requirement for any of those types, char16_t
> and char32_t are required to be typedefs for uint_least16_t and
> uint_least32_t, respectively, which need not have a size of exactly 16
> or 32 bits, respectively. It's extremely likely that each of those four
> types will have one of those three sizes, but it's not a requirement.

Thanks for remarking on that, I'd overlooked that. I find it lacking that
there's nothing in basic_string that tags the encoding, and have been
using sizeof(CharT) as an indicator of that, e.g. assuming wchar_t holds utf16 if sizeof(wchar_t) == 16, or utf32 if sizeof(wchar_t) == 32. I realize this isn't technically correct. Is there at least a presumption that char32_t holds utf32? as there's nothing that prevents you from stuffing utf8 or utf16 into it.

Daniel

James R. Kuyper

unread,

Oct 5, 2017, 6:18:19 PM10/5/17

to

On 2017-10-05 17:43, Daniel wrote:
> On Thursday, October 5, 2017 at 2:32:03 PM UTC-4, James R. Kuyper wrote:
>>
>> Actually, there's no such requirement for any of those types, char16_t
>> and char32_t are required to be typedefs for uint_least16_t and
>> uint_least32_t, respectively, which need not have a size of exactly 16

That's not quite right - I was thinking of C, where that statement was
perfectly correct. In C++, char16_t and char32_t are their own distinct
types. But it's still correct to say that 16 and 32 bits, respectively,
are only minimum values for the widths of those types. There's no
requirement that they be exactly that size.

>> or 32 bits, respectively. It's extremely likely that each of those four
>> types will have one of those three sizes, but it's not a requirement.
>
> Thanks for remarking on that, I'd overlooked that. I find it lacking that
> there's nothing in basic_string that tags the encoding, and have
> been using sizeof(CharT) as an indicator of that, e.g. assuming
> wchar_t holds utf16 if sizeof(wchar_t) == 16, or utf32 if sizeof(wchar_t)

> == 32. ...

I presume you mean sizeof(...)*CHAR_BIT?
The encoding used for narrow (char), and wide (wchar_t) strings and
characters is completely implementation-defined. There's no guarantee
that it has anything to do with either ASCII or Unicode. I gather that,
particularly in Japan, it is (or at least, used to be) commonplace for
neither of them to have either encoding.

> ... I realize this isn't technically correct. Is there at least a

> presumption that char32_t holds utf32? as there's nothing that prevents
> you from stuffing utf8 or utf16 into it.

You're right - there's nothing to prevent you from stuffing a arbitrary
numeric value that's within range into any object of either type.
However, there's facilities for creating and interpreting utf-8, utf-16
and utf-32 strings, and those facilities use char, char16_t, and
char32_t, respectively.

"A string literal that begins with u8, such as u8"asdf", is a UTF-8
string literal and is initialized with the given characters as encoded
in UTF-8.
Ordinary string literals and UTF-8 string literals are also referred to
as narrow string literals. A narrow string literal has type “array of n
const char”, where n is the size of the string as defined below, and has
static storage duration (3.7).
A string literal that begins with u, such as u"asdf", is a char16_t
string literal. A char16_t string literal has type “array of n const
char16_t”, where n is the size of the string as defined below; it has
static storage duration and is initialized with the given characters. A
single c-char may produce more than one char16_t character in the form
of surrogate pairs.
A string literal that begins with U, such as U"asdf", is a char32_t
string literal. A char32_t string literal has type “array of n const
char32_t”, where n is the size of the string as defined below; it has
static storage duration and is initialized with the given characters."
(2.14.5p7-10)

"... The specialization codecvt<char16_t, char, mbstate_t> converts
between the UTF-16 and UTF-8 encoding forms, and the specialization
codecvt <char32_t, char, mbstate_t> converts between the UTF-32 and
UTF-8 encoding forms." (22.4.1.4p3).

"For the facet codecvt_utf8:
— The facet shall convert between UTF-8 multibyte sequences and UCS2 or
UCS4 (depending on the size of Elem) within the program.
...
For the facet codecvt_utf16:
— The facet shall convert between UTF-16 multibyte sequences and UCS2 or
UCS4 (depending on the size of Elem) within the program.
...
For the facet codecvt_utf8_utf16:
— The facet shall convert between UTF-8 multibyte sequences and UTF-16
(one or two 16-bit codes) within the program." (22.5p4-6)

Daniel

unread,

Oct 5, 2017, 10:11:37 PM10/5/17

to

On Thursday, October 5, 2017 at 6:18:19 PM UTC-4, James R. Kuyper wrote:

Could I ask a somewhat tangential question, since you seem to be well versed in this stuff? Is there a logical explanation why we have three char types: char, signed char, and unsigned char?

Thanks!
Daniel

Paavo Helde

unread,

Oct 6, 2017, 5:45:52 AM10/6/17

to

On 5.10.2017 23:46, Daniel wrote:

> Point taken :-) On the other hand, you can't sensibly use substr on a utf8
> encoded string either, at least for arbitrary indices. find can work, but only because of UTF-8's self-synchronizing features.

Why should I use substr() on UTF-8 with an arbitrary index? Substr() is
used for extracting a specific piece of the string, if I have no idea
what I am extracting then there is no point in extracting it.

Typically substr() is using an index from a previous find(), which works
fine on UTF-8 as you mentioned. Example: extract the part of the string
after the last slash. Nobody cares if there are any UTF-8 characters
before or after the slash, everything is working just fine.

Cheers
Paavo (using substr() on UTF-8 for 15 years already).

Bo Persson

unread,

Oct 6, 2017, 5:54:27 AM10/6/17

to

On 2017-10-06 04:11, Daniel wrote:
> On Thursday, October 5, 2017 at 6:18:19 PM UTC-4, James R. Kuyper wrote:
>
> Could I ask a somewhat tangential question, since you seem to be well versed in this stuff? Is there a logical explanation why we have three char types: char, signed char, and unsigned char?
>

It goes all the way back to the 1970's when the C language was first
ported to a couple of different machines.

It turned out that on some instruction sets it was more efficient to
make char an unsigned type. So they did. And as long as you use 7-bit
ASCII it doesn't matter.

But it was also noticed that some people had already been using char for
calculations of small values (would now use int8_t). So it was decided
to add signed char to allow that on hardware that had plain char unsigned.

And, of course, you might then need unsigned char when moving code in
the other direction.

And there we are.

Bo Persson

Bo Persson

unread,

Oct 6, 2017, 6:03:02 AM10/6/17

to

There is a proposal for an array_view

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/p0122r0.pdf

but there has not been enough agreement on the details (or possibly its
potential overlap with the Ranges proposal) for it to enter the standard.

The current string_view seems to be a subset that has "escaped" from
those discussions and made it into the standard on its own.

Bo Persson

Jorgen Grahn

unread,

Oct 6, 2017, 9:07:39 AM10/6/17

to

On Fri, 2017-10-06, Paavo Helde wrote:
> On 5.10.2017 23:46, Daniel wrote:
>
>> Point taken :-) On the other hand, you can't sensibly use substr on
>> a utf8 encoded string either, at least for arbitrary indices. find
>> can work, but only because of UTF-8's self-synchronizing features.
>
> Why should I use substr() on UTF-8 with an arbitrary index? Substr() is
> used for extracting a specific piece of the string, if I have no idea
> what I am extracting then there is no point in extracting it.
>
> Typically substr() is using an index from a previous find(), which works
> fine on UTF-8 as you mentioned. Example: extract the part of the string
> after the last slash.

Or any kind of parsing of text where the important syntactic tokens
are ASCII.

> Nobody cares if there are any UTF-8 characters
> before or after the slash, everything is working just fine.

Which is of course why UTF-8 was designed as it was.

Often people seem to forget that it's not just /any/ soup of chars;
those in the ASCII range map 1:1 to actual ASCII characters.

ISTR that the original specification for UTF-8 summarizes pretty well
what you can do with these strings, and what you cannot do.

> Cheers
> Paavo (using substr() on UTF-8 for 15 years already).

/Jorgen

--
// Jorgen Grahn <grahn@ Oo o. . .
\X/ snipabacken.se> O o .

James R. Kuyper

unread,

Oct 6, 2017, 10:47:17 AM10/6/17

to

On 2017-10-05 22:11, Daniel wrote:
> On Thursday, October 5, 2017 at 6:18:19 PM UTC-4, James R. Kuyper wrote:
>
> Could I ask a somewhat tangential question, since you seem to be well versed in this stuff? Is there a logical explanation why we have three char types: char, signed char, and unsigned char?

No, but there is a historical one. In the early days of C, there was
only char, which was signed on some platforms, and unsigned on others.
This was allowed for the same reason that the representation of "int"
and "float" were allowed to be different on different implementations of
C. This allowed C to be implemented in the most efficient way for each
machine. All members of the basic execution character set are required
to be positive, which was not a problem for ASCII and it's many
variants. However, in EBCDIC, for example, many of those members have
the high bit set, so the only way to use EBCDIC as the character
encoding was to make char unsigned.

By the time the C language was standardized, there was a large body of
code that depended upon char being signed, and another large body of
code that depended upon char being unsigned, so that changing char to
mandate that it be exclusively one or the other would break far too much
code, either way. Instead, the committee decided to invent two new
types, unsigned char and signed char, for which the signedness was
guaranteed.

When C++ was created, it retained these features.

In practice, char should be used for text. unsigned char should be used
when you wish to examine the object representation of other types.
Unsigned char and signed char should be used when you need to store
small numbers, but only when you need to store so many of them that it's
important to minimize storage space.