std::byte

jacobnavia

unread,

Mar 28, 2017, 5:42:07 PM3/28/17

to

Reading
http://open-std.org/JTC1/SC22/WG21/docs/papers/2017/p0298r3.pdf
they say:
<quote>
Motivation and Scope
Many programs require byte-oriented access to memory.

Yes, byte access to memory has been the standard in harware since ages.
It predates C++, the term byte was coined by Werner Buchholz in July
1956. OK? Since then, a certain sequence of bits make a unit of
addressing in byte oriented machines, that since ages are standardized
as a sequence of 8 bits. Nowhere is specified what these people think
about that "many programs". Not ALL of them?

I mean all programs in C and C++ are using that access each day in
almost all computers running today. The sizeof() unit is a byte. The C
and the C++ languages are byte oriented languages.

<quote>
Today, such programs must use either the char, signed char, or unsigned
char types for this purpose.
<end quote>

Yes, I always try to use unsigned char for text, signed chars when I
store (eventually) small numbers in a byte. A signed byte can also
contain quantities like age, that with some effort could arrive at 128,
but....

I disgress.

What is important is that the existing language already gives you all
possible tools for doing all kinds of operations on unsigned chars.

They go on:

<quote>
However, these types perform a “triple duty”. Not only are they used for
byte addressing,
<end quote>

??? what has addressing to do with it?

We are speaking of bits, i.e. values.

<quote>
but also as arithmetic types,
<end quote>

yes, you can do unsigned integer operations with unsigned chars. So what?

<quote>
and as character types.
<end quote>

Yes. You can even store letters in an unsigned chars. All this is
already possible with known syntax and known rules.

<quote>
This multiplicity of roles opens the door for programmer error – such as
accidentally performing arithmetic on memory that should be treated as a
byte value – and confusion for both programmers and tools.
<end quote>

No data is provided in the document to prove this assertion. No research
is mentioned about what tools are confused and why should we bear yet
another syntax rule.

The new proposed syntax needs a template definition:
namespace std {
template <class IntegerType> // constrained appropriately
IntegerType to_integer(byte b);
}

This is doing what now is done with... nothing if you program according
to existing rules about common conversions.

unsigned char a,b;
int c;

c = a+b;

The operations are done using integer operations. This is the normal way
of doing this stuff. Now what does this template bring in new features?

They say:

<quote>
As its underlying type is unsigned char, to facilitate bit twiddling
operations, convenience conversion operations are provided for mapping a
byte to an unsigned integer type value. They are provided through the
function template.
<end quote>

But we HAVE already rules to do that. But why keep it simple when we can
complexify things?

Now, suppose that I write the following C program (that also compiles as
a C++ program)

1 #include <stdio.h>
2
3 int add(unsigned char a,unsigned char b)
4 {
5
6 for (int i=0; i != b; i = - ~i) {
7 a = - ~a;
8 }
9 return a;
10 }
11
12 int main(void)
13 {
14 unsigned char a = 3;
15 unsigned char b = 2;
16
17 printf("%d %d %d %d %d\n",a,~a,-~a,~-~a,-~-~a);
18 int c = add(a,b);
19 printf("2+3=%d\n",add(a,b));
20 printf("12+55=%d\n",add(12,55));
21 }

I am doing an addition using only logical operators because addition in
binary is based in boolean operations in the actual gates of the
hardware. Of course that is not a hardware binary adder but shows that
there is no real distinction between arithmetic operations and logical
operations in real computers.

Is this complexity warranted?

What does this bring?

namespace std {
// IntType would be constrained to be true for is_integral_v<IntType>
template <class IntType>
constexpr byte& operator<<=(byte& b, IntType shift) noexcept; template
<class IntType>
constexpr byte operator<<(byte b, IntType shift) noexcept; template
<class IntType>
constexpr byte& operator>>=(byte& b, IntType shift) noexcept;
template <class IntType>
constexpr byte operator>>(byte b, IntType shift) noexcept;
constexpr byte operator|(byte l, byte r) noexcept;
constexpr byte& operator|=(byte& l, byte r) noexcept;
constexpr byte operator&(byte l, byte r) noexcept;
constexpr byte& operator&=(byte& l, byte r) noexcept;
constexpr byte operator~(byte b) noexcept;
constexpr byte operator^(byte l, byte r) noexcept;
constexpr byte& operator^=(byte& l, byte r) noexcept;
}

All this stuff does exactly what unsigned char does...

And I can substitute unsigned char by "byte" in the program above and it
should work isn't it?

Hergen Lehmann

unread,

Mar 28, 2017, 6:15:10 PM3/28/17

to

Am 28.03.2017 um 23:41 schrieb jacobnavia:

> [...]

> Is this complexity warranted?
>
> What does this bring?

std::byte does not define arithmetic operators (+,-,*,/,%), so you can
not accidentally do arithmetic on them.

But i do not understand the need for that either. I can image very few
cases where arithmetic is actually harmful, and in most of these cases,
a std::bitset or a struct containing bit fields would be the better
choice than a quasi-numeric type like std::byte.

I do however understand the need to differentiate between a byte and a
character. But there's already uint8_t for that...

Daniel

unread,

Mar 28, 2017, 6:25:38 PM3/28/17

to

On Tuesday, March 28, 2017 at 5:42:07 PM UTC-4, jacobnavia wrote:

>
> All this stuff does exactly what unsigned char does...
>

It could be argued that C++ is more in need of a character type than a byte
type.

Daniel

jacobnavia

unread,

Mar 28, 2017, 6:27:28 PM3/28/17

to

Le 29/03/2017 à 00:04, Hergen Lehmann a écrit :
> I do however understand the need to differentiate between a byte and a
> character. But there's already uint8_t for that...

Well, yes!

What is the semantic difference between byte and uint8_t?
They say:

A byte is a collection of bits,

But *everything* in a computer is that: a sequence of bits!
There isn't anything else in RAM...

RAM (and disks, and tapes, etc) are just that: a sequence of bits.

Mr Flibble

unread,

Mar 28, 2017, 7:11:51 PM3/28/17

to

char.

/Flibble

jacobnavia

unread,

Mar 28, 2017, 7:23:57 PM3/28/17

to

Le 29/03/2017 à 01:09, Stefan Ram a écrit :
> A byte abstraction without arithmetics can apply to a
> register where arithmetics would absolutely make no sense.
> For example, when writing a 1 to bit 0 turns on the
> light, bit 1 the dish washer, and bit 2 the music.
> But then, one possibly might use bitfields for such a case.

Yes, there are bitfields. And if you define:

struct switches {
int light:1;
int padding:7;
int sound:1;
int padding1:7;
};

there is no math operations defined for that type , so all constraints
supposed to be gained by "byte" are ALREADY THERE.

struct switches a,b,c;

c = a+b; // Error.

Unless you overload addition of course, but why would you do that?

jacobnavia

unread,

Mar 28, 2017, 7:27:43 PM3/28/17

to

Le 29/03/2017 à 00:04, Hergen Lehmann a écrit :
>

> std::byte does not define arithmetic operators (+,-,*,/,%), so you can
> not accidentally do arithmetic on them.

The above program proves that it is indeed possible.

Anyway if you define

struct switches {
int lighisOn:1;
int soundisOn:1;
};

you can't do any addition or arithmetic operations on this type.

struct switches a,b,c;

c=a+b; // error

so, you ALREADY have the advantages of "byte".

Daniel

unread,

Mar 28, 2017, 8:14:41 PM3/28/17

to

Unfortunately, char cannot represent a Unicode character. A sequence of char can represent a byte encoding of a Unicode character.

Mr Flibble

unread,

Mar 28, 2017, 8:50:18 PM3/28/17

to

char and std::string are perfectly fine for Unicode in the form of UTF-8;
nothing else is needed at the model level.

/Flibble

Daniel

unread,

Mar 28, 2017, 11:35:24 PM3/28/17

to

On Tuesday, March 28, 2017 at 8:50:18 PM UTC-4, Mr Flibble wrote:
>
> char and std::string are perfectly fine for Unicode in the form of UTF-8;
> nothing else is needed at the model level.
>

Um, no :-)

UTF-8 is a Unicode byte encoding. It's something you send over a wire or
serialize to a stream. It's not something the application programmer should
have to be aware of.

good::string s;

for (good::character c : s)
{
}

should loop over unicode codepoints, irrespective of the encoding of s
(likely UTF-8).

s.length() should return the length of the string in codepoints. The find
members on s should work.

An application programmer shouldn't have to write something
like

std::string source = "Hi \xf0\x9f\x99\x82";

auto g = codepoint_generator(source.begin(),source.end());
while (!g.done())
{
char32_t codepoint = g.codepoint();
g.next();
}

But I know, I know, you don't see it.

Best regards,
Daniel

Hergen Lehmann

unread,

Mar 29, 2017, 12:00:12 AM3/29/17

to

Am 29.03.2017 um 01:23 schrieb jacobnavia:

> struct switches {
> int light:1;
> int padding:7;
> int sound:1;
> int padding1:7;
> };
>
> there is no math operations defined for that type , so all constraints
> supposed to be gained by "byte" are ALREADY THERE.

In fact, the struct with bitfields actually does provide the
sought-after constraints, while std::byte does not!

Accidental shift operations (which are allowed on std::byte!) would
break the boundaries between the bit fields in such a register the same
way an arithmetic operation would do. Nothing gained by forbidding
arithmetics only...

Alf P. Steinbach

unread,

Mar 29, 2017, 6:10:57 AM3/29/17

to

On 28-Mar-17 11:41 PM, jacobnavia wrote:
> Reading
> http://open-std.org/JTC1/SC22/WG21/docs/papers/2017/p0298r3.pdf
> they say:
> <quote>
> Motivation and Scope
> Many programs require byte-oriented access to memory.
>
> Yes, byte access to memory has been the standard in harware since ages.
> It predates C++, the term byte was coined by Werner Buchholz in July
> 1956. OK? Since then, a certain sequence of bits make a unit of
> addressing in byte oriented machines, that since ages are standardized
> as a sequence of 8 bits. Nowhere is specified what these people think
> about that "many programs". Not ALL of them?

I think the author means access to memory view as raw bytes.

Thanks for the Werner Buchholz reference. I didn't know. Learned
something. :)

> I mean all programs in C and C++ are using that access each day in
> almost all computers running today. The sizeof() unit is a byte. The C
> and the C++ languages are byte oriented languages.
>
> <quote>
> Today, such programs must use either the char, signed char, or unsigned
> char types for this purpose.
> <end quote>
>
> Yes, I always try to use unsigned char for text, signed chars when I
> store (eventually) small numbers in a byte. A signed byte can also
> contain quantities like age, that with some effort could arrive at 128,
> but....

AFAIK your choice is exactly opposite of most programmers' choice.

Still it would work on most architectures.

I gather it would work on /all/ extant architectures.

`signed char` has the formal problem that it can (read: once could) be
size-and-magnitude or one's complement, which has one bitpattern (just
one because all bits are required to be value representation bits) that
either denotes the same value as some other bitpattern, or is reserved
for some other purpose such as a trap representation.

Also, and probably for precisely that reason, the standard's special
support for bytes is for `unsigned char`, not `signed char`.

> I disgress.

Yes. :)

> What is important is that the existing language already gives you all
> possible tools for doing all kinds of operations on unsigned chars.
>
> They go on:
>
> <quote>
> However, these types perform a “triple duty”. Not only are they used for
> byte addressing,
> <end quote>
>
> ??? what has addressing to do with it?
>
> We are speaking of bits, i.e. values.
>
> <quote>
> but also as arithmetic types,
> <end quote>
>
> yes, you can do unsigned integer operations with unsigned chars. So what?
>
> <quote>
> and as character types.
> <end quote>
>
> Yes. You can even store letters in an unsigned chars. All this is
> already possible with known syntax and known rules.

I agree. `std::byte` is just silly. And sad.

> <quote>
> This multiplicity of roles opens the door for programmer error – such as
> accidentally performing arithmetic on memory that should be treated as a
> byte value – and confusion for both programmers and tools.
> <end quote>
>
> No data is provided in the document to prove this assertion. No research
> is mentioned about what tools are confused and why should we bear yet
> another syntax rule.

I don't think the author's assertion holds at all.

As you, I have some decades of programming behind me.

If the problem existed I should certainly have encountered it.

> [snip]

> But we HAVE already rules to do that. But why keep it simple when we can
> complexify things?

Politics, I think.

Everybody gets to contribute, and can then support other things.

Still that's very sad, it's IMO not how things should work. The C++
standardization committee shouldn't have to garner the support of
newbies. Listen to the learners and their problems and wishes, yes,
yes!; letting them design the language, absolutely not...

Cheers!,

- Alf

Chris Vine

unread,

Mar 29, 2017, 6:45:52 AM3/29/17

to

An 8-bit integer could never represent the full set of Unicode code
points, if that is what you mean, except by a multi-byte encoding. In
a single-byte encoding it can only hold the ASCII subset of unicode.

However, there is a standard way of representing such code points in a
single character, namely the char32_t type, and the associated
std::u32string type.

What do you think that C++ should provide that these do not?

Chris

David Brown

unread,

Mar 29, 2017, 7:00:35 AM3/29/17

to

On 29/03/17 12:10, Alf P. Steinbach wrote:
> On 28-Mar-17 11:41 PM, jacobnavia wrote:
>> Reading
>> http://open-std.org/JTC1/SC22/WG21/docs/papers/2017/p0298r3.pdf
>> they say:
>> <quote>
>> Motivation and Scope
>> Many programs require byte-oriented access to memory.
>>
>> Yes, byte access to memory has been the standard in harware since ages.
>> It predates C++, the term byte was coined by Werner Buchholz in July
>> 1956. OK? Since then, a certain sequence of bits make a unit of
>> addressing in byte oriented machines, that since ages are standardized
>> as a sequence of 8 bits. Nowhere is specified what these people think
>> about that "many programs". Not ALL of them?
>

There is still significant use for C and C++ on architectures that do
not support 8-bit types. The main class here is DSP's - many have
16-bit or 32-bit char, and a few have weirder sizes (though those
usually do not have C++ compilers). It would be reasonable to say that
future C or C++ standards would no longer need to fit with older
mainframe systems with 36-bit char and the like - but DSPs are current
architectures.

> I think the author means access to memory view as raw bytes.
>
> Thanks for the Werner Buchholz reference. I didn't know. Learned
> something. :)
>
>
>> I mean all programs in C and C++ are using that access each day in
>> almost all computers running today. The sizeof() unit is a byte. The C
>> and the C++ languages are byte oriented languages.

The sizeof() unit is /always/ a "byte" in C and C++ - that is by
definition of the language. But "byte" does not always mean 8 bits in C
and C++. It is the most common choice, especially for C++, but it is
/not/ universal.

It would, of course, be much simpler if "byte" were fixed at 8 bits.
But the C and C++ standards committees (especially the C one) are very
reluctant to add new limitations to the types of systems that can use
the languages.

>>
>> <quote>
>> Today, such programs must use either the char, signed char, or unsigned
>> char types for this purpose.
>> <end quote>
>>
>> Yes, I always try to use unsigned char for text, signed chars when I
>> store (eventually) small numbers in a byte. A signed byte can also
>> contain quantities like age, that with some effort could arrive at 128,
>> but....
>
> AFAIK your choice is exactly opposite of most programmers' choice.

Most people, I think, use plain "char" for text - at least for ASCII text.

The "char" types /do/ perform three separate jobs.

"char" is a sensible type for a "character" - a single element of the
basic character set. It won't hold a UTF character, but for a lot of
programs, that is not necessary - and there is char32_t for UTF-32
characters.

"char", "signed char" and "unsigned char" have never been good names for
small numerical types - but until C99 and C++11 had uint8_t, int8_t, and
related types, "signed char" and "unsigned char" were the only choices.
A particular problem with these types is that many people used "char"
and made assumptions about whether it was signed or unsigned.

And "char" or "unsigned char" is used as a "minimal item of memory" - a
"byte". Sometimes people use "uint8_t" here instead.

I think it is not unreasonable to give people an alternative type here,
to make it clearer what they mean. If you use "byte" in your code, it
is obvious that it is for raw memory storage, not for text characters or
small numbers. It is too late to /force/ people to use it (just as it
is too late to stop people using "char" for numbers). But it does mean
there is a standard name that can be used for this purpose.

The "sad" and "silly" thing about "std::byte", IMHO, is that we are
getting it /now/ - instead of getting it 30 or 40 years ago in C.

>
>
>> <quote>
>> This multiplicity of roles opens the door for programmer error – such as
>> accidentally performing arithmetic on memory that should be treated as a
>> byte value – and confusion for both programmers and tools.
>> <end quote>
>>
>> No data is provided in the document to prove this assertion. No research
>> is mentioned about what tools are confused and why should we bear yet
>> another syntax rule.
>
> I don't think the author's assertion holds at all.

I don't remember seeing such mistakes either. But I have certainly seen
people use "char" when they should have specifically been using "signed
char", "unsigned char", "int8_t" or "uint8_t".

One thing I think is odd about std::byte is that they have defined
bitwise operations on them. I don't see the point of that at all. I
see a use for an opaque "small item of memory" type, but can't
comprehend why one would want to be able to apply shifts and masking
while specifically blocking arithmetic.

Daniel

unread,

Mar 29, 2017, 7:20:57 AM3/29/17

to

On Wednesday, March 29, 2017 at 6:45:52 AM UTC-4, Chris Vine wrote:
>
> However, there is a standard way of representing such code points in a
> single character, namely the char32_t type, and the associated
> std::u32string type.
>
> What do you think that C++ should provide that these do not?
>

First of all, I don't see anybody using std::u32string. The
evolving best practice in C++ seems to be to keep the Unicode byte
encoding in UTF-8, and store them in std::string (Mr Flibble gets that
part right.) And use third party libraries for performing Unicode aware
operations on them (Mr Fliblle overlooks that.) At this point in the
evolution of C++, any path forward is going to have to keep std::string for
UTF-8 byte encodings.

A proper string class provides an abstract interface, and not merely an array
of codepoints.

Daniel

Scott Lurndal

unread,

Mar 29, 2017, 8:54:51 AM3/29/17

to

David Brown <david...@hesbynett.no> writes:

>
>There is still significant use for C and C++ on architectures that do
>not support 8-bit types. The main class here is DSP's - many have
>16-bit or 32-bit char, and a few have weirder sizes (though those
>usually do not have C++ compilers). It would be reasonable to say that
>future C or C++ standards would no longer need to fit with older
>mainframe systems with 36-bit char and the like - but DSPs are current
>architectures.

Minor correction - on 36-bit systems[*], a "byte"[***] was 9-bits (such that
four would be packed into a 36-bit word). On 48-bit systems[**], a word
can be decomposed into 6 bytes using various instructions, but individual
byte access to memory isn't supported.

[*] e.g. Clearpath Dorado (formerly Univac 1100/2200), still in production
[**] e.g. Clearpath Libra (formerly Burroughs Large Systems), still in production
[***] A.K.A. Quarter-word.

I know that Libra has a C compiler. Not sure about Dorado. Both
architectures are over 50 years old and still plugging along
(albeit in emulation for the next generations).

Alf P. Steinbach

unread,

Mar 29, 2017, 9:23:24 AM3/29/17

to

On 29-Mar-17 12:45 PM, Chris Vine wrote:
> On Tue, 28 Mar 2017 17:14:33 -0700 (PDT)
> Daniel <daniel...@gmail.com> wrote:
>> On Tuesday, March 28, 2017 at 7:11:51 PM UTC-4, Mr Flibble wrote:
>>> On 28/03/2017 23:25, Daniel wrote:
>>>>
>>>> It could be argued that C++ is more in need of a character type
>>>> than a byte type.
>>>
>>> char.
>>>
>> Unfortunately, char cannot represent a Unicode character. A sequence
>> of char can represent a byte encoding of a Unicode character.
>
> An 8-bit integer could never represent the full set of Unicode code
> points, if that is what you mean, except by a multi-byte encoding. In
> a single-byte encoding it can only hold the ASCII subset of unicode.

Oh, if it's about technical details, then a C or C++ byte is always
sufficient to hold one of the 256 first code points of Unicode, which is
ISO Latin-1.

I don't go so deep into details that I care about the hyphen there. ;)

> However, there is a standard way of representing such code points in a
> single character, namely the char32_t type, and the associated
> std::u32string type.
>
> What do you think that C++ should provide that these do not?

I can't answer for Chris Vine but IMHO a strongly typed character
encoding unit that is natural for the current system, like `int` is the
natural integer type for the platform (disregarding cross-compilation).

Such a type can be defined easily, as a `wchar_t` based enum in Windows,
for UTF-16 encoding, and as `char` based enum in Unix-land, for UTF-8
encoding. But there's no way to define a natural way to write string
literals for it. User defined literals just don't cut it, as I once
believed they would. Then one must use macros, which is ugly.

I've played around with the concept for some years, and wrote an article
in ACCU Overload journal. And my experience is that core language
support is definitely needed. But then one is up against the
single-universal-encoding dream of a certain subset of Unix-land folks,
which is baffling to me since C++ is a multi-paradigm language
originating in the most multi-paradigm OS ever devised, the ironically
named Unix (okay, I know the history, Multics and all that, but I
absolutely won't let inconvenient facts stand in the way of good arg).

Cheers!.

- Alf

Bo Persson

unread,

Mar 29, 2017, 9:25:52 AM3/29/17

to

The idea is to have a limited abstraction that is used ONLY for
accessing raw memory.

A std::byte has the advantage that it is NOT implicitly convertible to
integer or floating point types. It can be used as a function parameter
without getting ambiguous overloads with functions taking a std::uint8_t.

And the standard library will not try to convert it to a char if you
accidentally write std::cout << b;. You get an error instead.

The thing is that sometimes it is easier to write correct code if you
limit yourself, so you don't have to use the biggest sledge hammer on
the smallest nail.

Bo Persson

David Brown

unread,

Mar 29, 2017, 10:01:30 AM3/29/17

to

On 29/03/17 14:54, Scott Lurndal wrote:
> David Brown <david...@hesbynett.no> writes:
>
>>
>> There is still significant use for C and C++ on architectures that do
>> not support 8-bit types. The main class here is DSP's - many have
>> 16-bit or 32-bit char, and a few have weirder sizes (though those
>> usually do not have C++ compilers). It would be reasonable to say that
>> future C or C++ standards would no longer need to fit with older
>> mainframe systems with 36-bit char and the like - but DSPs are current
>> architectures.
>
> Minor correction - on 36-bit systems[*], a "byte"[***] was 9-bits (such that
> four would be packed into a 36-bit word).

Have there been no systems with 36-bit bytes? I know of systems with
12-bit bytes, 16-bit bytes, 18-bit bytes, 24-bit bytes and 32-bit bytes.
But those were not mainframes (they are DSPs or other specialised
processors).

> On 48-bit systems[**], a word
> can be decomposed into 6 bytes using various instructions, but individual
> byte access to memory isn't supported.

I think in C (and C++), a "byte" must be at least 8 bits. The hardware
can, of course, provide smaller divisions - but they are not "bytes" in
C parlance.

David Brown

unread,

Mar 29, 2017, 10:05:07 AM3/29/17

to

Overloads might well be a key reason for std::byte. The fact that small
integer types are the same as character types makes it difficult to make
overloaded output functions that work nicely with those types.

fir

unread,

Mar 29, 2017, 10:39:41 AM3/29/17

to

note when we talk about basic hardware types today we should probably not mention only 2 sizes (byte size, word size) but probably about 3 ones (byte size, word size, integer size) - today word (understood as an capable index over a ram) and basic integer size dont must necessary come together

Scott Lurndal

unread,

Mar 29, 2017, 11:08:16 AM3/29/17

to

David Brown <david...@hesbynett.no> writes:
>On 29/03/17 14:54, Scott Lurndal wrote:
>> David Brown <david...@hesbynett.no> writes:
>>
>>>
>>> There is still significant use for C and C++ on architectures that do
>>> not support 8-bit types. The main class here is DSP's - many have
>>> 16-bit or 32-bit char, and a few have weirder sizes (though those
>>> usually do not have C++ compilers). It would be reasonable to say that
>>> future C or C++ standards would no longer need to fit with older
>>> mainframe systems with 36-bit char and the like - but DSPs are current
>>> architectures.
>>
>> Minor correction - on 36-bit systems[*], a "byte"[***] was 9-bits (such that
>> four would be packed into a 36-bit word).
>
>Have there been no systems with 36-bit bytes? I know of systems with
>12-bit bytes, 16-bit bytes, 18-bit bytes, 24-bit bytes and 32-bit bytes.
> But those were not mainframes (they are DSPs or other specialised
>processors).

Well, the PDP-6 and PDP-10 used 36-bit words (double the 18-bits of PDP-1/4) where
character data was stored in 6-bit fields within the 36-bit
word but the instruction set (on the PDP-10) included instructions that could extract
arbitrary n-bit wide fields from the word.

Likewise the 12-bit PDP-5/PDP-8, the word size was 12-bits and generally two
6-bit characters were encoded into each word.

>
>> On 48-bit systems[**], a word
>> can be decomposed into 6 bytes using various instructions, but individual
>> byte access to memory isn't supported.
>
>I think in C (and C++), a "byte" must be at least 8 bits. The hardware
>can, of course, provide smaller divisions - but they are not "bytes" in
>C parlance.

Yes, the 48-bit systems have six 8-bit bytes.

Bo Persson

unread,

Mar 29, 2017, 12:19:20 PM3/29/17

to

On 2017-03-29 17:08, Scott Lurndal wrote:
>
>>
>>>
>>> [*] e.g. Clearpath Dorado (formerly Univac 1100/2200), still in production
>>> [**] e.g. Clearpath Libra (formerly Burroughs Large Systems), still in production
>>> [***] A.K.A. Quarter-word.
>>>
>>> I know that Libra has a C compiler. Not sure about Dorado. Both
>>> architectures are over 50 years old and still plugging along
>>> (albeit in emulation for the next generations).
>>>
>>

According to the comments here

http://stackoverflow.com/a/6972551/597607

the 2200 series even had a C++ compiler. Or at least an eye-witness
claims to have once seen the manual on-line. :-)

Bo Persson

Mr Flibble

unread,

Mar 29, 2017, 5:00:48 PM3/29/17

to

Sorry but you are talking nonsense. Individual Unicode codepoints
represented by a single scalar type are just as useless as variable
length UTF-8 encoded codepoints as far as wanting an atomic "character"
is concerned; you would know this if you did any serious i18n and/or
Unicode work.

I strongly disagree with your assertion that UTF-8 should only be used
during "serialization" and that an application should be unaware of it:
you are obviously a newbie or suffering from Dunning-Kruger effect if
you hold such views. Did you know Linux uses UTF-8 for filenames?

The text edit widget in my GUI library accepts UTF-8 as input; caches it
as both UTF-32 and font glyphs (after any glyph shaping performed for
e.g. Arabic script) and it works a treat so I do have practical
knowledge in this area.

/Flibble

Alf P. Steinbach

unread,

Mar 29, 2017, 8:58:29 PM3/29/17

to

On 29-Mar-17 11:00 PM, Mr Flibble wrote:
> Linux uses UTF-8 for filenames

If so then Linux has diverged in an incompatible way from Unix mainstream.

Of old a Unix filename was just a sequence of bytes with no interpretation.

In particular, uppercasing or lowercasing a name was meaningless.

Cheers!,

- Alf

Alf P. Steinbach

unread,

Mar 29, 2017, 9:03:04 PM3/29/17

to

And, sorry, I was half asleep, a Unix filename could be invalid as UTF-8.

Cheers!,

- Alf

Daniel

unread,

Mar 29, 2017, 10:24:42 PM3/29/17

to

On Wednesday, March 29, 2017 at 8:58:29 PM UTC-4, Alf P. Steinbach wrote:
> On 29-Mar-17 11:00 PM, Mr Flibble wrote:
> > Linux uses UTF-8 for filenames
>
> If so then Linux has diverged in an incompatible way from Unix mainstream.
>
> Of old a Unix filename was just a sequence of bytes with no interpretation.
>

Any sequence of bytes except for the forward slash / or the NUL byte. That
applies to Linux too. UTF-8 encoded filenames are fine, but so are filenames
in any other encoding. As a practical matter, though, some applications
may have difficulty with some encodings. I think it's fair to say that it's
considered best practice in Linux to use UTF-8 encoded filenames.

Daniel

Daniel

unread,

Mar 29, 2017, 11:46:49 PM3/29/17

to

On Wednesday, March 29, 2017 at 5:00:48 PM UTC-4, Mr Flibble wrote:
>
> Individual Unicode codepoints
> represented by a single scalar type are just as useless as variable
> length UTF-8 encoded codepoints as far as wanting an atomic "character"
> is concerned

I don't understand your point. Given a std::string that contains UTF-8
encoded bytes, some operations can be performed on that std::string without
iterating over the codepoints (or equivalently the UTF-8 character
sequences), for instance, compare equal with another UTF-8 encoded string.
But others cannot, for example, none of the std::string find operations will
work. I suppose you could specialize char_traits in basic_string with a
unicode version, but that leads to other issues.

>
> I strongly disagree with your assertion that UTF-8 should only be used
> during "serialization"

I never asserted that. My own view is that UTF-8 is probably the preferred
string _buffer_ encoding. wstring has been 32 bit on UNIX for a very long
time, and never gained any significant adoption in applications that have had
to support Unicode. I can't see u32string doing any better.

> and that an application should be unaware of it

I do believe that :-)

Most modern languages have one string class with a standard string interface,
not string, wstring, u16string, u32string, whole bunch of basic_string
possibilities with different character types, char_traits, allocators.
Instead of embuing encoding information in the type, they have an internal
type, and functions getBytes(encoding), toBytes(encoding) to convert the
internal representation from/to one of many possible external encoded types.
There's enough prior experience now to suggest that that's the right
approach.

Daniel

Robert Wessel

unread,

Mar 30, 2017, 2:13:20 AM3/30/17

to

http://public.support.unisys.com/2200/docs/cp14.0/pdf/78310422-011.pdf
https://public.support.unisys.com/2200/docs/cp15.0/pdf/78310430-016.pdf

David Brown

unread,

Mar 30, 2017, 4:00:33 AM3/30/17

to

On 30/03/17 02:58, Alf P. Steinbach wrote:
> On 29-Mar-17 11:00 PM, Mr Flibble wrote:
>> Linux uses UTF-8 for filenames
>
> If so then Linux has diverged in an incompatible way from Unix mainstream.
>

It has not - Mr. Flibble is wrong. Like all *nix systems, Linux
filenames are a sequence of 8-bit characters terminated in \0. The only
characters disallowed in the names are / (reserved for directories) and
\0 (the terminator).

It is up to the shell, file manager, desktop, etc., to interpret the
filenames in any it wants. In the early days, ASCII was the common
interpretation. Then people started using 8-bit code pages for
different locales, before UTF-8 became dominant.

In modern systems, UTF-8 is by far the most common encoding (with the
huge majority of file names matching the ASCII encoding), but it is not
required by the native filesystems or the OS.

Some non-native filesystems, such as NTFS and VFAT, /do/ have
requirements on character encoding (such as UCS-2 / UTF-16 / whatever
half-made jumble MS picked for that particular version of the
filesystem), and Linux will of course follow the rules there.

David Brown

unread,

Mar 30, 2017, 4:05:22 AM3/30/17

to

On 30/03/17 05:46, Daniel wrote:
> On Wednesday, March 29, 2017 at 5:00:48 PM UTC-4, Mr Flibble wrote:
>>
>> Individual Unicode codepoints
>> represented by a single scalar type are just as useless as variable
>> length UTF-8 encoded codepoints as far as wanting an atomic "character"
>> is concerned
>
> I don't understand your point. Given a std::string that contains UTF-8
> encoded bytes, some operations can be performed on that std::string without
> iterating over the codepoints (or equivalently the UTF-8 character
> sequences), for instance, compare equal with another UTF-8 encoded string.
> But others cannot, for example, none of the std::string find operations will
> work. I suppose you could specialize char_traits in basic_string with a
> unicode version, but that leads to other issues.
>

Why won't "find" work with UTF-8 strings? UTF-8 is self-synchronising -
if you search for one UTF-8 string inside another, matches done as
Unicode code points will be the same as matches done as raw 8-bit data.

David Brown

unread,

Mar 30, 2017, 4:07:05 AM3/30/17

to

Sorry - I misread your post as saying that 48-bit words were decomposed
into "6-bit bytes", not "6 bytes".

Bo Persson

unread,

Mar 30, 2017, 4:58:04 AM3/30/17

to

Oh, thanks! :-)

Bo Persson

Daniel

unread,

Mar 30, 2017, 8:30:44 AM3/30/17

to

On Thursday, March 30, 2017 at 4:05:22 AM UTC-4, David Brown wrote:
>
> Why won't "find" work with UTF-8 strings? UTF-8 is self-synchronising -
> if you search for one UTF-8 string inside another, matches done as

> Unicode code points will be the same as matches done as raw 8-bit data.dd

Good point. I hadn't thought of that, thanks.

Daniel

David Brown

unread,

Mar 30, 2017, 10:10:29 AM3/30/17

to

And I thought it was I who had missed something here.

The self-synchronising aspect of UTF-8 was a key design point, and I
think the ability to find substrings was a major reason for having it.

Vir Campestris

unread,

Mar 30, 2017, 4:49:59 PM3/30/17

to

On 29/03/2017 13:54, Scott Lurndal wrote:
> Minor correction - on 36-bit systems[*], a "byte"[***] was 9-bits (such that
> four would be packed into a 36-bit word). On 48-bit systems[**], a word
> can be decomposed into 6 bytes using various instructions, but individual
> byte access to memory isn't supported.

_Was_ is the operative word. I first learned assembly on a DECSystem10.
The 36 bit words normally contained 5 7-bit characters, with a spare
bit. There was no byte access. I've also used ICL1900s (24 bit word, 4x6
bit characters) and a weird TI graphics processor where you could have a
packed array of items of any (small) number of bits. Handy when you want
to address a display with 8 greys per pixel. And yes, we did program
that one in C!

But this was all a long time ago...

Andy

red floyd

unread,

Mar 30, 2017, 6:19:37 PM3/30/17

to

On 3/30/2017 1:49 PM, Vir Campestris wrote:

> _Was_ is the operative word. I first learned assembly on a DECSystem10.
> The 36 bit words normally contained 5 7-bit characters, with a spare
> bit. There was no byte access. I've also used ICL1900s (24 bit word, 4x6
> bit characters) and a weird TI graphics processor where you could have a
> packed array of items of any (small) number of bits. Handy when you want
> to address a display with 8 greys per pixel. And yes, we did program
> that one in C!

The TI processor was the TMS340x0 series. I had the "pleasure" of
coding for the both the 34010 and 34020.

TI's graphics library had a bug when compiled at full optimization. A
memory mapped register was not declared "volatile", so the compiler
would optimize out the read on a spin loop, and hang if the condition
was not met on initial loop.

Jorgen Grahn

unread,

Apr 1, 2017, 1:49:55 AM4/1/17

to

On Wed, 2017-03-29, Daniel wrote:
> On Tuesday, March 28, 2017 at 8:50:18 PM UTC-4, Mr Flibble wrote:
>>
>> char and std::string are perfectly fine for Unicode in the form of UTF-8;
>> nothing else is needed at the model level.
>>
> Um, no :-)
>
> UTF-8 is a Unicode byte encoding. It's something you send over a wire or
> serialize to a stream. It's not something the application programmer should
> have to be aware of.

Well, it's also something which looks like ASCII to any program which
doesn't look too closely. It was one of the selling points.

Surprisingly few of my programs look that closely. Parsing text files
works. Case-insensitive operations, sorting and alignment into
columns does not.

> good::string s;
>
> for (good::character c : s)
> {
> }
>
> should loop over unicode codepoints, irrespective of the encoding of s
> (likely UTF-8).

Programs which *do* have to care would want that kind of support, yes.

(Note that I'm not claiming to be an expert on the subject. I'm still
puzzled by the implications, especially in mixed environments -- I have
25 years worth of data encoded as iso8859-1.)

/Jorgen

--
// Jorgen Grahn <grahn@ Oo o. . .
\X/ snipabacken.se> O o .

Manfred

unread,

Apr 1, 2017, 11:05:56 AM4/1/17

to

I don't claim to be an expert in this area either, but FWIW in my
experience this is handled reasonably well by using utf-8 for I/O and
external data encoding, and converting to wchar_t (i.e. std::wstring)
for text manipulation and user interface.
Note that wchar_t is required by the standard to "represent distinct
codes for all members of the largest extended character set specified
among the supported locales" (3.9.1-5) so not necessarily limited to UCS-2.
IME the extra memory cost is acceptable for the applications that
require this functionality - being typically GUI applications where the
resource cost of the UI itself is much higher than that of data content.

Mr Flibble

unread,

Apr 1, 2017, 3:51:28 PM4/1/17

to

wchar_t is only 16 bits on Windows and UTF-16 is the WORST option out of
UTF-8, UTF-16 and UTF-32.

/Flibble

Manfred

unread,

Apr 1, 2017, 5:36:13 PM4/1/17

to

On 4/1/2017 9:51 PM, Mr Flibble wrote:
> On 01/04/2017 16:05, Manfred wrote:
>> Note that wchar_t is required by the standard to "represent distinct
>> codes for all members of the largest extended character set specified
>> among the supported locales" (3.9.1-5) so not necessarily limited to
>> UCS-2.
>

> wchar_t is only 16 bits on Windows and UTF-16 is the WORST option out of
> UTF-8, UTF-16 and UTF-32.

Possibly, but still it is the encoding recommended by Microsoft.
In Linux/gcc wchar_t is 32 bits.

Paavo Helde

unread,

Apr 15, 2017, 3:40:15 AM4/15/17

to

On 29.03.2017 6:34, Daniel wrote:
> On Tuesday, March 28, 2017 at 8:50:18 PM UTC-4, Mr Flibble wrote:
>>
>> char and std::string are perfectly fine for Unicode in the form of UTF-8;
>> nothing else is needed at the model level.
>>
> Um, no :-)
>
> UTF-8 is a Unicode byte encoding. It's something you send over a wire or
> serialize to a stream. It's not something the application programmer should
> have to be aware of.

> good::string s;
>
> for (good::character c : s)
> {
> }
>
> should loop over unicode codepoints, irrespective of the encoding of s
> (likely UTF-8).

Why should the application programmer be aware of Unicode codepoints? A
Unicode codepoint is an immensely more complex and harder-to-process
thingy than a byte.

>
> s.length() should return the length of the string in codepoints. The find
> members on s should work.

std::string::find() works perfectly well with UTF-8 strings, because of
the carefully designed properties of UTF-8.

I agree that in some situations it would be preferable to represent a
string as a sequence of code points, but I have not yet encountered any
such situation yet. The Unicode code-point is a very technical and
specific term related to the Unicode standard and which is thus relevant
only for very low-level functions dealing directly with the Unicode
representation, like normalizing the combining diacriticals or such.
These functions should reside in a low-level library like iconv. Looping
over Unicode codepoints is something which should be never needed at the
application level.

For example, proper aligning of text output in fixed-width font requires
knowing the length of the string in printable characters. Alas, this is
not at all the same as the number of codepoints, and representing the
string as a sequence of codepoints does not help a bit here.