Config of a library with modules?

Alf P. Steinbach

unread,

Feb 7, 2019, 3:06:16 PM2/7/19

to

Specifically, I wonder how will one achieve something like this with
modules?

-------------------------------------------------------------------------------
#pragma once // Source encoding: UTF-8 with BOM (π is a lowercase
Greek "pi").

namespace cppx
{
namespace best_effort
{
#ifdef CPPX_ASCII_PLEASE
constexpr auto& left_quote_str = "\"";
constexpr auto& right_quote_str = "\"";
constexpr auto& bullet_str = "*";
constexpr auto& left_arrow_str = "<-";
constexpr auto& right_arrow_str = "->";
#else
constexpr auto& left_quote_str = "“";
constexpr auto& right_quote_str = "”";
constexpr auto& bullet_str = "•";
constexpr auto& left_arrow_str = "←";
constexpr auto& right_arrow_str = "→";
#endif
} // namespace best_effort
} // namespace cppx
-------------------------------------------------------------------------------

I guess one could provide two variants of a library that with ordinary
headers just incorporates the above header, but this approach leads to a
combinatorial explosion. Imagine 16 yes/no config options like
`CPPX_ASCII_PLEASE` above. One would not want 65 536 library variants.

So, what's the new way to do things?

Cheers!,

- Alf

Daniel

unread,

Feb 7, 2019, 4:08:48 PM2/7/19

to

On Thursday, February 7, 2019 at 3:06:16 PM UTC-5, Alf P. Steinbach wrote:
> Specifically, I wonder how will one achieve something like this with
> modules?
>

No idea, but
>
> -------------------------------------------------------------------------------

> // Source encoding: UTF-8 with BOM

UTF-8 with BOM? In [rfc8259](https://tools.ietf.org/html/rfc8259), there are only two occurrences of the key word "MUST NOT", one to define the term, and one to state "Implementations MUST NOT add a byte order mark (U+FEFF) to the
beginning of a networked-transmitted JSON text."

Daniel

David Brown

unread,

Feb 7, 2019, 4:56:27 PM2/7/19

to

Can it be done with "if constexpr (CPPX_ASCII_PLEASE) " ?

Can you have template constexpr variables?

Put them in a templated struct?

Have them both as two namespaces, and let the module user pick the
namespace they want?

Scrap all encodings except proper UTF-8 (which has no BOM)?

I don't know if any of these solutions are possible - I haven't looked
at modules in detail. But perhaps they will put you onto a solution
that would work.

Alf P. Steinbach

unread,

Feb 8, 2019, 1:51:26 AM2/8/19

to

When in Rome, do as the Romans.

Still, the RFC that allows JSON implementations to be
non-standard-conforming in their treatment of UTF-8, allowing them to
treat a BOM as an error, is necessarily a statement of politics, the
aspect of belonging to a social group that at least partially is bound
together by the adoption of a collection of zealot ideas, and not sound
engineering.

I'd let Donald Trump whip those idiot 1990s-Linux fanbois, if I could.

Cheers!,

- Alf

¹ Quote: “implementations that parse JSON texts MAY ignore the presence
of a byte order mark rather than treating it as an error.”

Alf P. Steinbach

unread,

Feb 8, 2019, 2:01:48 AM2/8/19

to

I don't think so.

> Can you have template constexpr variables?

Yes, but the point is that the same client code may need to be compiled
for an environment where the UTF-8 encoded symbols won't work.

> Put them in a templated struct?

That's a good idea, thanks. One might conceivably provide a compile time
choice constant as a custom configuration module. It would be a hack
like thing, much like in current code leaving the definition of a
function to client code (or provide a weakly linked default), but could
indicate a direction for a language supported standard solution, maybe?

Anyway it's a Better Way™. :)

> Have them both as two namespaces, and let the module user pick the
> namespace they want?

Nope.

> Scrap all encodings except proper UTF-8 (which has no BOM)?

Proper UTF-8 supports BOM, and it's still necessary for portable code.
When or if Microsoft sees fit to equip Visual Studio with a GUI way to
select `/utf-8` option for the source code, one may start thinking about
using BOM-less UTF-8 also for portable code. Until then such source
code, if it contains text literals, is likely to be misinterpreted in a
Visual Studio project.

> I don't know if any of these solutions are possible - I haven't looked
> at modules in detail. But perhaps they will put you onto a solution
> that would work.

The templated struct thing sounds good. Or, well, not ideal, but much
gooder. :)

Cheers!,

- Alf

Daniel

unread,

Feb 8, 2019, 4:25:22 AM2/8/19

to

On Friday, February 8, 2019 at 1:51:26 AM UTC-5, Alf P. Steinbach wrote:
>
> Still, the RFC that allows JSON implementations to be
> non-standard-conforming in their treatment of UTF-8, allowing them to
> treat a BOM as an error, is necessarily a statement of politics, the
> aspect of belonging to a social group that at least partially is bound
> together by the adoption of a collection of zealot ideas, and not sound
> engineering.
>

From an engineering point of view, BOM's serve no purpose. But as a
practical matter, they exist, so it behooves parsers to ignore them.

Daniel

Alf P. Steinbach

unread,

Feb 8, 2019, 5:01:11 AM2/8/19

to

On 08.02.2019 10:25, Daniel wrote:
> On Friday, February 8, 2019 at 1:51:26 AM UTC-5, Alf P. Steinbach wrote:
>>
>> Still, the RFC that allows JSON implementations to be
>> non-standard-conforming in their treatment of UTF-8, allowing them to
>> treat a BOM as an error, is necessarily a statement of politics, the
>> aspect of belonging to a social group that at least partially is bound
>> together by the adoption of a collection of zealot ideas, and not sound
>> engineering.
>>
> From an engineering point of view, BOM's serve no purpose.

You state, in the form of an incorrect assertion that presumably I
should rush to correct (hey, someone's wrong on the internet!), that you
don't know any purpose of a BOM.

OK then.

A BOM serves two main purposes:

* It identifies the general encoding scheme (UTF-8, UTF-16 or UTF-32),
with high probability.
* It identifies the byte order for the multibyte unit encodings.

Since its original definition was a zero-width space it can be treated
as removable whitespace, and AFAIK that was the original intent.

As a practical matter, for JSON data it's probably best omitted in data
one produces and accepted in data one receives. Be strict in what one
produces, lenient in what one accepts. Where the strictness is relative
to established standards or conventions for the relevant usage.

For C++ source code, if one wants GUI-oriented Visual Studio users to be
able to use the source code with a correct interpretation of the source
code bytes, e.g. of a • bullet symbol in a literal, then a BOM is
required. Otherwise the Visual C++ compiler will assume Windows
ANSI-encoding. Here the usage has the opposite convention of that for
JSON data.

Users like me can of course configure the compiler via textual command
line options, but most users, in my experience, don't delve into that.

And the idea of creating something that doesn't work by default, when it
could easily work by default, is IMO brain-dead, from incompetents. It
gets worse when those folks /insist/ that /others/ should create things
that don't work by default. That's why IMO they should be subjected to
Donald Trump's whips, if practically possible, to teach them a little.

> But as a
> practical matter, they exist, so it behooves parsers to ignore them.

Yes, agreed. :)

Cheers!,

- Alf

Öö Tiib

unread,

Feb 8, 2019, 5:19:57 AM2/8/19

to

DirtyUTF8-to-UTF8 converter/validator should erase those. One
needs validator in external interfaces anyway since new naïve
programmers keep sending wrong stuff in supposedly UTF8 stream
(like that double dot "ï" of their "naïvety" in Windows-1250) forever.
Such messages have to be rejected and defective products blamed.

Ralf Goertz

unread,

Feb 8, 2019, 6:27:33 AM2/8/19

to

Am Fri, 8 Feb 2019 11:00:59 +0100
schrieb "Alf P. Steinbach" <alf.p.stein...@gmail.com>:

> You state, in the form of an incorrect assertion that presumably I
> should rush to correct (hey, someone's wrong on the internet!), that
> you don't know any purpose of a BOM.
>
> OK then.
>
> A BOM serves two main purposes:
>
> * It identifies the general encoding scheme (UTF-8, UTF-16 or UTF-32),
> with high probability.

BOM for UTF-N with N>8 is fine IMHO. But as I understand it UTF-8 is
meant to be as compatible as possible with ASCII. So if you happen to
have a »UTF-8« file that doesn't contain any non-ASCII characters then
why should it have a BOM? This can easily happen, e.g. when you decide
to erase those fancy quotation marks I just used and replace them with
ordinary ones like in "UTF-8". Suddenly, the file is pure ASCII but has
an unnecessary BOM. If the file contains non-ASCII characters you'll
notice that soon enough. My favourite editor (vim) is very good at
detecting that without the aid of BOMs and I guess others are, too.

And BOMs can be a burden, for instance when you want to quickly
concatenate two files with "cat file1 file2 >ouftile". Then you end up
with a BOM in the middle of a file which doesn't conform to the
standard AFAIK.

> * It identifies the byte order for the multibyte unit encodings.

As I said, for those BOMs are fine.

> Since its original definition was a zero-width space it can be
> treated as removable whitespace, and AFAIK that was the original
> intent.

But they increase the file size which can cause problems (in the above
mentioned case of an ASCII only UTF-8 file). I really don't understand
why UTF-8 has not become standard on Windows even after so many years of
it's existence.

Alf P. Steinbach

unread,

Feb 8, 2019, 7:45:53 AM2/8/19

to

On 08.02.2019 12:27, Ralf Goertz wrote:
> Am Fri, 8 Feb 2019 11:00:59 +0100
> schrieb "Alf P. Steinbach" <alf.p.stein...@gmail.com>:
>
>> You state, in the form of an incorrect assertion that presumably I
>> should rush to correct (hey, someone's wrong on the internet!), that
>> you don't know any purpose of a BOM.
>>
>> OK then.
>>
>> A BOM serves two main purposes:
>>
>> * It identifies the general encoding scheme (UTF-8, UTF-16 or UTF-32),
>> with high probability.
>
> BOM for UTF-N with N>8 is fine IMHO. But as I understand it UTF-8 is
> meant to be as compatible as possible with ASCII. So if you happen to
> have a »UTF-8« file that doesn't contain any non-ASCII characters then
> why should it have a BOM? This can easily happen, e.g. when you decide
> to erase those fancy quotation marks I just used and replace them with
> ordinary ones like in "UTF-8". Suddenly, the file is pure ASCII but has
> an unnecessary BOM.

It's not unnecessary if the intent is to further edit the file, because
then it says what encoding should better be used with this file.

Otherwise I'd just save as pure ASCII.

Done.

> If the file contains non-ASCII characters you'll
> notice that soon enough. My favourite editor (vim) is very good at
> detecting that without the aid of BOMs and I guess others are, too.

Evidently vim doesn't have to relate to many Windows ANSI encoded files,
where all byte sequences are valid.

It's possible to apply statistical measures over large stretches of
text, but these are necessarily grossly inefficient compared to just
checking three bytes, and that efficiency versus inefficiency counts for
tools such as compilers.

For an editor that loads the whole file anyway, and also has an
interactive user in front that can guide it, maybe it doesn't matter.

> And BOMs can be a burden, for instance when you want to quickly
> concatenate two files with "cat file1 file2 >ouftile". Then you end up
> with a BOM in the middle of a file which doesn't conform to the
> standard AFAIK.

Binary `cat` is a nice tool when it's not misapplied.

I guess the argument, that you've picked up from somebody else, is that
it's plain impossible to make a corresponding text concatenation tool.

>> * It identifies the byte order for the multibyte unit encodings.
>
> As I said, for those BOMs are fine.
>
>> Since its original definition was a zero-width space it can be
>> treated as removable whitespace, and AFAIK that was the original
>> intent.
>
> But they increase the file size which can cause problems (in the above
> mentioned case of an ASCII only UTF-8 file).

Not having the BOMs for files intended to be used with Windows tools,
causes problems of correctness.

In the above mentioned case the "problem" of /not forgetting the
encoding/ sounds to me like turning black to white and vice versa.

I'd rather /not/ throw away the encoding information, and would see the
throwing-away, if that were enforced, as a serious problem.

> I really don't understand
> why UTF-8 has not become standard on Windows even after so many years of
> it's existence.

As I see it, a war between Microsoft and other platforms, where they try
their best to subtly and not-so-subtly sabotage each other.

Microsoft does things like not supporting UTF-8 in Windows consoles
(input doesn't work at all for non-ASCII characters), and not supporting
UTF-8 locales in Windows, hiding the UTF-8 sans BOM encoding far down in
a very long list of useless encodings in the VS editor's GUI for
encoding choice, letting it save with system-dependent Windows ANSI
encoding by default, and even (Odin save us!) using that as the default
basic execution character set in Visual C++ -- a /system dependent/
encoding as basic execution character set.

*nix-world folks do things such as restricting the JSON format, in newer
version of its RFC, to UTF without BOM, permitting a BOM to be treated
as an error.

Very political, as I see it.

Not engineering.

Cheers!,

- Alf

Manfred

unread,

Feb 8, 2019, 8:19:00 AM2/8/19

to

On 2/8/2019 11:00 AM, Alf P. Steinbach wrote:
> On 08.02.2019 10:25, Daniel wrote:
>> On Friday, February 8, 2019 at 1:51:26 AM UTC-5, Alf P. Steinbach wrote:
>>>
>>> Still, the RFC that allows JSON implementations to be
>>> non-standard-conforming in their treatment of UTF-8, allowing them to
>>> treat a BOM as an error, is necessarily a statement of politics, the
>>> aspect of belonging to a social group that at least partially is bound
>>> together by the adoption of a collection of zealot ideas, and not sound
>>> engineering.

Not really - from RFC 3629 (the one that directly matters to UTF-8,
unlike 8259):
> A protocol SHOULD forbid use of U+FEFF as a signature for those
> textual protocol elements that the protocol mandates to be always
> UTF-8, the signature function being totally useless in those
> cases.

You have a good point, but it depends on what you mean by "work by
default" - it depends on the environment.
If you are in a Microsoft context, since they use BOM everywhere (even
where they shouldn't), then you are right.

I ran into this with XML. For this, RFC 3629 is very clear:
> A protocol SHOULD also forbid use of U+FEFF as a signature for
> those textual protocol elements for which the protocol provides
> character encoding identification mechanisms, ...

Still, if you want to embed an XML resource with Visual Studio, it
forces the BOM into it.
In this context, since the standard is so clear, I would say that "work
by default" would mean without inserting a BOM.

That said, I find your choice, to make the source code encoding clear
with a comment in the very first line, to be clear and appropriate.

If I would consider a theoretical alternative, it would be pure ASCII
source code with UTF-8 characters explicitly escaped in text strings:
since I don't like Unicode characters in identifiers (so no need for
UTF-8 for pure code), this would make it clear which text 'resources'
require UTF-8, and avoid the confusion of similar-looking characters in
favor of numeric code.
But, since this is quite impractical with tooling currently available,
it's just a theory.

Manfred

unread,

Feb 8, 2019, 8:26:29 AM2/8/19

to

On 2/8/2019 12:27 PM, Ralf Goertz wrote:
> I really don't understand
> why UTF-8 has not become standard on Windows even after so many years of
> it's existence.

From one side, recalling what Alf said, Microsoft has an history of
doing its best to be compatible with itself only.
From what I read around, there is even a culture traditionally
established within Microsoft to avoid anything that is not "invented by
Microsoft"

More practically, they also have to keep backwards compatibility with
the huge legacy they have, both with existing code base all around the
world, and existing executables that are tied to their existing
standard, so they're stuck with UTF-16 (or UCS-2, whatever it is).

james...@alumni.caltech.edu

unread,

Feb 8, 2019, 9:26:52 AM2/8/19

to

On Friday, February 8, 2019 at 2:01:48 AM UTC-5, Alf P. Steinbach wrote:
...

> Proper UTF-8 supports BOM, and it's still necessary for portable code.

Could you explain that, in light of Manfred's citation of

> Not really - from RFC 3629 (the one that directly matters to UTF-8,
> unlike 8259):
> > A protocol SHOULD forbid use of U+FEFF as a signature for those
> > textual protocol elements that the protocol mandates to be always
> > UTF-8, the signature function being totally useless in those
> > cases.

If a protocol mandates that an element must always be UTF-8, What
benefit would be gained by using U+FEFF as a signature?
I can understand the use of U+FEFF in UTF-16, because that determines
how subsequent bytes of the encoded text are to be interpreted, but
that's not the case for UTF-8.

Öö Tiib

unread,

Feb 8, 2019, 9:51:07 AM2/8/19

to

It can be that you are talking about different things. Alf seems to talk about
"text files" where BOM indeed might help to figure out what kind of "text file"
it is. You seem to talk about UTF8 field in transmission protocol.
There BOM is useless not-an-error.

Daniel

unread,

Feb 8, 2019, 10:07:04 AM2/8/19

to

On Friday, February 8, 2019 at 5:01:11 AM UTC-5, Alf P. Steinbach wrote:
>
>
> A BOM serves two main purposes:
>
> * It identifies the general encoding scheme (UTF-8, UTF-16 or UTF-32),
> with high probability.

> * It identifies the byte order for the multibyte unit encodings.
>

On the contrary, it's utterly redundant. Given the first four bytes of
UTF-8, UTF-16(LE), UTF-16(BE), UTF-32(LE), or UTF-32(BE) encoded text, the
encoding can be detected with equal reliability.

There was a very long discussion on the mailing list leading up RFC 8259
whether the json data interchange specification should provide a statement
of that algorithm (there are some subtleties), but it was dropped when it
was decided to restrict data interchange to UTF8 only.

Daniel

james...@alumni.caltech.edu

unread,

Feb 8, 2019, 10:32:07 AM2/8/19

to

I deliberately worded my message to match the wording of RFC 3629. It
seems to me that it's perfectly feasible for a protocol, in the most
general sense of the term (I don't know whether RFC 3629 uses it in that
sense), to include a textual element stored in a file. If that protocol
mandates that the file always be encoded in UTF-8, then the clause from
RFC 3629 quoted above is applicable.

Manfred

unread,

Feb 8, 2019, 10:44:58 AM2/8/19

to

There has been some mix-up between generic files and specific protocols
in this thread:
The piece of mine you quoted was in reply to a comment by Alf about
JSON. JSON is a protocol (either as network transfer or file format),
and since the JSON spec dictates UTF-8, then a BOM should not be there.

Back to the original post from Alf, this is about source code files, so
there is no specific protocol defined.

Finally, in the context of how Visual Studio handles project items, I
mentioned XML files, for which an established protocol/format does
exist, and should have no BOM as well, but VS keeps adding one.

Daniel

unread,

Feb 8, 2019, 10:55:02 AM2/8/19

to

On Friday, February 8, 2019 at 9:51:07 AM UTC-5, Öö Tiib wrote:
>

> It can be that you are talking about different things. Alf seems to talk about
> "text files" where BOM indeed might help to figure out what kind of "text file"
> it is.

BOM adds nothing over a detection mechanism, it's redundant. Unicode
encodings UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32LE, and UTF-32BE
can be detected by inspecting the first octets for zeros. Just two bytes
will do in some cases; at most four are needed.

Daniel

Daniel

unread,

Feb 8, 2019, 11:04:27 AM2/8/19

to

Apologies, my mind is on JSON files, where the first character is always US-
ASCII. My comments above aren't applicable to other kinds of text files.

Daniel

Daniel

unread,

Feb 8, 2019, 11:07:36 AM2/8/19

to

On Friday, February 8, 2019 at 10:07:04 AM UTC-5, Daniel wrote:
> On Friday, February 8, 2019 at 5:01:11 AM UTC-5, Alf P. Steinbach wrote:
> >
> >
> > A BOM serves two main purposes:
> >
> > * It identifies the general encoding scheme (UTF-8, UTF-16 or UTF-32),
> > with high probability.
>
> > * It identifies the byte order for the multibyte unit encodings.
> >
>
> On the contrary, it's utterly redundant. Given the first four bytes of
> UTF-8, UTF-16(LE), UTF-16(BE), UTF-32(LE), or UTF-32(BE) encoded text, the
> encoding can be detected with equal reliability.
>

Apologies, my mind is on JSON files, where the first character is always US-

ASCII, and hence a detection mechanism is possible. That observation doesn't
apply to other kinds of text files.

Daniel

Ralf Goertz

unread,

Feb 8, 2019, 11:28:35 AM2/8/19

to

Am Fri, 8 Feb 2019 13:45:39 +0100

schrieb "Alf P. Steinbach" <alf.p.stein...@gmail.com>:

> On 08.02.2019 12:27, Ralf Goertz wrote:
> > Am Fri, 8 Feb 2019 11:00:59 +0100
> > schrieb "Alf P. Steinbach" <alf.p.stein...@gmail.com>:
> >
> >> You state, in the form of an incorrect assertion that presumably I
> >> should rush to correct (hey, someone's wrong on the internet!),
> >> that you don't know any purpose of a BOM.
> >>
> >> OK then.
> >>
> >> A BOM serves two main purposes:
> >>
> >> * It identifies the general encoding scheme (UTF-8, UTF-16 or
> >> UTF-32), with high probability.
> >
> > BOM for UTF-N with N>8 is fine IMHO. But as I understand it UTF-8 is
> > meant to be as compatible as possible with ASCII. So if you happen
> > to have a »UTF-8« file that doesn't contain any non-ASCII
> > characters then why should it have a BOM? This can easily happen,
> > e.g. when you decide to erase those fancy quotation marks I just
> > used and replace them with ordinary ones like in "UTF-8". Suddenly,
> > the file is pure ASCII but has an unnecessary BOM.
>
> It's not unnecessary if the intent is to further edit the file,
> because then it says what encoding should better be used with this
> file.

But there is still no need for that BOM if you save it as UTF-8. That's
the whole point.

> Otherwise I'd just save as pure ASCII.
>
> Done.

Which is a UTF-8 file is that doesn't contain non-ASCII characters.

>
> > If the file contains non-ASCII characters you'll
> > notice that soon enough. My favourite editor (vim) is very good at
> > detecting that without the aid of BOMs and I guess others are,
> > too.
>
> Evidently vim doesn't have to relate to many Windows ANSI encoded
> files, where all byte sequences are valid.

And somehow it still manages to correctly detect both types of encoding
in most cases. If I save a file containing the line "My name is not
spelled Görtz but Goertz" after having ":set fileencoding=latin1" and
reopen it vim tells me '"file" [converted]' because it detected the
encoding and converted it to it's native encoding (but it saves the file
using the original encoding if I don't interfere, emitting a warning if
that is impossible).

> It's possible to apply statistical measures over large stretches of
> text, but these are necessarily grossly inefficient compared to just
> checking three bytes, and that efficiency versus inefficiency counts
> for tools such as compilers.
>
> For an editor that loads the whole file anyway, and also has an
> interactive user in front that can guide it, maybe it doesn't matter.

As I said I usually don't have to specify anything, vim does it
automagically.

> > And BOMs can be a burden, for instance when you want to quickly
> > concatenate two files with "cat file1 file2 >ouftile". Then you end
> > up with a BOM in the middle of a file which doesn't conform to the
> > standard AFAIK.
>
> Binary `cat` is a nice tool when it's not misapplied.
>
> I guess the argument, that you've picked up from somebody else, is
> that it's plain impossible to make a corresponding text concatenation
> tool.

I think I made that argument before in a discussion with you. But still
I never said it was originally mine. So even if I picked it up from
somebody else that doesn't make it invalid. And by the way "binary"
`cat` /is/ `cat`. One of the other nice things you don't have to care
about under *nix.

> >> * It identifies the byte order for the multibyte unit encodings.
> >
> > As I said, for those BOMs are fine.
> >
> >> Since its original definition was a zero-width space it can be
> >> treated as removable whitespace, and AFAIK that was the original
> >> intent.
> >
> > But they increase the file size which can cause problems (in the
> > above mentioned case of an ASCII only UTF-8 file).
>
> Not having the BOMs for files intended to be used with Windows tools,
> causes problems of correctness.

Yeah but that's the fault of (the) Windows (tools) IMHO.

Robert Wessel

unread,

Feb 8, 2019, 11:31:41 AM2/8/19

to

In MS's defense, they decided on UCS-2 for Windows (shipped 1993) at a
time when Unicode was explicitly* defined as a 16-bit code.

At the time the variable length encodings weren't really a thing yet
(although Plan 9's development of UTF-8 would have overlapped at least
the end of the initial NT development). And arguably a "simple"
16-bit code was a reasonable choice, sure it carried a bit of a size
penalty, but just for text.

Unicode was extended in 1996 to be a (sort-of) 32-bit code, but that
didn't get all that much traction until about 2006, when the Chinese
government started requiring some levels of character set support for
software sold there.

At that point MS switched from UCS-2 to UTF-16, as the least
disruptive change.

MS was already supporting two versions of (mostly) all APIs (8-bit
"ANSI" and 16-bit UCS-2), adding a third (for UTF-8 - the ANSI
encoding did not have room for the surrogate encodings) would have
been a considerable undertaking, although arguably one they should
have taken.

And this was at a time when it's likely that the majority of machines
in the world supporting Unicode were ones running Windows.

*The standard made that statement explicitly

Alf P. Steinbach

unread,

Feb 8, 2019, 11:41:33 AM2/8/19

to

On 08.02.2019 16:06, Daniel wrote:
> On Friday, February 8, 2019 at 5:01:11 AM UTC-5, Alf P. Steinbach wrote:
>>
>>
>> A BOM serves two main purposes:
>>
>> * It identifies the general encoding scheme (UTF-8, UTF-16 or UTF-32),
>> with high probability.
>
>> * It identifies the byte order for the multibyte unit encodings.
>>
>
> On the contrary, it's utterly redundant. Given the first four bytes of
> UTF-8, UTF-16(LE), UTF-16(BE), UTF-32(LE), or UTF-32(BE) encoded text, the
> encoding can be detected with equal reliability.

Apparently you remember that with a binary inversion, the exact opposite
of what someone wrote.

That happens to me sometimes, but mostly about things I can't reason
about, things that are arbitrary facts. I get them mixed up & inverted.

Anyway, with a BOM the first four bytes give a good, generally reliable
indication. But without a BOM... Well, consider the aforementioned (in
my code) bullet point, “•”.

A consultant accustomed to Powerpoint presentations, might well start a
text file with a bullet point. Or two. Or more.

It's Unicode code point u+2022. And as UTF-16(BE) it's 0x20 followed by
0x22. Interpreted as ASCII that's a space followed by a double quote.

Now you're looking at the first four bytes of the file. They're 0x20,
0x22, 0x20, 0x22.

Is it a space, quote, space, quote, in ASCII or some ASCII extension
such as UTF-8, Latin-1 or the original IBM PC encoding (Windows cp 437)?
Or is it perhaps two bullet points in UTF-16(BE)? Or, just maybe it's
twice the mathematical left angle symbol “∠” expressed in UTF-16(LE)?

As you can see these byte values leave the specter of possible encodings
wide open, except that UTF-32 would have guaranteed a nullbyte.

> There was a very long discussion on the mailing list leading up RFC 8259
> whether the json data interchange specification should provide a statement
> of that algorithm (there are some subtleties), but it was dropped when it
> was decided to restrict data interchange to UTF8 only.

Good idea. ;-)

Cheers!

- Alf

Alf P. Steinbach

unread,

Feb 8, 2019, 11:51:42 AM2/8/19

to

On 08.02.2019 17:28, Ralf Goertz wrote:
> Am Fri, 8 Feb 2019 13:45:39 +0100
> schrieb "Alf P. Steinbach" <alf.p.stein...@gmail.com>:
>

>> I guess the argument, that you've picked up from somebody else, is
>> that it's plain impossible to make a corresponding text concatenation
>> tool.
>
> I think I made that argument before in a discussion with you. But still
> I never said it was originally mine. So even if I picked it up from
> somebody else that doesn't make it invalid. And by the way "binary"
> `cat` /is/ `cat`. One of the other nice things you don't have to care
> about under *nix.

My point was not that it was invalid because you picked it up from
somewhere, but that it's an invalid argument that you didn't originate,
i.e. no fault of yours except not thinking deeply about it.

Text concatenation and binary concatenation was never the same even for
ASCII.

Consider a C++ source code text, and C++11 §2.2/1 requoted from an SO
posting I hastily googled up:

“A source file that is not empty and that does not end in a new-line
character, or that ends in a new-line character immediately preceded by
a backslash character before any such splicing takes place, shall be
processed as if an additional new-line character were appended to the file.”

As far as the C++11 or later compiler is concerned that file that ends
in in a line without a final newline, acts as if there was a newline.
Now you use binary `cat` to concatenate this text and some more source
code. Oh dang, that placed a preprocessor directive in the middle of a
line (namely the last line from the first file).

A nice textual concatenation tool would ensure by default that every
non-empty file's text was terminated with a newline, or the appropriate
system specific end of line specification, so that you could pass the
result to a C++ compiler... And by default it would also strip out those
pesky zero width spaces. Even in the middle of text.

>
>>>> * It identifies the byte order for the multibyte unit encodings.
>>>
>>> As I said, for those BOMs are fine.
>>>
>>>> Since its original definition was a zero-width space it can be
>>>> treated as removable whitespace, and AFAIK that was the original
>>>> intent.
>>>
>>> But they increase the file size which can cause problems (in the
>>> above mentioned case of an ASCII only UTF-8 file).
>>
>> Not having the BOMs for files intended to be used with Windows tools,
>> causes problems of correctness.
>
> Yeah but that's the fault of (the) Windows (tools) IMHO.

:)

Cheers!

- Alf

Manfred

unread,

Feb 8, 2019, 11:51:44 AM2/8/19

to

On 2/8/2019 5:31 PM, Robert Wessel wrote:
> On Fri, 8 Feb 2019 14:26:17 +0100, Manfred <non...@add.invalid> wrote:
>
>> On 2/8/2019 12:27 PM, Ralf Goertz wrote:
>>> I really don't understand
>>> why UTF-8 has not become standard on Windows even after so many years of
>>> it's existence.
>>
>> From one side, recalling what Alf said, Microsoft has an history of
>> doing its best to be compatible with itself only.
>> From what I read around, there is even a culture traditionally
>> established within Microsoft to avoid anything that is not "invented by
>> Microsoft"
>>
>> More practically, they also have to keep backwards compatibility with
>> the huge legacy they have, both with existing code base all around the
>> world, and existing executables that are tied to their existing
>> standard, so they're stuck with UTF-16 (or UCS-2, whatever it is).
>
>
> In MS's defense, they decided on UCS-2 for Windows (shipped 1993) at a
> time when Unicode was explicitly* defined as a 16-bit code.
>

[snip further valid points]

>
> And this was at a time when it's likely that the majority of machines
> in the world supporting Unicode were ones running Windows.
>

True, and in fact UTF-8 /is/ at least supported by recent versions of
MultiByteToWideChar() and WideCharToMultiByte(), so it is usable in
Windows. The inconvenience is having to code the conversion when
interfacing with nowadays' world out there.

james...@alumni.caltech.edu

unread,

Feb 8, 2019, 12:16:31 PM2/8/19

to

On Friday, February 8, 2019 at 11:51:42 AM UTC-5, Alf P. Steinbach wrote:
...

> Consider a C++ source code text, and C++11 §2.2/1 requoted from an SO
> posting I hastily googled up:
>
> “A source file that is not empty and that does not end in a new-line
> character, or that ends in a new-line character immediately preceded by
> a backslash character before any such splicing takes place, shall be
> processed as if an additional new-line character were appended to the file.”

That's 5.2p2. It comes as rather a shock to me, since I'm used to The C
standard's specification: "A source file that is not empty shall end in
a new-line character, which shall not be immediately preceded by a
backslash character before any such splicing takes place." (5.1.1.2p2).
Do you know in what version of C++ it first specified this processing?

Robert Wessel

unread,

Feb 8, 2019, 12:29:31 PM2/8/19

to

It was "If a source file that is not empty does not end in a newline
character, or ends in a newline character immediately preceded by a
backslash character, the behavior is undefined." in the original
version of C++98. So either C++11, or (unlikely, IMO) in one of the
TC to C++98.

Daniel

unread,

Feb 8, 2019, 12:59:44 PM2/8/19

to

On Friday, February 8, 2019 at 11:41:33 AM UTC-5, Alf P. Steinbach wrote:
> On 08.02.2019 16:06, Daniel wrote:

> > On the contrary, it's utterly redundant. Given the first four bytes of
> > UTF-8, UTF-16(LE), UTF-16(BE), UTF-32(LE), or UTF-32(BE) encoded text, the
> > encoding can be detected with equal reliability.
>
> Apparently you remember that with a binary inversion, the exact opposite
> of what someone wrote.
>
> That happens to me sometimes, but mostly about things I can't reason
> about, things that are arbitrary facts. I get them mixed up & inverted.
>

Alf, you're remarkably restrained :-) I was thinking only of json,
where a detection mechanism is possible because the first character is
always US ASCII, and then it's just a matter of detecting zeros in the
first octets to disambiguate. But you are of course correct in the general
case.

Daniel

Alf P. Steinbach

unread,

Feb 8, 2019, 3:38:27 PM2/8/19

to

On 08.02.2019 08:01, Alf P. Steinbach wrote:
> On 07.02.2019 22:56, David Brown wrote:

> [snip]

>
>> Put them in a templated struct?
>
> That's a good idea, thanks. One might conceivably provide a compile time
> choice constant as a custom configuration module. It would be a hack
> like thing, much like in current code leaving the definition of a
> function to client code (or provide a weakly linked default), but could
> indicate a direction for a language supported standard solution, maybe?
>
> Anyway it's a Better Way™. :)

I landed on a kind of compromise.

Compared to the original simple conditional code inclusion this feels
like very much overkill.

But now it's prepared for modules, like at one time receivers were
prepared for stereo broadcasts: only a teeny tiny little adjustment
would be necessary when the time came to do it for real. Yes, yes.

---------------------------------------------------------------------
#pragma once // Source encoding: UTF-8 with BOM (π is a lowercase
Greek "pi").

#include <cppx-core/config.hpp> //
cppx::use_ascii_substitutes
#include <cppx-core/meta-type/Type_choice_.hpp> // cppx::Type_choice_

namespace cppx
{
struct Symbol_strings_utf8
{
static constexpr auto& left_quote_str = "“";
static constexpr auto& right_quote_str = "”";
static constexpr auto& bullet_str = "•";
static constexpr auto& left_arrow_str = "←";
static constexpr auto& right_arrow_str = "→";
};

struct Symbol_strings_ascii
{
static constexpr auto& left_quote_str = "\"";
static constexpr auto& right_quote_str = "\"";
static constexpr auto& bullet_str = "*";
static constexpr auto& left_arrow_str = "<-";
static constexpr auto& right_arrow_str = "->";
};

namespace best_effort
{
using Symbol_strings = Type_choice_<
use_ascii_substitutes, Symbol_strings_ascii,
Symbol_strings_utf8
>;

constexpr auto& left_quote_str =
Symbol_strings::left_quote_str;
constexpr auto& right_quote_str =
Symbol_strings::right_quote_str;
constexpr auto& bullet_str = Symbol_strings::bullet_str;
constexpr auto& left_arrow_str =
Symbol_strings::left_arrow_str;
constexpr auto& right_arrow_str =
Symbol_strings::right_arrow_str;

} // namespace best_effort

} // namespace cppx

---------------------------------------------------------------------

Cheers!,

- Alf

Ralf Goertz

unread,

Feb 9, 2019, 12:44:07 PM2/9/19

to

Am Fri, 8 Feb 2019 17:51:30 +0100

schrieb "Alf P. Steinbach" <alf.p.stein...@gmail.com>:

> Text concatenation and binary concatenation was never the same even
> for ASCII.
>
> Consider a C++ source code text, and C++11 §2.2/1 requoted from an SO
> posting I hastily googled up:
>
> “A source file that is not empty and that does not end in a new-line
> character, or that ends in a new-line character immediately preceded
> by a backslash character before any such splicing takes place, shall
> be processed as if an additional new-line character were appended to
> the file.”
>
> As far as the C++11 or later compiler is concerned that file that
> ends in in a line without a final newline, acts as if there was a
> newline. Now you use binary `cat` to concatenate this text and some
> more source code. Oh dang, that placed a preprocessor directive in
> the middle of a line (namely the last line from the first file).
>
> A nice textual concatenation tool would ensure by default that every
> non-empty file's text was terminated with a newline, or the
> appropriate system specific end of line specification, so that you
> could pass the result to a C++ compiler... And by default it would
> also strip out those pesky zero width spaces. Even in the middle of
> text.

Hm, even if C++11 compilers need to treat a source file not ending in a
newline as if it does end in a newline I would still call such a file
ill formed. By the way, vim adds a newline when saving even if no
newline was used on the last line. It also tells you about the fact that
the newline is missing on the last line when it opens the file:

"file" [noeol]

What a great editor ;-)