Draft Proposal of File String Ltiterals

675 views
Skip to first unread message

Andrew Tomazos

unread,
Apr 23, 2016, 6:16:26 AM4/23/16
to std-pr...@isocpp.org
Please find attached a draft 3-page proposal entitled "Proposal of File String Literals".

Feedback appreciated.

Thanks,
Andrew.

ProposalofFileStringLiterals.pdf

Moritz Klammler

unread,
Apr 23, 2016, 7:27:18 PM4/23/16
to Andrew Tomazos, std-pr...@isocpp.org
I can see how the proposed feature would be useful in many contexts and
provide a clean solution to what if often handled in a messy way today.

I'm wondering why you have decided to handle a pre-processing action
with a syntax that doesn't look like this. I remember having seen
somebody discuss a similar feature some time ago that looked somehow
like this.

#inplace char data "datafile.txt"

The effect of the above snippet would have been that after
pre-processing, a variable

char data[<file-size>] = {<file-data>};

would have been defined. Note that `data` is not `const` in this
example which might be useful for some applications. If I wanted the
data to be read-only, I could have written

#inplace const char data "datafile.txt"

explicitly. This would naturally allow to also use

#inplace const char data <datafile.txt>

to be consistent with how the pre-processor finds other files. You
couldn't have anonymous variables with the contents of a file in this
case but I'm wondering how useful these would be anyway.

I'm also not sure whether it is necessary to require the replacement
text to be a raw string literal. Especially in environments where
compatibility with C is a concern, this could be an unnecessary road
block. Couldn't the data equally well be inlined as an array of
integer literals or any offending characters be escaped via the good old
`\0dd` syntax? Granted, that might not be very readable for humans but
pre-processed files are not very pretty in general. It should be
allowed, though, to break the replacement over multiple lines using
either line-breaks in array syntax or else concatenation of string
literals. The reason I think this is important is that many text
editors perform very poorly or even crash when faced with extremely long
lines.

I'm assuming that you're assuming that -- no matter how the syntax looks
like -- pre-processors would handle those file string literals the same
way they handle other file `#include`s so dependency computation by
build systems would continue to work by only running the pre-processor.
(Not that the standard would have to specify this, but it is good to be
aware of.)

Another question that should be discussed is whether and how to support
non-text data. For example, if we have such a mechanism, I could
imagine it would be useful to embed small graphics or other binary data
into the program image as well. Would encoding get into the way here?
Would NUL-termination mean that I have to subtract one from the size of
the generated array? Not a big deal but something to be aware of.
Maybe there could be an additional "encoding" prefix for binary data
that would guarantee a verbatim copy and also suppress NUL-termination.

Finally, another option to include arbitrary data into program images
used today is deferring the combining until link-time. At least on
GNU/Linux systems, this can be done by use of the `objcopy` program [1].

extern "C" char * data;
extern "C" const std::size_t size;

I don't think that this approach can do anything that replacement at
pre-processing time could not accomplish. If you want to reduce
compile-time dependencies, you can always have a dedicated translation
unit that merely contains the lines

char data = F"datafile.txt";
const std::size_t size = sizeof(data);

to simulate the behavior of the `objcopy` solution but going the other
way round is not possible. I'm just bringing it up because I thought
you might be interested to mention this in the discussion of your
proposal.

[1] https://sourceware.org/binutils/docs-2.26/binutils/objcopy.html

Thiago Macieira

unread,
Apr 23, 2016, 9:39:58 PM4/23/16
to std-pr...@isocpp.org
On sábado, 23 de abril de 2016 12:16:23 PDT Andrew Tomazos wrote:
> Please find attached a draft 3-page proposal entitled "Proposal of File
> String Literals".

Any thought about what to do if the input file isn't the exact binary data you
want? For example, suppose you need to encode or decode base64.

Can you show this can be done with constexpr expressions?

--
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
Software Architect - Intel Open Source Technology Center

Andrew Tomazos

unread,
Apr 24, 2016, 5:38:14 AM4/24/16
to std-pr...@isocpp.org
On Sun, Apr 24, 2016 at 3:39 AM, Thiago Macieira <thi...@macieira.org> wrote:
On sábado, 23 de abril de 2016 12:16:23 PDT Andrew Tomazos wrote:
> Please find attached a draft 3-page proposal entitled "Proposal of File
> String Literals".

Any thought about what to do if the input file isn't the exact binary data you
want? For example, suppose you need to encode or decode base64.

Can you show this can be done with constexpr expressions?

As per a normal string literal, the text of the dedicated source file of a file string literal will be decoded from source encoding in the usual manner and then encoded in execution encoding (as determined by implementation settings or encoding prefix). From there it will behave like a normal string literal.  It is possible with constexpr programming to encode or decode base64.  String literals are "constexpr-compatible".

Moritz Klammler

unread,
Apr 24, 2016, 7:16:28 AM4/24/16
to Andrew Tomazos, std-pr...@isocpp.org
Andrew Tomazos <andrew...@gmail.com> writes:

>> I'm also not sure whether it is necessary to require the replacement
>> text to be a raw string literal. Especially in environments where
>> compatibility with C is a concern, this could be an unnecessary road
>> block.
>
>
> Well, I guess it didn't block raw string literals, which are a feature
> of C++ and not of C (AFAIK).

What I wanted to say is that it would be nice if I could compile the
pre-processed file as C source code anyway. Given that the
pre-processor used to be the same for both, C and C++, this seemed a
natural thing to think about for me.

>> Couldn't the data equally well be inlined as an array of integer
>> literals or any offending characters be escaped via the good old
>> `\0dd` syntax? Granted, that might not be very readable for humans
>> but pre-processed files are not very pretty in general. It should be
>> allowed, though, to break the replacement over multiple lines using
>> either line-breaks in array syntax or else concatenation of string
>> literals. The reason I think this is important is that many text
>> editors perform very poorly or even crash when faced with extremely
>> long lines.
>>
>
> I don't follow this sorry.

I mean, instead of outputting

R"D(some "nasty"
stuff)D"

couldn't the pre-processor output

{115, 111, 109, 101, 32, 34, 110, 97, 115, 116, 121, 34, 10, 115,
116, 117, 102, 102, 0}

or

"some \x22nasty\x22\0astuff"

to achieve the same effect without having to depend on raw string
literals. These ways to encode string literals are inconvenient for
humans to type and read but the pre-processor shouldn't mind generating
them. It would be even simpler than having to figure out a valid escape
sequence for raw string literals. Not to mention the harvoc that could
be done by UTF-8 strings that switch to RTL scripts...

In my writing in the second half, I wanted to allow for the
pre-processor to generate replacement text broken up like this.

"........... some ................... very ................. long"
"......................... text ................................."
"..... broken ......... up ................... on ..............."
"............................ multiple .........................."
"................................................................"
"......... lines ......... using .................... implicit .."
"..... concatenation ............... of ....... string .........."
"............. literals ........................................."
"..................................... to ......................."
"........... keep .......... line ... lengths ..................."
"................................................................"
".......................................... reasonable .........."

Thiago Macieira

unread,
Apr 24, 2016, 4:03:38 PM4/24/16
to std-pr...@isocpp.org
You're missing my point.

I want to be sure that I could pre-process the contents of the file into the
data I actually want. That data and only that data should be stored in the
read-only sections of my binary's image.

If you can't prove that, it means I will need to have an extra tool to do the
pre-processing and a buildsystem that runs it before the source gets compiled.
That negates the benefit of having the feature in the first place, since the
extra tool could just as well generate a .h file with the contents I want.

Andrew Tomazos

unread,
Apr 25, 2016, 10:37:10 AM4/25/16
to std-pr...@isocpp.org
You can "pre-process" the contents of a string literal using constexpr programming:

  constexpr auto original_version = "original string";

  constexpr fixed_string process_string(fixed_string input) { ... }

  constexpr auto processed_version = process_string(original_version);

  int main() { cout << processed_version; }

In practice processed_version will be in the program image, and original_version won't.

The above ordinary string literal can be replaced with a file string literal, and the same logic applies.

Isn't that what you mean?

Matthew Woehlke

unread,
Apr 25, 2016, 11:24:38 AM4/25/16
to std-pr...@isocpp.org
On 2016-04-23 19:27, Moritz Klammler wrote:
> I'm wondering why you have decided to handle a pre-processing action
> with a syntax that doesn't look like this. I remember having seen
> somebody discuss a similar feature some time ago that looked somehow
> like this.
>
> #inplace char data "datafile.txt"

I was also wondering about that. In particular, I'll note that tools
that need to do non-compiler-assisted dependency scanning may have an
easier time with this format.

FWIW, I wouldn't ignore the point about being able to do path searches
for this feature. In fact, that may be a critical feature for some folks
(because the path cannot be determined when writing the source, but can
be specified by the build system). In particular, if no path search
occurs, it is impossible to include content from an external source
(i.e. a file not part of the source tree which uses it).

> Another question that should be discussed is whether and how to support
> non-text data.

No, that's not a question. That's a hard requirement ;-). I'm not sure
the proposal doesn't cover this, though?

auto binary_data = F"image.png"; // char[]
auto binary_size = sizeof(binary_data) / sizeof(*binary_data);
auto image = parse_png(binary_data, binary_size);

Any form of this feature that cannot replace qrc is a failure in my book.

That said... I also wonder if being able to select between text vs.
binary mode is important, especially if importing large text blocks is
really a desired feature? (Most of the use cases I've seen have been for
binary files such as images. Andrew's proposal seems to imply uses for
text.)

Note that by "text mode" I mean native-to-C line ending translation.

> Maybe there could be an additional "encoding" prefix for binary data
> that would guarantee a verbatim copy and also suppress NUL-termination.

That's probably a good idea. (That said, I suspect many binary formats
can cope with a superfluous trailing NUL already. Anyway, this might be
useful, but probably isn't critical.)

--
Matthew

Nicol Bolas

unread,
Apr 25, 2016, 12:21:48 PM4/25/16
to ISO C++ Standard - Future Proposals, andrew...@gmail.com, mor...@klammler.eu
On Saturday, April 23, 2016 at 7:27:18 PM UTC-4, Moritz Klammler wrote:
I can see how the proposed feature would be useful in many contexts and
provide a clean solution to what if often handled in a messy way today.

I'm wondering why you have decided to handle a pre-processing action
with a syntax that doesn't look like this.  I remember having seen
somebody discuss a similar feature some time ago that looked somehow
like this.

    #inplace char data "datafile.txt"

I despise the idea of making a new preprocessor directive. Even moreso, I despise the idea of having a preprocessor directive declaring a variable.

I say we just reuse `#include`. At present, `#include` requires the next token to be a ", a <, or a preprocessor token. We could just expand that to a couple of extra options:

#include <options> <file-specifier/pp-token>

The <options> can be a command which alters the nature of the include. With no options, then it includes the file's tokens into the command stream.

If <options> is `text`, for example, it would include the file as a string literal, with text translation (for newlines and such). If <options> is `bin`, then it would include the file as a string literal with no translation. Note that embedded \0 characters would be allowed, and the string literal would work just as any C++ string literal does with them. So to use this, you would to this:

auto the_string = #include text "somefile.txt";

Of course, now we get into the question of what kind of string literal. That is, narrow string, UTF-8 string, UTF-16 string (why not allow inclusion of them?), and so forth. Obviously for `bin`, it would be a narrow string literal. Perhaps there would be different forms of text: `text` (narrow), `utf8`, `wide`, `utf16`, etc.

Note that this form of #include should not be converting the included file for these different formats. It is up to the user to make sure that the file actually stores data in the format the compiler was told to expect. The only conversion that might be allowed would be the removal of an initial BOM for the UTF formats.

Thiago Macieira

unread,
Apr 25, 2016, 12:23:01 PM4/25/16
to std-pr...@isocpp.org
On segunda-feira, 25 de abril de 2016 16:37:08 PDT Andrew Tomazos wrote:
> > If you can't prove that, it means I will need to have an extra tool to do
> > the
> > pre-processing and a buildsystem that runs it before the source gets
> > compiled.
> > That negates the benefit of having the feature in the first place, since
> > the
> > extra tool could just as well generate a .h file with the contents I want.
>
> You can "pre-process" the contents of a string literal using constexpr
> programming:
>
> constexpr auto original_version = "original string";
>
> constexpr fixed_string process_string(fixed_string input) { ... }
>
> constexpr auto processed_version = process_string(original_version);
>
> int main() { cout << processed_version; }
>
> In practice processed_version will be in the program image, and
> original_version won't.
>
> The above ordinary string literal can be replaced with a file string
> literal, and the same logic applies.
>
> Isn't that what you mean?

Yes.

I'd like to see a concrete example in the paper, doing some transformation of
the data. As a strawman, it would be nice to know if you could write a
constexpr function that would generate a perfect hashing table given the
complete population of source strings (newline separated).

Andrew Tomazos

unread,
Apr 25, 2016, 1:17:29 PM4/25/16
to std-pr...@isocpp.org
On Mon, Apr 25, 2016 at 5:24 PM, Matthew Woehlke <mwoehlk...@gmail.com> wrote:
On 2016-04-23 19:27, Moritz Klammler wrote:
> I'm wondering why you have decided to handle a pre-processing action
> with a syntax that doesn't look like this.  I remember having seen
> somebody discuss a similar feature some time ago that looked somehow
> like this.
>
>     #inplace char data "datafile.txt"

I was also wondering about that. In particular, I'll note that tools
that need to do non-compiler-assisted dependency scanning may have an
easier time with this format.

I don't think that is true.  For a tool to accurately do dependency scanning, it already has to do near full tokenization and preprocessing:

#include MACRO_REPLACE_ME

auto x = R"(
#include "skip_me"
)";

#if some_expression_that_is_false
#include "skip_me_too"
#endif

For approximate dependency scanning, which is of dubious benefit anyway, it is easy to add the proposed file string literal token pattern to the scanner.
 
> Another question that should be discussed is whether and how to support
> non-text data.

No, that's not a question. That's a hard requirement ;-). I'm not sure
the proposal doesn't cover this, though?

  auto binary_data = F"image.png"; // char[]
  auto binary_size = sizeof(binary_data) / sizeof(*binary_data);
  auto image = parse_png(binary_data, binary_size);

I don't think that is portable, or advisable.  In particular source files are decoded in an implementation-defined manner during translation.  Even if your source encoding was the same as the execution encoding, the implementation may still reject arbitrary binary sequences which are not valid for that encoding.

While we might be able to extend the proposal to make this work, I think the better way to include an image would be to use a link-time or run-time strategy.

As designed, a use of a file string literal should be interchangable with a use of a raw string literal, and visa versa, by simply copy-and-pasting the body.  You would never put an image in a raw string literal (and I don't think it would work for the previous reason given).

That said... I also wonder if being able to select between text vs.
binary mode is important, especially if importing large text blocks is
really a desired feature? (Most of the use cases I've seen have been for
binary files such as images. Andrew's proposal seems to imply uses for
text.)
 
The use cases for file string literals are largely the same as for raw string literals.  File string literals simply allow you to factor them out into a dedicated source file, rather than having them in-line.

Matthew Woehlke

unread,
Apr 25, 2016, 1:38:05 PM4/25/16
to std-pr...@isocpp.org
On 2016-04-25 13:17, Andrew Tomazos wrote:
> On Mon, Apr 25, 2016 at 5:24 PM, Matthew Woehlke wrote:
>> On 2016-04-23 19:27, Moritz Klammler wrote:
>>> Another question that should be discussed is whether and how to support
>>> non-text data.
>>
>> No, that's not a question. That's a hard requirement ;-). I'm not sure
>> the proposal doesn't cover this, though?
>>
>> auto binary_data = F"image.png"; // char[]
>> auto binary_size = sizeof(binary_data) / sizeof(*binary_data);
>> auto image = parse_png(binary_data, binary_size);
>
> I don't think that is portable, or advisable. In particular source
> files are decoded in an implementation-defined manner during
> translation. Even if your source encoding was the same as the
> execution encoding, the implementation may still reject arbitrary
> binary sequences which are not valid for that encoding.
>
> While we might be able to extend the proposal to make this work, I
> think the better way [...]

In that case, I am Strongly Against your proposal. I probably, on some
occasions, want this feature for text. I *definitely* want it for binary
resources, and much more often than I might want it for text.

I think you are missing a significant and important use case, and, if
you don't account for that case, the feature is just begging to be
misused and abused and subject to confusion and surprise breakage.

> I think the better way to include an image would be to use a
> link-time or run-time strategy.

You're welcome to your opinion, but it does not match existing and
widespread practice.

--
Matthew

Nicol Bolas

unread,
Apr 25, 2016, 1:45:14 PM4/25/16
to ISO C++ Standard - Future Proposals, mwoehlk...@gmail.com
On Monday, April 25, 2016 at 1:38:05 PM UTC-4, Matthew Woehlke wrote:
On 2016-04-25 13:17, Andrew Tomazos wrote:
> On Mon, Apr 25, 2016 at 5:24 PM, Matthew Woehlke wrote:
>> On 2016-04-23 19:27, Moritz Klammler wrote:
>>> Another question that should be discussed is whether and how to support
>>> non-text data.
>>
>> No, that's not a question. That's a hard requirement ;-). I'm not sure
>> the proposal doesn't cover this, though?
>>
>>   auto binary_data = F"image.png"; // char[]
>>   auto binary_size = sizeof(binary_data) / sizeof(*binary_data);
>>   auto image = parse_png(binary_data, binary_size);
>
> I don't think that is portable, or advisable.  In particular source
> files are decoded in an implementation-defined manner during
> translation.  Even if your source encoding was the same as the
> execution encoding, the implementation may still reject arbitrary
> binary sequences which are not valid for that encoding.
>
> While we might be able to extend the proposal to make this work, I
> think the better way [...]

In that case, I am Strongly Against your proposal. I probably, on some
occasions, want this feature for text. I *definitely* want it for binary
resources, and much more often than I might want it for text.

I think you are missing a significant and important use case, and, if
you don't account for that case, the feature is just begging to be
misused and abused and subject to confusion and surprise breakage.

I think your second point is the best reason to make sure that binary inclusions are well-supported. If you give people the ability to include files as strings, people are going to use it for including binary files. That is guaranteed. So the only options are to have it cause subtle breakage or to properly support it.

It's better not to do it at all and force us to use our current measures than to do it halfway.

Andrew Tomazos

unread,
Apr 25, 2016, 2:01:27 PM4/25/16
to std-pr...@isocpp.org, mwoehlk...@gmail.com
Ok, you've convinced me.

We can add a new encoding-prefix "b" for binary.  So it would be:

  auto binary_data = bF"image.png";

I'd need to think about the wording, it will probably be impementation-defined with a note saying roughly that the data should undergo no decoding or encoding from source to execution.

Matthew Woehlke

unread,
Apr 25, 2016, 2:20:49 PM4/25/16
to std-pr...@isocpp.org
On 2016-04-25 14:01, Andrew Tomazos wrote:
> Ok, you've convinced me.

Thanks. I do think the original idea is also good, I just really want it
for binary data, and... well, Nicol eloquently reiterated my concern
:-). I do think there are enough folks that want something like this for
binary data that explicitly supporting binary data will definitely
*help* the proposal. (And to be fair, when I say "binary data", I really
mean "resources". Some of which... will be text :-). For example, SVG,
GLSL... and I'm sure you could add to that list.)

> We can add a new encoding-prefix "b" for binary. So it would be:
>
> auto binary_data = bF"image.png";
>
> I'd need to think about the wording, it will probably be
> impementation-defined with a note saying roughly that the data should
> undergo no decoding or encoding from source to execution.

Right. (I was going to say that feels backwards, but missed that your
end point is "execution". Presumably escaping or something would happen
in the PP stage.)

BTW, what about path search? IIUC, the intent is that the file name is
looked up as if `#include "name"`, correct? It might not be terrible to
state this more explicitly, though...

--
Matthew

Andrew Tomazos

unread,
Apr 26, 2016, 11:08:24 AM4/26/16
to std-pr...@isocpp.org
The wording says that F"foo" is equivalent to #include "foo" (with some modifications none of which effect the mapping of paths to source files).  In practice you just put the files in one of your include paths as usual.

Arthur O'Dwyer

unread,
Apr 28, 2016, 5:39:09 PM4/28/16
to ISO C++ Standard - Future Proposals, mwoehlk...@gmail.com
On Monday, April 25, 2016 at 11:01:27 AM UTC-7, Andrew Tomazos wrote:
On Mon, Apr 25, 2016 at 7:45 PM, Nicol Bolas <jmck...@gmail.com> wrote:
On Monday, April 25, 2016 at 1:38:05 PM UTC-4, Matthew Woehlke wrote:
On 2016-04-25 13:17, Andrew Tomazos wrote:
> On Mon, Apr 25, 2016 at 5:24 PM, Matthew Woehlke wrote:
>> On 2016-04-23 19:27, Moritz Klammler wrote:
>>> Another question that should be discussed is whether and how to support
>>> non-text data.
>>
>> No, that's not a question. That's a hard requirement ;-). I'm not sure
>> the proposal doesn't cover this, though?
>>
>>   auto binary_data = F"image.png"; // char[]
>>   auto binary_size = sizeof(binary_data) / sizeof(*binary_data);
>>   auto image = parse_png(binary_data, binary_size);
>
> I don't think that is portable, or advisable.  In particular source
> files are decoded in an implementation-defined manner during
> translation.  Even if your source encoding was the same as the
> execution encoding, the implementation may still reject arbitrary
> binary sequences which are not valid for that encoding. [...]
 
If you give people the ability to include files as strings, people are going to use it for including binary files. That is guaranteed. So the only options are to have it cause subtle breakage or to properly support it.
 
Ok, you've convinced me.

We can add a new encoding-prefix "b" for binary.  So it would be:

  auto binary_data = bF"image.png";

I'd need to think about the wording, it will probably be impementation-defined with a note saying roughly that the data should undergo no decoding or encoding from source to execution.

I was just thinking before your post that this has shades of the icky "mode" behavior of fopen(); i.e., it's up to the programmer (and therefore often buggy) whether the file is opened in "b"inary or "t"ext mode. What makes for the "often buggy" part is that the path-of-least-resistance happens to work perfectly on Unix/Linux/OSX, and therefore the vast majority of working programmers never need to learn the icky parts.

What happens on a Windows platform when I write

    const char data[] = R"(
    )";

? Does data come out equivalent to "\n" or to "\r\n"? Does it depend on the compiler (MSVC versus Clang) or not? I don't have a Windows machine to find out for myself, sorry.
I would expect that

    const char data[] = F"just-a-blank-line.h";

would have the same behavior on Windows as the above program, whatever that behavior is.

I would offer that perhaps the syntax

    const char data[] = RF"image.png";

is available for "raw binary" (i.e., "I want exactly these bytes at runtime") file input. However, this may be overcomplicating the token grammar; it would be the one(?) case where R"..." did not introduce a raw string literal token.

Re source and runtime encodings: I have experience with Green Hills compilers in "euc2jis" mode, where the source encoding is EUC and the runtime encoding is Shift-JIS: string literal tokens are assumed to be encoded in EUC, and have to be transcoded to JIS before their bytes are written to the data section. It would indeed be horrible if the programmer wrote

    const char data[] = F"image.png";

and the bytes that got written to the .rodata section were the "EUC2JIS'ed" version of "image.png". However, it would be almost as horrible if the programmer wrote

    const char license_message[] = F"license.txt";

and the bytes that got written to the .rodata section were NOT the "EUC2JIS'ed" version of "license.txt". And we definitely can't rely on heuristics like "the extension of the include file" to determine whether we should be assuming "source encoding" or "raw encoding". So I would agree with your (Andrew's) idea that we need a way for the programmer to specify whether the file input should be treated as "source" or "raw".  I merely offer that prefix-"b" seems awkward to me, and I think prefix-"R" is available.

Either way, your proposal should include an example along the lines of

    const char example[] = RF"foo(bar.h)foo";

Does this mean "include bar.h", or "include foo(bar.h)foo" — and why?

Re phases of translation: I think your proposal should include an example along the lines of

    #define X(x) F ## #x
    #include X(foo) ".h"
    const char example[] = X(foo) ".h";

I think the behavior of the above #include is implementation-defined or possibly undefined; I haven't checked.
I'm curious whether you'd make the behavior of the above "example" equivalent to

    const char example[] = F"foo" /*followed by the two characters*/ ".h";

or

    const char example[] = F"foo.h";

or ill-formed / implementation-defined / undefined. However, if I'm right about the behavior of the above #include, I would accept any of these results, or even "I don't care — let the committee sort it out", because my impression is that the preprocessor has lots of these little corner cases that end up getting sorted out over decades by the vendors, rather than by the standard.

Re preprocessing (e.g. base64-decoding), I've often wished that the major vendors would provide some Perl-style
"input from shell-pipe" syntax; for example,

    #include "image.png.64 | base64 -d"
    #include "<(base64 -d image.png.txt)"
    #include "<(wget -O - http://example.com/static/image.png)"

This strikes me as not a matter for top-down standardization but rather for some vendor to take the plunge (and expose themselves to all the bad publicity that would come with enabling arbitrary code execution in a C++ compiler).

my $.02,
–Arthur

Nicol Bolas

unread,
Apr 29, 2016, 12:01:52 PM4/29/16
to ISO C++ Standard - Future Proposals, mwoehlk...@gmail.com

According to the standard:

>  A source-file new-line in a raw string literal results in a new-line in the resulting execution string-
literal.

So it would be a `\n`, not a `\r\n`.

Granted, the above quote is not in normative text (presumably because section 2.14.3 makes it more clear). But clearly that is the intent of the specification. So if VS doesn't do that correctly, then it's broken.

And since VS has had raw string literals for a while now, odds are good someone would have noticed it if they did it wrong.

Andrew Tomazos

unread,
Apr 29, 2016, 1:16:01 PM4/29/16
to std-pr...@isocpp.org
Not so fast:

"Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set (introducing new-line characters for end-of-line indicators) if necessary. The set of physical source file characters accepted is implementation-defined."

This mapping is commonly known as the "source encoding".  As a part of the souce file, the content of raw string literals are input, likewise, in source encoding.

Nicol Bolas

unread,
Apr 29, 2016, 2:01:08 PM4/29/16
to ISO C++ Standard - Future Proposals

Considering that the specification has non-normative examples of raw string literals and their non-raw equivalents, and those examples explicitly show that a "source encoding" newline should be equivalent to "\n", then clearly the writers of the specification believe that the conversion is not implementation dependent. So either your interpretation or their interpretation of the spec is wrong.

In any case, it seems to me that "introducing new-line characters for end-of-line indicators" would mean that the implementation is not allowed to use a platform-specific "end-of-line indicators". That it must use the source character set value of "\n". That is, what "\n" maps to is implementation defined. That "end-of-line indicators" must become "\n" is not.

Andrew Tomazos

unread,
Apr 30, 2016, 6:16:20 PM4/30/16
to std-pr...@isocpp.org
The examples you are referring to do not show how the new-line is encoded in the original physical source file.  They cannot, unless they show the hex dump of the bytes of the physical source file.  The examples just show that the "text" new line in the raw string literal (after decoding) can be equivalent to an escaped new line '\n' in an ordinary string literal.  I think the motivation of the example was just to show that raw string literals can contain embedded new lines (unlike ordinary string literals) - among other this.

In Table 7 it says '\n' maps to NL(LF) and that '\r' maps to CR, and offers no further definition of what NL( LF) and CR are.  I assume NL(LF) is the new line character that is a member of the basic source character set, and that CR is the carriage return that is a member of the basic execution character set.  I think the these basic source character set new lines are the same "new-line characters" refered to in the "introducing new-line characters for end-of-line indicators" during phase 1 source decoding.

Nicol Bolas

unread,
Apr 30, 2016, 9:47:47 PM4/30/16
to ISO C++ Standard - Future Proposals
On Saturday, April 30, 2016 at 6:16:20 PM UTC-4, Andrew Tomazos wrote:
On Fri, Apr 29, 2016 at 8:01 PM, Nicol Bolas <jmck...@gmail.com> wrote:
On Friday, April 29, 2016 at 1:16:01 PM UTC-4, Andrew Tomazos wrote:
On Fri, Apr 29, 2016 at 6:01 PM, Nicol Bolas <jmck...@gmail.com> wrote:
On Thursday, April 28, 2016 at 5:39:09 PM UTC-4, Arthur O'Dwyer wrote:
I was just thinking before your post that this has shades of the icky "mode" behavior of fopen(); i.e., it's up to the programmer (and therefore often buggy) whether the file is opened in "b"inary or "t"ext mode. What makes for the "often buggy" part is that the path-of-least-resistance happens to work perfectly on Unix/Linux/OSX, and therefore the vast majority of working programmers never need to learn the icky parts.

What happens on a Windows platform when I write

    const char data[] = R"(
    )";

? Does data come out equivalent to "\n" or to "\r\n"? Does it depend on the compiler (MSVC versus Clang) or not? I don't have a Windows machine to find out for myself, sorry.

According to the standard:

>  A source-file new-line in a raw string literal results in a new-line in the resulting execution string-
literal.

So it would be a `\n`, not a `\r\n`.

Granted, the above quote is not in normative text (presumably because section 2.14.3 makes it more clear). But clearly that is the intent of the specification. So if VS doesn't do that correctly, then it's broken.

And since VS has had raw string literals for a while now, odds are good someone would have noticed it if they did it wrong.

Not so fast:

"Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set (introducing new-line characters for end-of-line indicators) if necessary. The set of physical source file characters accepted is implementation-defined."

This mapping is commonly known as the "source encoding".  As a part of the souce file, the content of raw string literals are input, likewise, in source encoding.

Considering that the specification has non-normative examples of raw string literals and their non-raw equivalents, and those examples explicitly show that a "source encoding" newline should be equivalent to "\n", then clearly the writers of the specification believe that the conversion is not implementation dependent. So either your interpretation or their interpretation of the spec is wrong.

The examples you are referring to do not show how the new-line is encoded in the original physical source file.

It doesn't matter how it's encoded in the original physical source file. So long as the line endings in the physical source file are converted to proper C++ newline characters, everything works fine.

Consider what would happen if you perform #include on a file that consists solely of a raw string literal. Textual file inclusion should be the equivalent of that: it does whatever #include would do, if the bytes of text file being included had been wrapped in a raw string (of the appropriate encoding).
 
They cannot, unless they show the hex dump of the bytes of the physical source file.  The examples just show that the "text" new line in the raw string literal (after decoding) can be equivalent to an escaped new line '\n' in an ordinary string literal.  I think the motivation of the example was just to show that raw string literals can contain embedded new lines (unlike ordinary string literals) - among other this.

The example in the standard does not use the term "can be equivalent". Indeed, the example's terminology needs no room for doubt. Observe:

Assuming no whitespace at the beginning of lines in the following example, the assert will succeed:
 
const char* p = R"(a\
b
c)"
;
assert(std::strcmp(p, "a\\\nb\nc") == 0);


"the assert will succeed" does not leave room for the ambiguity you seem to hold to. All of the other examples use the term "is equivalent", not "can be equivalent".

So either the standard has a defect that should be corrected or your view of how raw string literals are allowed to translate is wrong.

In Table 7 it says '\n' maps to NL(LF) and that '\r' maps to CR, and offers no further definition of what NL( LF) and CR are.

Sure it does. It defines '\n' to be "NL(LF)", but it also declares it to be the grammatical construct "new-line". Which is used in several places in the C++ grammar.

How they're encoded is essentially irrelevant. What matters is that the character set must provide a value that represents '\n', and when a file is loaded, platform-specific new-lines must be converted into 'new-line'.

I assume NL(LF) is the new line character that is a member of the basic source character set, and that CR is the carriage return that is a member of the basic execution character set.  I think the these basic source character set new lines are the same "new-line characters" refered to in the "introducing new-line characters for end-of-line indicators" during phase 1 source decoding.

You don't have to think; it says so. "new-line" is used quite frequently in the grammar.

Arthur O'Dwyer

unread,
May 1, 2016, 1:31:11 AM5/1/16
to ISO C++ Standard - Future Proposals
On Sat, Apr 30, 2016 at 6:47 PM, Nicol Bolas <jmck...@gmail.com> wrote:
> On Saturday, April 30, 2016 at 6:16:20 PM UTC-4, Andrew Tomazos wrote:
>> On Fri, Apr 29, 2016 at 8:01 PM, Nicol Bolas <jmck...@gmail.com> wrote:
>>> On Friday, April 29, 2016 at 1:16:01 PM UTC-4, Andrew Tomazos wrote:
>>>> On Fri, Apr 29, 2016 at 6:01 PM, Nicol Bolas <jmck...@gmail.com> wrote:
>>>>> On Thursday, April 28, 2016 at 5:39:09 PM UTC-4, Arthur O'Dwyer wrote:
>>>>>>
>>>>>> What happens on a Windows platform when I write
>>>>>>
>>>>>>     const char data[] = R"(
>>>>>>     )";
>>>>>>
>>>>>> ? Does data come out equivalent to "\n" or to "\r\n"? Does it depend
>>>>>> on the compiler (MSVC versus Clang) or not? I don't have a Windows machine
>>>>>> to find out for myself, sorry.
>>>>>
>>>>> According to the standard:
>>>>>
>>>>> >  A source-file new-line in a raw string literal results in a new-line
>>>>> > in the resulting execution string-literal.
>>>>>
>>>>> So it would be a `\n`, not a `\r\n`.
>>>>>
>>>>> Granted, the above quote is not in normative text (presumably because
>>>>> section 2.14.3 makes it more clear). But clearly that is the intent of the
>>>>> specification. [...]

>>>>
>>>> Not so fast:
>>>>
>>>> "Physical source file characters are mapped, in an
>>>> implementation-defined manner, to the basic source character set
>>>> (introducing new-line characters for end-of-line indicators) if necessary.
>>>> The set of physical source file characters accepted is
>>>> implementation-defined."
>>>>
>>>> This mapping is commonly known as the "source encoding".  As a part of
>>>> the souce file, the content of raw string literals are input, likewise, in
>>>> source encoding.

That doesn't contradict what Nicol said. In fact, I'm pretty sure that you two (Andrew, Nicol) are in violent agreement on the points that matter. But regardless, I'm pretty sure that Nicol is right. :)


> It doesn't matter how it's encoded in the original physical source file. So
> long as the line endings in the physical source file are converted to proper
> C++ newline characters, everything works fine.

Right.


> Consider what would happen if you perform #include on a file that consists
> solely of a raw string literal. Textual file inclusion should be the
> equivalent of that: it does whatever #include would do, if the bytes of text
> file being included had been wrapped in a raw string (of the appropriate
> encoding).

This is the part that requires a separate "raw/binary" mode for the inclusion of data files.
Suppose I'm on a Windows platform.
If I use the proposed new "file string literal" construct to include a file "foo.bin" whose physical contents are (hex)

    89 50 4E 47 0D 0A 1A 0A ...

[ http://www.libpng.org/pub/png/spec/1.2/PNG-Structure.html ]
then I definitely do not want that "end-of-line indicator" (0D 0A) converted into a new-line character (0A) by any phase of translation. It is absolutely critical that

    const char mypng[] = RF"foo.bin";

produce exactly the same program as

    const char mypng[] = { 0x89, 0x50, 0x4E, 0x47, 0x0D, 0x0A, 0x1A, 0x0A, ... };

. Whereas, if the file contains

    68 69 0D 0A 62 79 65 0D 0A ...

then it is absolutely critical that

    const char mymsg[] = F"foo.bin";

produce exactly the same program as

    const char mymsg[] = { 0x68, 0x69, '\n', 0x62, 0x79, 0x65, '\n', ... };

Notice that I used my suggestion of "R" in the first case to mean "take the physical bytes", and left off the "R" in the second case to mean "take the file as if it were textually included as source code going through the various phases of translation".

Notice that in that last sentence I used the cumbersome phrase "as source code..." instead of just saying "encoded in the source character set." That's because the source character encoding isn't the only thing that's relevant here. The format of the end-of-line indicator is not determined by the source character encoding.


>> In Table 7 it says '\n' maps to NL(LF) and that '\r' maps to CR [...]

>> I assume NL(LF) is the new line character that is a member of the basic
>> source character set, and that CR is the carriage return that is a member of
>> the basic execution character set.

No; the four characters '\n' in the source correspond at runtime to the character value of NL in the execution character set.
You're correct that the four characters '\r' in the source correspond at runtime to the character value of CR in the execution character set.
By runtime, no vestiges of the source character encoding remain. by definition, the character encoding that matters at execution time is the execution character encoding. (The standard often abuses the term "character set" to mean "character encoding"; this is unimportant.)

Generally speaking, barring weirdnesses like Green Hills' "euc2jis" mode, the source and execution character sets are identical, and the source and execution character encodings are also identical. However, at the source level we also have the concept of "end-of-line indicator", which doesn't correspond to anything at execution time. On Windows the end-of-line indicator is the pair of characters 0D 0A (that is to say, CR NL). The standard is clear (per Nicol's explanation) that if the end-of-line indicator 0D 0A occurs inside a raw string literal in the source code, it must be translated to a new-line character (that is to say, NL; that is to say, 0A).

If you were on Windows, cross-compiling for an old Mac where the new-line character (in the execution character encoding) was 0D, then the standard would require that

assert(strcmp("\n", "\x0d") == 0);  // axiom: NL in the execution character set is 0D
const char data[] = R"(
)";
assert(strcmp(data, "\n") == 0);  // because the end-of-line indicator on line 2 was converted to NL
assert(strcmp(data, "\x0d") == 0);  // Q.E.D.

If you were on Windows, compiling for Windows where the new-line character (in the execution character encoding) was 0A, then the standard would require that

assert(strcmp("\n", "\x0a") == 0);  // axiom: NL in the execution character set is 0A
const char data[] = R"(
)";
assert(strcmp(data, "\n") == 0);  // because the end-of-line indicator on line 2 was converted to NL
assert(strcmp(data, "\x0a") == 0);  // Q.E.D.


I'm pretty sure I've got all that right. Anyone disagree?

–Arthur

Matthew Woehlke

unread,
May 3, 2016, 11:49:31 AM5/3/16
to std-pr...@isocpp.org
On 2016-04-28 17:39, Arthur O'Dwyer wrote:
> Either way, your proposal should include an example along the lines of
>
> const char example[] = RF"foo(bar.h)foo";
>
> Does this mean "include bar.h", or "include foo(bar.h)foo" — and why?

Certainly the latter; anything else is just overly complicating things
to no benefit.

If you really need a string like "foo" + contents of 'bar.h' + "foo",
use concatenation:

auto example[] = R"foo" RF"bar.h" R"foo";

> I think your proposal should include an example
> along the lines of
>
> #define X(x) F ## #x
> #include X(foo) ".h"
> const char example[] = X(foo) ".h";
>
> I think the behavior of the above #include is implementation-defined or
> possibly undefined; I haven't checked.
> I'm curious whether you'd make the behavior of the above "example"
> equivalent to
>
> const char example[] = F"foo" /*followed by the two characters*/ ".h";
>
> or
>
> const char example[] = F"foo.h";
>
> or ill-formed / implementation-defined / undefined [...] or even "I
> don't care — let the committee sort it out".

Does string literal concatenation even occur during the PP phase? Some
crude experiments with GCC¹ suggest otherwise, which makes this a moot
point (read: the first result is obviously and necessarily "correct").

To get the second, you would probably have to write something more like
(disclaimer: not tested):

#define S(s) #s
#define X(x) F ## S(x.h)
const char example[] = X(foo); // F"foo.h"

(¹ echo 'char const* c = "a" "b" "c";' | gcc -E -)

--
Matthew

Greg Marr

unread,
May 3, 2016, 4:50:41 PM5/3/16
to ISO C++ Standard - Future Proposals, mwoehlk...@gmail.com
On Tuesday, May 3, 2016 at 11:49:31 AM UTC-4, Matthew Woehlke wrote:
On 2016-04-28 17:39, Arthur O'Dwyer wrote:
> Either way, your proposal should include an example along the lines of
>
>     const char example[] = RF"foo(bar.h)foo";
>
> Does this mean "include bar.h", or "include foo(bar.h)foo" — and why?

Certainly the latter; anything else is just overly complicating things
to no benefit.

If you really need a string like "foo" + contents of 'bar.h' + "foo",
use concatenation:

I believe that Arthur was comparing to current raw string literals:

const char example[] = R"foo(bar.h)foo";

results in example containing "bar.h".

Does adding the F after the R change the delimiter semantics?

Nicol Bolas

unread,
May 3, 2016, 8:48:50 PM5/3/16
to ISO C++ Standard - Future Proposals, mwoehlk...@gmail.com

I thought the point was that `R` already had a meaning and therefore shouldn't be used to mean "binary". There should be a syntax specifically for reading files that should be interpreted as binary data.

Arthur O'Dwyer

unread,
May 3, 2016, 10:04:18 PM5/3/16
to ISO C++ Standard - Future Proposals, mwoehlk...@gmail.com
Correct, that's what I was getting at.
And the fact that Matthew apparently interpreted the (non-)meaning of the string as something like

    const char example[] = "foo" F"bar.h" "foo";

— i.e. yet a third interpretation — shows that there's room for programmer confusion here. This is why I think an example or two is important, and why I think the exact syntax needs discussion and polishing, so that the final result avoids as much programmer confusion as possible.

–Arthur

Matthew Woehlke

unread,
May 4, 2016, 10:23:17 AM5/4/16
to std-pr...@isocpp.org
Ah... yes, I interpret the original example as being the same as
`F"foo(bar.h)foo"`, with the `R` serving only to specify binary vs. text
include mode. (As Nicol notes, this may be a good reason to use
something other than `R` for that purpose. Maybe we should use `Ft` and
`Fb` instead, with `F` by itself being a synonym for `Ft`?)

I doubt esoteric file names are going to be so common as to justify a
mechanism for naming them other than whatever can be used in e.g.
`#include <foo.h>`. Let's not overengineer the feature :-). If we
*really* need it, we could just specify that escapes are parsed within
the name; that's inconvenient but covers anything, and realistically I
doubt it will be needed much if at all.

(Can you #include a file name with e.g. a newline in its name? I've
never actually had occasion to need to find out...)

--
Matthew

Nicol Bolas

unread,
May 4, 2016, 11:08:53 AM5/4/16
to ISO C++ Standard - Future Proposals, mwoehlk...@gmail.com
On Wednesday, May 4, 2016 at 10:23:17 AM UTC-4, Matthew Woehlke wrote:
Ah... yes, I interpret the original example as being the same as
`F"foo(bar.h)foo"`, with the `R` serving only to specify binary vs. text
include mode. (As Nicol notes, this may be a good reason to use
something other than `R` for that purpose. Maybe we should use `Ft` and
`Fb` instead, with `F` by itself being a synonym for `Ft`?)

We need more than just `t` and `b` here. We need to be able to use the full range of encodings that C++ provides for string literals: narrow, wide, UTF-8, UTF-16, and UTF-32.

So `u8F` would mean that the file is encoded in UTF-8, so the generated literal should match. I would prefer to avoid `Fu8`, because that makes `Fu8"filename.txt"` seem like the `u8` applies to the filename literal rather than the generated one.

We need to have: `F` (narrow), `LF` (wide), `u8F` (UTF-8), `uF` (UTF-16), `UF` (UTF-32), and `bF` (no translation).

Nicol Bolas

unread,
May 4, 2016, 11:22:40 AM5/4/16
to ISO C++ Standard - Future Proposals, mwoehlk...@gmail.com

Actually, something just occurred to me about `bF`. Namely, NUL termination.

All of the genuine string literals should be NUL terminated, since that's how we expect literals to behave. But `bF` shouldn't be NUL terminated. So... what do we do?

A string literal is always considered an array in C++, so sizing information is there. But people are very used to discarding sizing information. How many times have you seen the equivalent of this:

const char *str = "SomeLiteral";

We do this all the time. Now, to be fair to us, this is in part because of a long-time lack of `string_view` and an appropriate literal: `auto str = "SomeLiteral"sv;`.

But my point is that people frequently discard sizing information of string literals. Because of NUL termination, that's generally OK, if slow. NUL characters in strings are quite rare, thanks to long-standing practice of using NUL characters as terminators.

NUL characters in binary files however are very common. So using `strlen` to recompute the lost length is not workable.

Given that we're adding something new to the language anyway (the encoding formats I used above could simply be a new case for `string-literal`, using `encoding-prefix<opt>`, with the exception of `b`), maybe we could also add some language that will prevent this. Perhaps a `bF` string literal is a special "binary literal" that can have some different rules. It could have the type `const unsigned char[X]`, but we could add language so that the literal itself would not decay into a pointer. You could still do this:

const unsigned char var[] = bF"Filename.bin";
const unsigned char* pVar = var;

But you couldn't do this directly:

const unsigned char* pVar = bF"Filename.bin";

I do feel that we should make `bF` return an array of `unsigned char`'s though.

Matthew Woehlke

unread,
May 4, 2016, 12:24:17 PM5/4/16
to std-pr...@isocpp.org
On 2016-05-04 11:22, Nicol Bolas wrote:
> Actually, something just occurred to me about `bF`. Namely, NUL termination.
>
> All of the genuine string literals should be NUL terminated, since that's
> how we expect literals to behave. But `bF` shouldn't be NUL terminated.
> So... what do we do?

I'm not sure that's genuinely a problem¹. Many file formats, even
binary, are likely tolerant of a "stray" NUL at the end, and even if
not, I can't think how you would use such a string without specifying
the length, in which case it would be trivial to subtract 1.

(¹ Didn't we have this conversation already? It seems familiar...)

> A string literal is always considered an array in C++, so sizing
> information is there. But people are very used to *discarding* sizing
> information. How many times have you seen the equivalent of this:
>
> const char *str = "SomeLiteral";

Well, it's fine (if inefficient) for text.

> Given that we're adding something new to the language anyway (the encoding
> formats I used above could simply be a new case for `string-literal`, using
> `encoding-prefix<opt>`, with the exception of `b`), maybe we could also add
> some language that will prevent this. Perhaps a `bF` string literal is a
> special "binary literal" that can have some different rules. It could have
> the type `const unsigned char[X]`, but we could add language so that the
> literal itself would not decay into a pointer. You could still do this:
>
> const unsigned char var[] = bF"Filename.bin";
> const unsigned char* pVar = var;
>
> But you couldn't do this directly:
>
> const unsigned char* pVar = bF"Filename.bin";

Compilers should almost definitely *warn* about that. I'm less convinced
it needs to be forbidden in the standard.

--
Matthew

Arthur O'Dwyer

unread,
May 4, 2016, 4:09:34 PM5/4/16
to ISO C++ Standard - Future Proposals, mwoehlk...@gmail.com
Incorrect (for once ;)).
The prefixes u8, u, U and the suffix L don't apply to the encoding of the *source code*. They apply to the encoding of the *runtime data*.
For example:

    const char d1[] = "abc";
    const char d2[] = u8"abc";
    assert(d1[0] == 'a');  // but not necessarily 0x61
    assert(d2[0] == 0x61);  // but not necessarily == 'a'

This source file can be encoded in ASCII or EBCDIC or otherwise, and compiled for a target platform that uses ASCII or EBCDIC or otherwise; no matter which of those 3x3 = 9 possibilities you pick, both asserts are guaranteed by the Standard to pass.

The source construct that looks like "a" (no matter what source-encoded bytes represent those three glyphs) is guaranteed to correspond at runtime to the bytes of data that represent the letter "a" in the runtime character encoding.
The source construct that looks like u8"a" (no matter what source-encoded bytes represent those five glyphs) is guaranteed to correspond at runtime to the bytes of data that represent the letter "a" in UTF-8, regardless of the runtime character encoding.

Therefore, leaving aside the issue of binary (raw) data and null-termination for a moment, the following constructs are unambiguous, if let's say "foo.txt" contains the text "foo" in source encoding:

    const char a[] = F"foo.txt";  assert(strcmp(a, "foo") == 0);
    const wchar_t b[] = F"foo.txt"L;  assert(wcscmp(b, "foo"L) == 0);
    const char c[] = u8F"foo.txt";  assert(memcmp(c, u8"foo", 4) == 0);
    const char16_t d[] = uF"foo.txt";  assert(memcmp(d, u"foo", 8) == 0);
    const char32_t e[] = UF"foo.txt";  assert(memcmp(e, U"foo", 16) == 0);

Andrew's proposal handles all of these cases perfectly.

The problems arise only when you don't want F"foo.txt" to behave like any variety of string literal — i.e., you don't want the compiler to run the data through the "decode source encoding into glyphs" pass.
Suppose I want to achieve the effect of

    const char single_byte_weights[] = { 134, 150, 150 };  // toy example; this array might be hundreds of elements long

So I write

    const char single_byte_weights[] = F"weights.bin";

where weights.bin is a 3-byte file containing the bytes 134, 150, 150 (that is, 0x86 0x96 0x96).
Now, I happen to be on an EBCDIC system, where

    const char f[] = "foo";
    assert(memcmp(f, "\x86\x96\x96", 4) == 0);

I compile my code on my EBCDIC platform, and it works fine.
Then I port my code to an ASCII platform. Obviously I don't do any kind of translation on my weights.bin file; that's raw data and I don't want those byte values to change. I compile my code on the ASCII platform:

    const char single_byte_weights[] = F"weights.bin";

and the compiler blows up! It's complaining that the byte 0x86 in "weights.bin" doesn't correspond to any ASCII character, so it's not legal to appear in (what's tantamount to) a string literal in my C++ source code.


Or suppose I'm on a Unix platform and I want the effect of

    const char single_byte_weights[] = { 13, 10 };

I compile

    const char single_byte_weights[] = F"weights.bin";

where weights.bin is a 2-byte file containing the bytes 13, 10.
It works fine on Unix.
Then I port my code to Windows. Obviously I don't do any kind of translation on my weights.bin file; that's raw data and I don't want those byte values to change. I compile my code on the Windows platform...
and the compiler happily accepts it...
but at runtime I notice that my weights are all wrong! The debugger tells me that my code has mysteriously changed to the equivalent of

    const char single_byte_weights[] = { 10 };

These are the kinds of problems that a "raw" or "binary" file-input mode would solve — they're problems related to source encoding. Problems related to the runtime encoding of glyphs into bytes are already thoroughly solved by the existing prefix/suffix system, which composes fine with the F-prefix.

HTH,
–Arthur

Arthur O'Dwyer

unread,
May 4, 2016, 4:18:16 PM5/4/16
to ISO C++ Standard - Future Proposals
On Wed, May 4, 2016 at 9:24 AM, Matthew Woehlke <mwoehlk...@gmail.com> wrote:
On 2016-05-04 11:22, Nicol Bolas wrote:
> Actually, something just occurred to me about `bF`. Namely, NUL termination.
>
> All of the genuine string literals should be NUL terminated, since that's
> how we expect literals to behave. But `bF` shouldn't be NUL terminated.
> So... what do we do?

I'm not sure that's genuinely a problem¹. Many file formats, even
binary, are likely tolerant of a "stray" NUL at the end, and even if
not, I can't think how you would use such a string without specifying
the length, in which case it would be trivial to subtract 1.

IMO this is a genuine problem.
The most obvious use-case for raw binary file input would be as a portable replacement for "xxd -i".
If you're trying to replace

    const char data[] = { 0x01, 0x02, 0x03 };

with a file-input construct, well, you just can't, unless the file-input construct allows you to specify an array that doesn't end with 0x00. If it shoves a 0x00 at the end of every array it creates, then not only will your executable wind up "too big" (and all your subsequent data wind up shoved down by one byte, which can indeed cause problems for embedded systems) but also sizeof(data) will be 4 instead of 3, so any code that uses sizeof(data) will be wrong... It's just a mess. Let's not do that.

I believe that prefix-b or prefix-R or whatever ends up getting proposed might as well do both things at once:
- don't apply source-encoding-to-character decoding
- don't add null terminator
that is, it should behave just like "xxd -i" (AFAIK).

–Arthur

Nicol Bolas

unread,
May 4, 2016, 7:05:20 PM5/4/16
to ISO C++ Standard - Future Proposals, mwoehlk...@gmail.com
On Wednesday, May 4, 2016 at 4:09:34 PM UTC-4, Arthur O'Dwyer wrote:
On Wed, May 4, 2016 at 8:08 AM, Nicol Bolas <jmck...@gmail.com> wrote:
On Wednesday, May 4, 2016 at 10:23:17 AM UTC-4, Matthew Woehlke wrote:
Ah... yes, I interpret the original example as being the same as
`F"foo(bar.h)foo"`, with the `R` serving only to specify binary vs. text
include mode. (As Nicol notes, this may be a good reason to use
something other than `R` for that purpose. Maybe we should use `Ft` and
`Fb` instead, with `F` by itself being a synonym for `Ft`?)

We need more than just `t` and `b` here. We need to be able to use the full range of encodings that C++ provides for string literals: narrow, wide, UTF-8, UTF-16, and UTF-32.

So `u8F` would mean that the file is encoded in UTF-8, so the generated literal should match. I would prefer to avoid `Fu8`, because that makes `Fu8"filename.txt"` seem like the `u8` applies to the filename literal rather than the generated one.

We need to have: `F` (narrow), `LF` (wide), `u8F` (UTF-8), `uF` (UTF-16), `UF` (UTF-32), and `bF` (no translation).

Incorrect (for once ;)).
The prefixes u8, u, U and the suffix L don't apply to the encoding of the *source code*. They apply to the encoding of the *runtime data*.

So you're saying that you will always be limited to the source character set. So... what if the source character set doesn't include all of the characters you want to use? What if you have a Unicode-encoded file you want to include?

In a regular C++ file, you can work around this by escaping characters in string literals. C++11 allows you to do `u8"\u4321"`, and it will convert that Unicode code unit into a UTF-8 sequence.

How do I do the same with file input? There seem to be 3 alternatives:

1: Escape sequences in the included file are processed as though they were in a non-raw string literal. That's... bad. I'm pretty sure most people don't want to have escape sequences work that way, especially if they're including text for other languages like scripting languages.

2: No escape sequences are allowed, which means that the character data for inclusions is limited to only the implementation-defined source character set. No Unicode characters, nada.

3: The user has the ability to tell the compiler what character set is being read at the inclusion point.

#3 is what I am referring to with those prefixes.

I recognize that, with binary file loading, you can get the effective equivalent of Unicode encoding by loading the file as binary data. But it would be very strange indeed if you couldn't include a UTF-16 file as a genuine `const char16_t[]` literal, if you had to do a cast from the binary literal to a string literal.

Especially given endian conversion problems. It'd be great if we could load a UTF-16 file into our executable and have it use the correct endian for the platform the executable is running.

So we seem to have two semi-orthogonal dimensions of options: the format of the source file and the desired format of the converted string literal. Source files can be:

- Source character set
- Unicode, UTF-8
- Unicode, UTF-16
- Unicode, UTF-32
- Binary, no translation.

String literal formats include, along with their associated source restrictions:

- Narrow characters: restricted to "source character set" inputs.
- Wide characters: restricted to "source character set" inputs.
- UTF-8: restricted to non-"Binary" inputs.
- UTF-16: restricted to non-"Binary" inputs.
- UTF-32: restricted to non-"Binary" inputs.
- Binary: restricted to "Binary" only.

Given the complexities of the needs here, I have no idea how to specify this. I really don't want to see: `u8Fu8`, but that's pretty much the only way I can imagine it working.

Alternatively, we can make the Unicode reading a specialized form of binary loading. So if you say `u8bF`, what you're saying is that no text translation will be done, but the file's data will assumed to be a UTF-8 string. Similarly, with `ubF`, the file will be read as a UTF-16 string with no text translation. `bF` would be pure binary as `const unsigned char[]`. The Unicode versions would be NUL terminated, but they would have no text translation done beyond that.

The only use cases with this that would be missed are:

- Unicode transcoding. Reading a UTF-16 file and re-encoding it as UTF-8. Probably not a compelling use case.

- Text translation with Unicode files. That is, if you have a UTF-8 file that contains platform-specific newlines and you want it translated to platform-neutral ones.

That's a compromise I can live with.

Arthur O'Dwyer

unread,
May 4, 2016, 9:15:22 PM5/4/16
to ISO C++ Standard - Future Proposals, mwoehlk...@gmail.com
On Wed, May 4, 2016 at 4:05 PM, Nicol Bolas <jmck...@gmail.com> wrote:
On Wednesday, May 4, 2016 at 4:09:34 PM UTC-4, Arthur O'Dwyer wrote:
On Wed, May 4, 2016 at 8:08 AM, Nicol Bolas <jmck...@gmail.com> wrote:
On Wednesday, May 4, 2016 at 10:23:17 AM UTC-4, Matthew Woehlke wrote:
Ah... yes, I interpret the original example as being the same as
`F"foo(bar.h)foo"`, with the `R` serving only to specify binary vs. text
include mode. (As Nicol notes, this may be a good reason to use
something other than `R` for that purpose. Maybe we should use `Ft` and
`Fb` instead, with `F` by itself being a synonym for `Ft`?)

We need more than just `t` and `b` here. We need to be able to use the full range of encodings that C++ provides for string literals: narrow, wide, UTF-8, UTF-16, and UTF-32.

So `u8F` would mean that the file is encoded in UTF-8, so the generated literal should match. I would prefer to avoid `Fu8`, because that makes `Fu8"filename.txt"` seem like the `u8` applies to the filename literal rather than the generated one.

We need to have: `F` (narrow), `LF` (wide), `u8F` (UTF-8), `uF` (UTF-16), `UF` (UTF-32), and `bF` (no translation).

Incorrect (for once ;)).
The prefixes u8, u, U and the suffix L don't apply to the encoding of the *source code*. They apply to the encoding of the *runtime data*.

So you're saying that you will always be limited to the source character set. So... what if the source character set doesn't include all of the characters you want to use? What if you have a Unicode-encoded file you want to include?

In a regular C++ file, you can work around this by escaping characters in string literals. C++11 allows you to do `u8"\u4321"`, and it will convert that Unicode code unit into a UTF-8 sequence.

How do I do the same with file input? There seem to be 3 alternatives:

1: Escape sequences in the included file are processed as though they were in a non-raw string literal. That's... bad. I'm pretty sure most people don't want to have escape sequences work that way, especially if they're including text for other languages like scripting languages.

2: No escape sequences are allowed, which means that the character data for inclusions is limited to only the implementation-defined source character set. No Unicode characters, nada.

3: The user has the ability to tell the compiler what character set is being read at the inclusion point.

#3 is what I am referring to with those prefixes.

Okay, I now see the problem you're trying to solve, at least. (But maybe I'm still missing something; see below.)
You're saying: Suppose Alice is on an EBCDIC system (source encoding), compiling for a system whose runtime encoding is UTF-8, and Alice wants the file-input equivalent of

    const char32_t data[] = U"\u20AC";

Alice can't represent U+20AC EURO SIGN natively in her EBCDIC source encoding, so she can't use the "text"/"source-encoded" form of file-input (i.e. the one in Andrew's proposal).
Alice could create a four-byte file with the contents 0x00 0x00 0x20 0xAC and include it using the "raw"/"un-encoded" form of file-input, but that just gets her an array[4] of bytes (say, char). You can't convert those four bytes to a single char32_t unless you know the platform's endianness, which is a whole *other* can of worms.
So therefore, Alice has a problem that C++11 string literals can solve but file-input syntax cannot. This is unfortunate.


So we seem to have two semi-orthogonal dimensions of options: the format of the source file and the desired format of the converted string literal. Source files can be:

- Source character set
- Unicode, UTF-8
- Unicode, UTF-16
- Unicode, UTF-32
- Binary, no translation.

My contentions here include:
- Saying a source file is "UTF-16" or "UTF-32" doesn't make sense, because source files (and files in general, IMHO) are sequences of bytes. You can't have a "file" of 16-bit or 32-bit units unless you start talking about endianness, which is a whole can of worms.
- In current C++, u8"foo" and "foo" have the same type, so "file of UTF-8" and "file of untranslated bytes" are pretty much the same thing in my mind. I would agree that this state of affairs kind of sucks, but it's also kind of nice for the people lucky enough to live in a Unicode world already (which most of the time includes me).
- Whether the file is encoded in the source character set is technically orthogonal to whether it undergoes newline translation (CRLF to LF, on Windows), but newline translation is very important and we must specify it somehow.
- Whether the file is encoded in the source character set is orthogonal to whether the programmer wants a terminating '\0'; this is also important and we must specify it somehow.

[...]
Alternatively, we can make the Unicode reading a specialized form of binary loading. So if you say `u8bF`, what you're saying is that no text translation will be done, but the file's data will assumed to be a UTF-8 string. Similarly, with `ubF`, the file will be read as a UTF-16 string with no text translation.

I contend that "the file will be read as a UTF-16 string" is underspecified; you have to say either "the file will be read as a little-endian UTF-16 string" or "the file will be read as a big-endian UTF-8 string" or "the file will be read as a UTF-16 string respecting (and discarding) the leading Byte Order Mark; if no BOM is present the file will be treated as (big-endian, little-endian, ill-formed)".
 
[...]
The only use cases with this that would be missed are:

- Unicode transcoding. Reading a UTF-16 file and re-encoding it as UTF-8. Probably not a compelling use case.

- Text translation with Unicode files. That is, if you have a UTF-8 file that contains platform-specific newlines and you want it translated to platform-neutral ones.

I'd agree that neither of these cases are useful enough to need solving. However, isn't the "Unicode transcoding" case exactly the Alice case I described above as "the problem you're trying to solve"? If you don't care about that scenario... can you give me a concrete example of a Bob case that you do care about, with the same level of detail?

Thanks,
Arthur

Andrew Tomazos

unread,
May 4, 2016, 10:35:55 PM5/4/16
to std-pr...@isocpp.org
"Big-endian UTF-8" is nonsensical.  UTF-8 has no byte-wise endianness.

Big-endian UTF-16 is spelled UTF-16BE.

Little-endian UTF-16 is spelled UTF-16LE.

A file encoded in UTF-16 means it is either encoded in UTF-16LE or UTF-16BE.  The platform endianness has no impact on that.  They are two different encodings.  You have to know which of the two encodings the file is in to decode it, which can be deduced from a BOM if present.

We can talk about a sequence of 16-bit integer code units being encoded in UTF-16 without talking about endianness.  This is because the code units have already been decoded from bytes to integers.

My current design has two input formats.  Binary and text.  The text encoding format is the usual source encoding for your source files (the overwhelming default these days is UTF-8).  If you have a text file in a different encoding, then you can transcode it into source encoding within your source files.  It will then undergo the usual source-to-execution transcoding that the bodies of raw string literals undergo. Alternatively, you can read it in as a binary file and then transcode it yourself into execution encoding with constexpr programming in whatever fashion you want.

Beyond these two options, I do not think futher built-in support for a specific set of different source text encoding formats is necessary.

I am struggling with the binary file literal size issue.  I do not think that null-termination should be present on binary file literals, to discourage use.  Binary file literals in the general case do contain embedded nulls.  I would like them to be arrays of char, but they would then easily coerce to const char* which looks like a C string.  I would like to be able to write a constexpr function f such that both f(F"foo") and f(bF"foo") will work and properly get the range of bytes that the two literals present.  It is conceivable using unsigned char instead of char for the binary file literals would help here:

   void f(const char* cstr);  // use cstr[0] through cstr[strlen(cstr)-1]

   template<size_t N>
   void f(const unsigned char (&arr)[N]); // use arr[0] through arr[N-1]

Ultimately I would like this to work:

   constexpr string_view sv = bF"foo";

It is unclear to me if the seemingly necessary reinterpret cast of the const unsigned char* to the const char* is allowed in constexpr programming.  This is where I am stuck currently.


--
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposal...@isocpp.org.
To post to this group, send email to std-pr...@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp.org/d/msgid/std-proposals/CADvuK0%2BDjUDqprR8VvuqUJMMaP9LhGepOKpEAbnvB%2BTk6VAwoQ%40mail.gmail.com.

Matthew Woehlke

unread,
May 5, 2016, 10:53:02 AM5/5/16
to std-pr...@isocpp.org
On 2016-05-04 19:05, Nicol Bolas wrote:
> Alternatively, we can make the Unicode reading a specialized form of binary
> loading. So if you say `u8bF`, what you're saying is that no text
> translation will be done, but the file's data will assumed to be a UTF-8
> string. Similarly, with `ubF`, the file will be read as a UTF-16 string
> with no text translation. `bF` would be pure binary as `const unsigned
> char[]`. The Unicode versions would be NUL terminated, but they would have
> no text translation done beyond that.
>
> The only use cases with this that would be missed are:
>
> - Unicode transcoding. Reading a UTF-16 file and re-encoding it as UTF-8.
> Probably not a compelling use case.
>
> - Text translation with Unicode files. That is, if you have a UTF-8 file
> that contains platform-specific newlines and you want it translated to
> platform-neutral ones.
>
> That's a compromise I can live with.

Personally, I'd prefer if the source (if text) is required to be in
Unicode with a BOM, *but* newline translation is performed. If we
require a BOM (except in UTF-8; lack of a BOM would therefore imply
UTF-8), we can trivially identify which flavor of Unicode is used by the
input file.

Losing newline translation seems... unfortunate.

I suppose the major problem would be if a file needs to be used both via
this C++ include mechanism and also some other context (e.g. a source
file of some other language that is both embedded and also used directly
by that language's compiler/interpreter).

That said... keep in mind that compile-time string processing may
provide a better (more flexible, at least, if almost certainly far more
verbose) mechanism for coping with these issues. For example, per
Andrew's comment, if you *must* read a file in other than the source
encoding, you could read it in binary and run it through such mechanism
to produce whatever run-time representation is required.

I have to wonder, again, if we're trying to overengineer the feature...

--
Matthew

Tom Honermann

unread,
May 5, 2016, 11:01:37 AM5/5/16
to std-pr...@isocpp.org
On 5/5/2016 10:52 AM, Matthew Woehlke wrote:
> Personally, I'd prefer if the source (if text) is required to be in
> Unicode with a BOM...

Please, no. On non-ASCII based systems, forcing Unicode would be a
significant burden to users.

Tom.

Matthew Woehlke

unread,
May 5, 2016, 11:15:41 AM5/5/16
to std-pr...@isocpp.org
Uh, you *do* realize that ASCII is a subset of UTF-8, yes? Requiring
input files to be Unicode with BOM (no BOM → UTF-8) would be fully
compatible with ASCII-conforming input files.

--
Matthew

Tom Honermann

unread,
May 5, 2016, 11:18:26 AM5/5/16
to std-pr...@isocpp.org
I said on *non*-ASCII based systems.

Also, UTF-8 files may have a BOM.

Tom.

Nicol Bolas

unread,
May 5, 2016, 11:26:36 AM5/5/16
to ISO C++ Standard - Future Proposals
On Wednesday, May 4, 2016 at 10:35:55 PM UTC-4, Andrew Tomazos wrote:
My current design has two input formats.  Binary and text.  The text encoding format is the usual source encoding for your source files (the overwhelming default these days is UTF-8).  If you have a text file in a different encoding, then you can transcode it into source encoding within your source files. It will then undergo the usual source-to-execution transcoding that the bodies of raw string literals undergo.

Source encodings are not required to be able to support all of Unicode. And without escape characters and `\U`, I have no way to "transcode it into source encoding" because the source encoding cannot support it.  If I have a Unicode-encoded file under your rules, the only thing I can do is load it as binary. And thanks to cross-compiling, the executable environment may have a different endian than the source environment.

Beyond these two options, I do not think futher built-in support for a specific set of different source text encoding formats is necessary.

Internationalization is not optional. Nor is cross-compilation support.
 
  I would like to be able to write a constexpr function f such that both f(F"foo") and f(bF"foo") will work and properly get the range of bytes that the two literals present.

... why? Generally speaking, functions that process text and functions that process binary data are different functions. Unless your binary data actually is text, but that doesn't really make sense.

Ultimately I would like this to work:

   constexpr string_view sv = bF"foo";

Why? `string_view`, as the name suggests, is for strings. The class you want is called `span<unsigned char>`. That's how we spell "array of arbitrary bytes" in C++.

Matthew Woehlke

unread,
May 5, 2016, 11:29:49 AM5/5/16
to std-pr...@isocpp.org
Ah, sorry, misread :-). But in that case, Andrew's suggestion (just read
it as source encoding, and if you need something else, "too bad"¹)
works. I'm more inclined to that anyway.

(¹ Use compile-time text processing for this case if you simply *must*
have it.)

--
Matthew

Nicol Bolas

unread,
May 5, 2016, 11:49:44 AM5/5/16
to ISO C++ Standard - Future Proposals, mwoehlk...@gmail.com
On Thursday, May 5, 2016 at 11:29:49 AM UTC-4, Matthew Woehlke wrote:
On 2016-05-05 11:18, Tom Honermann wrote:
> On 5/5/2016 11:15 AM, Matthew Woehlke wrote:
>> On 2016-05-05 11:01, Tom Honermann wrote:
>>> On 5/5/2016 10:52 AM, Matthew Woehlke wrote:
>>>> Personally, I'd prefer if the source (if text) is required to be in
>>>> Unicode with a BOM...
>>> Please, no.  On non-ASCII based systems, forcing Unicode would be a
>>> significant burden to users.
>> Uh, you *do* realize that ASCII is a subset of UTF-8, yes? Requiring
>> input files to be Unicode with BOM (no BOM → UTF-8) would be fully
>> compatible with ASCII-conforming input files.
>>
> I said on *non*-ASCII based systems.

Ah, sorry, misread :-).
 
But in that case, Andrew's suggestion (just read
it as source encoding, and if you need something else, "too bad"¹)
works. I'm more inclined to that anyway.

It's silly to give up now that we're so close to a functional design. You pointed out that you can identify Unicode-encoded files by their BOMs. So instead of the giant number of sources, we actually only have:

- Source character set
- Unicode, in a format as identified by BOMs
- Binary

`F` would mean source character set. `Fb` would mean binary. And `Fu` would mean Unicode, as identified by BOMs. You can still apply the encoding prefixes to the non-binary forms, and indeed you must provide one for `Fu`. For example:

- u8F: Read source character set, convert to UTF-8.
- uFu: Read Unicode text, convert to UTF-16.

And so forth.

`Fu` would specifically mean:

- Platform-specific text translation (new-lines and so forth).
- BOM to identify the source Unicode encoding and endian. Lack of BOM automatically means UTF-8, but the UTF-8 BOM will also mean UTF-8. BOM is stripped out.
- NUL-terminated.

The encoding prefix allows cross-Unicode conversion. So if you have a UTF-16 text file and you want to store it as UTF-8, you use `u8Fu`.

It's a simple and elegant solution to the source encoding problem.


(¹ Use compile-time text processing for this case if you simply *must*
have it.)

The problem with that is that you are assuming that Unicode-encoded files represent a minor province of what people do. I disagree. "Must have it" is something that a lot of people will need.

There's a reason why C++11 added explicit support for Unicode-encoded literals. Even if they are written in the source character set, the fact that you get Unicode-defined results for them is important.

People will want to use textual inclusion for many kinds of data tables. Some of these tables will include Internationalization strings. And that requires Unicode support.

Furthermore, it is not clear to me at all that one actually can implement compile-time conversion support. It seems to me that lengthy compile-time conversions could run afoul of compiler limitations, depending on the size of the input data. Whereas built-in support would not only be faster but far less likely to encounter such limits.

Andrew Tomazos

unread,
May 5, 2016, 11:49:57 AM5/5/16
to std-pr...@isocpp.org
On Thu, May 5, 2016 at 5:26 PM, Nicol Bolas <jmck...@gmail.com> wrote:
On Wednesday, May 4, 2016 at 10:35:55 PM UTC-4, Andrew Tomazos wrote:
My current design has two input formats.  Binary and text.  The text encoding format is the usual source encoding for your source files (the overwhelming default these days is UTF-8).  If you have a text file in a different encoding, then you can transcode it into source encoding within your source files. It will then undergo the usual source-to-execution transcoding that the bodies of raw string literals undergo.

Source encodings are not required to be able to support all of Unicode. And without escape characters and `\U`, I have no way to "transcode it into source encoding" because the source encoding cannot support it.  If I have a Unicode-encoded file under your rules, the only thing I can do is load it as binary. And thanks to cross-compiling, the executable environment may have a different endian than the source environment.

It's very rare to be cross-compiling from a build system that is less powerful than the target system in this fashion.

Nevertheless, in such very rare cases the binary option I outlined works fine.  You can compile-time configure the constexpr-programmed transcoding to produce whatever execution encoding or endianness you want.
 

Beyond these two options, I do not think futher built-in support for a specific set of different source text encoding formats is necessary.

Internationalization is not optional. Nor is cross-compilation support.

The second option works for both internationalization and cross-compilation.
 
 
  I would like to be able to write a constexpr function f such that both f(F"foo") and f(bF"foo") will work and properly get the range of bytes that the two literals present.

... why? Generally speaking, functions that process text and functions that process binary data are different functions. Unless your binary data actually is text, but that doesn't really make sense.

Ultimately I would like this to work:

   constexpr string_view sv = bF"foo";

Why? `string_view`, as the name suggests, is for strings. The class you want is called `span<unsigned char>`. That's how we spell "array of arbitrary bytes" in C++.

People use string_view to address ranges of "raw memory", and not just for text.

span hasn't been standardized yet, and may never be.

Andrew Tomazos

unread,
May 5, 2016, 12:10:25 PM5/5/16
to std-pr...@isocpp.org
That's not the claim.  The claim is that the number of people using both an unusual source encoding and a Unicode execution encoding is very small.  I think that is correct.

That given, supporting that small (or perhaps even nonexistant) group with the DIY constexpr/binary option, seems more reasonable than introducing the complexity of multi-source-encoding into the feature.

Nicol Bolas

unread,
May 5, 2016, 12:17:09 PM5/5/16
to ISO C++ Standard - Future Proposals


On Thursday, May 5, 2016 at 11:49:57 AM UTC-4, Andrew Tomazos wrote:
On Thu, May 5, 2016 at 5:26 PM, Nicol Bolas <jmck...@gmail.com> wrote:
On Wednesday, May 4, 2016 at 10:35:55 PM UTC-4, Andrew Tomazos wrote:
My current design has two input formats.  Binary and text.  The text encoding format is the usual source encoding for your source files (the overwhelming default these days is UTF-8).  If you have a text file in a different encoding, then you can transcode it into source encoding within your source files. It will then undergo the usual source-to-execution transcoding that the bodies of raw string literals undergo.

Source encodings are not required to be able to support all of Unicode. And without escape characters and `\U`, I have no way to "transcode it into source encoding" because the source encoding cannot support it.  If I have a Unicode-encoded file under your rules, the only thing I can do is load it as binary. And thanks to cross-compiling, the executable environment may have a different endian than the source environment.

It's very rare to be cross-compiling from a build system that is less powerful than the target system in this fashion.

Endian-ness has nothing to do with "power". If I'm on a desktop machine (little-endian), and I'm compiling for Android (usually big-endian), that's a problem.

And such cross-compilation is not exactly "rare" these days either.

Ultimately I would like this to work:

   constexpr string_view sv = bF"foo";

Why? `string_view`, as the name suggests, is for strings. The class you want is called `span<unsigned char>`. That's how we spell "array of arbitrary bytes" in C++.

People use string_view to address ranges of "raw memory", and not just for text.

No, people using string_view for that because `span<unsigned char>` is not available to them.

We should not encourage the continued abuse of such types.

Nicol Bolas

unread,
May 5, 2016, 12:22:18 PM5/5/16
to ISO C++ Standard - Future Proposals

What is "an unusual source encoding?" Unicode is not "unusual".

That given, supporting that small (or perhaps even nonexistant) group with the DIY constexpr/binary option, seems more reasonable than introducing the complexity of multi-source-encoding into the feature.

Given that we have no evidence that "DIY constexpr/binary option" is even possible (for example, how do you detect what endian your execution environment is?) I see no reason to ignore such use cases.

Also, what "complexity" are we talking about? I added a single suffix to `F`. That's not exactly complex.

Thiago Macieira

unread,
May 5, 2016, 12:38:28 PM5/5/16
to std-pr...@isocpp.org
On quinta-feira, 5 de maio de 2016 08:26:35 PDT Nicol Bolas wrote:
> If I
> have a Unicode-encoded file under your rules, the only thing I can do is
> load it as binary. And thanks to cross-compiling, the executable
> environment may have a different endian than the source environment.

If we add this feature to the language, then you must be able to do a
constexpr byte-swapping of the entire contents and reverse the endianness to
the native of your target platform.

Let me emphasise: you MUST be able to. This kind of constexpr transformation
is a requirement of the feature. If it is not possible to do it, if instead we
have to rely on other tools to do the transformation before the compiler is
run, then we don't need this feature at all.

--
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
Software Architect - Intel Open Source Technology Center

Andrew Tomazos

unread,
May 5, 2016, 12:41:52 PM5/5/16
to std-pr...@isocpp.org
On Thu, May 5, 2016 at 6:22 PM, Nicol Bolas <jmck...@gmail.com> wrote:
On Thursday, May 5, 2016 at 12:10:25 PM UTC-4, Andrew Tomazos wrote:
On Thu, May 5, 2016 at 5:49 PM, Nicol Bolas <jmck...@gmail.com> wrote:
It's silly to give up now that we're so close to a functional design. You pointed out that you can identify Unicode-encoded files by their BOMs. So instead of the giant number of sources, we actually only have:

- Source character set
- Unicode, in a format as identified by BOMs
- Binary

`F` would mean source character set. `Fb` would mean binary. And `Fu` would mean Unicode, as identified by BOMs. You can still apply the encoding prefixes to the non-binary forms, and indeed you must provide one for `Fu`. For example:

- u8F: Read source character set, convert to UTF-8.
- uFu: Read Unicode text, convert to UTF-16.

And so forth.

`Fu` would specifically mean:

- Platform-specific text translation (new-lines and so forth).
- BOM to identify the source Unicode encoding and endian. Lack of BOM automatically means UTF-8, but the UTF-8 BOM will also mean UTF-8. BOM is stripped out.
- NUL-terminated.

The encoding prefix allows cross-Unicode conversion. So if you have a UTF-16 text file and you want to store it as UTF-8, you use `u8Fu`.

It's a simple and elegant solution to the source encoding problem.


(¹ Use compile-time text processing for this case if you simply *must*
have it.)

The problem with that is that you are assuming that Unicode-encoded files represent a minor province of what people do.

That's not the claim.  The claim is that the number of people using both an unusual source encoding and a Unicode execution encoding is very small.  I think that is correct.

What is "an unusual source encoding?" Unicode is not "unusual".

One that "cannot support all of Unicode".  Yes, of course Unicode is not unusual - it is the norm these days.

In order to not be able to use my first solution (that is manually transcode your source file for the text file literal into source encoding before checking it in with your source files), the source encoding must be unusual.

That given, supporting that small (or perhaps even nonexistant) group with the DIY constexpr/binary option, seems more reasonable than introducing the complexity of multi-source-encoding into the feature.

Given that we have no evidence that "DIY constexpr/binary option" is even possible (for example, how do you detect what endian your execution environment is?) I see no reason to ignore such use cases.

It works fine.  I've done much more complicated things with constexpr programming than transcoding some text.

As for "detecting the endianness" of your execution environment, you can either get it from your compiler predefined macros / intrinsics (if available) or specify it explicitly as part of your build configuration (setting for example a macro or constexpr variable).  It's not a big deal.
 
Also, what "complexity" are we talking about? I added a single suffix to `F`. That's not exactly complex.

It increases the number of encoding prefixes quadratically for everyone - to solve what seems to be a small problem for which there are adequate simpler solutions.

Matthew Woehlke

unread,
May 5, 2016, 1:44:03 PM5/5/16
to std-pr...@isocpp.org
On 2016-05-05 11:49, Nicol Bolas wrote:
> On Thursday, May 5, 2016 at 11:29:49 AM UTC-4, Matthew Woehlke wrote:
>> (¹ Use compile-time text processing for this case if you simply *must*
>> have it.)
>
> The problem with that is that you are assuming that Unicode-encoded files
> represent a minor province of what people do. I disagree. "Must have it" is
> something that a *lot* of people will need.

Rather, I expect that most people will only need to read source
character set. (Bearing in mind that for many people, "source character
set" is already UTF-8...)

Also, I'm thinking more of esoteric cases where the input and/or output
are not Unicode.

Having three input modes (source character set, auto-detected flavor of
unicode, binary / no translation) seems sane, and allows explicit
support for e.g. reading UTF-16 with arbitrary endianness into the
correct run-time representation. This is not much of an addition from
only two modes (SCS, binary).

--
Matthew

Nicol Bolas

unread,
May 5, 2016, 1:48:50 PM5/5/16
to ISO C++ Standard - Future Proposals

Wait a second, I just realized something.

The whole point of file inclusions is to avoid "manually transcoding your source file", right? After all, if you're going to do that, if you have integrated such transcoding into the build process... why not just go all the way to an actual C++ source file? If your source character set can actual handle this, just turn the file into an actual RAW string literal or whatever and just #include it.

So you seem to have undermined the very purpose of your own proposal. Why should some users have to "manually transcode" their text files while other users don't?
 
, the source encoding must be unusual.

ASCII is not "unusual" for a source character set. The default source character set for quite a few compilers does not natively handle embedded Unicode characters.

That given, supporting that small (or perhaps even nonexistant) group with the DIY constexpr/binary option, seems more reasonable than introducing the complexity of multi-source-encoding into the feature.

Given that we have no evidence that "DIY constexpr/binary option" is even possible (for example, how do you detect what endian your execution environment is?) I see no reason to ignore such use cases.

It works fine.  I've done much more complicated things with constexpr programming than transcoding some text.
 
As for "detecting the endianness" of your execution environment, you can either get it from your compiler predefined macros / intrinsics (if available) or specify it explicitly as part of your build configuration (setting for example a macro or constexpr variable).  It's not a big deal.

So it's not something you can put in the standard library. It's a thing everyone has to write individually.

I fail to see why adding a simple source encoding suffix is a worse alternative.
 
Also, what "complexity" are we talking about? I added a single suffix to `F`. That's not exactly complex.

It increases the number of encoding prefixes quadratically for everyone - to solve what seems to be a small problem for which there are adequate simpler solutions.

Please look at the actual design I posted. It doesn't increase the number of encoding prefixes at all. It adds a different loading prefix. `F` for source character set, `Fb` for binary, and `Fu` for Unicode. None of those are prefixes.

There is no quadratic increase of anything.

Matthew Woehlke

unread,
May 5, 2016, 2:11:55 PM5/5/16
to std-pr...@isocpp.org
On 2016-05-05 12:41, Andrew Tomazos wrote:
> On Thu, May 5, 2016 at 6:22 PM, Nicol Bolas wrote:
>> Also, what "complexity" are we talking about? I added a single suffix to
>> `F`. That's not exactly complex.
>
> It increases the number of encoding prefixes quadratically for everyone -
> to solve what seems to be a small problem for which there are adequate
> simpler solutions.

I'm not convinced that's true. Yes, the *possible combinations* increase
geometrically, but the actual complexity increase can be linear¹, as the
input and output / run-time encoding are orthogonal.

First, let's set aside binary mode. I don't think it makes sense to even
permit transcoding of binary, e.g. `uFb` would be illegal. So we have a
prefix of the form `<out>F<in>`, where `<out>` is any of the usual `u`,
`u8`, ``, etc. and `<in>` is `u` or ``.

Now, except for the case of source encoding == run-time encoding, the
compiler *already* needs to be able to transcode², and can do so by
converting the input text into One True Format (probably some form of
Unicode), which is then converted into the requested output format. The
point is, these steps are *separable*, so adding additional input
formats is O(n), not O(n²).

For `<out>Fu`, the compiler reads the file, transcodes the text into an
internal representation (also normalizing line endings), then transcodes
the intermediary result into the requested run-time format.

(¹ Compilers *may*, but are not *required*, to implement direct
transcoding paths without using an intermediary "universal" encoding.
But see also next note.)

(² ...and if the compiler doesn't do its own transcoding, almost
certainly it is using a library that already handles arbitrary in and
out encodings.)

Do we need this? I'm inclined to think "yes". Given a modern compiler
with UTF-8 source encoding, 1) transcoding is already implemented for
wide Unicode strings and 2) it's not *that* unreasonable to want this
feature for UTF-16 input files.

--
Matthew

Andrew Tomazos

unread,
May 5, 2016, 2:19:37 PM5/5/16
to std-pr...@isocpp.org
You're forgetting this is a corner case.  In almost all cases the source encoding is UTF-8, and the text files for input to text file literals will already be in that format.
 
, the source encoding must be unusual.

ASCII is not "unusual" for a source character set. The default source character set for quite a few compilers does not natively handle embedded Unicode characters.

I don't think that is true.  Please list a few of those quite a few compilers.

Tom Honermann

unread,
May 5, 2016, 3:26:55 PM5/5/16
to std-pr...@isocpp.org
On 5/5/2016 2:19 PM, Andrew Tomazos wrote:
In almost all cases the source encoding is UTF-8, and the text files for input to text file literals will already be in that format.

I keep hearing this echoed on various C++ standard mailing lists, but in my experience, this just isn't true.  I think Clang assumes that the source encoding is UTF-8, but gcc, the Microsoft compiler, IBM's compilers, etc... use the current locale at the time of compilation to determine the source encoding in the absence of a UTF-8 BOM (rare), #pragma (rare, somewhat common for IBM compilers, at least in header files) or explicit compiler option (also rare, and only very recently an option for the Microsoft compiler [1]).  It is certainly common for source files to be limited to ASCII and therefore UTF-8 compatible, but I don't think it is fair to state this is true in "almost all cases".

 
, the source encoding must be unusual.

ASCII is not "unusual" for a source character set. The default source character set for quite a few compilers does not natively handle embedded Unicode characters.

I don't think that is true.  Please list a few of those quite a few compilers.

The most obvious example is IBM's z/OS C++ compiler.  But again, by default, gcc and Microsoft use the current locale at the time of compilation to determine the source character set.  The current locale may specify a character set that is not ASCII compatible; Shift JIS, for example, encodes a yen symbol (¥) at the code point (0x5c) that is used for backslash (\) in ASCII.

Tom.

[1]: https://blogs.msdn.microsoft.com/vcblog/2016/02/22/new-options-for-managing-character-sets-in-the-microsoft-cc-compiler/

Edward Catmur

unread,
May 5, 2016, 6:49:29 PM5/5/16
to ISO C++ Standard - Future Proposals
On Thursday, 5 May 2016 03:35:55 UTC+1, Andrew Tomazos wrote:
> I am struggling with the binary file literal size issue.  I do not think that null-termination should be present on binary file literals, to discourage use.  Binary file literals in the general case do contain embedded nulls.  I would like them to be arrays of char, but they would then easily coerce to const char* which looks like a C string.  I would like to be able to write a constexpr function f such that both f(F"foo") and f(bF"foo") will work and properly get the range of bytes that the two literals present. 

Have you considered expanding binary file literals to a braced-init-list rather than a string literal? A braced-init-list will not bind to a pointer, but it can bind to an array reference and will initialize an array of unspecified length with the correct size.

Andrew Tomazos

unread,
May 6, 2016, 8:25:30 AM5/6/16
to std-pr...@isocpp.org
On Thu, May 5, 2016 at 9:26 PM, Tom Honermann <t...@honermann.net> wrote:
On 5/5/2016 2:19 PM, Andrew Tomazos wrote:
In almost all cases the source encoding is UTF-8, and the text files for input to text file literals will already be in that format.

I keep hearing this echoed on various C++ standard mailing lists, but in my experience, this just isn't true.  I think Clang assumes that the source encoding is UTF-8, but gcc,

From the GCC manual:

"The files input to CPP might be in any character set at all. CPP's very first action, before it even looks for line boundaries, is to convert the file into the character set it uses for internal processing. That set is what the C standard calls the source character set. It must be isomorphic with ISO 10646, also known as Unicode. CPP uses the UTF-8 encoding of Unicode."

Can you please tell me where you get the idea that gcc uses the current locale to select the source encoding at the time of compilation?  I believe the default of -finput-charset is UTF-8.

the Microsoft compiler, IBM's compilers, etc... use the current locale at the time of compilation to determine the source encoding in the absence of a UTF-8 BOM (rare), #pragma (rare, somewhat common for IBM compilers, at least in header files) or explicit compiler option (also rare, and only very recently an option for the Microsoft compiler [1]).  It is certainly common for source files to be limited to ASCII and therefore UTF-8 compatible, but I don't think it is fair to state this is true in "almost all cases".
 
, the source encoding must be unusual.

ASCII is not "unusual" for a source character set. The default source character set for quite a few compilers does not natively handle embedded Unicode characters.

I don't think that is true.  Please list a few of those quite a few compilers.

The most obvious example is IBM's z/OS C++ compiler.  But again, by default, gcc and Microsoft use the current locale at the time of compilation to determine the source character set.  The current locale may specify a character set that is not ASCII compatible; Shift JIS, for example, encodes a yen symbol (¥) at the code point (0x5c) that is used for backslash (\) in ASCII.

Tom.

[1]: https://blogs.msdn.microsoft.com/vcblog/2016/02/22/new-options-for-managing-character-sets-in-the-microsoft-cc-compiler/

--
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposal...@isocpp.org.
To post to this group, send email to std-pr...@isocpp.org.

Matthew Woehlke

unread,
May 6, 2016, 11:03:45 AM5/6/16
to std-pr...@isocpp.org
On 2016-05-06 08:25, Andrew Tomazos wrote:
> From the GCC manual:
>
> "The files input to CPP might be in any character set at all. CPP's very
> first action, before it even looks for line boundaries, is to convert the
> file into the character set it uses for internal processing. That set is
> what the C standard calls the source character set. It must be isomorphic
> with ISO 10646, also known as Unicode. CPP uses the UTF-8 encoding of
> Unicode."
>
> Can you please tell me where you get the idea that gcc uses the current
> locale to select the source encoding at the time of compilation? I believe
> the default of -finput-charset is UTF-8.

(From the GCC 4.9 documentation; emphasis added:)

-finput-charset=CHARSET
Set the input character set, used for translation from the
character set of the input file to the source character set used by
GCC. *If the locale does not specify*, or GCC cannot get this
information from the locale, the default is UTF-8.

The *fallback* is UTF-8. The "default" is 'as specified by the current
locale'. (At least, that's what the documentation claims; I haven't
actually attempted to test it.)

--
Matthew

Tom Honermann

unread,
May 6, 2016, 11:46:30 AM5/6/16
to std-pr...@isocpp.org
On 5/6/2016 11:03 AM, Matthew Woehlke wrote:
> On 2016-05-06 08:25, Andrew Tomazos wrote:
>> From the GCC manual:
>>
>> "The files input to CPP might be in any character set at all. CPP's very
>> first action, before it even looks for line boundaries, is to convert the
>> file into the character set it uses for internal processing. That set is
>> what the C standard calls the source character set. It must be isomorphic
>> with ISO 10646, also known as Unicode. CPP uses the UTF-8 encoding of
>> Unicode."
I believe the above indicates that gcc's internal character set is
UTF-8; this doesn't indicate a default input file character set.
>>
>> Can you please tell me where you get the idea that gcc uses the current
>> locale to select the source encoding at the time of compilation? I believe
>> the default of -finput-charset is UTF-8.
> (From the GCC 4.9 documentation; emphasis added:)
>
> -finput-charset=CHARSET
> Set the input character set, used for translation from the
> character set of the input file to the source character set used by
> GCC. *If the locale does not specify*, or GCC cannot get this
> information from the locale, the default is UTF-8.
>
> The *fallback* is UTF-8. The "default" is 'as specified by the current
> locale'. (At least, that's what the documentation claims; I haven't
> actually attempted to test it.)
>
Despite the documentation, I can't seem to get my gcc builds to
differentiate behavior based on locale settings.

Regardless, the default I see looks like ISO8859-1, not UTF-8. My gcc
builds accepts the following input that is ill-formed UTF-8.

$ cat t.cpp
#include <stdio.h>
int main() {
unsigned char c = '£'; // 0xA3
printf("0x%X\n", (unsigned int)c);
}

# Note 'a3' at offset 0x67.
$ od -t x1 t.cpp
0000000 23 69 6e 63 6c 75 64 65 20 3c 73 74 64 69 6f 2e
0000020 68 3e 0a 69 6e 74 20 6d 61 69 6e 28 29 20 7b 0a
0000040 20 20 20 20 75 6e 73 69 67 6e 65 64 20 63 68 61
0000060 72 20 63 20 3d 20 27 a3 27 3b 20 2f 2f 20 30 78
0000100 41 33 0a 20 20 20 20 70 72 69 6e 74 66 28 22 30
0000120 78 25 58 5c 6e 22 2c 20 28 75 6e 73 69 67 6e 65
0000140 64 20 69 6e 74 29 63 29 3b 0a 7d 0a
0000154

$ iconv -f utf-8 -t utf-8 t.cpp
#include <stdio.h>
int main() {
unsigned char c = 'iconv: illegal input sequence at position 55

$ g++ t.cpp -o t
$ ./t
0xA3

Tom.

Matthew Woehlke

unread,
May 6, 2016, 1:16:14 PM5/6/16
to std-pr...@isocpp.org
On 2016-05-06 11:46, Tom Honermann wrote:
> On 5/6/2016 11:03 AM, Matthew Woehlke wrote:
>> (From the GCC 4.9 documentation; emphasis added:)
>>
>> -finput-charset=CHARSET
>> Set the input character set, used for translation from the
>> character set of the input file to the source character set used by
>> GCC. *If the locale does not specify*, or GCC cannot get this
>> information from the locale, the default is UTF-8.
>>
>> The *fallback* is UTF-8. The "default" is 'as specified by the current
>> locale'. (At least, that's what the documentation claims; I haven't
>> actually attempted to test it.)
>
> Despite the documentation, I can't seem to get my gcc builds to
> differentiate behavior based on locale settings.
>
> Regardless, the default I see looks like ISO8859-1, not UTF-8. My gcc
> builds accepts the following input that is ill-formed UTF-8.
> [snipped]

Interesting; *mine* clearly expects UTF-8¹. See the attached source file
in latin1 (a.k.a. ISO 8859-1) encoding. I get:

$ g++ -std=c++11 latin1.cpp && ./a.out
�5, please
$ g++ -std=c++11 -finput-charset=latin1 latin1.cpp && ./a.out
£5, please

If I convert the source file to UTF-8:

$ g++ -std=c++11 -finput-charset=latin1 utf8.cpp && ./a.out
£5, please!

(Interestingly, I can't seem to get -finput-charset=utf16 to work:
`error: failure to convert UTF-16 to UTF-8`...)

(¹ ...which is consistent with my locale, en_US.UTF-8. What is your locale?)

--
Matthew
latin1.cpp

Thiago Macieira

unread,
May 6, 2016, 5:37:35 PM5/6/16
to std-pr...@isocpp.org
On sexta-feira, 6 de maio de 2016 11:03:34 PDT Matthew Woehlke wrote:
> -finput-charset=CHARSET
> Set the input character set, used for translation from the
> character set of the input file to the source character set used by
> GCC. *If the locale does not specify*, or GCC cannot get this
> information from the locale, the default is UTF-8.
>
> The *fallback* is UTF-8. The "default" is 'as specified by the current
> locale'. (At least, that's what the documentation claims; I haven't
> actually attempted to test it.)

The documentation is wrong. GCC does not attempt to identify the current
locale's charset and directly falls back to UTF-8.

Tom Honermann

unread,
May 7, 2016, 12:16:19 PM5/7/16
to std-pr...@isocpp.org
I get the same behavior with your test case (I'm using a gcc 6.0.0 build).
> (Interestingly, I can't seem to get -finput-charset=utf16 to work:
> `error: failure to convert UTF-16 to UTF-8`...)
>
> (¹ ...which is consistent with my locale, en_US.UTF-8. What is your locale?)

My locale is also en_US.UTF-8.

I played around a bit more. It looks like gcc accepts wtutf8 by
default; ill-formed UTF-8 code unit sequences are copied verbatim
without being transcoded (not even to a replacement character), the same
as for hex escape sequences. Using the attached test:

$ g++ -std=c++11 t.cpp -o t
$ ./t
narrow string:
0xA3
0x0
narrow string (hex escape):
0xA3
0x0
UTF-8 string:
0xA3
0x0
UTF-8 string (hex escape):
0xA3
0x0

$ g++ -std=c++11 -finput-charset=iso8859-1 t.cpp -o t
$ ./t
narrow string:
0xC2
0xA3
0x0
narrow string (hex escape):
0xA3
0x0
UTF-8 string:
0xC2
0xA3
0x0
UTF-8 string (hex escape):
0xA3
0x0

Tom.
t.cpp

Tom Honermann

unread,
May 7, 2016, 12:23:24 PM5/7/16
to std-pr...@isocpp.org
On 05/06/2016 05:37 PM, Thiago Macieira wrote:
> On sexta-feira, 6 de maio de 2016 11:03:34 PDT Matthew Woehlke wrote:
>> -finput-charset=CHARSET
>> Set the input character set, used for translation from the
>> character set of the input file to the source character set used by
>> GCC. *If the locale does not specify*, or GCC cannot get this
>> information from the locale, the default is UTF-8.
>>
>> The *fallback* is UTF-8. The "default" is 'as specified by the current
>> locale'. (At least, that's what the documentation claims; I haven't
>> actually attempted to test it.)
> The documentation is wrong. GCC does not attempt to identify the current
> locale's charset and directly falls back to UTF-8.
>
If gcc doesn't consult the locale, then there isn't anything to fall
back from (unrecognized -finput-charset operands are rejected). The
test I supplied demonstrated that gcc doesn't reject ill-formed UTF-8,
so I think it is imprecise to state it uses UTF-8 as the default input
encoding. Per my response to Matthew, it looks like it uses wtutf8.

Tom.

Tom Honermann

unread,
May 7, 2016, 12:28:11 PM5/7/16
to std-pr...@isocpp.org
Bah, I had modified the test case to add testing of well-formed UTF-8
code unit sequences, but forgot to update the output above. The
differences above are the only part actually interesting for this
discussion, but for completeness, here is the correct output:

$ g++ -std=c++11 t.cpp -o t
$ ./t
narrow string: (well-formed UTF-8)
0xC2
0xA3
0x0
narrow string: (ill-formed UTF-8)
0xA3
0x0
narrow string (hex escape):
0xA3
0x0
UTF-8 string: (well-formed UTF-8)
0xC2
0xA3
0x0
UTF-8 string: (ill-formed UTF-8)
0xA3
0x0
UTF-8 string (hex escape):
0xA3
0x0

$ g++ -std=c++11 -finput-charset=iso8859-1 t.cpp -o t
$ ./t
narrow string: (well-formed UTF-8)
0xC3
0x82
0xC2
0xA3
0x0
narrow string: (ill-formed UTF-8)
0xC2
0xA3
0x0
narrow string (hex escape):
0xA3
0x0
UTF-8 string: (well-formed UTF-8)
0xC3
0x82
0xC2
0xA3
0x0
UTF-8 string: (ill-formed UTF-8)

Nicol Bolas

unread,
May 7, 2016, 5:57:18 PM5/7/16
to ISO C++ Standard - Future Proposals
On Saturday, May 7, 2016 at 12:23:24 PM UTC-4, Tom Honermann wrote:

If gcc doesn't consult the locale, then there isn't anything to fall
back from (unrecognized -finput-charset operands are rejected).  The
test I supplied demonstrated that gcc doesn't reject ill-formed UTF-8,
so I think it is imprecise to state it uses UTF-8 as the default input
encoding.  Per my response to Matthew, it looks like it uses wtutf8.

What is wtutf8? I've never heard of this term before.

Thiago Macieira

unread,
May 7, 2016, 7:18:33 PM5/7/16
to std-pr...@isocpp.org
On sábado, 7 de maio de 2016 12:23:21 PDT Tom Honermann wrote:
> If gcc doesn't consult the locale, then there isn't anything to fall
> back from (unrecognized -finput-charset operands are rejected). The
> test I supplied demonstrated that gcc doesn't reject ill-formed UTF-8,
> so I think it is imprecise to state it uses UTF-8 as the default input
> encoding. Per my response to Matthew, it looks like it uses wtutf8.

Your test was invalid because it contained invalid UTF-8 sequences and my
editor destroyed them.

This discussion is also going off-topic. GCC behaviour should be discussed in
a GCC mailing list. Sending files over the network and sharing with other
people in other operating systems, with possibly different locale encodings,
is not part of the C++ standard.

I'm not joking. The standard doesn't take that into account. Sharing files with
other people and getting the same compilation requires stepping outside of the
standard and and into compiler-specific territory.

Tom Honermann

unread,
May 7, 2016, 9:35:53 PM5/7/16
to std-pr...@isocpp.org
On May 7, 2016, at 5:57 PM, Nicol Bolas <jmck...@gmail.com> wrote:
> What is wtutf8? I've never heard of this term

It's a colloquial term that might be more readily recognized if spelled "wtf utf-8".

I believe I first heard it used to describe how Clang attempts to interpret file names for presentation purposes (perhaps in diagnostics? Or when generating preprocessing line control directives? I don't recall).

Tom.

Nicol Bolas

unread,
May 7, 2016, 10:48:10 PM5/7/16
to ISO C++ Standard - Future Proposals
On Saturday, May 7, 2016 at 7:18:33 PM UTC-4, Thiago Macieira wrote:
On sábado, 7 de maio de 2016 12:23:21 PDT Tom Honermann wrote:
> If gcc doesn't consult the locale, then there isn't anything to fall
> back from (unrecognized -finput-charset operands are rejected).  The
> test I supplied demonstrated that gcc doesn't reject ill-formed UTF-8,
> so I think it is imprecise to state it uses UTF-8 as the default input
> encoding.  Per my response to Matthew, it looks like it uses wtutf8.

Your test was invalid because it contained invalid UTF-8 sequences and my
editor destroyed them.

An implementation of UTF-8 must also disallow illegal UTF-8 sequences. Just like an implementation of C++ must disallow illegal C++.
 
This discussion is also going off-topic. GCC behaviour should be discussed in
a GCC mailing list.

The point of discussing it here is to answer a very important question: what source character sets are the defaults for compilers, and how many don't default to some form of Unicode? If 20% of compilers (by use) default to ASCII or whatever, then literal inclusion of text files would be significantly hampered due to the inability of people to use it for Unicode strings or Internationalization of any kind.
 
Sending files over the network and sharing with other
people in other operating systems, with possibly different locale encodings,
is not part of the C++ standard.

I'm not joking. The standard doesn't take that into account. Sharing files with
other people and getting the same compilation requires stepping outside of the
standard and and into compiler-specific territory.

That's true, the standard doesn't provide any such guarantees.

That doesn't mean that this file inclusion mechanism shouldn't. After all, the whole point of allowing binary inclusion is to allow people to be able to store literal binary data cross-platform, exactly as it was in the file. With `Fb`, each compiler is required to store the same stream of `unsigned char` that would have been read from an untranslated `fopen` or `iostream` or whatever.

That's as cross-compiler as it gets.

Allowing a way to do this with text, through the use of Unicode, is not a bad idea. It would also be a very useful idea. And unlike mandating some form of Unicode as the source character set for actual source files, it wouldn't impact any existing systems. Anything could implement it.

Richard Smith

unread,
May 8, 2016, 12:28:13 AM5/8/16
to std-pr...@isocpp.org
I think what you're referring to is WTF-8? (See https://simonsapin.github.io/wtf-8/).

Thiago Macieira

unread,
May 8, 2016, 2:36:45 AM5/8/16
to std-pr...@isocpp.org
On sábado, 7 de maio de 2016 19:48:09 PDT Nicol Bolas wrote:
> > Your test was invalid because it contained invalid UTF-8 sequences and my
> > editor destroyed them.
>
> An implementation of UTF-8 must also disallow illegal UTF-8 sequences. Just
> like an implementation of C++ must disallow illegal C++.

Right, it "disallowed" the invalid sequences by destroying them. They were
replaced by the replacement character.

> > This discussion is also going off-topic. GCC behaviour should be discussed
> > in
> > a GCC mailing list.
>
> The point of discussing it *here* is to answer a very important question:
> what source character sets are the defaults for compilers, and how many
> don't default to some form of Unicode? If 20% of compilers (by use) default
> to ASCII or whatever, then literal inclusion of text files would be
> significantly hampered due to the inability of people to use it for Unicode
> strings or Internationalization of any kind.

Given the discussion about trigraphs, I'm guessing only IBM currently still
cares about a non-ASCII encoding of source code.

Maybe if we ask Michael Wong directly for his opinion on this matter, we'll
get somewhere.

> > Sending files over the network and sharing with other
> > people in other operating systems, with possibly different locale
> > encodings,
> > is not part of the C++ standard.
> >
> > I'm not joking. The standard doesn't take that into account. Sharing files
> > with
> > other people and getting the same compilation requires stepping outside of
> > the
> > standard and and into compiler-specific territory.
>
> That's true, the standard doesn't provide any such guarantees.
>
> That doesn't mean that this file inclusion mechanism *shouldn't*. After
> all, the whole point of allowing binary inclusion is to allow people to be
> able to store literal binary data cross-platform, *exactly* as it was in
> the file. With `Fb`, each compiler is required to store the same stream of
> `unsigned char` that would have been read from an untranslated `fopen` or
> `iostream` or whatever.
>
> That's as cross-compiler as it gets.

Well, you're assuming that the target platform has bytes the same size as the
host platform that is compiling the source code. The C++ standard cannot
guarantee that. That means the only possible solution is "implementation-
defined".

I don't see the point in making binary files any more sharable than source code
itself.

> Allowing a way to do this with text, through the use of Unicode, is not a
> bad idea. It would also be a very *useful* idea. And unlike mandating some
> form of Unicode as the source character set for actual source files, it
> wouldn't impact any existing systems. Anything could implement it.

The non-binary file inclusion options are much simpler, indeed.

Nicol Bolas

unread,
May 8, 2016, 10:12:11 AM5/8/16
to ISO C++ Standard - Future Proposals


On Sunday, May 8, 2016 at 2:36:45 AM UTC-4, Thiago Macieira wrote:
On sábado, 7 de maio de 2016 19:48:09 PDT Nicol Bolas wrote:
> > Your test was invalid because it contained invalid UTF-8 sequences and my
> > editor destroyed them.
>
> An implementation of UTF-8 must also disallow illegal UTF-8 sequences. Just
> like an implementation of C++ must disallow illegal C++.

Right, it "disallowed" the invalid sequences by destroying them. They were
replaced by the replacement character.

> > This discussion is also going off-topic. GCC behaviour should be discussed
> > in
> > a GCC mailing list.
>
> The point of discussing it *here* is to answer a very important question:
> what source character sets are the defaults for compilers, and how many
> don't default to some form of Unicode? If 20% of compilers (by use) default
> to ASCII or whatever, then literal inclusion of text files would be
> significantly hampered due to the inability of people to use it for Unicode
> strings or Internationalization of any kind.

Given the discussion about trigraphs, I'm guessing only IBM currently still
cares about a non-ASCII encoding of source code.

You don't seem to understand the goal here.

The goal is for a user to be able to write text file literals that can be used with every compiler. This is a very useful thing to be able to do, and if we're going to have file literals, they should be able to be written in a platform-neutral. Something that you can actually rely upon.

More fragility is something that C++ can do without.

The question is whether that goal has already been de-facto achieved through the source character set (based on the default inputs for compilers) or whether we need a de-jure means of doing so with file literals. That's why we want to know about what compilers take. If most compilers can accept Unicode strings by default, then whether we need a real mechanism is in doubt. But if there are a lot of compilers that don't, or if they differ greatly on which Unicode encodings they take, then there is a clear need to build one into the system.

A compiler which can only take ASCII is not part of the "de-facto achieved" group. This is not a question of "ASCII vs. IBM's stuff". It's "Known Unicode format vs. other things."

Maybe if we ask Michael Wong directly for his opinion on this matter, we'll
get somewhere.

> > Sending files over the network and sharing with other
> > people in other operating systems, with possibly different locale
> > encodings,
> > is not part of the C++ standard.
> >
> > I'm not joking. The standard doesn't take that into account. Sharing files
> > with
> > other people and getting the same compilation requires stepping outside of
> > the
> > standard and and into compiler-specific territory.
>
> That's true, the standard doesn't provide any such guarantees.
>
> That doesn't mean that this file inclusion mechanism *shouldn't*. After
> all, the whole point of allowing binary inclusion is to allow people to be
> able to store literal binary data cross-platform, *exactly* as it was in
> the file. With `Fb`, each compiler is required to store the same stream of
> `unsigned char` that would have been read from an untranslated `fopen` or
> `iostream` or whatever.
>
> That's as cross-compiler as it gets.

Well, you're assuming that the target platform has bytes the same size as the
host platform that is compiling the source code.

No, I am not. If the host platform can compile to the target platform, then the host platform knows the byte size of the target. Otherwise it would be unable to convert even a literal string to the target platform's `char` type. Armed with such knowledge, the host compiler can do whatever internal conversion is needed to make it work as if the target platform loaded it.

The `char` sequence you get from a binary file literal is as if you had read the file through a stream on the target platform. Generating that data is simply a matter of the host doing whatever gymnastics are needed to convert what the host platform sees to what the target platform would have seen had it read the file.

I fail to see how this would not be possible.

The C++ standard cannot
guarantee that. That means the only possible solution is "implementation-
defined".

I don't see the point in making binary files any more sharable than source code
itself.

You mean besides making it genuinely useful? And stopping people from writing brittle code that works on one platform but doesn't work on another?

Thiago Macieira

unread,
May 8, 2016, 2:33:53 PM5/8/16
to std-pr...@isocpp.org
On domingo, 8 de maio de 2016 07:12:11 PDT Nicol Bolas wrote:
> > Given the discussion about trigraphs, I'm guessing only IBM currently
> > still
> > cares about a non-ASCII encoding of source code.
>
> You don't seem to understand the goal here.
>
> The goal is for a user to be able to write text file literals that can be
> used with every compiler. This is a very useful thing to be able to do, and
> if we're going to have file literals, they should be able to be written in
> a platform-neutral. Something that you can actually *rely* upon.

I got that, but I don't see the point in having that if we can't rely on being
able to have platform-neutral source code in the first place.

I don't mean the variants that ask the compiler to interpret as UTF-8, 16 or
32. Those are ok. I meant the binary and locale text variants of the literal.

> More fragility is something that C++ can do without.
>
> The question is whether that goal has already been *de-facto* achieved
> through the source character set (based on the default inputs for
> compilers) or whether we need a *de-jure* means of doing so with file
> literals. That's why we want to know about what compilers take. If most
> compilers can accept Unicode strings by default, then whether we need a
> real mechanism is in doubt. But if there are a lot of compilers that don't,
> or if they differ greatly on which Unicode encodings they take, then there
> is a clear need to build one into the system.

There's no de-facto solution. You cannot share a file that isn't strict US-
ASCII to colleagues and expect it to be compiled the same way, even with
modern compilers. You can share US-ASCII files, if you don't count IBM and
EBCDIC.

To be clear, I mean that this source code does not always produce the same
string literal:

auto x = u"é";

> > Well, you're assuming that the target platform has bytes the same size as
> > the
> > host platform that is compiling the source code.
>
> No, I am not. If the host platform can compile to the target platform, then
> the host platform knows the byte size of the target. Otherwise it would be
> unable to convert even a literal string to the target platform's `char`
> type. Armed with such knowledge, the host compiler can do whatever internal
> conversion is needed to make it work as if the target platform loaded it.

This was a comment on the binary include. Suppose you're building on a regular
8-bit-byte platform, targetting a 9-bit-byte platform. How are you going to
represent values 256 to 512 in that byte, in your source file, when you do:

Fb"data.bin"


> The `char` sequence you get from a binary file literal is as if you had
> read the file through a stream on the target platform. Generating that data
> is simply a matter of the host doing whatever gymnastics are needed to
> convert what the host platform sees to what the target platform would have
> seen had it read the file.

Transferring this file to the other platform implies a translation. How the
translator operates (signed bytes vs unsigned bytes) is not specified.
Moreover, files on that other machine may not be representable on the build
platform.

> > The C++ standard cannot
> > guarantee that. That means the only possible solution is "implementation-
> > defined".
> >
> > I don't see the point in making binary files any more sharable than source
> > code
> > itself.
>
> You mean besides making it genuinely useful? And stopping people from
> writing brittle code that works on one platform but doesn't work on another?

I'm not saying this isn't useful. I'm saying that making the interpretation of
locale text files (not binary, not Unicode) as "implementation-defined" suffices,
since it would be as useful as sharing code is today.

Summary:

* F"filename" has implementation-defined behaviour because it depends on the
locale charset
* Fu8"filename", Fu"filename", FU"filename" have specific behaviour because the
interpretation of the bytes in each is standardised
* Fb"filename" is 1:1 so long as bytes on the build and target platforms are
the same. Otherwise, implementation-defined.

Viacheslav Usov

unread,
May 8, 2016, 3:31:43 PM5/8/16
to ISO C++ Standard - Future Proposals
On Sun, May 8, 2016 at 8:33 PM, Thiago Macieira <thi...@macieira.org> wrote:

> This was a comment on the binary include. Suppose you're building on a regular 8-bit-byte platform, targetting a 9-bit-byte platform. How are you going to represent values 256 to 512 in that byte

"Binary" really means "binary", "made of binary digits", i.e., bits, not "made of bytes".

Then this "binary include" is a bit stream, and so, logically, it should be copied into a contiguous region of memory, placing bits "one by one" (this may need further specification to be fully unambiguous; or this can be implementation-defined).

Cheers,
V.

Tom Honermann

unread,
May 8, 2016, 3:57:46 PM5/8/16
to std-pr...@isocpp.org
So, apparently:

1) I misremembered the term.

2) Given that there is a specification, it isn't a colloquialism.

3) Having now read the specification, I can see that what it describes is not the behavior exhibited by gcc; wtf-8 addresses lone surrogates, but not ill-formed code unit sequences as I thought it did.

4) I am an agent of misinformation.

Sorry, and thanks for the link, Richard!

Tom.

Tom Honermann

unread,
May 8, 2016, 4:08:01 PM5/8/16
to std-pr...@isocpp.org
On 5/7/2016 7:18 PM, Thiago Macieira wrote:
> Your test was invalid because it contained invalid UTF-8 sequences and my
> editor destroyed them.
The test source code was invalid UTF-8, but that was the point; gcc
still accepted it demonstrating that its default input file encoding is
not (strict) UTF-8.

I hope your editor at least warned you that it was mutating the file ;)
> This discussion is also going off-topic. GCC behaviour should be discussed in
> a GCC mailing list. Sending files over the network and sharing with other
> people in other operating systems, with possibly different locale encodings,
> is not part of the C++ standard.
>
> I'm not joking. The standard doesn't take that into account. Sharing files with
> other people and getting the same compilation requires stepping outside of the
> standard and and into compiler-specific territory.

I think Nicol adequately addressed this, so I won't comment further.

Tom.

Ross Smith

unread,
May 8, 2016, 5:13:03 PM5/8/16
to std-pr...@isocpp.org
On 2016-05-09 06:33, Thiago Macieira wrote:
>
> This was a comment on the binary include. Suppose you're building on a regular
> 8-bit-byte platform, targetting a 9-bit-byte platform. How are you going to
> represent values 256 to 512 in that byte, in your source file, when you do:
>
> Fb"data.bin"

This is kind of tangential to the thread, but honestly I think it's time
we started giving serious consideration to giving up making the C++
specification tie itself in knots trying to support platforms where a
byte is not 8 bits.

Ross Smith

Thiago Macieira

unread,
May 9, 2016, 2:06:48 AM5/9/16
to std-pr...@isocpp.org
On domingo, 8 de maio de 2016 16:07:53 PDT Tom Honermann wrote:
> On 5/7/2016 7:18 PM, Thiago Macieira wrote:
> > Your test was invalid because it contained invalid UTF-8 sequences and my
> > editor destroyed them.
>
> The test source code was invalid UTF-8, but that was the point; gcc
> still accepted it demonstrating that its default input file encoding is
> not (strict) UTF-8.

GCC tries to be pass-through unless it is forced to interpret the contents
(u8, u & U). And, of course, GIGO.

Thiago Macieira

unread,
May 9, 2016, 2:10:13 AM5/9/16
to std-pr...@isocpp.org
On domingo, 8 de maio de 2016 21:31:40 PDT Viacheslav Usov wrote:
> Then this "binary include" is a bit stream, and so, logically, it should be
> copied into a contiguous region of memory, placing bits "one by one" (this
> may need further specification to be fully unambiguous; or this can be
> implementation-defined).

Yeah, except that may not be the way it works. As I said, it depends on how
the file transfer over the network would happen (remember: networks use octets
as units, not bytes).

Then again, this is an academic discussion.

Thiago Macieira

unread,
May 9, 2016, 2:11:49 AM5/9/16
to std-pr...@isocpp.org
Agreed. C++14 made it mandatory to be at least 8 bits, which means we've
dropped support for the 6- and 7-bit byte platforms that used to exist. 9-bit
byte platforms also existed, but were even more rare.

Viacheslav Usov

unread,
May 9, 2016, 3:08:59 AM5/9/16
to ISO C++ Standard - Future Proposals
On Mon, May 9, 2016 at 8:10 AM, Thiago Macieira <thi...@macieira.org> wrote:

> Yeah, except that may not be the way it works. As I said, it depends on how the file transfer over the network would happen (remember: networks use octets as units, not bytes).

Networks use bits in their lower layers. Bytes and octets happen at higher layers.

A file can always be represented as a bit stream, and a bit-by-bit inclusion is always possible. To make this fully unambiguous, we can have specifiers in the include directive that say how source bytes are converted to bits (low end vs high end), and how bits are pushed into destination bytes (ditto). We can even specify padding. Compared with the rest of the discussion, this is trivial.

Cheers,
V.

Thiago Macieira

unread,
May 9, 2016, 3:52:53 AM5/9/16
to std-pr...@isocpp.org
On segunda-feira, 9 de maio de 2016 09:08:57 PDT Viacheslav Usov wrote:
> A file can always be represented as a bit stream, and a bit-by-bit
> inclusion is always possible. To make this fully unambiguous, we can have
> specifiers in the include directive that say how source bytes are converted
> to bits (low end vs high end), and how bits are pushed into destination
> bytes (ditto). We can even specify padding. Compared with the rest of the
> discussion, this is trivial.

I'd rather just leave that part as implementation-defined, since I have
absolutely zero experience with machines with bytes that aren't of 8 bits.

Tom Honermann

unread,
May 9, 2016, 8:29:41 AM5/9/16
to std-pr...@isocpp.org
On 5/8/2016 2:33 PM, Thiago Macieira wrote:
> * F"filename" has implementation-defined behaviour because it depends on the
> locale charset
> * Fu8"filename", Fu"filename", FU"filename" have specific behaviour because the
> interpretation of the bytes in each is standardised
> * Fb"filename" is 1:1 so long as bytes on the build and target platforms are
> the same. Otherwise, implementation-defined.
I agree with the above. Just to be clear, the above expressions would
have the following types, yes?

* F"filename" -> const char[N]
* Fu8"filename" -> const char[N]
* Fu"filename" -> const char16_t[N]
* FU"filename" -> const char32_t[N]
* Fb"filename" -> const char[N]

Also, as discussed earlier, BOMs present in the Unicode variants would
be removed and all of the variants except for Fb would have a nul
character appended.

The question of whether to require normalization of end of line
sequences remains. If that isn't specified, then I think the behavior
needs to be implementation-defined for all the above variants.

Tom.

Thiago Macieira

unread,
May 9, 2016, 12:47:22 PM5/9/16
to std-pr...@isocpp.org
On segunda-feira, 9 de maio de 2016 08:29:27 PDT Tom Honermann wrote:
> On 5/8/2016 2:33 PM, Thiago Macieira wrote:
> > * F"filename" has implementation-defined behaviour because it depends on
> > the>
> > locale charset
> >
> > * Fu8"filename", Fu"filename", FU"filename" have specific behaviour
> > because the>
> > interpretation of the bytes in each is standardised
> >
> > * Fb"filename" is 1:1 so long as bytes on the build and target platforms
> > are>
> > the same. Otherwise, implementation-defined.
>
> I agree with the above. Just to be clear, the above expressions would
> have the following types, yes?
>
> * F"filename" -> const char[N]
> * Fu8"filename" -> const char[N]
> * Fu"filename" -> const char16_t[N]
> * FU"filename" -> const char32_t[N]
> * Fb"filename" -> const char[N]

I was expect all of those to be char[N] and that if I wanted a char16_t, I'd
have to write:

uF"filename"
uFu8"filename"
uFu"filename"
uFU"filename"

That's mighty ugly. Given my own requirement of being able to constexpr-
transform the input to the format of my liking, if I wanted an UCS-4-encoded
string out of an UTF-8 input source, I'd implement the UTF-8 decoder myself.

> Also, as discussed earlier, BOMs present in the Unicode variants would
> be removed and all of the variants except for Fb would have a nul
> character appended.
>
> The question of whether to require normalization of end of line
> sequences remains. If that isn't specified, then I think the behavior
> needs to be implementation-defined for all the above variants.

For all non-binary variants, I'd expect line-ending normalisation.

Tom Honermann

unread,
May 9, 2016, 1:21:49 PM5/9/16
to std-pr...@isocpp.org
On 5/9/2016 12:47 PM, Thiago Macieira wrote:
>> * F"filename" -> const char[N]
>> * Fu8"filename" -> const char[N]
>> * Fu"filename" -> const char16_t[N]
>> * FU"filename" -> const char32_t[N]
>> * Fb"filename" -> const char[N]
> I was expect all of those to be char[N]
Hmm. Would that include transcoding to the source character set for the
Fu8, Fu, and FU variants? If so, then these would all be inherently
implementation-defined. If not, then endianness needs to be addressed
for Fu amd FU.
> and that if I wanted a char16_t, I'd
> have to write:
>
> uF"filename"
> uFu8"filename"
> uFu"filename"
> uFU"filename"
>
> That's mighty ugly. Given my own requirement of being able to constexpr-
> transform the input to the format of my liking, if I wanted an UCS-4-encoded
> string out of an UTF-8 input source, I'd implement the UTF-8 decoder myself.
I think I prefer the match-the-filename-encoding-to-the-string-type
approach I implied above over either of these approaches. That seems
more consistent with string literals to me anyway.
>
>> Also, as discussed earlier, BOMs present in the Unicode variants would
>> be removed and all of the variants except for Fb would have a nul
>> character appended.
>>
>> The question of whether to require normalization of end of line
>> sequences remains. If that isn't specified, then I think the behavior
>> needs to be implementation-defined for all the above variants.
> For all non-binary variants, I'd expect line-ending normalisation.
Good, me too.

Tom.

Viacheslav Usov

unread,
May 9, 2016, 1:44:15 PM5/9/16
to ISO C++ Standard - Future Proposals
On Mon, May 9, 2016 at 7:21 PM, Tom Honermann <t...@honermann.net> wrote:

> Hmm.  Would that include transcoding to the source character set for the Fu8, Fu, and FU variants?

I would say that being able to specify transcoding is a useful option. At least UTF-8 <-> UTF-16 <-> source encoding should just be there, and source/target encodings should be specifiable separately. Personally, I find the end-of-line transcoding a misfeature in C/C++ std libs, but, in principle, it could also be specifiable.

Cheers,
V.

Thiago Macieira

unread,
May 9, 2016, 2:25:35 PM5/9/16
to std-pr...@isocpp.org
On segunda-feira, 9 de maio de 2016 13:21:10 PDT Tom Honermann wrote:
> On 5/9/2016 12:47 PM, Thiago Macieira wrote:
> >> * F"filename" -> const char[N]
> >> * Fu8"filename" -> const char[N]
> >> * Fu"filename" -> const char16_t[N]
> >> * FU"filename" -> const char32_t[N]
> >> * Fb"filename" -> const char[N]
> >
> > I was expect all of those to be char[N]
>
> Hmm. Would that include transcoding to the source character set for the
> Fu8, Fu, and FU variants? If so, then these would all be inherently
> implementation-defined. If not, then endianness needs to be addressed
> for Fu amd FU.

Yes, that was the case. But that's an ugly solution, so I am not asking for
it. Instead, let's have what you suggested and no transcoding. Not even
endianness correction: let the BOM be imported too. You can fix the endianness
and strip the BOM at constexpr time.

What should the compiler do to invalid sources?
* UTF-8 overlong sequences
* UTF-8 invalid sequences
* mismatched UTF-16 surrogate pairs
* out-of-range UTF-8 and UCS-4

Should it do any validation at all?

> >> Also, as discussed earlier, BOMs present in the Unicode variants would
> >> be removed and all of the variants except for Fb would have a nul
> >> character appended.
> >>
> >> The question of whether to require normalization of end of line
> >> sequences remains. If that isn't specified, then I think the behavior
> >> needs to be implementation-defined for all the above variants.
> >
> > For all non-binary variants, I'd expect line-ending normalisation.
>
> Good, me too.


Thiago Macieira

unread,
May 9, 2016, 2:26:32 PM5/9/16
to std-pr...@isocpp.org
On segunda-feira, 9 de maio de 2016 19:44:12 PDT Viacheslav Usov wrote:
> I would say that being able to specify transcoding is a useful option. At
> least UTF-8 <-> UTF-16 <-> source encoding should just be there, and
> source/target encodings should be specifiable separately. Personally, I
> find the end-of-line transcoding a misfeature in C/C++ std libs, but, in
> principle, it could also be specifiable.

You can do that with your own constexpr functions. That's a requirement of
this feature anyway.

As a side note, that's yet another example showing where a constexpr function
should not be used at runtime. At runtime, you want CPU-accelerated functions
to do this kind of transcoding.

Richard Smith

unread,
May 9, 2016, 3:07:39 PM5/9/16
to std-pr...@isocpp.org
On Sun, May 8, 2016 at 11:11 PM, Thiago Macieira <thi...@macieira.org> wrote:
On segunda-feira, 9 de maio de 2016 09:12:55 PDT Ross Smith wrote:
> On 2016-05-09 06:33, Thiago Macieira wrote:
> > This was a comment on the binary include. Suppose you're building on a
> > regular 8-bit-byte platform, targetting a 9-bit-byte platform. How are
> > you going to>
> > represent values 256 to 512 in that byte, in your source file, when you do:
> >     Fb"data.bin"
>
> This is kind of tangential to the thread, but honestly I think it's time
> we started giving serious consideration to giving up making the C++
> specification tie itself in knots trying to support platforms where a
> byte is not 8 bits.

Agreed. C++14 made it mandatory to be at least 8 bits,

That's been mandatory since C89.

What C++14 disallowed is char being signed, only 8 bits wide, and either sign-magnitude or 1's complement (or more generally, as C++ allows representations other than 2s' complement, 1's complement, and sign-magnitude, it requires that char has at least 256 distinct values).

Thiago Macieira

unread,
May 9, 2016, 3:32:25 PM5/9/16
to std-pr...@isocpp.org
On segunda-feira, 9 de maio de 2016 12:07:37 PDT Richard Smith wrote:
> That's been mandatory since C89.
>
> What C++14 disallowed is char being signed, only 8 bits wide, and either
> sign-magnitude or 1's complement (or more generally, as C++ allows
> representations other than 2s' complement, 1's complement, and
> sign-magnitude, it requires that char has at least 256 distinct values).

Thanks for the correction. I hadn't realised the difference.

Tom Honermann

unread,
May 9, 2016, 11:54:08 PM5/9/16
to std-pr...@isocpp.org
On 5/9/2016 2:25 PM, Thiago Macieira wrote:
> On segunda-feira, 9 de maio de 2016 13:21:10 PDT Tom Honermann wrote:
>> On 5/9/2016 12:47 PM, Thiago Macieira wrote:
>>>> * F"filename" -> const char[N]
>>>> * Fu8"filename" -> const char[N]
>>>> * Fu"filename" -> const char16_t[N]
>>>> * FU"filename" -> const char32_t[N]
>>>> * Fb"filename" -> const char[N]
>>> I was expect all of those to be char[N]
>> Hmm. Would that include transcoding to the source character set for the
>> Fu8, Fu, and FU variants? If so, then these would all be inherently
>> implementation-defined. If not, then endianness needs to be addressed
>> for Fu amd FU.
> Yes, that was the case. But that's an ugly solution, so I am not asking for
> it. Instead, let's have what you suggested and no transcoding. Not even
> endianness correction: let the BOM be imported too. You can fix the endianness
> and strip the BOM at constexpr time.

I don't agree with skipping endian conversions and retaining BOMs as I
think that would be antithetical to common usage. If a user wants to
elide endian conversions and retain BOMs, then the Fb variant is
available to do so.

> What should the compiler do to invalid sources?
> * UTF-8 overlong sequences
> * UTF-8 invalid sequences
> * mismatched UTF-16 surrogate pairs
> * out-of-range UTF-8 and UCS-4
>
> Should it do any validation at all?

Yes, I think it should validate and reject such ill-formed sources with
a diagnostic. This would differ from handling of string literals, but
offers better correctness and portability guarantees. There will be no
ability to encode escape sequences or similar "pass-through" code unit
sequences in these sources (I hope, I don't recall anyone proposing such
support), so I see little reason to tolerate files with an ill-formed or
mismatched encoding. Again, the Fb variant is available to import such
files.

Updated summary:

* F"filename"
- Result type is const char[N]
- File encoding is implementation-defined.
- A non-normative note could mention that end of line sequences are
expected to be replaced with the \n character.
- A '\0' character is appended following the file content.

* Fu8"filename"
- Result type is const char[N]
- Any leading BOM is stripped.
- An ill-formed code unit sequence results in a diagnostic.
- Implementation-defined end of line sequences are normalized to u8'\n'.
- A u8'\0' character is appended following the file content.

* Fu"filename"
- Result type is const char16_t[N]
- Any leading BOM is stripped.
- Endianness is detected if a BOM is present and endian conversion
performed to match the encoding of char16_t string literals.
- An ill-formed code unit sequence results in a diagnostic.
- Implementation-defined end of line sequences are normalized to u'\n'.
- A u'\0' character is appended following the file content.

* FU"filename"
- Result type is const char32_t[N]
- Any leading BOM is stripped.
- Endianness is detected if a BOM is present and endian conversion
performed to match the encoding of char32_t string literals.
- An ill-formed code unit sequence results in a diagnostic.
- Implementation-defined end of line sequences are normalized to U'\n'.
- A U'\0' character is appended following the file content.

* Fb"filename"
- Result type is const char[N]. Or should it be const unsigned
char[N] ? I'm leaning towards unsigned char.
- A NUL character is *not* appended following the file content.
- Code units have implementation-defined values.
- A non-normative note could mention that code unit values are
expected to match the byte values of the file, but that this may not be
possible when cross-compiling to a target that has a different byte size
than the local file system.

Tom.

Tom Honermann

unread,
May 10, 2016, 12:01:11 AM5/10/16
to std-pr...@isocpp.org
On 5/9/2016 11:54 PM, Tom Honermann wrote:
> Updated summary:

I guess for completeness, add:

* FL"filename"
- Result type is const wchar_t[N]
- File encoding is implementation-defined.
- A non-normative note could mention that end of line sequences are
expected to be replaced with the L'\n' character.
- A L'\0' character is appended following the file content.

Tom.

Viacheslav Usov

unread,
May 10, 2016, 3:56:51 AM5/10/16
to ISO C++ Standard - Future Proposals
On Mon, May 9, 2016 at 8:26 PM, Thiago Macieira <thi...@macieira.org> wrote:

> You can do that with your own constexpr functions.

My own transcoding for UTF-8 and UTF-16, in 2017 and later? That's not even funny.

I could agree that the standard library should have something that works together with the include directive at the compile time to effect transcoding that covers at least UTF-8, UTF-16 and the source encoding. But then this include proposal should be bundled with that library feature.

Cheers,
V.

Viacheslav Usov

unread,
May 10, 2016, 4:11:52 AM5/10/16
to ISO C++ Standard - Future Proposals
On Tue, May 10, 2016 at 9:56 AM, Viacheslav Usov <via....@gmail.com> wrote:

> I could agree that the standard library should have something that works together with the include directive at the compile time to effect transcoding that covers at least UTF-8, UTF-16 and the source encoding. But then this include proposal should be bundled with that library feature.

Scrap it. UTF-8 and UTF-16 transcoding should just be supported by the directive. If the input is ill-formed, the program shall be ill-formed. If somebody wants to be smart about that, the binary mode is there.

Cheers,
V.

Thiago Macieira

unread,
May 10, 2016, 1:03:59 PM5/10/16
to std-pr...@isocpp.org
On terça-feira, 10 de maio de 2016 09:56:46 PDT Viacheslav Usov wrote:
> On Mon, May 9, 2016 at 8:26 PM, Thiago Macieira <thi...@macieira.org> wrote:
> > You can do that with your own constexpr functions.
>
> My own transcoding for UTF-8 and UTF-16, in 2017 and later? That's not even
> funny.

Indeed. I agree that they should exist in the Standard Library and that you
should be able to call them in constexpr contexts.

And that it should be fast at runtime. I can see if Intel is ok with making my
UTF-8 hardware-accelerated algorithm available under a more permissive licence
as it is today.

> I could agree that the standard library should have something that works
> together with the include directive at the compile time to effect
> transcoding that covers at least UTF-8, UTF-16 and the source encoding. But
> then this include proposal should be bundled with that library feature.

I don't think that's necessary, strictly speaking.

Tom Honermann

unread,
May 10, 2016, 1:18:48 PM5/10/16
to std-pr...@isocpp.org
On 5/10/2016 1:03 PM, Thiago Macieira wrote:
> On terça-feira, 10 de maio de 2016 09:56:46 PDT Viacheslav Usov wrote:
>> On Mon, May 9, 2016 at 8:26 PM, Thiago Macieira <thi...@macieira.org> wrote:
>>> You can do that with your own constexpr functions.
>> My own transcoding for UTF-8 and UTF-16, in 2017 and later? That's not even
>> funny.
> Indeed. I agree that they should exist in the Standard Library and that you
> should be able to call them in constexpr contexts.

I'm trying to help with that:
P0244R1: Text_view: A C++ concepts and range based character encoding
and code point enumeration library
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0244r1.html

Note the first two sections under Future Directions. If anyone wishes
to comment further on this, please start a new email thread.

> And that it should be fast at runtime. I can see if Intel is ok with making my
> UTF-8 hardware-accelerated algorithm available under a more permissive licence
> as it is today.

That would be cool.

Tom.

Thiago Macieira

unread,
May 10, 2016, 2:03:57 PM5/10/16
to std-pr...@isocpp.org
On terça-feira, 10 de maio de 2016 13:18:43 PDT Tom Honermann wrote:
> > And that it should be fast at runtime. I can see if Intel is ok with
> > making my UTF-8 hardware-accelerated algorithm available under a more
> > permissive licence as it is today.
>
> That would be cool.

I'll do that only after I see some movement about allowing different code to
be run at runtime, compared to constexpr time, for the same function.

std::u16string appendWorld(const char *base)
{
constexpr auto s = convertToUtf16(" World");
return convertToUtf16(base) + s;

Tom Honermann

unread,
May 25, 2016, 10:09:07 AM5/25/16
to std-pr...@isocpp.org
On 05/06/2016 05:37 PM, Thiago Macieira wrote:
> On sexta-feira, 6 de maio de 2016 11:03:34 PDT Matthew Woehlke wrote:
>> -finput-charset=CHARSET
>> Set the input character set, used for translation from the
>> character set of the input file to the source character set used by
>> GCC. *If the locale does not specify*, or GCC cannot get this
>> information from the locale, the default is UTF-8.
>>
>> The *fallback* is UTF-8. The "default" is 'as specified by the current
>> locale'. (At least, that's what the documentation claims; I haven't
>> actually attempted to test it.)
> The documentation is wrong. GCC does not attempt to identify the current
> locale's charset and directly falls back to UTF-8.
>
There is an existing bug to get the documentation updated to remove the
incorrect description with regard to the current locale:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61896

I proposed an update to the documentation to address the behavior
exhibited elsewhere in this email thread.

Tom.
Reply all
Reply to author
Forward
0 new messages