Draft Proposal of File String Ltiterals

631 views
Skip to first unread message

Andrew Tomazos

unread,
Apr 23, 2016, 6:16:26 AM4/23/16
to std-pr...@isocpp.org
Please find attached a draft 3-page proposal entitled "Proposal of File String Literals".

Feedback appreciated.

Thanks,
Andrew.

ProposalofFileStringLiterals.pdf

Moritz Klammler

unread,
Apr 23, 2016, 7:27:18 PM4/23/16
to Andrew Tomazos, std-pr...@isocpp.org
I can see how the proposed feature would be useful in many contexts and
provide a clean solution to what if often handled in a messy way today.

I'm wondering why you have decided to handle a pre-processing action
with a syntax that doesn't look like this. I remember having seen
somebody discuss a similar feature some time ago that looked somehow
like this.

#inplace char data "datafile.txt"

The effect of the above snippet would have been that after
pre-processing, a variable

char data[<file-size>] = {<file-data>};

would have been defined. Note that `data` is not `const` in this
example which might be useful for some applications. If I wanted the
data to be read-only, I could have written

#inplace const char data "datafile.txt"

explicitly. This would naturally allow to also use

#inplace const char data <datafile.txt>

to be consistent with how the pre-processor finds other files. You
couldn't have anonymous variables with the contents of a file in this
case but I'm wondering how useful these would be anyway.

I'm also not sure whether it is necessary to require the replacement
text to be a raw string literal. Especially in environments where
compatibility with C is a concern, this could be an unnecessary road
block. Couldn't the data equally well be inlined as an array of
integer literals or any offending characters be escaped via the good old
`\0dd` syntax? Granted, that might not be very readable for humans but
pre-processed files are not very pretty in general. It should be
allowed, though, to break the replacement over multiple lines using
either line-breaks in array syntax or else concatenation of string
literals. The reason I think this is important is that many text
editors perform very poorly or even crash when faced with extremely long
lines.

I'm assuming that you're assuming that -- no matter how the syntax looks
like -- pre-processors would handle those file string literals the same
way they handle other file `#include`s so dependency computation by
build systems would continue to work by only running the pre-processor.
(Not that the standard would have to specify this, but it is good to be
aware of.)

Another question that should be discussed is whether and how to support
non-text data. For example, if we have such a mechanism, I could
imagine it would be useful to embed small graphics or other binary data
into the program image as well. Would encoding get into the way here?
Would NUL-termination mean that I have to subtract one from the size of
the generated array? Not a big deal but something to be aware of.
Maybe there could be an additional "encoding" prefix for binary data
that would guarantee a verbatim copy and also suppress NUL-termination.

Finally, another option to include arbitrary data into program images
used today is deferring the combining until link-time. At least on
GNU/Linux systems, this can be done by use of the `objcopy` program [1].

extern "C" char * data;
extern "C" const std::size_t size;

I don't think that this approach can do anything that replacement at
pre-processing time could not accomplish. If you want to reduce
compile-time dependencies, you can always have a dedicated translation
unit that merely contains the lines

char data = F"datafile.txt";
const std::size_t size = sizeof(data);

to simulate the behavior of the `objcopy` solution but going the other
way round is not possible. I'm just bringing it up because I thought
you might be interested to mention this in the discussion of your
proposal.

[1] https://sourceware.org/binutils/docs-2.26/binutils/objcopy.html

Thiago Macieira

unread,
Apr 23, 2016, 9:39:58 PM4/23/16
to std-pr...@isocpp.org
On sábado, 23 de abril de 2016 12:16:23 PDT Andrew Tomazos wrote:
> Please find attached a draft 3-page proposal entitled "Proposal of File
> String Literals".

Any thought about what to do if the input file isn't the exact binary data you
want? For example, suppose you need to encode or decode base64.

Can you show this can be done with constexpr expressions?

--
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
Software Architect - Intel Open Source Technology Center

Andrew Tomazos

unread,
Apr 24, 2016, 5:38:14 AM4/24/16
to std-pr...@isocpp.org
On Sun, Apr 24, 2016 at 3:39 AM, Thiago Macieira <thi...@macieira.org> wrote:
On sábado, 23 de abril de 2016 12:16:23 PDT Andrew Tomazos wrote:
> Please find attached a draft 3-page proposal entitled "Proposal of File
> String Literals".

Any thought about what to do if the input file isn't the exact binary data you
want? For example, suppose you need to encode or decode base64.

Can you show this can be done with constexpr expressions?

As per a normal string literal, the text of the dedicated source file of a file string literal will be decoded from source encoding in the usual manner and then encoded in execution encoding (as determined by implementation settings or encoding prefix). From there it will behave like a normal string literal.  It is possible with constexpr programming to encode or decode base64.  String literals are "constexpr-compatible".

Moritz Klammler

unread,
Apr 24, 2016, 7:16:28 AM4/24/16
to Andrew Tomazos, std-pr...@isocpp.org
Andrew Tomazos <andrew...@gmail.com> writes:

>> I'm also not sure whether it is necessary to require the replacement
>> text to be a raw string literal. Especially in environments where
>> compatibility with C is a concern, this could be an unnecessary road
>> block.
>
>
> Well, I guess it didn't block raw string literals, which are a feature
> of C++ and not of C (AFAIK).

What I wanted to say is that it would be nice if I could compile the
pre-processed file as C source code anyway. Given that the
pre-processor used to be the same for both, C and C++, this seemed a
natural thing to think about for me.

>> Couldn't the data equally well be inlined as an array of integer
>> literals or any offending characters be escaped via the good old
>> `\0dd` syntax? Granted, that might not be very readable for humans
>> but pre-processed files are not very pretty in general. It should be
>> allowed, though, to break the replacement over multiple lines using
>> either line-breaks in array syntax or else concatenation of string
>> literals. The reason I think this is important is that many text
>> editors perform very poorly or even crash when faced with extremely
>> long lines.
>>
>
> I don't follow this sorry.

I mean, instead of outputting

R"D(some "nasty"
stuff)D"

couldn't the pre-processor output

{115, 111, 109, 101, 32, 34, 110, 97, 115, 116, 121, 34, 10, 115,
116, 117, 102, 102, 0}

or

"some \x22nasty\x22\0astuff"

to achieve the same effect without having to depend on raw string
literals. These ways to encode string literals are inconvenient for
humans to type and read but the pre-processor shouldn't mind generating
them. It would be even simpler than having to figure out a valid escape
sequence for raw string literals. Not to mention the harvoc that could
be done by UTF-8 strings that switch to RTL scripts...

In my writing in the second half, I wanted to allow for the
pre-processor to generate replacement text broken up like this.

"........... some ................... very ................. long"
"......................... text ................................."
"..... broken ......... up ................... on ..............."
"............................ multiple .........................."
"................................................................"
"......... lines ......... using .................... implicit .."
"..... concatenation ............... of ....... string .........."
"............. literals ........................................."
"..................................... to ......................."
"........... keep .......... line ... lengths ..................."
"................................................................"
".......................................... reasonable .........."

Thiago Macieira

unread,
Apr 24, 2016, 4:03:38 PM4/24/16
to std-pr...@isocpp.org
You're missing my point.

I want to be sure that I could pre-process the contents of the file into the
data I actually want. That data and only that data should be stored in the
read-only sections of my binary's image.

If you can't prove that, it means I will need to have an extra tool to do the
pre-processing and a buildsystem that runs it before the source gets compiled.
That negates the benefit of having the feature in the first place, since the
extra tool could just as well generate a .h file with the contents I want.

Andrew Tomazos

unread,
Apr 25, 2016, 10:37:10 AM4/25/16
to std-pr...@isocpp.org
You can "pre-process" the contents of a string literal using constexpr programming:

  constexpr auto original_version = "original string";

  constexpr fixed_string process_string(fixed_string input) { ... }

  constexpr auto processed_version = process_string(original_version);

  int main() { cout << processed_version; }

In practice processed_version will be in the program image, and original_version won't.

The above ordinary string literal can be replaced with a file string literal, and the same logic applies.

Isn't that what you mean?

Matthew Woehlke

unread,
Apr 25, 2016, 11:24:38 AM4/25/16
to std-pr...@isocpp.org
On 2016-04-23 19:27, Moritz Klammler wrote:
> I'm wondering why you have decided to handle a pre-processing action
> with a syntax that doesn't look like this. I remember having seen
> somebody discuss a similar feature some time ago that looked somehow
> like this.
>
> #inplace char data "datafile.txt"

I was also wondering about that. In particular, I'll note that tools
that need to do non-compiler-assisted dependency scanning may have an
easier time with this format.

FWIW, I wouldn't ignore the point about being able to do path searches
for this feature. In fact, that may be a critical feature for some folks
(because the path cannot be determined when writing the source, but can
be specified by the build system). In particular, if no path search
occurs, it is impossible to include content from an external source
(i.e. a file not part of the source tree which uses it).

> Another question that should be discussed is whether and how to support
> non-text data.

No, that's not a question. That's a hard requirement ;-). I'm not sure
the proposal doesn't cover this, though?

auto binary_data = F"image.png"; // char[]
auto binary_size = sizeof(binary_data) / sizeof(*binary_data);
auto image = parse_png(binary_data, binary_size);

Any form of this feature that cannot replace qrc is a failure in my book.

That said... I also wonder if being able to select between text vs.
binary mode is important, especially if importing large text blocks is
really a desired feature? (Most of the use cases I've seen have been for
binary files such as images. Andrew's proposal seems to imply uses for
text.)

Note that by "text mode" I mean native-to-C line ending translation.

> Maybe there could be an additional "encoding" prefix for binary data
> that would guarantee a verbatim copy and also suppress NUL-termination.

That's probably a good idea. (That said, I suspect many binary formats
can cope with a superfluous trailing NUL already. Anyway, this might be
useful, but probably isn't critical.)

--
Matthew

Nicol Bolas

unread,
Apr 25, 2016, 12:21:48 PM4/25/16
to ISO C++ Standard - Future Proposals, andrew...@gmail.com, mor...@klammler.eu
On Saturday, April 23, 2016 at 7:27:18 PM UTC-4, Moritz Klammler wrote:
I can see how the proposed feature would be useful in many contexts and
provide a clean solution to what if often handled in a messy way today.

I'm wondering why you have decided to handle a pre-processing action
with a syntax that doesn't look like this.  I remember having seen
somebody discuss a similar feature some time ago that looked somehow
like this.

    #inplace char data "datafile.txt"

I despise the idea of making a new preprocessor directive. Even moreso, I despise the idea of having a preprocessor directive declaring a variable.

I say we just reuse `#include`. At present, `#include` requires the next token to be a ", a <, or a preprocessor token. We could just expand that to a couple of extra options:

#include <options> <file-specifier/pp-token>

The <options> can be a command which alters the nature of the include. With no options, then it includes the file's tokens into the command stream.

If <options> is `text`, for example, it would include the file as a string literal, with text translation (for newlines and such). If <options> is `bin`, then it would include the file as a string literal with no translation. Note that embedded \0 characters would be allowed, and the string literal would work just as any C++ string literal does with them. So to use this, you would to this:

auto the_string = #include text "somefile.txt";

Of course, now we get into the question of what kind of string literal. That is, narrow string, UTF-8 string, UTF-16 string (why not allow inclusion of them?), and so forth. Obviously for `bin`, it would be a narrow string literal. Perhaps there would be different forms of text: `text` (narrow), `utf8`, `wide`, `utf16`, etc.

Note that this form of #include should not be converting the included file for these different formats. It is up to the user to make sure that the file actually stores data in the format the compiler was told to expect. The only conversion that might be allowed would be the removal of an initial BOM for the UTF formats.

Thiago Macieira

unread,
Apr 25, 2016, 12:23:01 PM4/25/16
to std-pr...@isocpp.org
On segunda-feira, 25 de abril de 2016 16:37:08 PDT Andrew Tomazos wrote:
> > If you can't prove that, it means I will need to have an extra tool to do
> > the
> > pre-processing and a buildsystem that runs it before the source gets
> > compiled.
> > That negates the benefit of having the feature in the first place, since
> > the
> > extra tool could just as well generate a .h file with the contents I want.
>
> You can "pre-process" the contents of a string literal using constexpr
> programming:
>
> constexpr auto original_version = "original string";
>
> constexpr fixed_string process_string(fixed_string input) { ... }
>
> constexpr auto processed_version = process_string(original_version);
>
> int main() { cout << processed_version; }
>
> In practice processed_version will be in the program image, and
> original_version won't.
>
> The above ordinary string literal can be replaced with a file string
> literal, and the same logic applies.
>
> Isn't that what you mean?

Yes.

I'd like to see a concrete example in the paper, doing some transformation of
the data. As a strawman, it would be nice to know if you could write a
constexpr function that would generate a perfect hashing table given the
complete population of source strings (newline separated).

Andrew Tomazos

unread,
Apr 25, 2016, 1:17:29 PM4/25/16
to std-pr...@isocpp.org
On Mon, Apr 25, 2016 at 5:24 PM, Matthew Woehlke <mwoehlk...@gmail.com> wrote:
On 2016-04-23 19:27, Moritz Klammler wrote:
> I'm wondering why you have decided to handle a pre-processing action
> with a syntax that doesn't look like this.  I remember having seen
> somebody discuss a similar feature some time ago that looked somehow
> like this.
>
>     #inplace char data "datafile.txt"

I was also wondering about that. In particular, I'll note that tools
that need to do non-compiler-assisted dependency scanning may have an
easier time with this format.

I don't think that is true.  For a tool to accurately do dependency scanning, it already has to do near full tokenization and preprocessing:

#include MACRO_REPLACE_ME

auto x = R"(
#include "skip_me"
)";

#if some_expression_that_is_false
#include "skip_me_too"
#endif

For approximate dependency scanning, which is of dubious benefit anyway, it is easy to add the proposed file string literal token pattern to the scanner.
 
> Another question that should be discussed is whether and how to support
> non-text data.

No, that's not a question. That's a hard requirement ;-). I'm not sure
the proposal doesn't cover this, though?

  auto binary_data = F"image.png"; // char[]
  auto binary_size = sizeof(binary_data) / sizeof(*binary_data);
  auto image = parse_png(binary_data, binary_size);

I don't think that is portable, or advisable.  In particular source files are decoded in an implementation-defined manner during translation.  Even if your source encoding was the same as the execution encoding, the implementation may still reject arbitrary binary sequences which are not valid for that encoding.

While we might be able to extend the proposal to make this work, I think the better way to include an image would be to use a link-time or run-time strategy.

As designed, a use of a file string literal should be interchangable with a use of a raw string literal, and visa versa, by simply copy-and-pasting the body.  You would never put an image in a raw string literal (and I don't think it would work for the previous reason given).

That said... I also wonder if being able to select between text vs.
binary mode is important, especially if importing large text blocks is
really a desired feature? (Most of the use cases I've seen have been for
binary files such as images. Andrew's proposal seems to imply uses for
text.)
 
The use cases for file string literals are largely the same as for raw string literals.  File string literals simply allow you to factor them out into a dedicated source file, rather than having them in-line.

Matthew Woehlke

unread,
Apr 25, 2016, 1:38:05 PM4/25/16
to std-pr...@isocpp.org
On 2016-04-25 13:17, Andrew Tomazos wrote:
> On Mon, Apr 25, 2016 at 5:24 PM, Matthew Woehlke wrote:
>> On 2016-04-23 19:27, Moritz Klammler wrote:
>>> Another question that should be discussed is whether and how to support
>>> non-text data.
>>
>> No, that's not a question. That's a hard requirement ;-). I'm not sure
>> the proposal doesn't cover this, though?
>>
>> auto binary_data = F"image.png"; // char[]
>> auto binary_size = sizeof(binary_data) / sizeof(*binary_data);
>> auto image = parse_png(binary_data, binary_size);
>
> I don't think that is portable, or advisable. In particular source
> files are decoded in an implementation-defined manner during
> translation. Even if your source encoding was the same as the
> execution encoding, the implementation may still reject arbitrary
> binary sequences which are not valid for that encoding.
>
> While we might be able to extend the proposal to make this work, I
> think the better way [...]

In that case, I am Strongly Against your proposal. I probably, on some
occasions, want this feature for text. I *definitely* want it for binary
resources, and much more often than I might want it for text.

I think you are missing a significant and important use case, and, if
you don't account for that case, the feature is just begging to be
misused and abused and subject to confusion and surprise breakage.

> I think the better way to include an image would be to use a
> link-time or run-time strategy.

You're welcome to your opinion, but it does not match existing and
widespread practice.

--
Matthew

Nicol Bolas

unread,
Apr 25, 2016, 1:45:14 PM4/25/16
to ISO C++ Standard - Future Proposals, mwoehlk...@gmail.com
On Monday, April 25, 2016 at 1:38:05 PM UTC-4, Matthew Woehlke wrote:
On 2016-04-25 13:17, Andrew Tomazos wrote:
> On Mon, Apr 25, 2016 at 5:24 PM, Matthew Woehlke wrote:
>> On 2016-04-23 19:27, Moritz Klammler wrote:
>>> Another question that should be discussed is whether and how to support
>>> non-text data.
>>
>> No, that's not a question. That's a hard requirement ;-). I'm not sure
>> the proposal doesn't cover this, though?
>>
>>   auto binary_data = F"image.png"; // char[]
>>   auto binary_size = sizeof(binary_data) / sizeof(*binary_data);
>>   auto image = parse_png(binary_data, binary_size);
>
> I don't think that is portable, or advisable.  In particular source
> files are decoded in an implementation-defined manner during
> translation.  Even if your source encoding was the same as the
> execution encoding, the implementation may still reject arbitrary
> binary sequences which are not valid for that encoding.
>
> While we might be able to extend the proposal to make this work, I
> think the better way [...]

In that case, I am Strongly Against your proposal. I probably, on some
occasions, want this feature for text. I *definitely* want it for binary
resources, and much more often than I might want it for text.

I think you are missing a significant and important use case, and, if
you don't account for that case, the feature is just begging to be
misused and abused and subject to confusion and surprise breakage.

I think your second point is the best reason to make sure that binary inclusions are well-supported. If you give people the ability to include files as strings, people are going to use it for including binary files. That is guaranteed. So the only options are to have it cause subtle breakage or to properly support it.

It's better not to do it at all and force us to use our current measures than to do it halfway.

Andrew Tomazos

unread,
Apr 25, 2016, 2:01:27 PM4/25/16
to std-pr...@isocpp.org, mwoehlk...@gmail.com
Ok, you've convinced me.

We can add a new encoding-prefix "b" for binary.  So it would be:

  auto binary_data = bF"image.png";

I'd need to think about the wording, it will probably be impementation-defined with a note saying roughly that the data should undergo no decoding or encoding from source to execution.

Matthew Woehlke

unread,
Apr 25, 2016, 2:20:49 PM4/25/16
to std-pr...@isocpp.org
On 2016-04-25 14:01, Andrew Tomazos wrote:
> Ok, you've convinced me.

Thanks. I do think the original idea is also good, I just really want it
for binary data, and... well, Nicol eloquently reiterated my concern
:-). I do think there are enough folks that want something like this for
binary data that explicitly supporting binary data will definitely
*help* the proposal. (And to be fair, when I say "binary data", I really
mean "resources". Some of which... will be text :-). For example, SVG,
GLSL... and I'm sure you could add to that list.)

> We can add a new encoding-prefix "b" for binary. So it would be:
>
> auto binary_data = bF"image.png";
>
> I'd need to think about the wording, it will probably be
> impementation-defined with a note saying roughly that the data should
> undergo no decoding or encoding from source to execution.

Right. (I was going to say that feels backwards, but missed that your
end point is "execution". Presumably escaping or something would happen
in the PP stage.)

BTW, what about path search? IIUC, the intent is that the file name is
looked up as if `#include "name"`, correct? It might not be terrible to
state this more explicitly, though...

--
Matthew

Andrew Tomazos

unread,
Apr 26, 2016, 11:08:24 AM4/26/16
to std-pr...@isocpp.org
The wording says that F"foo" is equivalent to #include "foo" (with some modifications none of which effect the mapping of paths to source files).  In practice you just put the files in one of your include paths as usual.

Arthur O'Dwyer

unread,
Apr 28, 2016, 5:39:09 PM4/28/16
to ISO C++ Standard - Future Proposals, mwoehlk...@gmail.com
On Monday, April 25, 2016 at 11:01:27 AM UTC-7, Andrew Tomazos wrote:
On Mon, Apr 25, 2016 at 7:45 PM, Nicol Bolas <jmck...@gmail.com> wrote:
On Monday, April 25, 2016 at 1:38:05 PM UTC-4, Matthew Woehlke wrote:
On 2016-04-25 13:17, Andrew Tomazos wrote:
> On Mon, Apr 25, 2016 at 5:24 PM, Matthew Woehlke wrote:
>> On 2016-04-23 19:27, Moritz Klammler wrote:
>>> Another question that should be discussed is whether and how to support
>>> non-text data.
>>
>> No, that's not a question. That's a hard requirement ;-). I'm not sure
>> the proposal doesn't cover this, though?
>>
>>   auto binary_data = F"image.png"; // char[]
>>   auto binary_size = sizeof(binary_data) / sizeof(*binary_data);
>>   auto image = parse_png(binary_data, binary_size);
>
> I don't think that is portable, or advisable.  In particular source
> files are decoded in an implementation-defined manner during
> translation.  Even if your source encoding was the same as the
> execution encoding, the implementation may still reject arbitrary
> binary sequences which are not valid for that encoding. [...]
 
If you give people the ability to include files as strings, people are going to use it for including binary files. That is guaranteed. So the only options are to have it cause subtle breakage or to properly support it.
 
Ok, you've convinced me.

We can add a new encoding-prefix "b" for binary.  So it would be:

  auto binary_data = bF"image.png";

I'd need to think about the wording, it will probably be impementation-defined with a note saying roughly that the data should undergo no decoding or encoding from source to execution.

I was just thinking before your post that this has shades of the icky "mode" behavior of fopen(); i.e., it's up to the programmer (and therefore often buggy) whether the file is opened in "b"inary or "t"ext mode. What makes for the "often buggy" part is that the path-of-least-resistance happens to work perfectly on Unix/Linux/OSX, and therefore the vast majority of working programmers never need to learn the icky parts.

What happens on a Windows platform when I write

    const char data[] = R"(
    )";

? Does data come out equivalent to "\n" or to "\r\n"? Does it depend on the compiler (MSVC versus Clang) or not? I don't have a Windows machine to find out for myself, sorry.
I would expect that

    const char data[] = F"just-a-blank-line.h";

would have the same behavior on Windows as the above program, whatever that behavior is.

I would offer that perhaps the syntax

    const char data[] = RF"image.png";

is available for "raw binary" (i.e., "I want exactly these bytes at runtime") file input. However, this may be overcomplicating the token grammar; it would be the one(?) case where R"..." did not introduce a raw string literal token.

Re source and runtime encodings: I have experience with Green Hills compilers in "euc2jis" mode, where the source encoding is EUC and the runtime encoding is Shift-JIS: string literal tokens are assumed to be encoded in EUC, and have to be transcoded to JIS before their bytes are written to the data section. It would indeed be horrible if the programmer wrote

    const char data[] = F"image.png";

and the bytes that got written to the .rodata section were the "EUC2JIS'ed" version of "image.png". However, it would be almost as horrible if the programmer wrote

    const char license_message[] = F"license.txt";

and the bytes that got written to the .rodata section were NOT the "EUC2JIS'ed" version of "license.txt". And we definitely can't rely on heuristics like "the extension of the include file" to determine whether we should be assuming "source encoding" or "raw encoding". So I would agree with your (Andrew's) idea that we need a way for the programmer to specify whether the file input should be treated as "source" or "raw".  I merely offer that prefix-"b" seems awkward to me, and I think prefix-"R" is available.

Either way, your proposal should include an example along the lines of

    const char example[] = RF"foo(bar.h)foo";

Does this mean "include bar.h", or "include foo(bar.h)foo" — and why?

Re phases of translation: I think your proposal should include an example along the lines of

    #define X(x) F ## #x
    #include X(foo) ".h"
    const char example[] = X(foo) ".h";

I think the behavior of the above #include is implementation-defined or possibly undefined; I haven't checked.
I'm curious whether you'd make the behavior of the above "example" equivalent to

    const char example[] = F"foo" /*followed by the two characters*/ ".h";

or

    const char example[] = F"foo.h";

or ill-formed / implementation-defined / undefined. However, if I'm right about the behavior of the above #include, I would accept any of these results, or even "I don't care — let the committee sort it out", because my impression is that the preprocessor has lots of these little corner cases that end up getting sorted out over decades by the vendors, rather than by the standard.

Re preprocessing (e.g. base64-decoding), I've often wished that the major vendors would provide some Perl-style
"input from shell-pipe" syntax; for example,

    #include "image.png.64 | base64 -d"
    #include "<(base64 -d image.png.txt)"
    #include "<(wget -O - http://example.com/static/image.png)"

This strikes me as not a matter for top-down standardization but rather for some vendor to take the plunge (and expose themselves to all the bad publicity that would come with enabling arbitrary code execution in a C++ compiler).

my $.02,
–Arthur

Nicol Bolas

unread,
Apr 29, 2016, 12:01:52 PM4/29/16
to ISO C++ Standard - Future Proposals, mwoehlk...@gmail.com

According to the standard:

>  A source-file new-line in a raw string literal results in a new-line in the resulting execution string-
literal.

So it would be a `\n`, not a `\r\n`.

Granted, the above quote is not in normative text (presumably because section 2.14.3 makes it more clear). But clearly that is the intent of the specification. So if VS doesn't do that correctly, then it's broken.

And since VS has had raw string literals for a while now, odds are good someone would have noticed it if they did it wrong.

Andrew Tomazos

unread,
Apr 29, 2016, 1:16:01 PM4/29/16
to std-pr...@isocpp.org
Not so fast:

"Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set (introducing new-line characters for end-of-line indicators) if necessary. The set of physical source file characters accepted is implementation-defined."

This mapping is commonly known as the "source encoding".  As a part of the souce file, the content of raw string literals are input, likewise, in source encoding.

Nicol Bolas

unread,
Apr 29, 2016, 2:01:08 PM4/29/16
to ISO C++ Standard - Future Proposals

Considering that the specification has non-normative examples of raw string literals and their non-raw equivalents, and those examples explicitly show that a "source encoding" newline should be equivalent to "\n", then clearly the writers of the specification believe that the conversion is not implementation dependent. So either your interpretation or their interpretation of the spec is wrong.

In any case, it seems to me that "introducing new-line characters for end-of-line indicators" would mean that the implementation is not allowed to use a platform-specific "end-of-line indicators". That it must use the source character set value of "\n". That is, what "\n" maps to is implementation defined. That "end-of-line indicators" must become "\n" is not.

Andrew Tomazos

unread,
Apr 30, 2016, 6:16:20 PM4/30/16
to std-pr...@isocpp.org
The examples you are referring to do not show how the new-line is encoded in the original physical source file.  They cannot, unless they show the hex dump of the bytes of the physical source file.  The examples just show that the "text" new line in the raw string literal (after decoding) can be equivalent to an escaped new line '\n' in an ordinary string literal.  I think the motivation of the example was just to show that raw string literals can contain embedded new lines (unlike ordinary string literals) - among other this.

In Table 7 it says '\n' maps to NL(LF) and that '\r' maps to CR, and offers no further definition of what NL( LF) and CR are.  I assume NL(LF) is the new line character that is a member of the basic source character set, and that CR is the carriage return that is a member of the basic execution character set.  I think the these basic source character set new lines are the same "new-line characters" refered to in the "introducing new-line characters for end-of-line indicators" during phase 1 source decoding.

Nicol Bolas

unread,
Apr 30, 2016, 9:47:47 PM4/30/16
to ISO C++ Standard - Future Proposals
On Saturday, April 30, 2016 at 6:16:20 PM UTC-4, Andrew Tomazos wrote:
On Fri, Apr 29, 2016 at 8:01 PM, Nicol Bolas <jmck...@gmail.com> wrote:
On Friday, April 29, 2016 at 1:16:01 PM UTC-4, Andrew Tomazos wrote:
On Fri, Apr 29, 2016 at 6:01 PM, Nicol Bolas <jmck...@gmail.com> wrote:
On Thursday, April 28, 2016 at 5:39:09 PM UTC-4, Arthur O'Dwyer wrote:
I was just thinking before your post that this has shades of the icky "mode" behavior of fopen(); i.e., it's up to the programmer (and therefore often buggy) whether the file is opened in "b"inary or "t"ext mode. What makes for the "often buggy" part is that the path-of-least-resistance happens to work perfectly on Unix/Linux/OSX, and therefore the vast majority of working programmers never need to learn the icky parts.

What happens on a Windows platform when I write

    const char data[] = R"(
    )";

? Does data come out equivalent to "\n" or to "\r\n"? Does it depend on the compiler (MSVC versus Clang) or not? I don't have a Windows machine to find out for myself, sorry.

According to the standard:

>  A source-file new-line in a raw string literal results in a new-line in the resulting execution string-
literal.

So it would be a `\n`, not a `\r\n`.

Granted, the above quote is not in normative text (presumably because section 2.14.3 makes it more clear). But clearly that is the intent of the specification. So if VS doesn't do that correctly, then it's broken.

And since VS has had raw string literals for a while now, odds are good someone would have noticed it if they did it wrong.

Not so fast:

"Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set (introducing new-line characters for end-of-line indicators) if necessary. The set of physical source file characters accepted is implementation-defined."

This mapping is commonly known as the "source encoding".  As a part of the souce file, the content of raw string literals are input, likewise, in source encoding.

Considering that the specification has non-normative examples of raw string literals and their non-raw equivalents, and those examples explicitly show that a "source encoding" newline should be equivalent to "\n", then clearly the writers of the specification believe that the conversion is not implementation dependent. So either your interpretation or their interpretation of the spec is wrong.

The examples you are referring to do not show how the new-line is encoded in the original physical source file.

It doesn't matter how it's encoded in the original physical source file. So long as the line endings in the physical source file are converted to proper C++ newline characters, everything works fine.

Consider what would happen if you perform #include on a file that consists solely of a raw string literal. Textual file inclusion should be the equivalent of that: it does whatever #include would do, if the bytes of text file being included had been wrapped in a raw string (of the appropriate encoding).
 
They cannot, unless they show the hex dump of the bytes of the physical source file.  The examples just show that the "text" new line in the raw string literal (after decoding) can be equivalent to an escaped new line '\n' in an ordinary string literal.  I think the motivation of the example was just to show that raw string literals can contain embedded new lines (unlike ordinary string literals) - among other this.

The example in the standard does not use the term "can be equivalent". Indeed, the example's terminology needs no room for doubt. Observe:

Assuming no whitespace at the beginning of lines in the following example, the assert will succeed:
 
const char* p = R"(a\
b
c)"
;
assert(std::strcmp(p, "a\\\nb\nc") == 0);


"the assert will succeed" does not leave room for the ambiguity you seem to hold to. All of the other examples use the term "is equivalent", not "can be equivalent".

So either the standard has a defect that should be corrected or your view of how raw string literals are allowed to translate is wrong.

In Table 7 it says '\n' maps to NL(LF) and that '\r' maps to CR, and offers no further definition of what NL( LF) and CR are.

Sure it does. It defines '\n' to be "NL(LF)", but it also declares it to be the grammatical construct "new-line". Which is used in several places in the C++ grammar.

How they're encoded is essentially irrelevant. What matters is that the character set must provide a value that represents '\n', and when a file is loaded, platform-specific new-lines must be converted into 'new-line'.

I assume NL(LF) is the new line character that is a member of the basic source character set, and that CR is the carriage return that is a member of the basic execution character set.  I think the these basic source character set new lines are the same "new-line characters" refered to in the "introducing new-line characters for end-of-line indicators" during phase 1 source decoding.

You don't have to think; it says so. "new-line" is used quite frequently in the grammar.

Arthur O'Dwyer

unread,
May 1, 2016, 1:31:11 AM5/1/16
to ISO C++ Standard - Future Proposals
On Sat, Apr 30, 2016 at 6:47 PM, Nicol Bolas <jmck...@gmail.com> wrote:
> On Saturday, April 30, 2016 at 6:16:20 PM UTC-4, Andrew Tomazos wrote:
>> On Fri, Apr 29, 2016 at 8:01 PM, Nicol Bolas <jmck...@gmail.com> wrote:
>>> On Friday, April 29, 2016 at 1:16:01 PM UTC-4, Andrew Tomazos wrote:
>>>> On Fri, Apr 29, 2016 at 6:01 PM, Nicol Bolas <jmck...@gmail.com> wrote:
>>>>> On Thursday, April 28, 2016 at 5:39:09 PM UTC-4, Arthur O'Dwyer wrote:
>>>>>>
>>>>>> What happens on a Windows platform when I write
>>>>>>
>>>>>>     const char data[] = R"(
>>>>>>     )";
>>>>>>
>>>>>> ? Does data come out equivalent to "\n" or to "\r\n"? Does it depend
>>>>>> on the compiler (MSVC versus Clang) or not? I don't have a Windows machine
>>>>>> to find out for myself, sorry.
>>>>>
>>>>> According to the standard:
>>>>>
>>>>> >  A source-file new-line in a raw string literal results in a new-line
>>>>> > in the resulting execution string-literal.
>>>>>
>>>>> So it would be a `\n`, not a `\r\n`.
>>>>>
>>>>> Granted, the above quote is not in normative text (presumably because
>>>>> section 2.14.3 makes it more clear). But clearly that is the intent of the
>>>>> specification. [...]

>>>>
>>>> Not so fast:
>>>>
>>>> "Physical source file characters are mapped, in an
>>>> implementation-defined manner, to the basic source character set
>>>> (introducing new-line characters for end-of-line indicators) if necessary.
>>>> The set of physical source file characters accepted is
>>>> implementation-defined."
>>>>
>>>> This mapping is commonly known as the "source encoding".  As a part of
>>>> the souce file, the content of raw string literals are input, likewise, in
>>>> source encoding.

That doesn't contradict what Nicol said. In fact, I'm pretty sure that you two (Andrew, Nicol) are in violent agreement on the points that matter. But regardless, I'm pretty sure that Nicol is right. :)


> It doesn't matter how it's encoded in the original physical source file. So
> long as the line endings in the physical source file are converted to proper
> C++ newline characters, everything works fine.

Right.


> Consider what would happen if you perform #include on a file that consists
> solely of a raw string literal. Textual file inclusion should be the
> equivalent of that: it does whatever #include would do, if the bytes of text
> file being included had been wrapped in a raw string (of the appropriate
> encoding).

This is the part that requires a separate "raw/binary" mode for the inclusion of data files.
Suppose I'm on a Windows platform.
If I use the proposed new "file string literal" construct to include a file "foo.bin" whose physical contents are (hex)

    89 50 4E 47 0D 0A 1A 0A ...

[ http://www.libpng.org/pub/png/spec/1.2/PNG-Structure.html ]
then I definitely do not want that "end-of-line indicator" (0D 0A) converted into a new-line character (0A) by any phase of translation. It is absolutely critical that

    const char mypng[] = RF"foo.bin";

produce exactly the same program as

    const char mypng[] = { 0x89, 0x50, 0x4E, 0x47, 0x0D, 0x0A, 0x1A, 0x0A, ... };

. Whereas, if the file contains

    68 69 0D 0A 62 79 65 0D 0A ...

then it is absolutely critical that

    const char mymsg[] = F"foo.bin";

produce exactly the same program as

    const char mymsg[] = { 0x68, 0x69, '\n', 0x62, 0x79, 0x65, '\n', ... };

Notice that I used my suggestion of "R" in the first case to mean "take the physical bytes", and left off the "R" in the second case to mean "take the file as if it were textually included as source code going through the various phases of translation".

Notice that in that last sentence I used the cumbersome phrase "as source code..." instead of just saying "encoded in the source character set." That's because the source character encoding isn't the only thing that's relevant here. The format of the end-of-line indicator is not determined by the source character encoding.


>> In Table 7 it says '\n' maps to NL(LF) and that '\r' maps to CR [...]

>> I assume NL(LF) is the new line character that is a member of the basic
>> source character set, and that CR is the carriage return that is a member of
>> the basic execution character set.

No; the four characters '\n' in the source correspond at runtime to the character value of NL in the execution character set.
You're correct that the four characters '\r' in the source correspond at runtime to the character value of CR in the execution character set.
By runtime, no vestiges of the source character encoding remain. by definition, the character encoding that matters at execution time is the execution character encoding. (The standard often abuses the term "character set" to mean "character encoding"; this is unimportant.)

Generally speaking, barring weirdnesses like Green Hills' "euc2jis" mode, the source and execution character sets are identical, and the source and execution character encodings are also identical. However, at the source level we also have the concept of "end-of-line indicator", which doesn't correspond to anything at execution time. On Windows the end-of-line indicator is the pair of characters 0D 0A (that is to say, CR NL). The standard is clear (per Nicol's explanation) that if the end-of-line indicator 0D 0A occurs inside a raw string literal in the source code, it must be translated to a new-line character (that is to say, NL; that is to say, 0A).

If you were on Windows, cross-compiling for an old Mac where the new-line character (in the execution character encoding) was 0D, then the standard would require that

assert(strcmp("\n", "\x0d") == 0);  // axiom: NL in the execution character set is 0D
const char data[] = R"(
)";
assert(strcmp(data, "\n") == 0);  // because the end-of-line indicator on line 2 was converted to NL
assert(strcmp(data, "\x0d") == 0);  // Q.E.D.

If you were on Windows, compiling for Windows where the new-line character (in the execution character encoding) was 0A, then the standard would require that

assert(strcmp("\n", "\x0a") == 0);  // axiom: NL in the execution character set is 0A
const char data[] = R"(
)";
assert(strcmp(data, "\n") == 0);  // because the end-of-line indicator on line 2 was converted to NL
assert(strcmp(data, "\x0a") == 0);  // Q.E.D.


I'm pretty sure I've got all that right. Anyone disagree?

–Arthur

Matthew Woehlke

unread,
May 3, 2016, 11:49:31 AM5/3/16
to std-pr...@isocpp.org
On 2016-04-28 17:39, Arthur O'Dwyer wrote:
> Either way, your proposal should include an example along the lines of
>
> const char example[] = RF"foo(bar.h)foo";
>
> Does this mean "include bar.h", or "include foo(bar.h)foo" — and why?

Certainly the latter; anything else is just overly complicating things
to no benefit.

If you really need a string like "foo" + contents of 'bar.h' + "foo",
use concatenation:

auto example[] = R"foo" RF"bar.h" R"foo";

> I think your proposal should include an example
> along the lines of
>
> #define X(x) F ## #x
> #include X(foo) ".h"
> const char example[] = X(foo) ".h";
>
> I think the behavior of the above #include is implementation-defined or
> possibly undefined; I haven't checked.
> I'm curious whether you'd make the behavior of the above "example"
> equivalent to
>
> const char example[] = F"foo" /*followed by the two characters*/ ".h";
>
> or
>
> const char example[] = F"foo.h";
>
> or ill-formed / implementation-defined / undefined [...] or even "I
> don't care — let the committee sort it out".

Does string literal concatenation even occur during the PP phase? Some
crude experiments with GCC¹ suggest otherwise, which makes this a moot
point (read: the first result is obviously and necessarily "correct").

To get the second, you would probably have to write something more like
(disclaimer: not tested):

#define S(s) #s
#define X(x) F ## S(x.h)
const char example[] = X(foo); // F"foo.h"

(¹ echo 'char const* c = "a" "b" "c";' | gcc -E -)

--
Matthew

Greg Marr

unread,
May 3, 2016, 4:50:41 PM5/3/16
to ISO C++ Standard - Future Proposals, mwoehlk...@gmail.com
On Tuesday, May 3, 2016 at 11:49:31 AM UTC-4, Matthew Woehlke wrote:
On 2016-04-28 17:39, Arthur O'Dwyer wrote:
> Either way, your proposal should include an example along the lines of
>
>     const char example[] = RF"foo(bar.h)foo";
>
> Does this mean "include bar.h", or "include foo(bar.h)foo" — and why?

Certainly the latter; anything else is just overly complicating things
to no benefit.

If you really need a string like "foo" + contents of 'bar.h' + "foo",
use concatenation:

I believe that Arthur was comparing to current raw string literals:

const char example[] = R"foo(bar.h)foo";

results in example containing "bar.h".

Does adding the F after the R change the delimiter semantics?

Nicol Bolas

unread,
May 3, 2016, 8:48:50 PM5/3/16
to ISO C++ Standard - Future Proposals, mwoehlk...@gmail.com

I thought the point was that `R` already had a meaning and therefore shouldn't be used to mean "binary". There should be a syntax specifically for reading files that should be interpreted as binary data.

Arthur O'Dwyer

unread,
May 3, 2016, 10:04:18 PM5/3/16
to ISO C++ Standard - Future Proposals, mwoehlk...@gmail.com
Correct, that's what I was getting at.
And the fact that Matthew apparently interpreted the (non-)meaning of the string as something like

    const char example[] = "foo" F"bar.h" "foo";

— i.e. yet a third interpretation — shows that there's room for programmer confusion here. This is why I think an example or two is important, and why I think the exact syntax needs discussion and polishing, so that the final result avoids as much programmer confusion as possible.

–Arthur

Matthew Woehlke

unread,
May 4, 2016, 10:23:17 AM5/4/16
to std-pr...@isocpp.org
Ah... yes, I interpret the original example as being the same as
`F"foo(bar.h)foo"`, with the `R` serving only to specify binary vs. text
include mode. (As Nicol notes, this may be a good reason to use
something other than `R` for that purpose. Maybe we should use `Ft` and
`Fb` instead, with `F` by itself being a synonym for `Ft`?)

I doubt esoteric file names are going to be so common as to justify a
mechanism for naming them other than whatever can be used in e.g.
`#include <foo.h>`. Let's not overengineer the feature :-). If we
*really* need it, we could just specify that escapes are parsed within
the name; that's inconvenient but covers anything, and realistically I
doubt it will be needed much if at all.

(Can you #include a file name with e.g. a newline in its name? I've
never actually had occasion to need to find out...)

--
Matthew

Nicol Bolas

unread,
May 4, 2016, 11:08:53 AM5/4/16
to ISO C++ Standard - Future Proposals, mwoehlk...@gmail.com
On Wednesday, May 4, 2016 at 10:23:17 AM UTC-4, Matthew Woehlke wrote:
Ah... yes, I interpret the original example as being the same as
`F"foo(bar.h)foo"`, with the `R` serving only to specify binary vs. text
include mode. (As Nicol notes, this may be a good reason to use
something other than `R` for that purpose. Maybe we should use `Ft` and
`Fb` instead, with `F` by itself being a synonym for `Ft`?)

We need more than just `t` and `b` here. We need to be able to use the full range of encodings that C++ provides for string literals: narrow, wide, UTF-8, UTF-16, and UTF-32.

So `u8F` would mean that the file is encoded in UTF-8, so the generated literal should match. I would prefer to avoid `Fu8`, because that makes `Fu8"filename.txt"` seem like the `u8` applies to the filename literal rather than the generated one.

We need to have: `F` (narrow), `LF` (wide), `u8F` (UTF-8), `uF` (UTF-16), `UF` (UTF-32), and `bF` (no translation).

Nicol Bolas

unread,
May 4, 2016, 11:22:40 AM5/4/16
to ISO C++ Standard - Future Proposals, mwoehlk...@gmail.com

Actually, something just occurred to me about `bF`. Namely, NUL termination.

All of the genuine string literals should be NUL terminated, since that's how we expect literals to behave. But `bF` shouldn't be NUL terminated. So... what do we do?

A string literal is always considered an array in C++, so sizing information is there. But people are very used to discarding sizing information. How many times have you seen the equivalent of this:

const char *str = "SomeLiteral";

We do this all the time. Now, to be fair to us, this is in part because of a long-time lack of `string_view` and an appropriate literal: `auto str = "SomeLiteral"sv;`.

But my point is that people frequently discard sizing information of string literals. Because of NUL termination, that's generally OK, if slow. NUL characters in strings are quite rare, thanks to long-standing practice of using NUL characters as terminators.

NUL characters in binary files however are very common. So using `strlen` to recompute the lost length is not workable.

Given that we're adding something new to the language anyway (the encoding formats I used above could simply be a new case for `string-literal`, using `encoding-prefix<opt>`, with the exception of `b`), maybe we could also add some language that will prevent this. Perhaps a `bF` string literal is a special "binary literal" that can have some different rules. It could have the type `const unsigned char[X]`, but we could add language so that the literal itself would not decay into a pointer. You could still do this:

const unsigned char var[] = bF"Filename.bin";
const unsigned char* pVar = var;

But you couldn't do this directly:

const unsigned char* pVar = bF"Filename.bin";

I do feel that we should make `bF` return an array of `unsigned char`'s though.

Matthew Woehlke

unread,
May 4, 2016, 12:24:17 PM5/4/16
to std-pr...@isocpp.org
On 2016-05-04 11:22, Nicol Bolas wrote:
> Actually, something just occurred to me about `bF`. Namely, NUL termination.
>
> All of the genuine string literals should be NUL terminated, since that's
> how we expect literals to behave. But `bF` shouldn't be NUL terminated.
> So... what do we do?

I'm not sure that's genuinely a problem¹. Many file formats, even
binary, are likely tolerant of a "stray" NUL at the end, and even if
not, I can't think how you would use such a string without specifying
the length, in which case it would be trivial to subtract 1.

(¹ Didn't we have this conversation already? It seems familiar...)

> A string literal is always considered an array in C++, so sizing
> information is there. But people are very used to *discarding* sizing
> information. How many times have you seen the equivalent of this:
>
> const char *str = "SomeLiteral";

Well, it's fine (if inefficient) for text.

> Given that we're adding something new to the language anyway (the encoding
> formats I used above could simply be a new case for `string-literal`, using
> `encoding-prefix<opt>`, with the exception of `b`), maybe we could also add
> some language that will prevent this. Perhaps a `bF` string literal is a
> special "binary literal" that can have some different rules. It could have
> the type `const unsigned char[X]`, but we could add language so that the
> literal itself would not decay into a pointer. You could still do this:
>
> const unsigned char var[] = bF"Filename.bin";
> const unsigned char* pVar = var;
>
> But you couldn't do this directly:
>
> const unsigned char* pVar = bF"Filename.bin";

Compilers should almost definitely *warn* about that. I'm less convinced
it needs to be forbidden in the standard.

--
Matthew

Arthur O'Dwyer

unread,
May 4, 2016, 4:09:34 PM5/4/16
to ISO C++ Standard - Future Proposals, mwoehlk...@gmail.com
Incorrect (for once ;)).
The prefixes u8, u, U and the suffix L don't apply to the encoding of the *source code*. They apply to the encoding of the *runtime data*.
For example:

    const char d1[] = "abc";
    const char d2[] = u8"abc";
    assert(d1[0] == 'a');  // but not necessarily 0x61
    assert(d2[0] == 0x61);  // but not necessarily == 'a'

This source file can be encoded in ASCII or EBCDIC or otherwise, and compiled for a target platform that uses ASCII or EBCDIC or otherwise; no matter which of those 3x3 = 9 possibilities you pick, both asserts are guaranteed by the Standard to pass.

The source construct that looks like "a" (no matter what source-encoded bytes represent those three glyphs) is guaranteed to correspond at runtime to the bytes of data that represent the letter "a" in the runtime character encoding.
The source construct that looks like u8"a" (no matter what source-encoded bytes represent those five glyphs) is guaranteed to correspond at runtime to the bytes of data that represent the letter "a" in UTF-8, regardless of the runtime character encoding.

Therefore, leaving aside the issue of binary (raw) data and null-termination for a moment, the following constructs are unambiguous, if let's say "foo.txt" contains the text "foo" in source encoding:

    const char a[] = F"foo.txt";  assert(strcmp(a, "foo") == 0);
    const wchar_t b[] = F"foo.txt"L;  assert(wcscmp(b, "foo"L) == 0);
    const char c[] = u8F"foo.txt";  assert(memcmp(c, u8"foo", 4) == 0);
    const char16_t d[] = uF"foo.txt";  assert(memcmp(d, u"foo", 8) == 0);
    const char32_t e[] = UF"foo.txt";  assert(memcmp(e, U"foo", 16) == 0);

Andrew's proposal handles all of these cases perfectly.

The problems arise only when you don't want F"foo.txt" to behave like any variety of string literal — i.e., you don't want the compiler to run the data through the "decode source encoding into glyphs" pass.
Suppose I want to achieve the effect of

    const char single_byte_weights[] = { 134, 150, 150 };  // toy example; this array might be hundreds of elements long

So I write

    const char single_byte_weights[] = F"weights.bin";

where weights.bin is a 3-byte file containing the bytes 134, 150, 150 (that is, 0x86 0x96 0x96).
Now, I happen to be on an EBCDIC system, where

    const char f[] = "foo";
    assert(memcmp(f, "\x86\x96\x96", 4) == 0);

I compile my code on my EBCDIC platform, and it works fine.
Then I port my code to an ASCII platform. Obviously I don't do any kind of translation on my weights.bin file; that's raw data and I don't want those byte values to change. I compile my code on the ASCII platform:

    const char single_byte_weights[] = F"weights.bin";

and the compiler blows up! It's complaining that the byte 0x86 in "weights.bin" doesn't correspond to any ASCII character, so it's not legal to appear in (what's tantamount to) a string literal in my C++ source code.


Or suppose I'm on a Unix platform and I want the effect of

    const char single_byte_weights[] = { 13, 10 };

I compile

    const char single_byte_weights[] = F"weights.bin";

where weights.bin is a 2-byte file containing the bytes 13, 10.
It works fine on Unix.
Then I port my code to Windows. Obviously I don't do any kind of translation on my weights.bin file; that's raw data and I don't want those byte values to change. I compile my code on the Windows platform...
and the compiler happily accepts it...
but at runtime I notice that my weights are all wrong! The debugger tells me that my code has mysteriously changed to the equivalent of

    const char single_byte_weights[] = { 10 };

These are the kinds of problems that a "raw" or "binary" file-input mode would solve — they're problems related to source encoding. Problems related to the runtime encoding of glyphs into bytes are already thoroughly solved by the existing prefix/suffix system, which composes fine with the F-prefix.

HTH,
–Arthur

Arthur O'Dwyer

unread,
May 4, 2016, 4:18:16 PM5/4/16
to ISO C++ Standard - Future Proposals
On Wed, May 4, 2016 at 9:24 AM, Matthew Woehlke <mwoehlk...@gmail.com> wrote:
On 2016-05-04 11:22, Nicol Bolas wrote:
> Actually, something just occurred to me about `bF`. Namely, NUL termination.
>
> All of the genuine string literals should be NUL terminated, since that's
> how we expect literals to behave. But `bF` shouldn't be NUL terminated.
> So... what do we do?

I'm not sure that's genuinely a problem¹. Many file formats, even
binary, are likely tolerant of a "stray" NUL at the end, and even if
not, I can't think how you would use such a string without specifying
the length, in which case it would be trivial to subtract 1.

IMO this is a genuine problem.
The most obvious use-case for raw binary file input would be as a portable replacement for "xxd -i".
If you're trying to replace

    const char data[] = { 0x01, 0x02, 0x03 };

with a file-input construct, well, you just can't, unless the file-input construct allows you to specify an array that doesn't end with 0x00. If it shoves a 0x00 at the end of every array it creates, then not only will your executable wind up "too big" (and all your subsequent data wind up shoved down by one byte, which can indeed cause problems for embedded systems) but also sizeof(data) will be 4 instead of 3, so any code that uses sizeof(data) will be wrong... It's just a mess. Let's not do that.

I believe that prefix-b or prefix-R or whatever ends up getting proposed might as well do both things at once:
- don't apply source-encoding-to-character decoding
- don't add null terminator
that is, it should behave just like "xxd -i" (AFAIK).

–Arthur

Nicol Bolas

unread,
May 4, 2016, 7:05:20 PM5/4/16
to ISO C++ Standard - Future Proposals, mwoehlk...@gmail.com
On Wednesday, May 4, 2016 at 4:09:34 PM UTC-4, Arthur O'Dwyer wrote:
On Wed, May 4, 2016 at 8:08 AM, Nicol Bolas <jmck...@gmail.com> wrote:
On Wednesday, May 4, 2016 at 10:23:17 AM UTC-4, Matthew Woehlke wrote:
Ah... yes, I interpret the original example as being the same as
`F"foo(bar.h)foo"`, with the `R` serving only to specify binary vs. text
include mode. (As Nicol notes, this may be a good reason to use
something other than `R` for that purpose. Maybe we should use `Ft` and
`Fb` instead, with `F` by itself being a s