std::filesystem support for UTF-8 encoded std::string(s)

1,450 views
Skip to first unread message

cez...@gmail.com

unread,
Sep 29, 2017, 6:59:16 PM9/29/17
to ISO C++ Standard - Discussion
In the current specification for std::filesystem I'm finding problematic the allowance for platform specific encoding for the char type. To my knowledge the behavior is not configurable statically in the source or dynamically at runtime. This is annoying in cross-platform code when storing utf-8 encoded std::string since in Windows the construction of filesystem::path will treat the strings as local ANSI charset encoded. Correct cross-platform code should always convert UTF-8 to UTF-16 on Windows platform, and substitute all the occurrences of path::string() method with path::u8string(). Forgetting these precautions is quite easy, leading to incorrect i18n handling.

The solution I'm currently testing in my code is to fully alias the std::filesystem, recreating a std::u8filesystem with classes and methods with same name that handle std::string as UTF-8 encoded strings. Classes inherit their corresponding in std::filesystem without adding any field or virtual method to offer full interoperability. I have some incomplete Proof of Concept code in github[1]. So far I'm testing in Windows only, but other popular platforms are already UTF-8 so it shouldn't be a problem to just alias std::filesystem with std::u8filesystem.

The questions I have are: are my concerns about UTF-8 handling in std::filesystem shared by the c++ community? Are there alternative solutions that are in development or that I'm missing? Could my approach also be considered for adoption in the standard?

Cheers,
Francesco

Nicol Bolas

unread,
Sep 29, 2017, 11:54:07 PM9/29/17
to ISO C++ Standard - Discussion, cez...@gmail.com
Really, is calling `std::u8path` that difficult? Sure, it's inconvenient, but it's not particularly difficult. Plus it makes it clear that you're using

On Friday, September 29, 2017 at 6:59:16 PM UTC-4, cez...@gmail.com wrote:
In the current specification for std::filesystem I'm finding problematic the allowance for platform specific encoding for the char type. To my knowledge the behavior is not configurable statically in the source or dynamically at runtime. This is annoying in cross-platform code when storing utf-8 encoded std::string since in Windows the construction of filesystem::path will treat the strings as local ANSI charset encoded. Correct cross-platform code should always convert UTF-8 to UTF-16 on Windows platform,

No, correct cross-platform code should use `std::filesystem::u8path` when creating paths from UTF-8 strings.
 
and substitute all the occurrences of path::string() method with path::u8string(). Forgetting these precautions is quite easy, leading to incorrect i18n handling.

The solution I'm currently testing in my code is to fully alias the std::filesystem, recreating a std::u8filesystem with classes and methods with same name that handle std::string as UTF-8 encoded strings. Classes inherit their corresponding in std::filesystem without adding any field or virtual method to offer full interoperability. I have some incomplete Proof of Concept code in github[1]. So far I'm testing in Windows only, but other popular platforms are already UTF-8 so it shouldn't be a problem to just alias std::filesystem with std::u8filesystem.

FYI: You're not allowed to add arbitrary stuff to the `std` namespace.

The questions I have are: are my concerns about UTF-8 handling in std::filesystem shared by the c++ community?

No more than any other issues around the `char` type being shared by narrow character strings and UTF-8 strings.

Are there alternative solutions that are in development or that I'm missing?

You mean, besides using `std::filesystem::u8path` and `path::u8string`?

I don't consider what you've done to be a "solution" in any case, since it simply replaces `std::filesystem::u8path` with `std::u8filesystem::path`. Either way, you have to change your code. It's also not clear how interoperable this is. If a user uses `std::u8filesystem::path`, is that type interconvertible with `std::filesystem::path`?

Also, your `u8filesystem` namespace makes it impossible to work with narrow character strings and paths.

Overall, it seems absurd to create an entire new type in a new namespace just to replace the behavior of 2 overloads of 2 functions.

The only effective solution would be for the committee to adopt "P0482: A type for UTF-8 characters and strings." That would allow `using u8string = basic_string<char8_t, ...>` to exist, and therefore `path` could use that type to do any UTF conversions. But thus far, the committee has been (needlessly) hesitant to do so, despite how often this keeps cropping up.

Could my approach also be considered for adoption in the standard?

Your "solution" is to redeclare massive parts of the filesystem library, all to change the behavior of a very few functions.

Assuming the committee remains resistant to a UTF-8 character type, a far more reasonable library solution would be this:

struct utf8_path_t {};
inline constexpr utf8_path_t utf8_path;

template<typename Source>
auto operator|(const Source &str, utf8_path_t) { return filesystem::u8path(str); }
auto operator|(const filesystem::path &pth, utf8_path_t) { return pth::u8string(); }

So if you want to make a path from a utf-8 string, you do `some_string | utf8_path`. If you to extract a UTF-8 string from a path, you do `some_path | utf8_path`.

Feel free to make `utf16_path_t` and `utf32_path_t` versions for the sake of consistency.

I personally love using `operator|` with a `constexpr` variable to do conversions that aren't reasonable to be implicit, but I'd rather not have to rely on overloads to invoke. Also, `operator|` has much lower precedence than `+`, so you can still do string stuff like `some_string1 + u8".exe" | utf8_path`.


Francesco Pretto

unread,
Sep 30, 2017, 3:21:13 AM9/30/17
to ISO C++ Standard - Discussion, cez...@gmail.com

SOn Saturday, September 30, 2017 at 5:54:07 AM UTC+2, Nicol Bolas wrote:
Really, is calling `std::u8path` that difficult? Sure, it's inconvenient, but it's not particularly difficult. Plus it makes it clear that you're using

 
The problem is not only remembering to use a separate constructor function, the problem is also an implicit conversion from std::string to std::filesystem::path which is currently in place.

On Friday, September 29, 2017 at 6:59:16 PM UTC-4, cez...@gmail.com wrote:
In the current specification for std::filesystem I'm finding problematic the allowance for platform specific encoding for the char type. To my knowledge the behavior is not configurable statically in the source or dynamically at runtime. This is annoying in cross-platform code when storing utf-8 encoded std::string since in Windows the construction of filesystem::path will treat the strings as local ANSI charset encoded. Correct cross-platform code should always convert UTF-8 to UTF-16 on Windows platform,

No, correct cross-platform code should use `std::filesystem::u8path` when creating paths from UTF-8 strings.
 
I forgot about `std::filesystem::u8path`. Still it's something that one has to remember to use and be consistent.


FYI: You're not allowed to add arbitrary stuff to the `std` namespace.


Of course u8filesystem can be put in a different namespace without making it less convenient to use. Still, for my code I will ignore your suggestion ;)
 

I don't consider what you've done to be a "solution" in any case, since it simply replaces `std::filesystem::u8path` with `std::u8filesystem::path`. Either way, you have to change your code. It's also not clear how interoperable this is. If a user uses `std::u8filesystem::path`, is that type interconvertible with `std::filesystem::path`?

According to my tests yes, it is interconvertible . It's true that the core of my code is just redefining some constructors and method, but interchangeability with `std::filesystem::path` is obtained much more elegantly than you suspect (I think):
 


Also, your `u8filesystem` namespace makes it impossible to work with narrow character strings and paths.
 
std::filesystem would still be there.


The only effective solution would be for the committee to adopt "P0482: A type for UTF-8 characters and strings." That would allow `using u8string = basic_string<char8_t, ...>` to exist, and therefore `path` could use that type to do any UTF conversions. But thus far, the committee has been (needlessly) hesitant to do so, despite how often this keeps cropping up.


 Proposing alternative solutions could also help urging the committee to take a decision on this.


Assuming the committee remains resistant to a UTF-8 character type, a far more reasonable library solution would be this:

struct utf8_path_t {};
inline constexpr utf8_path_t utf8_path; // [...]

My concern is that also in your approach there is something that the developer must tediously remember to do. But honestly: if there wasn't an *implicit* conversion from std::string, all your suggestions would be perfectly fine to me. In my opinion, implicit conversion tends to make the use of `std::filesystem::path` unsafe. The rationale of `std::u8filesystem` is also allowing to fix big parts of the code that use utf-8 encoded strings with few line changes.

Nicol Bolas

unread,
Sep 30, 2017, 10:58:59 AM9/30/17
to ISO C++ Standard - Discussion, cez...@gmail.com
On Saturday, September 30, 2017 at 3:21:13 AM UTC-4, Francesco Pretto wrote:

SOn Saturday, September 30, 2017 at 5:54:07 AM UTC+2, Nicol Bolas wrote:
Really, is calling `std::u8path` that difficult? Sure, it's inconvenient, but it's not particularly difficult. Plus it makes it clear that you're using

 
The problem is not only remembering to use a separate constructor function, the problem is also an implicit conversion from std::string to std::filesystem::path which is currently in place.

On Friday, September 29, 2017 at 6:59:16 PM UTC-4, cez...@gmail.com wrote:
In the current specification for std::filesystem I'm finding problematic the allowance for platform specific encoding for the char type. To my knowledge the behavior is not configurable statically in the source or dynamically at runtime. This is annoying in cross-platform code when storing utf-8 encoded std::string since in Windows the construction of filesystem::path will treat the strings as local ANSI charset encoded. Correct cross-platform code should always convert UTF-8 to UTF-16 on Windows platform,

No, correct cross-platform code should use `std::filesystem::u8path` when creating paths from UTF-8 strings.
 
I forgot about `std::filesystem::u8path`. Still it's something that one has to remember to use and be consistent.

FYI: You're not allowed to add arbitrary stuff to the `std` namespace.

Of course u8filesystem can be put in a different namespace without making it less convenient to use. Still, for my code I will ignore your suggestion ;)

It's not "my suggestion"; it's the C++ standard. From [namespace.std]/1:

> The behavior of a C ++ program is undefined if it adds declarations or definitions to namespace std or to a namespace within namespace std unless otherwise specified.

You can ignore that if you like. But that doesn't make your program rely any less on UB.

I don't consider what you've done to be a "solution" in any case, since it simply replaces `std::filesystem::u8path` with `std::u8filesystem::path`. Either way, you have to change your code. It's also not clear how interoperable this is. If a user uses `std::u8filesystem::path`, is that type interconvertible with `std::filesystem::path`?

According to my tests yes, it is interconvertible . It's true that the core of my code is just redefining some constructors and method, but interchangeability with `std::filesystem::path` is obtained much more elegantly than you suspect (I think):
 


So, if there is a function:

void func(const std::filesystem::path &pth);

How do I force it to use `u8filesystem`? Oh sure, if I already have a `u8filesystem::path`, then I can convert that to a `filesystem::path` (and thanks for an extra copy/move operation). But if all I have is a `string`, then I have to avoid using `filesystem::path`'s implicit conversion. I have to call it with:

func(u8filesystem::path(some_string));

How is that easier to remember than:

func(filesystem::u8path(some_string));

Remember: it's not just your code out there. Lots of code will be taking `filesystem::path` objects; they're not going to switch to `u8filesystem::path`. So while this namespace may make your personal code easier, it won't help anything that needs to interoperate with the outside world.

Not to mention the fact that you're creating a whole new type (and replicating existing APIs that use it) just to change the behavior of two functions.

Francesco Pretto

unread,
Oct 3, 2017, 5:21:04 AM10/3/17
to ISO C++ Standard - Discussion, cez...@gmail.com
On Saturday, September 30, 2017 at 4:58:59 PM UTC+2, Nicol Bolas wrote:
> Not to mention the fact that you're creating a whole new type (and replicating existing APIs that use it) just to change the behavior of two functions.

Yes, I'm deliberately doing this to change the behavior of two functions, but for a reason. Maybe the committee didn't receive enough feedback on the behavior of `std::filesystem::path` constructors, hence I will add mine: to my advice, having and implicit conversion from `std::string` when the encoding of the string that can be platform dependent is an error prone API design choice. Either this get addressed (`std::filesystem` is still experimental), removing the implicit conversion and forcing the user to always declare the encoding of the string when creating `std::filesystem::path` (for example using the utf8_path technique you suggested), or alternative structures are provided so one consistent ecosystem using utf-8 encoded strings can be made safe relatively quickly and easily. It's true that this reasoning would apply to `fstream` as well. But that's not a good reason to pursue with error prone API design in a brand new C++ standard module that is focused on handling filesystem paths.

So, if there is a function:

void func(const std::filesystem::path &pth);

How do I force it to use `u8filesystem`? 

[...] Lots of code will be taking `filesystem::path` objects; they're not going to switch to `u8filesystem::path`. So while this namespace may make your personal code easier[...]

There would be no need to force any external code to use `u8filesystem`. And yes, `u8filesystem` would fit best self contained projects that internally uses utf-8 for their strings. In a sense, you convinced me that `u8filesystem` may be not be the best choice in a public library with wide user base scope. Still, it makes sense in a private/public ecosystem where everything is handled as utf-8 encoded. 

Tom Honermann

unread,
Oct 3, 2017, 11:13:30 AM10/3/17
to std-dis...@isocpp.org, cez...@gmail.com
On 10/03/2017 05:21 AM, Francesco Pretto wrote:
On Saturday, September 30, 2017 at 4:58:59 PM UTC+2, Nicol Bolas wrote:
> Not to mention the fact that you're creating a whole new type (and replicating existing APIs that use it) just to change the behavior of two functions.

Yes, I'm deliberately doing this to change the behavior of two functions, but for a reason. Maybe the committee didn't receive enough feedback on the behavior of `std::filesystem::path` constructors, hence I will add mine: to my advice, having and implicit conversion from `std::string` when the encoding of the string that can be platform dependent is an error prone API design choice. Either this get addressed (`std::filesystem` is still experimental), removing the implicit conversion and forcing the user to always declare the encoding of the string when creating `std::filesystem::path` (for example using the utf8_path technique you suggested), or alternative structures are provided so one consistent ecosystem using utf-8 encoded strings can be made safe relatively quickly and easily. It's true that this reasoning would apply to `fstream` as well. But that's not a good reason to pursue with error prone API design in a brand new C++ standard module that is focused on handling filesystem paths.
std::filesystem is in C++17; I don't think it is correct to categorize it as experimental at this point.

I share concerns regarding interoperability of UTF-8 encoded strings and, well, pretty much the entire standard library.  I wrote P0482 [1] based on these concerns; the motivation section discusses std::filesystem and u8path.  I find it troubling that use of UTF-8, the most popular character encoding of all time, requires special handling relative to other encodings.

Some members of the committee have taken the view that UTF-8 is the future, that all implementations will eventually converge to use of UTF-8 as the narrow character encoding, and we therefore don't need a distinct type (char8_t) for UTF-8 and can live with one-off workarounds like u8path until the future arrives.  They may be right.  I'm skeptical that this future will arise any time soon though given the amount of legacy code still in use today that was not written for UTF-8; particularly for Windows and z/OS and for countries where encodings such as GB18030 and ShiftJIS remain popular.  Visual Studio 2015 at least added an option [2] to allow use of UTF-8 as the source and narrow execution character sets.  The questions are, when will that option become the default behavior and how much legacy code will be updated to work correctly when compiled in that mode?  I'd like to see char8_t added in order to enable writing portable code using UTF-8 now without having to resort to odd workarounds like u8path.  Such a change would have no effect on any future convergence to UTF-8 for the narrow character encoding.

Tom.

[1]: https://wg21.link/p0482
[2]: https://docs.microsoft.com/en-us/cpp/build/reference/utf-8-set-source-and-executable-character-sets-to-utf-8

Nicol Bolas

unread,
Oct 3, 2017, 11:20:23 AM10/3/17
to ISO C++ Standard - Discussion, cez...@gmail.com
Not to mention, having `char8_t` as a character type means that it is at least theoretically possible for us to stop using `char*` as for strings, reserving it to mean "byte buffer". Or more to the point, `char8_t` won't be allowed to alias with other types, which can improve efficiency, since the compiler doesn't have to assume that you might do something alias-y with such a pointer.

So even in this supposed UTF-8 future, there's a need to have a distinct character type.

Matthew Woehlke

unread,
Oct 3, 2017, 11:50:12 AM10/3/17
to std-dis...@isocpp.org, Nicol Bolas, cez...@gmail.com
On 2017-10-03 11:20, Nicol Bolas wrote:
> Not to mention, having `char8_t` as a character type means that it is at
> least theoretically possible for us to stop using `char*` as for strings,
> reserving it to mean "byte buffer".

I was going to say that, but also someone must point out that we have
std::byte for that... OTOH...

> Or more to the point, `char8_t` *won't* be allowed to alias with
> other types, which can improve efficiency, since the compiler doesn't
> have to assume that you might do something alias-y with such a
> pointer.
...this seems more useful. For *this* reason, it possibly makes more
sense to deprecate use of `char` for text, with the goal of making
std::byte obsolete, vs. pursuing any future in which char8_t would
eventually become superfluous.

--
Matthew

Thiago Macieira

unread,
Oct 3, 2017, 12:15:50 PM10/3/17
to std-dis...@isocpp.org
On terça-feira, 3 de outubro de 2017 08:13:25 PDT Tom Honermann wrote:
> I share concerns regarding interoperability of UTF-8 encoded strings
> and, well, pretty much the entire standard library. I wrote P0482 [1]
> based on these concerns; the motivation section discusses
> std::filesystem and u8path. I find it troubling that use of UTF-8, the
> most popular character encoding of all time, requires special handling
> relative to other encodings.

That's because there are two types of OS-level API:

1) those that use arbitrary 8-bit for filenames
2) those that use UTF-16 for filenames

As you can see, UTF-8 is not listed. Sure, the vast majority of the systems
using arbitrary 8-bit are choosing to use UTF-8, but that's not a requirement.
In fact, enforcing UTF-8 for those systems causes issues that Qt developers
are familiar with: files whose names that fail to decode as UTF-8 simply
disappear from directory listings and cannot be accessed (we consider those
files to be filesystem corruption). I don't think that's acceptable for the C++
Standard Library.

When interfacing with the Win32 API, you'll get the UTF-16 conversion. When
interfacing with the Unix API, you'll get the arbitrary 8-bit data. Converting
from the UTF-16 to UTF-8 is extremely uncommon.

--
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
Software Architect - Intel Open Source Technology Center

Thiago Macieira

unread,
Oct 3, 2017, 12:17:25 PM10/3/17
to std-dis...@isocpp.org
On terça-feira, 3 de outubro de 2017 08:50:06 PDT Matthew Woehlke wrote:
> On 2017-10-03 11:20, Nicol Bolas wrote:
> > Not to mention, having `char8_t` as a character type means that it is at
> > least theoretically possible for us to stop using `char*` as for strings,
> > reserving it to mean "byte buffer".
>
> I was going to say that, but also someone must point out that we have
> std::byte for that... OTOH...

std::byte exists *because* char8_t doesn't. Among (unsigned) char, char8_t and
std::byte, we need only two.

Hyman Rosen

unread,
Oct 3, 2017, 12:31:13 PM10/3/17
to std-dis...@isocpp.org
On Tue, Oct 3, 2017 at 12:15 PM, Thiago Macieira <thi...@macieira.org> wrote:
That's because there are two types of OS-level API:
 1) those that use arbitrary 8-bit for filenames
 2) those that use UTF-16 for filenames

Yeah, no: <https://mjtsai.com/blog/2017/06/27/apfs-native-normalization/>.
There are filesystems which normalize utf-8 names, so that different sequences
of 8-bit characters become the same file name in the filesystem, and when those
names are read back, they have their normalized form. 

Thiago Macieira

unread,
Oct 3, 2017, 12:48:52 PM10/3/17
to std-dis...@isocpp.org
On terça-feira, 3 de outubro de 2017 09:30:50 PDT Hyman Rosen wrote:
> Yeah, no: <https://mjtsai.com/blog/2017/06/27/apfs-native-normalization/>.
> There are filesystems which normalize utf-8 names, so that different
> sequences
> of 8-bit characters become the same file name in the filesystem, and when
> those
> names are read back, they have their normalized form.

Though it normalises UTF-8, it's still an encoding-agnostic and you can create
malformed-UTF-8 entries. So you cannot enforce UTF-8 in the API either.

Matthew Woehlke

unread,
Oct 3, 2017, 12:53:14 PM10/3/17
to std-dis...@isocpp.org, Thiago Macieira
On 2017-10-03 12:17, Thiago Macieira wrote:
> On terça-feira, 3 de outubro de 2017 08:50:06 PDT Matthew Woehlke wrote:
>> On 2017-10-03 11:20, Nicol Bolas wrote:
>>> Not to mention, having `char8_t` as a character type means that it is at
>>> least theoretically possible for us to stop using `char*` as for strings,
>>> reserving it to mean "byte buffer".
>>
>> I was going to say that, but also someone must point out that we have
>> std::byte for that... OTOH...
>
> std::byte exists *because* char8_t doesn't. Among (unsigned) char, char8_t and
> std::byte, we need only two.

Well, yes, but *someone* is almost certainly going to argue that
std::byte was here first, and thus we shouldn't go adding a third type
that is born superfluous.

*I* am not making that argument. If you re-read the part of my previous
post that you snipped, you'll note that I basically agreed with your
closing statement.

--
Matthew

Richard Smith

unread,
Oct 3, 2017, 2:45:20 PM10/3/17
to std-dis...@isocpp.org
On 3 October 2017 at 09:17, Thiago Macieira <thi...@macieira.org> wrote:
On terça-feira, 3 de outubro de 2017 08:50:06 PDT Matthew Woehlke wrote:
> On 2017-10-03 11:20, Nicol Bolas wrote:
> > Not to mention, having `char8_t` as a character type means that it is at
> > least theoretically possible for us to stop using `char*` as for strings,
> > reserving it to mean "byte buffer".
>
> I was going to say that, but also someone must point out that we have
> std::byte for that... OTOH...

std::byte exists *because* char8_t doesn't. Among (unsigned) char, char8_t and
std::byte, we need only two.

I don't think that's true at all.

We desperately need a type that unambiguously indicates its content is UTF-8. Despite having both char and std::byte, we still don't have such a type. What we do have is experience that tells us that using "char" as that type does not work -- "char" is just used far too much to mean "the current narrow character encoding". (It's *also* used to mean "raw storage", but now we have std::byte, hopefully that usage will diminish.)

Ross Smith

unread,
Oct 3, 2017, 4:10:57 PM10/3/17
to std-dis...@isocpp.org
On 2017-10-04 05:15, Thiago Macieira wrote:
>
> That's because there are two types of OS-level API:
>
> 1) those that use arbitrary 8-bit for filenames
> 2) those that use UTF-16 for filenames

Not quite. In fact there are three:

1. Those that use arbitrary 8-bit filenames (pretty much all Unix
systems except Apple).
2. Those that use UTF-8 (Apple's HFS+ and APFS, though in slightly
different ways, and as Hyman has pointed out, APFS can also be
used in arbitrary-8-bit-filename mode).
3. Those that use arbitrary 16-bit filenames (Windows).

Contrary to popular belief, Windows filenames are not UTF-16; they are
strings of arbitrary 16-bit unsigned integers, i.e. UCS-2. Apart from
a short list of banned characters (control characters 0-31, and 9
punctuation marks: <>:"/\|?* ), Windows filenames can contain any
16-bit code unit, including unpaired surrogates. You can easily create
a file whose name is not valid UTF-16, although some of the Windows
APIs will have trouble with it.

This causes problems for anything that tries to be a universal file
system API: it means that there is no universal string format you
can use on all systems, since Unix systems can have arbitrary 8-bit
names that can't be translated to UTF-16, while Windows can have
arbitrary 16-bit names that can't be translated to UTF-8.

In my own file utility library I've simply accepted the need for two
different representations, using std::string on Unix and std::wstring
on Windows.

Ross Smith

Nicol Bolas

unread,
Oct 3, 2017, 4:35:27 PM10/3/17
to ISO C++ Standard - Discussion, jmck...@gmail.com, cez...@gmail.com
We will always need byte buffers. And we will always need an array of characters.

What we don't need is for "an array of characters" to mean "byte buffer". That's the problem with `char*`; it means both.

The purpose of `std::byte` is to give us a way to say "byte buffer" which doesn't also say "array of characters". What we don't have is a way to say "array of characters" without also saying "byte buffer".

`char8_t` can do that for us. Or at least, with UTF-8 characters at any rate.

So given the "among (unsigned) char, char8_t and std::byte, we need only two." thing, it's `char*` that needs to go.


--
Matthew

Thiago Macieira

unread,
Oct 3, 2017, 4:55:41 PM10/3/17
to std-dis...@isocpp.org
On terça-feira, 3 de outubro de 2017 11:44:57 PDT Richard Smith wrote:
> > std::byte exists *because* char8_t doesn't. Among (unsigned) char, char8_t
> > and
> > std::byte, we need only two.
>
> I don't think that's true at all.
>
> We desperately need a type that unambiguously indicates its content is
> UTF-8. Despite having both char and std::byte, we still don't have such a
> type. What we do have is experience that tells us that using "char" as that
> type does not work -- "char" is just used far too much to mean "the current
> narrow character encoding". (It's *also* used to mean "raw storage", but
> now we have std::byte, hopefully that usage will diminish.)

Actually, let me rephrase what I said, after your reply and Matthew's: we do
need char8_t. The one we don't need is std::byte, since we already have
(unsigned) char. That already is a byte of unknown encoding.

Richard Smith

unread,
Oct 3, 2017, 4:59:38 PM10/3/17
to std-dis...@isocpp.org, Nicol Bolas, cez...@gmail.com
We still need a type for the native narrow character encoding (the narrow analogue of wchar_t). That must not be char8_t, or else char8_t fails to serve its purpose as a type that indicates its content is encoded in UTF-8. Ideally it would not be an aliases-everything type... but the baggage we inherited from C indicates that this type is named 'char' and does alias everything.

We need (at least) three one-byte types: byte, native narrow character, and UTF-8 character. "char" is wrong for all three use cases. "unsigned char" would make a fairly decent byte, but that ship has sailed. (Arguably, an int8_t / uint8_t would be useful too, and is none of those three.)

Thiago Macieira

unread,
Oct 3, 2017, 5:15:38 PM10/3/17
to std-dis...@isocpp.org
On terça-feira, 3 de outubro de 2017 13:59:14 PDT Richard Smith wrote:
> We need (at least) three one-byte types: byte, native narrow character, and
> UTF-8 character. "char" is wrong for all three use cases. "unsigned char"
> would make a fairly decent byte, but that ship has sailed. (Arguably, an
> int8_t / uint8_t would be useful too, and is none of those three.)

Do we need a native narrow character? Can't we just say that everything that
isn't Unicode text (UTF-8) is a bag of bytes that could be binary?

Tom Honermann

unread,
Oct 3, 2017, 5:34:54 PM10/3/17
to std-dis...@isocpp.org
On 10/03/2017 05:15 PM, Thiago Macieira wrote:
> On terça-feira, 3 de outubro de 2017 13:59:14 PDT Richard Smith wrote:
>> We need (at least) three one-byte types: byte, native narrow character, and
>> UTF-8 character. "char" is wrong for all three use cases. "unsigned char"
>> would make a fairly decent byte, but that ship has sailed. (Arguably, an
>> int8_t / uint8_t would be useful too, and is none of those three.)
> Do we need a native narrow character? Can't we just say that everything that
> isn't Unicode text (UTF-8) is a bag of bytes that could be binary?
>
As Richard indicated, we need a type that is distinct from std::byte to
avoid the aliasing issues, and a type that is distinct from char8_t to
avoid character encoding confusion.

Tom.

Thiago Macieira

unread,
Oct 3, 2017, 5:50:54 PM10/3/17
to std-dis...@isocpp.org
I understood the motivation. I just disagree with it.

Non-UTF8 encoding does not need further optimisation. It can stay aliasing
everything else and be considered nothing more than "bag of bytes".

Richard Smith

unread,
Oct 3, 2017, 6:14:36 PM10/3/17
to std-dis...@isocpp.org
On 3 October 2017 at 14:50, Thiago Macieira <thi...@macieira.org> wrote:
On terça-feira, 3 de outubro de 2017 14:34:52 PDT Tom Honermann wrote:
> On 10/03/2017 05:15 PM, Thiago Macieira wrote:
> > On terça-feira, 3 de outubro de 2017 13:59:14 PDT Richard Smith wrote:
> >> We need (at least) three one-byte types: byte, native narrow character,
> >> and
> >> UTF-8 character. "char" is wrong for all three use cases. "unsigned char"
> >> would make a fairly decent byte, but that ship has sailed. (Arguably, an
> >> int8_t / uint8_t would be useful too, and is none of those three.)
> >
> > Do we need a native narrow character? Can't we just say that everything
> > that isn't Unicode text (UTF-8) is a bag of bytes that could be binary?
> As Richard indicated, we need a type that is distinct from std::byte to
> avoid the aliasing issues, and a type that is distinct from char8_t to
> avoid character encoding confusion.

I understood the motivation. I just disagree with it.

Non-UTF8 encoding does not need further optimisation. It can stay aliasing
everything else and be considered nothing more than "bag of bytes".

I basically agree. It'd be preferable if applications that do a lot of heavy text transcoding work didn't suffer from the aliasing problems of char, but I don't think it's essential if we envision the future of C++ as using char8_t for all internal text processing.

I think the question is then whether text in the native narrow character encoding uses std::byte or char. And given where we are today -- with char used for plain string literals and by all the library functions dealing with narrow character strings -- char seems to be clearly the better choice.

So that leaves us with char8_t for UTF-8 text, char for text in the native narrow encoding, and std::byte for raw storage... which may not be perfect, but would be a significant improvement over the status quo in my view. (We still don't have a good representation for int8_t or uint8_t, but implementations can choose to provide those as extended integer types if they like.)

Thiago Macieira

unread,
Oct 3, 2017, 7:03:49 PM10/3/17
to std-dis...@isocpp.org
On terça-feira, 3 de outubro de 2017 15:14:13 PDT Richard Smith wrote:
> So that leaves us with char8_t for UTF-8 text, char for text in the native
> narrow encoding, and std::byte for raw storage... which may not be perfect,
> but would be a significant improvement over the status quo in my view. (We
> still don't have a good representation for int8_t or uint8_t, but
> implementations can choose to provide those as extended integer types if
> they like.)

Having uint8_t be of a different type from unsigned char will be a lot of
headache. But, it's the implementation's choice.
Reply all
Reply to author
Forward
0 new messages