In the current specification for std::filesystem I'm finding problematic the allowance for platform specific encoding for the char type. To my knowledge the behavior is not configurable statically in the source or dynamically at runtime. This is annoying in cross-platform code when storing utf-8 encoded std::string since in Windows the construction of filesystem::path will treat the strings as local ANSI charset encoded. Correct cross-platform code should always convert UTF-8 to UTF-16 on Windows platform,
and substitute all the occurrences of path::string() method with path::u8string(). Forgetting these precautions is quite easy, leading to incorrect i18n handling.The solution I'm currently testing in my code is to fully alias the std::filesystem, recreating a std::u8filesystem with classes and methods with same name that handle std::string as UTF-8 encoded strings. Classes inherit their corresponding in std::filesystem without adding any field or virtual method to offer full interoperability. I have some incomplete Proof of Concept code in github[1]. So far I'm testing in Windows only, but other popular platforms are already UTF-8 so it shouldn't be a problem to just alias std::filesystem with std::u8filesystem.
The questions I have are: are my concerns about UTF-8 handling in std::filesystem shared by the c++ community?
Are there alternative solutions that are in development or that I'm missing?
Could my approach also be considered for adoption in the standard?
struct utf8_path_t {};
inline constexpr utf8_path_t utf8_path;
template<typename Source>
auto operator|(const Source &str, utf8_path_t) { return filesystem::u8path(str); }
auto operator|(const filesystem::path &pth, utf8_path_t) { return pth::u8string(); }
Really, is calling `std::u8path` that difficult? Sure, it's inconvenient, but it's not particularly difficult. Plus it makes it clear that you're using
On Friday, September 29, 2017 at 6:59:16 PM UTC-4, cez...@gmail.com wrote:In the current specification for std::filesystem I'm finding problematic the allowance for platform specific encoding for the char type. To my knowledge the behavior is not configurable statically in the source or dynamically at runtime. This is annoying in cross-platform code when storing utf-8 encoded std::string since in Windows the construction of filesystem::path will treat the strings as local ANSI charset encoded. Correct cross-platform code should always convert UTF-8 to UTF-16 on Windows platform,No, correct cross-platform code should use `std::filesystem::u8path` when creating paths from UTF-8 strings.
FYI: You're not allowed to add arbitrary stuff to the `std` namespace.
I don't consider what you've done to be a "solution" in any case, since it simply replaces `std::filesystem::u8path` with `std::u8filesystem::path`. Either way, you have to change your code. It's also not clear how interoperable this is. If a user uses `std::u8filesystem::path`, is that type interconvertible with `std::filesystem::path`?
Also, your `u8filesystem` namespace makes it impossible to work with narrow character strings and paths.
The only effective solution would be for the committee to adopt "P0482: A type for UTF-8 characters and strings." That would allow `using u8string = basic_string<char8_t, ...>` to exist, and therefore `path` could use that type to do any UTF conversions. But thus far, the committee has been (needlessly) hesitant to do so, despite how often this keeps cropping up.
Assuming the committee remains resistant to a UTF-8 character type, a far more reasonable library solution would be this:
struct utf8_path_t {};
inline constexpr utf8_path_t utf8_path; // [...]
SOn Saturday, September 30, 2017 at 5:54:07 AM UTC+2, Nicol Bolas wrote:Really, is calling `std::u8path` that difficult? Sure, it's inconvenient, but it's not particularly difficult. Plus it makes it clear that you're usingThe problem is not only remembering to use a separate constructor function, the problem is also an implicit conversion from std::string to std::filesystem::path which is currently in place.On Friday, September 29, 2017 at 6:59:16 PM UTC-4, cez...@gmail.com wrote:In the current specification for std::filesystem I'm finding problematic the allowance for platform specific encoding for the char type. To my knowledge the behavior is not configurable statically in the source or dynamically at runtime. This is annoying in cross-platform code when storing utf-8 encoded std::string since in Windows the construction of filesystem::path will treat the strings as local ANSI charset encoded. Correct cross-platform code should always convert UTF-8 to UTF-16 on Windows platform,No, correct cross-platform code should use `std::filesystem::u8path` when creating paths from UTF-8 strings.I forgot about `std::filesystem::u8path`. Still it's something that one has to remember to use and be consistent.FYI: You're not allowed to add arbitrary stuff to the `std` namespace.Of course u8filesystem can be put in a different namespace without making it less convenient to use. Still, for my code I will ignore your suggestion ;)
I don't consider what you've done to be a "solution" in any case, since it simply replaces `std::filesystem::u8path` with `std::u8filesystem::path`. Either way, you have to change your code. It's also not clear how interoperable this is. If a user uses `std::u8filesystem::path`, is that type interconvertible with `std::filesystem::path`?According to my tests yes, it is interconvertible . It's true that the core of my code is just redefining some constructors and method, but interchangeability with `std::filesystem::path` is obtained much more elegantly than you suspect (I think):
void func(const std::filesystem::path &pth);
func(u8filesystem::path(some_string));
func(filesystem::u8path(some_string));
So, if there is a function:
void func(const std::filesystem::path &pth);How do I force it to use `u8filesystem`?
[...] Lots of code will be taking `filesystem::path` objects; they're not going to switch to `u8filesystem::path`. So while this namespace may make your personal code easier[...]
On Saturday, September 30, 2017 at 4:58:59 PM UTC+2, Nicol Bolas wrote:> Not to mention the fact that you're creating a whole new type (and replicating existing APIs that use it) just to change the behavior of two functions.
Yes, I'm deliberately doing this to change the behavior of two functions, but for a reason. Maybe the committee didn't receive enough feedback on the behavior of `std::filesystem::path` constructors, hence I will add mine: to my advice, having and implicit conversion from `std::string` when the encoding of the string that can be platform dependent is an error prone API design choice. Either this get addressed (`std::filesystem` is still experimental), removing the implicit conversion and forcing the user to always declare the encoding of the string when creating `std::filesystem::path` (for example using the utf8_path technique you suggested), or alternative structures are provided so one consistent ecosystem using utf-8 encoded strings can be made safe relatively quickly and easily. It's true that this reasoning would apply to `fstream` as well. But that's not a good reason to pursue with error prone API design in a brand new C++ standard module that is focused on handling filesystem paths.
That's because there are two types of OS-level API:
1) those that use arbitrary 8-bit for filenames
2) those that use UTF-16 for filenames
On terça-feira, 3 de outubro de 2017 08:50:06 PDT Matthew Woehlke wrote:
> On 2017-10-03 11:20, Nicol Bolas wrote:
> > Not to mention, having `char8_t` as a character type means that it is at
> > least theoretically possible for us to stop using `char*` as for strings,
> > reserving it to mean "byte buffer".
>
> I was going to say that, but also someone must point out that we have
> std::byte for that... OTOH...
std::byte exists *because* char8_t doesn't. Among (unsigned) char, char8_t and
std::byte, we need only two.
--
Matthew
On terça-feira, 3 de outubro de 2017 14:34:52 PDT Tom Honermann wrote:
> On 10/03/2017 05:15 PM, Thiago Macieira wrote:
> > On terça-feira, 3 de outubro de 2017 13:59:14 PDT Richard Smith wrote:
> >> We need (at least) three one-byte types: byte, native narrow character,
> >> and
> >> UTF-8 character. "char" is wrong for all three use cases. "unsigned char"
> >> would make a fairly decent byte, but that ship has sailed. (Arguably, an
> >> int8_t / uint8_t would be useful too, and is none of those three.)
> >
> > Do we need a native narrow character? Can't we just say that everything
> > that isn't Unicode text (UTF-8) is a bag of bytes that could be binary?
> As Richard indicated, we need a type that is distinct from std::byte to
> avoid the aliasing issues, and a type that is distinct from char8_t to
> avoid character encoding confusion.
I understood the motivation. I just disagree with it.
Non-UTF8 encoding does not need further optimisation. It can stay aliasing
everything else and be considered nothing more than "bag of bytes".