On 17.03.2020 19:07, Felix Palmen wrote:
> * Alf P. Steinbach <
alf.p.stein...@gmail.com>:
>> Additionally, at least for the standard library's filesystem, the
>> encoding is system specific, UTF-8 in Linux and Windows ANSI in Windows.
>
> Without knowing this particular API: This can't be entirely true. It
> seems to be something handling filenames. The thing is: A filename on a
> Unix system ist just an array of octets, with no encoding information
> attached or implied.
The "implied" is incorrect.
Today the implied encoding of a Linux filename is UTF-8.
I guess a lot of tools will have difficulties with filenames that are
invalid as UTF-8, even though technically wrt. the OS API they're valid
filenames. 20 years ago they would be good. In that period that changed.
> So it's UTF-8 IFF the file in question was named
> while an UTF-8 locale was in effect.
That's true as far as it goes, which is a micro-meter or so.
> On Windows, it's a different thing: Filenames are 16bit wide characters,
> encoded in UTF-16. That seems to be the reason why the Windows
> implementation returns a newly created object: It must convert the
> native Windows filename to an 8-bit string.
Right, but only because the library chooses to use UTF-16 internally in
Windows.
IIRC the original filesystem v3 spec had that as a requirement.
With the standardization, instead the iostreams were outfitted with
constructors taking `std::filesystem::path`.
> This whole thing is a
> recurring PITA, as Windows API calls based on `char` (and also e.g. the
> Standard C library functions like `fopen()`) indeed use a Windows ANSI
> encoding, so information is lost.
No, information is only lost when the process Windows ANSI codepage is
not UTF-8.
Support for UTF-8 Windows ANSI process codepages was added in May last year.
> IMHO the only sane thing to do on
> Windows when you need 8bit strings is to enforce using UTF-8,
That's now, in the last few years, become a good idea.
Because now also the system compiler, Visual C++, supports UTF-8 as
execution character set, as well as default source encoding.
Still worth noting that without setting UTF-8 process codepage
* `main` arguments will be ANSI encoded,
* ditto for `char` based environment variables,
* narrow output to wide streams will be treated as ANSI,
* narrow filenames will be assumed to be ANSI,
* all the locale dependent functionality will be ungood.
> and convert it to/from UTF-16 whenever needed to use exclusively the
> wide-string APIs.
That used to be my advice for Windows.
However, if one doesn't have to support Windows versions earlier than
May 2019, then there is no need to do all this conversion.
- Alf