[boost] [review] Review of Nowide (Unicode) starts today

138 views
Skip to first unread message

Frédéric Bron via Boost

unread,
Jun 12, 2017, 12:21:33 AM6/12/17
to bo...@lists.boost.org, Frédéric Bron
Hi Everyone,

The formal review of Artyom Beilis' Nowide library starts today and
will last until Wed. 21st of June.

Your participation is encouraged, as the proposed library is
uncoupled, focused and rather small. Nowadays, everyone needs to
handle Unicode but this is very difficult using only the standard
library in a platform independant way. Nowide offers a very simple way
to handle Unicode the same way on
Windows/MacOS/Linux.

Key features:

* work with UTF-8 in your code, Nowide converts to OS encoding
* Easy to use functions for converting UTF-8 to/from UTF-16
* A class to fixing argc, argc and env main parameters to use UTF-8
* UTF-8 aware functions:
- stdio.h functions (fopen, freopen, remove, rename)
- stdlib.h functions (system, getenv, setenv, unsetenv, putenv)
- fstream (filebuf, fstream/ofstream/ifstream)
- iostream (cout, cerr, clog, cin)

Documentation:
http://cppcms.com/files/nowide/html

GitHub:
https://github.com/artyom-beilis/nowide
git clone https://github.com/artyom-beilis/nowide.git

Latest tarballs:
- to be unzipped in boost source:
https://github.com/artyom-beilis/nowide/archive/master.zip
- as a standalone library: http://cppcms.com/files/nowide/nowide_standalone.zip

Nowide has standard boost layout. So, just put it into the boost
source tree under libs/nowide directory for building and running
tests.
Or alternatively - for header only part of the library add the
nowide/include path to the include path of your projects.

Please post your comments and review to the boost mailing list
(preferably), or privately to the Review Manager (to me ;-). Here are
some questions you might want to answer in your review:

- What is your evaluation of the design?
- What is your evaluation of the implementation?
- What is your evaluation of the documentation?
- What is your evaluation of the potential usefulness of the library?
- Did you try to use the library? With what compiler? Did you have any
problems?
- How much effort did you put into your evaluation? A glance? A quick
reading? In-depth study?
- Are you knowledgeable about the problem domain?

And most importantly:
- Do you think the library should be accepted as a Boost library? Be
sure to say this explicitly so that your other comments don't obscure
your overall opinion.

For more information about Boost Formal Review Process, see:
http://www.boost.org/community/reviews.html

Thank you very much for your time and efforts.

Frédéric

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Groke, Paul via Boost

unread,
Jun 12, 2017, 4:09:22 AM6/12/17
to bo...@lists.boost.org, Groke, Paul
Frédéric Bron wrote
> Your participation is encouraged, as the proposed library is uncoupled,
> focused and rather small. Nowadays, everyone needs to handle Unicode but
> this is very difficult using only the standard library in a platform independant
> way. Nowide offers a very simple way to handle Unicode the same way on
> Windows/MacOS/Linux.
>
> Key features:
>
> * work with UTF-8 in your code, Nowide converts to OS encoding
> * Easy to use functions for converting UTF-8 to/from UTF-16
> * A class to fixing argc, argc and env main parameters to use UTF-8
> * UTF-8 aware functions:
> - stdio.h functions (fopen, freopen, remove, rename)
> - stdlib.h functions (system, getenv, setenv, unsetenv, putenv)
> - fstream (filebuf, fstream/ofstream/ifstream)
> - iostream (cout, cerr, clog, cin)

Not a review, just a question (actually two): how does Nowide deal with modified UTF-8? I'm asking because if modified UTF-8 is fully supported (converting from/to, passing modified UTF-8 directly to the UTF-8 aware functions etc.), then Nowide could be valuable to us. Also: are any other platforms aside from Windows/OS X/Linux supported (e.g. AIX, z/OS)?

Thanks,
Paul Groke

Artyom Beilis via Boost

unread,
Jun 12, 2017, 4:16:13 AM6/12/17
to bo...@lists.boost.org, Artyom Beilis
> Not a review, just a question (actually two): how does Nowide deal with modified UTF-8?

No, modified utf-8 is not supported since it isn't utf-8 it will be
considered invalid encoding.

> Also: are any other platforms aside from Windows/OS X/Linux supported (e.g. AIX, z/OS)?

Actually if you take a look on the docs - the support of POSIX based
OSes (or OS with native narrow API)
is transparent.

>
> Thanks,
> Paul Groke

Artyom Beilis

Yakov Galka via Boost

unread,
Jun 12, 2017, 5:03:43 AM6/12/17
to bo...@lists.boost.org, Yakov Galka
On Mon, Jun 12, 2017 at 11:15 AM, Artyom Beilis via Boost <
bo...@lists.boost.org> wrote:

> No, modified utf-8 is not supported since it isn't utf-8 it will be
> considered invalid encoding.
>

On a related note, does it support WTF-8? I.e. encoding lone UTF-16
surrogates (malformed UTF-16 sequences) within the UTF-8 scheme. It is
needed to guarantee UTF-16 → UTF-8 → UTF-16 roundtrip of invalid UTF-16
data on Windows, and is not an invalid behavior per se, because all valid
UTF-16 sequences still map bijectively onto valid UTF-8 sequences.

--
Yakov Galka
http://stannum.co.il/

Groke, Paul via Boost

unread,
Jun 12, 2017, 5:20:51 AM6/12/17
to bo...@lists.boost.org, Groke, Paul
Artyom Beilis wrote
> > Not a review, just a question (actually two): how does Nowide deal with
> modified UTF-8?
>
> No, modified utf-8 is not supported since it isn't utf-8 it will be considered
> invalid encoding.

I know modified UTF-8 is (can be) invalid UTF-8, that's why I asked. I think it could make sense to support it anyway though. Round tripping (strictly invalid, but possible) file names on Windows, easier interoperability with stuff like JNI, ...
OTOH it would add overhead for systems with native UTF-8 APIs, because Nowide would at least have to check every string for "modified UTF-8 encoded" surrogate pairs and convert the string if necessary. Which of course is a good argument for not supporting modified UTF-8, because then Nowide could just pass the strings through unmodified on those systems.

Yakov Galka via Boost

unread,
Jun 12, 2017, 5:29:52 AM6/12/17
to bo...@lists.boost.org, Yakov Galka
On Mon, Jun 12, 2017 at 12:20 PM, Groke, Paul via Boost <
bo...@lists.boost.org> wrote:

> I know modified UTF-8 is (can be) invalid UTF-8, that's why I asked. I
> think it could make sense to support it anyway though. Round tripping
> (strictly invalid, but possible) file names on Windows, easier
> interoperability with stuff like JNI, ...
>

Don't you mean WTF-8 then? AFAIK "Modified UTF-8" is UTF-8 that encodes the
null character with an overlong sequence, and thus is incompatible with
standard UTF-8, unlike WTF-8 which is a compatible extension.


> OTOH it would add overhead for systems with native UTF-8 APIs, because
> Nowide would at least have to check every string for "modified UTF-8
> encoded" surrogate pairs and convert the string if necessary. Which of
> course is a good argument for not supporting modified UTF-8, because then
> Nowide could just pass the strings through unmodified on those systems.
>

Implementing WTF-8 removes a check in UTF-8 → UTF-16 conversion, and
doesn't change anything in the reverse direction when there is a valid
UTF-16. I suspect it isn't slower.

--
Yakov Galka
http://stannum.co.il/

_______________________________________________

Groke, Paul via Boost

unread,
Jun 12, 2017, 5:55:18 AM6/12/17
to bo...@lists.boost.org, Groke, Paul
Yakov Galka wrote:

> On Mon, Jun 12, 2017 at 12:20 PM, Groke, Paul via Boost
> <mailto:bo...@lists.boost.org> wrote:
>> I know modified UTF-8 is (can be) invalid UTF-8, that's why I asked. I
>> think it could make sense to support it anyway though. Round tripping
>> (strictly invalid, but possible) file names on Windows, easier
>> interoperability with stuff like JNI, ...
>
> Don't you mean WTF-8 then? AFAIK "Modified UTF-8" is UTF-8 that encodes
> the null character with an overlong sequence, and thus is incompatible
> with standard UTF-8, unlike WTF-8 which is a compatible extension.

No, I mean modified UTF-8. Modified UTF-8 is UTF-8 plus the following extensions:
- Allow encoding UTF-16 surrogates as if they were code points (=what "WTF-8" does)
- Allow an over-long 2 byte encoding of the NUL character

Both are not strictly UTF-8 compatible, but both don't introduce significant
overhead in most situations. I don't see how over-long NUL encodings are
"more incompatible" then UTF-8 encoded surrogates, but then again
that's not really important.

>> OTOH it would add overhead for systems with native UTF-8 APIs, because
>> Nowide would at least have to check every string for "modified UTF-8
>> encoded" surrogate pairs and convert the string if necessary. Which of
>> course is a good argument for not supporting modified UTF-8, because
>> then Nowide could just > pass the strings through unmodified on those
>> systems.
>
> Implementing WTF-8 removes a check in UTF-8 -> UTF-16 conversion, and
> doesn't change anything in the reverse direction when there is a valid
> UTF-16. I suspect it isn't slower.

Supporting modified UTF-8 or WTF-8 adds overhead on systems where the
native OS API accepts UTF-8, but only strictly valid UTF-8.
When some UTF-8 enabled function of the library is called on such a
system, it would have to check for WTF-8 encoded surrogates and
convert them to "true" UTF-8 before passing the string to the OS API.
Because you would expect and want the "normal" UTF-8 encoding for
a string to refer to the same file as the WTF-8 encoding of the same
string.

Regards,
Paul Groke

Artyom Beilis via Boost

unread,
Jun 12, 2017, 6:53:01 AM6/12/17
to bo...@lists.boost.org, Artyom Beilis
>
> On a related note, does it support WTF-8? I.e. encoding lone UTF-16
> surrogates (malformed UTF-16 sequences) within the UTF-8 scheme. It is
> needed to guarantee UTF-16 → UTF-8 → UTF-16 roundtrip of invalid UTF-16 data
> on Windows, and is not an invalid behavior per se, because all valid UTF-16
> sequences still map bijectively onto valid UTF-8 sequences.
>
> --
> Yakov Galka
> http://stannum.co.il/


No it does not.

I considered it before but I think that security risk of creating or
accepting malformed UTF-8 or UTF-16.

Converting invalid UTF-16 to WTF-8 and other way around is not obvious
behavior and has potential of security risk especially for users that
are not aware of such an issue. So invalid UTF-8/16 sequences are
rejected by design.


Artyom

Peter Dimov via Boost

unread,
Jun 12, 2017, 8:11:50 AM6/12/17
to bo...@lists.boost.org, Peter Dimov
Artyom Beilis wrote:

> Converting invalid UTF-16 to WTF-8 and other way around is not obvious
> behavior and has potential of security risk especially for users that are
> not aware of such an issue. So invalid UTF-8/16 sequences are rejected by
> design.

Implying that there are Windows file names that can't be handled by the
library?

Peter Dimov via Boost

unread,
Jun 12, 2017, 8:57:47 AM6/12/17
to bo...@lists.boost.org, Peter Dimov
> Documentation:
> http://cppcms.com/files/nowide/html

Question: what do the functions do (on Windows) when the passed UTF-8 is
invalid? From a brief look at the documentation, I don't see this specified,
but I may have missed it.

degski via Boost

unread,
Jun 12, 2017, 10:17:57 AM6/12/17
to boost, degski
On 12 June 2017 at 15:11, Peter Dimov via Boost <bo...@lists.boost.org>
wrote:

> Artyom Beilis wrote:
>
> Converting invalid UTF-16 to WTF-8 and other way around is not obvious
>> behavior and has potential of security risk especially for users that are
>> not aware of such an issue. So invalid UTF-8/16 sequences are rejected by
>> design.
>>
>
> Implying that there are Windows file names that can't be handled by the
> library?
>

Question: "Shouldn't the passing of invalid UTF-8/16 sequences be defined
as UB?" How important is this point? I'm oblivious to this problem, and it
sounds like I would like to keep it that way.

degski
--
"*Ihre sogenannte Religion wirkt bloß wie ein Opiat reizend, betäubend,
Schmerzen aus Schwäche stillend.*" - Novalis 1798

Artyom Beilis via Boost

unread,
Jun 12, 2017, 10:55:09 AM6/12/17
to bo...@lists.boost.org, Artyom Beilis
On Mon, Jun 12, 2017 at 3:57 PM, Peter Dimov via Boost
<bo...@lists.boost.org> wrote:
>> Documentation:
>> http://cppcms.com/files/nowide/html
>
>
> Question: what do the functions do (on Windows) when the passed UTF-8 is
> invalid? From a brief look at the documentation, I don't see this specified,
> but I may have missed it.
>

It is documented in reference documentation.

http://cppcms.com/files/nowide/html/namespaceboost_1_1nowide.html#a6644397c539754076b11e95c535be192

fopen/ifstream etc - would fail in usual way these functions fail -
fail to open file.

widen/narrow will return null or throw
boost::locale::conv::conversion_error depending on function.

Artyom

Peter Dimov via Boost

unread,
Jun 12, 2017, 10:57:31 AM6/12/17
to bo...@lists.boost.org, Peter Dimov
degski wrote:

> Question: "Shouldn't the passing of invalid UTF-8/16 sequences be defined
> as UB?"

Of course not. Why would one need to use the library then? It defeats the
whole purpose of it.

Artyom Beilis via Boost

unread,
Jun 12, 2017, 10:58:42 AM6/12/17
to bo...@lists.boost.org, Artyom Beilis
On Mon, Jun 12, 2017 at 3:11 PM, Peter Dimov via Boost
<bo...@lists.boost.org> wrote:
> Artyom Beilis wrote:
>
>> Converting invalid UTF-16 to WTF-8 and other way around is not obvious
>> behavior and has potential of security risk especially for users that are
>> not aware of such an issue. So invalid UTF-8/16 sequences are rejected by
>> design.
>
>
> Implying that there are Windows file names (that their names contain invalid UTF-16) that can't be handled by the
> library?
>

Yes indeed if you will try to use stuff like FindFileW and it returns
invalid UTF-16 you will get an error trying to convert it to UTF-8.

By definition: you can't handle file names that can't be represented
in UTF-8 as there is no valid UTF-8 representation exist.

Artyom

Vadim Zeitlin via Boost

unread,
Jun 12, 2017, 11:06:05 AM6/12/17
to bo...@lists.boost.org, Vadim Zeitlin
On Mon, 12 Jun 2017 17:58:32 +0300 Artyom Beilis via Boost <bo...@lists.boost.org> wrote:

AB> By definition: you can't handle file names that can't be represented
AB> in UTF-8 as there is no valid UTF-8 representation exist.

This is a nice principle to have in theory, but very unfortunate in
practice because at least under Unix systems such file names do occur in
the wild (maybe less often now than 10 years ago, when UTF-8 was less
ubiquitous, but it's still hard to believe that the problem has completely
disappeared). And there are ways to solve it, e.g. I think glib represents
such file names using special characters from a PUA and there are other
possible approaches, even if, admittedly, none of them is perfect.

While it is reasonable to restrict the file names created by the program
to valid UTF-8, not allowing certain existing files to be opened at all is,
IMHO, not a good idea and probably has security implications as well.

Regards,
VZ

Peter Dimov via Boost

unread,
Jun 12, 2017, 11:12:37 AM6/12/17
to bo...@lists.boost.org, Peter Dimov
Artyom Beilis wrote:

> > Question: what do the functions do (on Windows) when the passed UTF-8 is
> > invalid? From a brief look at the documentation, I don't see this
> > specified, but I may have missed it.
>
>
> It is documented in reference documentation.
>
> http://cppcms.com/files/nowide/html/namespaceboost_1_1nowide.html#a6644397c539754076b11e95c535be192

Yes, I see now that it's documented on some, such as for instance fopen:

If invalid UTF-8 given, NULL is returned and errno is set to EINVAL

but not all, for instance getenv:

UTF-8 aware getenv. Returns 0 if the variable is not set.

or setenv:

UTF-8 aware setenv, key - the variable name, value is a new UTF-8
value,.

and it's not clear what happens when one feeds invalid UTF-8 to cout/cerr.

Another minor point is that the library assumes Cygwin is POSIX and tries
`using ::setenv;`, which is a compile error. For the purposes of Nowide,
Cygwin should be considered Windows.

Artyom Beilis via Boost

unread,
Jun 12, 2017, 1:39:28 PM6/12/17
to bo...@lists.boost.org, Artyom Beilis
On Mon, Jun 12, 2017 at 6:05 PM, Vadim Zeitlin via Boost
<bo...@lists.boost.org> wrote:
> On Mon, 12 Jun 2017 17:58:32 +0300 Artyom Beilis via Boost <bo...@lists.boost.org> wrote:
>
> AB> By definition: you can't handle file names that can't be represented
> AB> in UTF-8 as there is no valid UTF-8 representation exist.
>
> This is a nice principle to have in theory, but very unfortunate in
> practice because at least under Unix systems such file names do occur in
> the wild (maybe less often now than 10 years ago, when UTF-8 was less
> ubiquitous, but it's still hard to believe that the problem has completely
> disappeared). And there are ways to solve it, e.g. I think glib represents
> such file names using special characters from a PUA and there are other
> possible approaches, even if, admittedly, none of them is perfect.
>

Please note: Under POSIX platforms no conversions are performed
and no UTF-8 validation is done as this is incorrect:

http://cppcms.com/files/nowide/html/index.html#qna

The only case is when Windows Wide API returns/creates
invalid UTF-16 - which can happen only when invalid surrogate
UTF-16 pairs are generated - and they have no valid UTF-8
representation.

On the other hand creating deliberately invalid UTF-8 is very problematic idea.

Regards,

Artyom Beilis via Boost

unread,
Jun 12, 2017, 1:45:54 PM6/12/17
to bo...@lists.boost.org, Artyom Beilis
On Mon, Jun 12, 2017 at 6:03 PM, Peter Dimov via Boost
<bo...@lists.boost.org> wrote:
> Artyom Beilis wrote:
>
>> > Question: what do the functions do (on Windows) when the passed UTF-8 is
>> > > invalid? From a brief look at the documentation, I don't see this >
>> > specified, but I may have missed it.
>>
>>
>> It is documented in reference documentation.
>>
>>
>> http://cppcms.com/files/nowide/html/namespaceboost_1_1nowide.html#a6644397c539754076b11e95c535be192
>
>
> Yes, I see now that it's documented on some, such as for instance fopen:
>
> If invalid UTF-8 given, NULL is returned and errno is set to EINVAL
>
> but not all, for instance getenv:
>
> UTF-8 aware getenv. Returns 0 if the variable is not set.
>
> or setenv:
>
> UTF-8 aware setenv, key - the variable name, value is a new UTF-8 value,.
>

I try to keep interface similar to standard one. Changing errno by
getenv isn't something expected.

> and it's not clear what happens when one feeds invalid UTF-8 to cout/cerr.

Yes you are right it should be more clear. In case of invalid UTF-8
fail bit will be set.

>
> Another minor point is that the library assumes Cygwin is POSIX and tries
> `using ::setenv;`, which is a compile error. For the purposes of Nowide,
> Cygwin should be considered Windows

Not exactly.

Since Cygwin 1.7 it uses UTF-8 by default so all operations like fopen
cooperate properly with Windows API.
So it is correct to consider Cygwin as POSIX and not Windows.

Regarding specific setenv/getenv and how and if it handles
narrow/wide conversions
I indeed need to check.

Thanks,
Artyom

Peter Dimov via Boost

unread,
Jun 12, 2017, 1:53:14 PM6/12/17
to bo...@lists.boost.org, Peter Dimov
Artyom Beilis wrote:
> > Implying that there are Windows file names (that their names contain
> > invalid UTF-16) that can't be handled by the library?
>
> Yes indeed if you will try to use stuff like FindFileW and it returns
> invalid UTF-16 you will get an error trying to convert it to UTF-8.

That's what I meant, already existing files that can't be manipulated by the
library.

Incidentally, the mention of FileFindW leads me to my next question: by
looking at my own collection of functions that basically mirror what Nowide
does (everyone has his own Nowide, I suppose, as this is the only sane way
to program), I see 'stat' and a 'readdir' equivalent.

Have you considered providing these? One could go the filesystem route, of
course, but sometimes one may wish to minimize dependencies.

> By definition: you can't handle file names that can't be represented in
> UTF-8 as there is no valid UTF-8 representation exist.

Well yes in principle, but the file already exists. Although one can well
argue that this is no different from putting ':' in the name or something
like that.

Yakov Galka via Boost

unread,
Jun 12, 2017, 2:08:04 PM6/12/17
to bo...@lists.boost.org, Yakov Galka
On Mon, Jun 12, 2017 at 8:39 PM, Artyom Beilis via Boost <
bo...@lists.boost.org> wrote:

> Please note: Under POSIX platforms no conversions are performed
> and no UTF-8 validation is done as this is incorrect:
>
> http://cppcms.com/files/nowide/html/index.html#qna
>
> The only case is when Windows Wide API returns/creates
> invalid UTF-16 - which can happen only when invalid surrogate
> UTF-16 pairs are generated - and they have no valid UTF-8
> representation.
>
> On the other hand creating deliberately invalid UTF-8 is very problematic
> idea.
>

That's not convincing. Why Windows has to be different in this regard? In
both cases, Windows or POSIX, valid input gives valid UTF-8, and invalid
input can give 'invalid' UTF-8. I expect the library to be as transparent
layer as possible and not to interfere with such issues.

--
Yakov Galka
http://stannum.co.il/

Yakov Galka via Boost

unread,
Jun 12, 2017, 2:12:52 PM6/12/17
to bo...@lists.boost.org, Yakov Galka
On Mon, Jun 12, 2017 at 12:55 PM, Groke, Paul via Boost <
bo...@lists.boost.org> wrote:

> Supporting modified UTF-8 or WTF-8 adds overhead on systems where the
> native OS API accepts UTF-8, but only strictly valid UTF-8.
> When some UTF-8 enabled function of the library is called on such a
> system, it would have to check for WTF-8 encoded surrogates and
> convert them to "true" UTF-8 before passing the string to the OS API.
> Because you would expect and want the "normal" UTF-8 encoding for
> a string to refer to the same file as the WTF-8 encoding of the same
> string.
>

That's the point: if the string is at all representable in UTF-8 then its
WTF-8 representation is already in a valid UTF-8 representation and no
conversion has to be done. Thus you don't even have to check anything at
all. This is how WTF-8 is 'more compatible' with UTF-8 than Modified UTF-8
is.

By analogy, you don't need to do special checks if you want to pass a UTF-8
string to an ASCII only API, because UTF-8 is a strict superset.

--
Yakov Galka
http://stannum.co.il/

Groke, Paul via Boost

unread,
Jun 12, 2017, 2:14:45 PM6/12/17
to bo...@lists.boost.org, Groke, Paul
Artyom Beilis wrote:
> On Mon, Jun 12, 2017 at 6:05 PM, Vadim Zeitlin via Boost
> <bo...@lists.boost.org> wrote:
> > On Mon, 12 Jun 2017 17:58:32 +0300 Artyom Beilis via Boost
> <bo...@lists.boost.org> wrote:
> >
> > AB> By definition: you can't handle file names that can't be
> > AB> represented in UTF-8 as there is no valid UTF-8 representation exist.
> >
> > This is a nice principle to have in theory, but very unfortunate in
> > practice because at least under Unix systems such file names do occur
> > in the wild (maybe less often now than 10 years ago, when UTF-8 was
> > less ubiquitous, but it's still hard to believe that the problem has
> > completely disappeared). And there are ways to solve it, e.g. I think
> > glib represents such file names using special characters from a PUA
> > and there are other possible approaches, even if, admittedly, none of
> them is perfect.
> >
>
> Please note: Under POSIX platforms no conversions are performed and no
> UTF-8 validation is done as this is incorrect:
>
> http://cppcms.com/files/nowide/html/index.html#qna

Well... what's correct on POSIX platforms is a matter of opinion. If you go with the strict interpretation, then in fact conversion from the current locale to UTF-8 must be considered incorrect. Only then you cannot rely on *anything*, except that 0x00 is NUL and 0x2F is the path separator. Which makes any kind of isdigit/toupper/tolower/... string parsing/processing "incorrect".

> The only case is when Windows Wide API returns/creates invalid UTF-16 -
> which can happen only when invalid surrogate
> UTF-16 pairs are generated - and they have no valid UTF-8 representation.
>
> On the other hand creating deliberately invalid UTF-8 is very problematic
> idea.

Since the UTF-8 conversion is only done on/for Windows, and Windows doesn't guarantee that all wchar_t paths (or strings in general) will always be valid UTF-16, wouldn't it make more sense to just *define* that the library always uses WTF-8, which allows round-tripping of all possible 16 bit strings? If it's documented that way it shouldn't matter. Especially since users of the library cannot rely on the strings being in UTF-8 anyway, at least not in portable applications.

I agree that the over-long zero/NUL encoding part of modified UTF-8 might still be problematic though, and therefor WTF-8 might be the better choice. Now that still leaves some files that can theoretically exist on a Windows system inaccessible (i.e. those with embedded NUL characters), but those are not accessible via the "usual" Windows APIs either (CreateFileW etc.). So this should be acceptable.

Peter Dimov via Boost

unread,
Jun 12, 2017, 2:31:15 PM6/12/17
to bo...@lists.boost.org, Peter Dimov
Yakov Galka wrote:

> That's not convincing. Why Windows has to be different in this regard?

Because it is. The file name on Windows physically is a sequence of
uint16_t, whereas on Linux, it's a sequence of uint8_t. On Linux, any
null-terminated byte sequence results in a perfect roundtrip, as it's stored
unmodified, it's literally the exact same byte sequence as the one you gave.
If the library tried to mess with the file name in any way, this would
result in nothing but problems.

On Windows, it is simply not possible to have the exact same byte sequence
stored as the file name. The file system does not use byte sequences, it
uses uint16_t sequences.

(Mac OS X also stores uint16_t by the way, and the functions require UTF-8.
So it's not quite POSIX vs Windows.)

Zach Laine via Boost

unread,
Jun 12, 2017, 2:35:33 PM6/12/17
to bo...@lists.boost.org, Zach Laine
On Mon, Jun 12, 2017 at 1:14 PM, Groke, Paul via Boost <
bo...@lists.boost.org> wrote:

> Since the UTF-8 conversion is only done on/for Windows, and Windows
> doesn't guarantee that all wchar_t paths (or strings in general) will
> always be valid UTF-16, wouldn't it make more sense to just *define* that
> the library always uses WTF-8, which allows round-tripping of all possible
> 16 bit strings? If it's documented that way it shouldn't matter. Especially
> since users of the library cannot rely on the strings being in UTF-8
> anyway, at least not in portable applications.
>

I agree that round-tripping to wchar_t paths on Windows is very important.
I also agree that not detecting invalid UTF-8, or failing to produce an
error in such a case, is very important to *avoid*.

Can we get both? Is it possible to add a WTF-8 mode, perhaps only used in
user-selectable string processing cases?

Zach

Yakov Galka via Boost

unread,
Jun 12, 2017, 2:36:52 PM6/12/17
to bo...@lists.boost.org, Yakov Galka, Peter Dimov
On Mon, Jun 12, 2017 at 9:30 PM, Peter Dimov via Boost <
bo...@lists.boost.org> wrote:

> Yakov Galka wrote:
>
> That's not convincing. Why Windows has to be different in this regard?
>>
>
> Because it is. The file name on Windows physically is a sequence of
> uint16_t, whereas on Linux, it's a sequence of uint8_t. On Linux, any
> null-terminated byte sequence results in a perfect roundtrip, as it's
> stored unmodified, it's literally the exact same byte sequence as the one
> you gave. If the library tried to mess with the file name in any way, this
> would result in nothing but problems.
>
> On Windows, it is simply not possible to have the exact same byte sequence
> stored as the file name. The file system does not use byte sequences, it
> uses uint16_t sequences.
>

Yeah, so? I say that the library can provide a Windows -> std::string ->
Windows roundtrip just as it does with any other platform. If FreeBSD ->
std::string conversion can return invalid UTF-8, then so does Windows ->
std::string conversion.

--
Yakov Galka
http://stannum.co.il/

Peter Dimov via Boost

unread,
Jun 12, 2017, 2:52:11 PM6/12/17
to bo...@lists.boost.org, Peter Dimov
Yakov Galka wrote:

> Yeah, so? I say that the library can provide a Windows -> std::string ->
> Windows roundtrip just as it does with any other platform. If FreeBSD ->
> std::string conversion can return invalid UTF-8, then so does Windows ->
> std::string conversion.

The security concern here is that under FreeBSD the file name is what it is,
and different byte sequences refer to different names, whereas under Windows
if invalid UTF-8 is allowed many different byte sequences may map to the
same file name.

This does not necessarily preclude handling free surrogate pairs though. In
practice the main problem is probably with overlong encoding of '.', '/' and
'\'.

Last time this came up I argued that if you rely on finding '.' as the
literal 8 bit '.' your input validation is wrong anyway, but requiring
strictly valid UTF-8 is a reasonable first line of defense.

And realistically, if you want to validate the input in the correct manner,
you have to #ifdef for each OS anyway, which kind of makes the library
redundant. So in the specific use case where you _do_ use the library to
avoid #ifdef'ing, it does make sense for it to protect you from invalid
UTF-8 on Windows.

With all that said, I don't quite see the concern with WTF-8. What's the
attack we're defending from by disallowing it?

Yakov Galka via Boost

unread,
Jun 12, 2017, 3:02:58 PM6/12/17
to bo...@lists.boost.org, Yakov Galka, Peter Dimov
On Mon, Jun 12, 2017 at 9:51 PM, Peter Dimov via Boost <
bo...@lists.boost.org> wrote:

> ... whereas under Windows if invalid UTF-8 is allowed many different byte
> sequences may map to the same file name.
>

This is a false presumption. Nobody here proposes allowing absolutely ANY
byte sequences, only using WTF-8 as means of guaranteeing a round-trip. And
as far as WTF-8 goes there is a unique representation for every 16-bit
codeunit sequence.


> With all that said, I don't quite see the concern with WTF-8. What's the
> attack we're defending from by disallowing it?
>

There are some concerns with WTF-8, specifically if you concatenate two
WTF-8 strings where one ends in an unpaired surrogate whereas the other
begins with one, then the result is an invalid WTF-8 string. Filenames are
usually parsed and concatenated on ASCII separators, so I don't see a
problem in the typical use-case. As for the non-typical use cases, I would
argue that they are beyond the responsibility of this library.

--
Yakov Galka
http://stannum.co.il/

Groke, Paul via Boost

unread,
Jun 12, 2017, 3:07:02 PM6/12/17
to bo...@lists.boost.org, Groke, Paul
Zach Laine wrote:
> On Mon, Jun 12, 2017 at 1:14 PM, Groke, Paul via Boost <
> bo...@lists.boost.org> wrote:
>
> > Since the UTF-8 conversion is only done on/for Windows, and Windows
> > doesn't guarantee that all wchar_t paths (or strings in general) will
> > always be valid UTF-16, wouldn't it make more sense to just *define*
> > that the library always uses WTF-8, which allows round-tripping of all
> > possible
> > 16 bit strings? If it's documented that way it shouldn't matter.
> > Especially since users of the library cannot rely on the strings being
> > in UTF-8 anyway, at least not in portable applications.
> >
>
> I agree that round-tripping to wchar_t paths on Windows is very important.
> I also agree that not detecting invalid UTF-8, or failing to produce an error in
> such a case, is very important to *avoid*.
>
> Can we get both? Is it possible to add a WTF-8 mode, perhaps only used in
> user-selectable string processing cases?

Well, I don't see why detecting invalid UTF-8 would be important. In my initial mail I (wrongly) assumed that the library would be translating between the native encoding and UTF-8 also on e.g. Linux (or non-Windows platforms in general). But since this isn't so, I guess the library can simply pass-through strings on all platforms that have narrow APIs. In fact I think it should, since checking for valid UTF-8 would make some files inaccessible on systems like Linux, where you can very easily create file names that aren't valid UTF-8.

In that case the terms "Unicode" and "UTF-8" should not be used in describing the library (name and documentation). The documentation should just say that it transforms strings to some unspecified encoding with the following properties:
- Byte-based
- ASCII compatible (*)
- self-synchronizing
- able to 100% round-trip all characters/NUL terminated strings from the native platform encoding
And for Windows this unspecified encoding then would just happen to be WTF-8.
(* For POSIX systems one cannot even 100% rely on that... so maybe using an even wider set of constraints would be good.)

On platforms like OS X, the API-wrappers of Boost.Nowide would then simply produce the same result as the native APIs would -- because the narrow string would simply be passed through unmodified. It it's invalid UTF-8, and the OS decides to fail the call because of the invalid UTF-8, then so be it - using the library wouldn't change anything.

Almost the same for Windows: the library would simply round-trip invalid UTF-16 to the same invalid UTF-16. If the native Windows API decides to fail the call because of that, OK, then it just fails -- it would also have failed in a wchar_t application.

But maybe I missed something here. If there really is a good reason for enforcing valid UTF-8 in some situation, please let me know :)

Artyom Beilis via Boost

unread,
Jun 12, 2017, 3:37:44 PM6/12/17
to bo...@lists.boost.org, Artyom Beilis
> But maybe I missed something here. If there really is a good reason for enforcing valid UTF-8 in some situation, please let me know :)
>


Ok I want to summarize my response regarding WTF-8 especially from
security point of view.

There is lots of dangers in this assumption - generation of WTF-8 as
if it was UTF-8 can lead to serious issues.

It almost never OK to generate invalid UTF-8 especially since most of
chances 99% of users will not understand
what are we talking about round-trip/WTF-8 and so - and I'm talking
from experience.

Just to make clear how serious it can be - this is CVE regarding one
bug in Boost.Locale:

http://people.canonical.com/~ubuntu-security/cve/2013/CVE-2013-0252.html

Lets show a trivial example that WTF-8 can lead to:

Tool:

- User monitoring system that monitors all files and creates
report with all changes by XML to some central server

Deny of Service Attack Example:

- User creates a file with invalid UTF-16
- System monitors the file system and adds it to the XML report in
WTF-8 format
- The central server does not accept the XML since it fails UTF-8 validation
- User does whatever he wants without monitoring
- It removes the file
- There were no reports generated during the period user needed -DOS attack

Bottom line:

(a) Since 99% of programmers are barely aware of various Unicode
issues it is dangerous to assume that giving such a round trip in
trivial way is OK

(b) If you need to do complex manipulations of file system you my
consider Boost.Filesystem that keeps internal representation in native
format and convert
to UTF-8 for other uses (nowide provides integration with Boost.Filesystem)

(c) It is possible to add WTF-8 support under following restrictions:
(c.1) only by adding wtf8_to_wide and wide_to_wtf8 - if user wants it
explicitly(!)
(c.2) No "C/C++" like API should accept one - i.e.
nowide::fopen/nowide::fstream must not do it.


Artyom Beilis

Zach Laine via Boost

unread,
Jun 12, 2017, 4:39:58 PM6/12/17
to bo...@lists.boost.org, Zach Laine
On Mon, Jun 12, 2017 at 2:37 PM, Artyom Beilis via Boost <
bo...@lists.boost.org> wrote:

> > But maybe I missed something here. If there really is a good reason for
> enforcing valid UTF-8 in some situation, please let me know :)
> >
> [snip]

Bottom line:
>
> [snip]


To add a bottom line under that, it seems that nowide does not support
round-tripping from arbitrary Windows path names to UTF-8 and back. Is
that correct?

Zach

Peter Dimov via Boost

unread,
Jun 12, 2017, 5:06:13 PM6/12/17
to bo...@lists.boost.org, Peter Dimov
Artyom Beilis wrote:
> Deny of Service Attack Example:
>
> - User creates a file with invalid UTF-16
> - System monitors the file system and adds it to the XML report in
> WTF-8 format
> - The central server does not accept the XML since it fails UTF-8
> validation
> - User does whatever he wants without monitoring
> - It removes the file
> - There were no reports generated during the period user needed -DOS
> attack

I can't help but note that the same attack would work under Unix. The user
can easily create a file with an invalid UTF-8 name. And, since the library
doesn't enforce valid UTF-8 on POSIX (right?) it would pass through.

Niall Douglas via Boost

unread,
Jun 12, 2017, 6:52:54 PM6/12/17
to bo...@lists.boost.org, Niall Douglas
On 12/06/2017 22:05, Peter Dimov via Boost wrote:
> Artyom Beilis wrote:
>> Deny of Service Attack Example:
>>
>> - User creates a file with invalid UTF-16
>> - System monitors the file system and adds it to the XML report in
>> WTF-8 format
>> - The central server does not accept the XML since it fails UTF-8
>> validation
>> - User does whatever he wants without monitoring
>> - It removes the file
>> - There were no reports generated during the period user needed
>> -DOS attack
>
> I can't help but note that the same attack would work under Unix. The
> user can easily create a file with an invalid UTF-8 name. And, since the
> library doesn't enforce valid UTF-8 on POSIX (right?) it would pass
> through.

Also, running sanitisation on UTF sequences is laudable and all, but
what about the 80-90% of code which will never, ever feed strings from
untrusted remote parties to system APIs? Do they pay for the UTF
sequence validation too?

Like Peter and many of us, I have a personal "nowide" library too. It
simply wraps the *W() Win32 APIs consuming 16 bit characters with a call
to the NT kernel function for converting UTF-8 to UTF-16 which is
RtlUTF8ToUnicodeN().

It does nothing else fancy, it leaves it entirely up to the Windows
kernel what that conversion means. I'd suggest that the nowide library
ought to keep things similarly simple.

Niall

--
ned Productions Limited Consulting
http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/

Artyom Beilis via Boost

unread,
Jun 12, 2017, 11:49:17 PM6/12/17
to bo...@lists.boost.org, Artyom Beilis
On Tue, Jun 13, 2017 at 12:05 AM, Peter Dimov via Boost
<bo...@lists.boost.org> wrote:
> Artyom Beilis wrote:
>>
>> Deny of Service Attack Example:
>>
>> - User creates a file with invalid UTF-16
>> - System monitors the file system and adds it to the XML report in
>> WTF-8 format
>> - The central server does not accept the XML since it fails UTF-8
>> validation
>> - User does whatever he wants without monitoring
>> - It removes the file
>> - There were no reports generated during the period user needed -DOS
>> attack
>
>
> I can't help but note that the same attack would work under Unix. The user
> can easily create a file with an invalid UTF-8 name. And, since the library
> doesn't enforce valid UTF-8 on POSIX (right?) it would pass through.
>
>

Note, under POSIX user takes strings as is and can't trust the source.
Under Windows it need to convert them using nowide which can give him
false assumption that it receives valid UTF-8.

Once again I have no problem providing wtf8 to wide and other way
around functions when user EXPLICITLY says it.
But it shell not be default behavior or some behavior you turn on with
some global define.


Artyom

degski via Boost

unread,
Jun 13, 2017, 2:20:45 AM6/13/17
to boost, degski
On 12 June 2017 at 17:57, Peter Dimov via Boost <bo...@lists.boost.org>
wrote:

> degski wrote:


>
> Question: "Shouldn't the passing of invalid UTF-8/16 sequences be defined
>> as UB?"
>>
>
> Of course not. Why would one need to use the library then? It defeats the
> whole purpose of it.


From WP (read up on it now): "RFC 3629 states "Implementations of the
decoding algorithm MUST protect against decoding invalid sequences."[13]
<https://en.wikipedia.org/wiki/UTF-8#cite_note-rfc3629-13> *The Unicode
Standard* requires decoders to "...treat any ill-formed code unit sequence
as an error condition. This guarantees that it will neither interpret nor
emit an ill-formed code unit sequence.""

So not UB then, but it should not pass either.

Are we talking FAT32 or NTFS? What Windows verions are affected? I also
think, as some posters below (and in another thread) state, that Windows
should not be treated differently. A new boost library should not
accomodate bad/sloppy windows' historic quirks. The library *can* require
that's it's use depends on the system and its' users adhere to the standard.

Then WP on Overlong encodings: "The standard specifies that the correct
encoding of a code point use only the minimum number of bytes required to
hold the significant bits of the code point. Longer encodings are called
*overlong* and are not valid UTF-8 representations of the code point. This
rule maintains a one-to-one correspondence between code points and their
valid encodings, so that there is a unique valid encoding for each code
point."

The key being: "... are not valid UTF-8 representations ...", i.e. we're
back to the case above.

degski

WP: https://en.wikipedia.org/wiki/UTF-8


--
"*Ihre sogenannte Religion wirkt bloß wie ein Opiat reizend, betäubend,
Schmerzen aus Schwäche stillend.*" - Novalis 1798

_______________________________________________

Groke, Paul via Boost

unread,
Jun 13, 2017, 5:31:47 AM6/13/17
to bo...@lists.boost.org, Groke, Paul
Artyom Beilis wrote:
> On Tue, Jun 13, 2017 at 12:05 AM, Peter Dimov via Boost
> <bo...@lists.boost.org> wrote:
>> Artyom Beilis wrote:
>>>
>>> - User creates a file with invalid UTF-16
>>> - System monitors the file system and adds it to the XML report
>>> in
>>> WTF-8 format
>>> - The central server does not accept the XML since it fails UTF-8
>>> validation
>>> - User does whatever he wants without monitoring
>>> - It removes the file
>>> - There were no reports generated during the period user needed
>>> -DOS attack
>>
>> I can't help but note that the same attack would work under Unix. The
>> user can easily create a file with an invalid UTF-8 name. And, since
>> the library doesn't enforce valid UTF-8 on POSIX (right?) it would pass
>> through.
>
> Note, under POSIX user takes strings as is and can't trust the source.
> Under Windows it need to convert them using nowide which can give him
> false assumption that it receives valid UTF-8.

OK, thanks for explaining, I understand your concern now. I don't agree that a library like Nowide should necessarily try to address that kind of issue though.

Regarding your example... isn't the whole point of Nowide to make it possible to write portable applications without #ifdef-ing stuff? So ... for this to create a problem on Windows (and at the same time not create a problem on POSIX), a bunch of things would have to coincide
- Programmer didn't read manual and therefor assumes to get UTF-8 on Windows
- Application then wants to write some cmdline argument/path/... into some document that requires valid UTF-8
- The software component that builds/writes that document fails to validate the UTF-8 itself
- Programmer decides to skip the UTF-8 validation on Windows, because he assumes it has already been done
Assuming that the application doesn't need an #ifdef _WIN32 in that part of the code anyway, that would mean that the programmer deliberately introduced an #ifdef _WIN32, just to selectively skip some validation on Windows, because he thinks that it might already have been done.

No, I wouldn't say that's something that should be driving the design decisions of a library like Nowide.

Additionally I don't see what validating the UTF-8 would improve here. In that case the user would still create a file with invalid UTF-16. Only then the application would even fail to enumerate the directory or convert the path. And I'm not convinced that kind of error is more likely to be handled correctly.

> Once again I have no problem providing wtf8 to wide and other way around
> functions when user EXPLICITLY says it.
> But it shell not be default behavior or some behavior you turn on with some
> global define.

The whole appeal of a library like Nowide is that it works more or less transparently - i.e. without having to write conversion calls all over the place. And duplicating all APIs is something that I would deem a very bad choice - that would be confusing as hell.

Peter Dimov via Boost

unread,
Jun 13, 2017, 11:09:49 AM6/13/17
to bo...@lists.boost.org, Peter Dimov
Artyom Beilis wrote:

> Note, under POSIX user takes strings as is and can't trust the source.
> Under Windows it need to convert them using nowide which can give him
> false assumption that it receives valid UTF-8.

What use case are we discussing here?

Is it the one where the user gets a path from an external source and passes
it to Nowide? In which case Nowide does no validation on POSIX and strict
UTF-8 validation on Windows? Giving the Windows user the false assumption
that Nowide validates, which turns out to not happen in a later POSIX port.

Or is it the one where the user receives a path from Nowide? Which currently
doesn't occur anywhere, because the library does not return paths, it only
takes them as arguments. (It would occur if you add `readdir`.)

> Once again I have no problem providing wtf8 to wide and other way around
> functions when user EXPLICITLY says it.

How would portable user code look like in this scenario?

Artyom Beilis via Boost

unread,
Jun 13, 2017, 11:43:11 AM6/13/17
to bo...@lists.boost.org, Artyom Beilis
On Tue, Jun 13, 2017 at 6:08 PM, Peter Dimov via Boost
<bo...@lists.boost.org> wrote:
> Artyom Beilis wrote:
>
>> Note, under POSIX user takes strings as is and can't trust the source.
>> Under Windows it need to convert them using nowide which can give him
>> false assumption that it receives valid UTF-8.
>
>
> What use case are we discussing here?
>
> Is it the one where the user gets a path from an external source and passes
> it to Nowide? In which case Nowide does no validation on POSIX and strict
> UTF-8 validation on Windows? Giving the Windows user the false assumption
> that Nowide validates, which turns out to not happen in a later POSIX port.
>
> Or is it the one where the user receives a path from Nowide? Which currently
> doesn't occur anywhere, because the library does not return paths, it only
> takes them as arguments. (It would occur if you add `readdir`.)
>

For that purpose you have Boost.Filesystem that internally keeps wide strings
on Windows and narrow on POSIX platforms.

Nowide does not try to replace Boost.Filestem. However once you try to get
an std::string from Boost.Filesystem (using nowide integration) you will
get either error or valid UTF-8 string.

>> Once again I have no problem providing wtf8 to wide and other way around
>> functions when user EXPLICITLY says it.
>
>
> How would portable user code look like in this scenario?
>

Nowide does not try to provide solution to all possible portable API.
But it provides a toolset to make such portable solutions easier.

However if you convert some wide string from some Windows API you
will expect to get valid UTF-8 and not WTF-8.

Artyom

Peter Dimov via Boost

unread,
Jun 13, 2017, 11:50:38 AM6/13/17
to bo...@lists.boost.org, Peter Dimov
Artyom Beilis wrote:
> On Tue, Jun 13, 2017 at 6:08 PM, Peter Dimov via Boost
> <bo...@lists.boost.org> wrote:
> > Artyom Beilis wrote:
> >
> >> Note, under POSIX user takes strings as is and can't trust the source.
> >> Under Windows it need to convert them using nowide which can give him
> >> false assumption that it receives valid UTF-8.
> >
> >
> > What use case are we discussing here?
> >
> > Is it the one where the user gets a path from an external source and
> > passes it to Nowide? In which case Nowide does no validation on POSIX
> > and strict UTF-8 validation on Windows? Giving the Windows user the
> > false assumption that Nowide validates, which turns out to not happen in
> > a later POSIX port.
> >
> > Or is it the one where the user receives a path from Nowide? Which
> > currently doesn't occur anywhere, because the library does not return
> > paths, it only takes them as arguments. (It would occur if you add
> > `readdir`.)
> >
>
> For that purpose you have Boost.Filesystem that internally keeps wide
> strings on Windows and narrow on POSIX platforms.
>
> Nowide does not try to replace Boost.Filestem. However once you try to get
> an std::string from Boost.Filesystem (using nowide integration) you will
> get either error or valid UTF-8 string.

This doesn't answer my question.

Let me rephrase.

The user gets a path from an external source and passes it to Nowide, for
example, to nowide::fopen.

Nowide does no validation on POSIX and strict UTF-8 validation on Windows.

Why is the Windows-only strict validation a good thing?

What attacks are prevented by not accepting WTF-8 in nowide::fopen ONLY
under Windows, and passing everything through unvalidated on POSIX?

If the program originates on Windows and as a result comes to rely on
Nowide's strict validation, and is later ported to POSIX, are not the users
left with a false sense of security?

Peter Dimov via Boost

unread,
Jun 13, 2017, 1:34:38 PM6/13/17
to bo...@lists.boost.org, Peter Dimov
> Another minor point is that the library assumes Cygwin is POSIX and tries
> `using ::setenv;`, which is a compile error.

I fixed this by changing cenv.hpp as follows:

#if defined(BOOST_WINDOWS) || defined(__CYGWIN__)
#include <boost/nowide/windows.hpp>
#endif

namespace boost {
namespace nowide {
#if !defined(BOOST_WINDOWS) && !defined(__CYGWIN__) &&
!defined(BOOST_NOWIDE_DOXYGEN)

and now everything compiles under Cygwin. I get the following two test
failures:

testing.capture-output
..\..\..\bin.v2\libs\nowide\test\test_system_n.test\gcc-cxx03-6.3.0\debug\test_system_n.run
====== BEGIN OUTPUT ======
Failed Error boost::nowide::getenv("BOOST_NOWIDE_TEST") in
test_system.cpp:25 main
Failed Error boost::nowide::system(command.c_str()) == 0 in
test_system.cpp:75 main

EXIT STATUS: 1
====== END OUTPUT ======

Artyom Beilis via Boost

unread,
Jun 13, 2017, 3:49:49 PM6/13/17
to bo...@lists.boost.org, Artyom Beilis
On Tue, Jun 13, 2017 at 8:34 PM, Peter Dimov via Boost
<bo...@lists.boost.org> wrote:
>> Another minor point is that the library assumes Cygwin is POSIX and tries
>> `using ::setenv;`, which is a compile error.
>
>
> I fixed this by changing cenv.hpp as follows:
>
> #if defined(BOOST_WINDOWS) || defined(__CYGWIN__)
> #include <boost/nowide/windows.hpp>
> #endif
>
> namespace boost {
> namespace nowide {
> #if !defined(BOOST_WINDOWS) && !defined(__CYGWIN__) &&
> !defined(BOOST_NOWIDE_DOXYGEN)
>
> and now everything compiles under Cygwin. I get the following two test
> failures:
>
> testing.capture-output
> ..\..\..\bin.v2\libs\nowide\test\test_system_n.test\gcc-cxx03-6.3.0\debug\test_system_n.run
> ====== BEGIN OUTPUT ======
> Failed Error boost::nowide::getenv("BOOST_NOWIDE_TEST") in
> test_system.cpp:25 main
> Failed Error boost::nowide::system(command.c_str()) == 0 in
> test_system.cpp:75 main
>
> EXIT STATUS: 1
> ====== END OUTPUT ======
>


Ok I'll test cygwin (hasn't worked with one for a while) and make sure
it works on this platform. Also I need to check how
native cygwin getenv/setenv behaves wither regard to Unicode.

Artyom

Artyom Beilis via Boost

unread,
Jun 13, 2017, 4:22:21 PM6/13/17
to bo...@lists.boost.org, Artyom Beilis
>
> Let me rephrase.
>
> The user gets a path from an external source and passes it to Nowide, for
> example, to nowide::fopen.
>
> Nowide does no validation on POSIX and strict UTF-8 validation on Windows.
>
> Why is the Windows-only strict validation a good thing?
>
> What attacks are prevented by not accepting WTF-8 in nowide::fopen ONLY
> under Windows, and passing everything through unvalidated on POSIX?

Ok...

This is a very good question.

On windows I need to convert for obvious reason.

Now the question is what I accept as valid and what is not valid and
where do I draw the line.
---------------------------------------------------------------------------------------------------------------------------

Now as you have seen there are many possible "non-standard" UTF-8 variants.

What should I accept?

Should I accept CESU-8 (non BMP encoded as 2 pairs of 3 "UTF-8" like
bytes) and UTF-8

Should I accept WTF-8 only non-paired surrogates? What if these
surrogates can be combined into correct UTF-16, is it valid?
(In fact concatenation of WTF-8 strings isn't trivial operation as
simple string + string does not work and lead to invalid WTF-16)

Should I accept modified UTF-8 - it was already asked (i.e. values
encoded without shortest sequence) for example should I accept "x"
encoded in two bytes? What about "."?
How should I treat stuff like "\xFF.txt" <- invalid UTF-8? Should I
convert it to L"\xFF.txt"? Should I convert it to some sort of WTF-16
to preserve the string?

Now despite what it may look from the discussion WTF-8 is far from
being "standard" for representing invalid UTF-16.
Some may substitute it with "?" others with U+FFFD, others just remove
one. I actually tested some cases for real "bad"
file names by different system and each did something different.

I don't think using WTF-8 is some widely used industry standard it is
just one of many variants to create UTF-8 extension.

But there MUST be clear line of what is accepted and what is not and
the safest and most standard line to draw is well defined
UTF-8 and UTF-16 that are (a) 100% convertible one from other (b)
widely used accepted standards.

So that was my decision - based on safety and standards (and there is
no such thing as non strict UTF-8/16)

Does it fits everybody? Of course not!
It there some line that fits everybody? There is no such thing!
Does it fits common case for vast majority of users/developers? IMHO yes.

So as a policy I decided to use UTF-8 and UTF-16 as selection of
encoding for each sides of widen/narrow is required.


>
> If the program originates on Windows and as a result comes to rely on
> Nowide's strict validation, and is later ported to POSIX, are not the users
> left with a false sense of security?
>

You have valid point.

But adding validation on POSIX systems in general will be wrong (as I
noted in Q&A)
because there is no single encoding for POSIX OS - it is runtime
parameter - unlike Windows API that has
Wide UTF-16 API so such a validation on Linux/POSIX will likely cause
more issues than solve.

Thanks,
Artyom

Niall Douglas via Boost

unread,
Jun 13, 2017, 4:33:11 PM6/13/17
to bo...@lists.boost.org, Niall Douglas
>> Why is the Windows-only strict validation a good thing?
>>
>> What attacks are prevented by not accepting WTF-8 in nowide::fopen ONLY
>> under Windows, and passing everything through unvalidated on POSIX?
>
> Ok...
>
> This is a very good question.
>
> On windows I need to convert for obvious reason.
>
> Now the question is what I accept as valid and what is not valid and
> where do I draw the line.
> ---------------------------------------------------------------------------------------------------------------------------
>
> Now as you have seen there are many possible "non-standard" UTF-8 variants.
>
> What should I accept?

I still strongly suggest you simply call RtlUTF8ToUnicodeN()
(https://msdn.microsoft.com/en-us/library/windows/hardware/ff563018(v=vs.85).aspx)
to do the UTF-8 conversion. Do **nothing** else.

Niall

--
ned Productions Limited Consulting
http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/


Zach Laine via Boost

unread,
Jun 13, 2017, 4:52:25 PM6/13/17
to bo...@lists.boost.org, Zach Laine
On Tue, Jun 13, 2017 at 3:33 PM, Niall Douglas via Boost <
bo...@lists.boost.org> wrote:

> > Now as you have seen there are many possible "non-standard" UTF-8
> variants.
> >
> > What should I accept?
>
> I still strongly suggest you simply call RtlUTF8ToUnicodeN()
> (https://msdn.microsoft.com/en-us/library/windows/
> hardware/ff563018(v=vs.85).aspx)
> to do the UTF-8 conversion. Do **nothing** else.


Niall, could you explain why? I don't know any of the Windows-relevant
details.

Zach

Niall Douglas via Boost

unread,
Jun 13, 2017, 4:59:26 PM6/13/17
to bo...@lists.boost.org, Niall Douglas
> > Now as you have seen there are many possible "non-standard" UTF-8 variants.
> >
> > What should I accept?
>
> I still strongly suggest you simply call RtlUTF8ToUnicodeN()
> (https://msdn.microsoft.com/en-us/library/windows/hardware/ff563018(v=vs.85).aspx
> <https://msdn.microsoft.com/en-us/library/windows/hardware/ff563018(v=vs.85).aspx>)
> to do the UTF-8 conversion. Do **nothing** else.
>
>
> Niall, could you explain why? I don't know any of the Windows-relevant
> details.

1. RtlUTF8ToUnicodeN() is what the NT kernel uses and isn't polluted by
Win32.

2. RtlUTF8ToUnicodeN() has a well designed API unlike the awful Win32
MultiByteToWideChar() function.

3. RtlUTF8ToUnicodeN() is close to as fast as any implementation of the
same thing, unlike MultiByteToWideChar() and some STL implementations of
<codecvt>.

4. RtlUTF8ToUnicodeN() treats invalid input in the way which the rest of
the NT kernel is built to expect. I caveat this with saying that Win32
functions can mangle input in a really unhelpful way, I would only trust
RtlUTF8ToUnicodeN() being fed directly to NT kernel APIs. That I know
works well.

5. I've been using RtlUTF8ToUnicodeN() in my own code for years and have
found it unsurprising and unproblematic. Unlike MultiByteToWideChar() or
C++ 11's UTF-8 support which doesn't quite work right on some standard
libraries.

Niall

--
ned Productions Limited Consulting
http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/


Peter Dimov via Boost

unread,
Jun 13, 2017, 7:34:04 PM6/13/17
to bo...@lists.boost.org, Peter Dimov
Artyom Beilis wrote:

> Now as you have seen there are many possible "non-standard" UTF-8
> variants.
>
> What should I accept?

Others have made their case for WTF-8, which has the desirable property of
only allowing one encoding for any uint16_t sequence (taken in isolation).

My personal choice here is to be more lenient and accept any combination of
valid UTF-8 and UTF-8 encoded surrogates (a superset of CESU-8 and WTF-8),
but I'm not going to argue very strongly for it over WTF-8, if at all.

I think that overlongs should NOT be accepted, including overlong zero.

Peter Dimov via Boost

unread,
Jun 13, 2017, 10:02:27 PM6/13/17
to bo...@lists.boost.org, Peter Dimov
> - What is your evaluation of the design?

The general principles on which the library is build are solid, the design
is good. As already noted, I would prefer `stat` and `opendir`/`readdir` as
additions, while we're at it. Yes, we could use Filesystem. On the other
hand, we _could_ use Filesystem for remove, rename, and fstream, too.

> - What is your evaluation of the implementation?

I didn't look at the implementation too deeply, but I did take a look at,
and run, the test suite.

There are a number of copy/paste references to Boost.System here
(test/Jamfile) and Boost.Locale there (index.html), which should ideally be
corrected. Also, the test Jamfile is missing

import testing ;

which I do not hold against the author in any way as the Boost.System
Jamfile also had it missing (it works without it unless invoked from the
Boost root.)

The doc/ directory is missing a Jamfile, and there's also no
meta/libraries.json.

There are no .travis.yml and appveyor.yml files in the root, which is odd in
this day and age. :-)

The global BOOST_USE_WINDOWS_H macro should probably be respected.

> - What is your evaluation of the documentation?

Does the job, although not all functions had their behavior documented in
the case the passed UTF-8 sequence is invalid.

> - What is your evaluation of the potential usefulness of the library?

The library, or something isomorphic to it, is indispensable for any serious
Windows work.

> - Did you try to use the library? With what compiler? Did you have any
> problems?

As mentioned, I ran the test suite on various versions of msvc (all passed),
MinGW gcc (warnings in c++03 mode, errors in c++11 and above), and Cygwin
gcc/clang (errors because of missing ::getenv and friends).

The problem MinGW gcc 7 (and Cygwin clang, once I enabled the Windows path
in cenv.hpp) identified was

In file included from test\test_system.cpp:11:0:
..\../boost/nowide/cenv.hpp: In function 'int boost::nowide::putenv(char*)':
..\../boost/nowide/cenv.hpp:104:41: warning: comparison between pointer and
zero character constant [-Wpointer-compare]
while(*key_end!='=' && key_end!='\0')
^~~~

which indicates that (1) the library is missing a test for putenv with the
string not containing '=', and (2) the author should consider testing on
Cygwin/Mingw, either locally, or on Appveyor. These two may not be important
production platforms, but they enable testing the Windows-specific code
paths with g++ and clang++, which as the above shows, may uncover potential
problems that testing on MSVC alone would miss.

> - How much effort did you put into your evaluation? A glance? A quick
> reading? In-depth study?

Quick reading.

> - Are you knowledgeable about the problem domain?

More or less.

> - Do you think the library should be accepted as a Boost library?

Yes. Nowide should be accepted.

Artyom Beilis via Boost

unread,
Jun 14, 2017, 12:51:56 AM6/14/17
to bo...@lists.boost.org, Artyom Beilis
>> This is a very good question.
>>
>> On windows I need to convert for obvious reason.
>>
>> Now the question is what I accept as valid and what is not valid and
>> where do I draw the line.
>> ---------------------------------------------------------------------------------------------------------------------------
>>
>> Now as you have seen there are many possible "non-standard" UTF-8 variants.
>>
>> What should I accept?
>
> I still strongly suggest you simply call RtlUTF8ToUnicodeN()
> (https://msdn.microsoft.com/en-us/library/windows/hardware/ff563018(v=vs.85).aspx)
> to do the UTF-8 conversion. Do **nothing** else.
>
> Niall
>

Actually I think you provided me a good direction I hadn't considered before.

RtlUTF8ToUnicodeN and other way around function does something very simple:

It substitutes invalid codepoints/encoding with U+FFFD - REPLACEMENT CHARACTER
which is standard Unicode way to say I failed to convert a thing.

It is something similar to current ANSI / Wide conversions creating ? instead.

It looks like it is better way to do it instead of failing to convert
entire string all together.

If you get invalid string conversion will success but you'll get
special characters (that are usually marked as � in UI)
that will actually tell you something was wrong.

This way for example getenv on valid key will not return NULL and
create ambiguity of what happened and it is actually
something that is more common behavior in Windows.

I like it and I think I'll change the behavior of the conversion
functions in Boost.Nowide to this one

Thanks!

Artyom Beilis

Artyom Beilis via Boost

unread,
Jun 14, 2017, 1:06:25 AM6/14/17
to bo...@lists.boost.org, Artyom Beilis
>
> There are a number of copy/paste references to Boost.System here
> (test/Jamfile) and Boost.Locale there (index.html), which should ideally be
> corrected.

Ok I'll fix it

> Also, the test Jamfile is missing
> import testing ;
> which I do not hold against the author in any way as the Boost.System
> Jamfile also had it missing (it works without it unless invoked from the
> Boost root.)
>

What it does and why do I need one?


> The doc/ directory is missing a Jamfile, and there's also no
> meta/libraries.json.
>

Ok.

What I remember from Boost.Locale that doxygen wasn't run during doc build
but rather HTML was pushed to directory. If today there is support of automatic
building of docs with vanilla Doxygen (out of quick book scope) I'll add one.

> There are no .travis.yml and appveyor.yml files in the root, which is odd in
> this day and age. :-)
>

What do I miss?

> The global BOOST_USE_WINDOWS_H macro should probably be respected.
>

Ok good point I'll fix it.


>> - What is your evaluation of the documentation?
>
>
> Does the job, although not all functions had their behavior documented in
> the case the passed UTF-8 sequence is invalid.
>

Good point.

>> - What is your evaluation of the potential usefulness of the library?
>
>
> The library, or something isomorphic to it, is indispensable for any serious
> Windows work.
>
>> - Did you try to use the library? With what compiler? Did you have any
>> problems?
>
>
> As mentioned, I ran the test suite on various versions of msvc (all passed),
> MinGW gcc (warnings in c++03 mode, errors in c++11 and above), and Cygwin
> gcc/clang (errors because of missing ::getenv and friends).
>
> The problem MinGW gcc 7 (and Cygwin clang, once I enabled the Windows path
> in cenv.hpp) identified was
>
> In file included from test\test_system.cpp:11:0:
> ..\../boost/nowide/cenv.hpp: In function 'int boost::nowide::putenv(char*)':
> ..\../boost/nowide/cenv.hpp:104:41: warning: comparison between pointer and
> zero character constant [-Wpointer-compare]
> while(*key_end!='=' && key_end!='\0')
> ^~~~
>

Ooops it is serious I'll fix it.

https://github.com/artyom-beilis/nowide/issues/14


> which indicates that (1) the library is missing a test for putenv with the
> string not containing '=', and (2) the author should consider testing on
> Cygwin/Mingw, either locally, or on Appveyor. These two may not be important
> production platforms, but they enable testing the Windows-specific code
> paths with g++ and clang++, which as the above shows, may uncover potential
> problems that testing on MSVC alone would miss.
>

I general I tested using MinGW and MSVC also never tried clang on
Windows - it is probably time to.

>> - How much effort did you put into your evaluation? A glance? A quick
>> reading? In-depth study?
>
>
> Quick reading.
>
>> - Are you knowledgeable about the problem domain?
>
>
> More or less.
>
>> - Do you think the library should be accepted as a Boost library?
>
>
> Yes. Nowide should be accepted.
>

Thank You for your effort and valuable inputs.

Artyom Beilis

Groke, Paul via Boost

unread,
Jun 14, 2017, 4:12:26 AM6/14/17
to bo...@lists.boost.org, Groke, Paul
Peter Dimov wrote:
> Others have made their case for WTF-8, which has the desirable property of
> only allowing one encoding for any uint16_t sequence (taken in isolation).
>
> My personal choice here is to be more lenient and accept any combination of
> valid UTF-8 and UTF-8 encoded surrogates (a superset of CESU-8 and WTF-8),
> but I'm not going to argue very strongly for it over WTF-8, if at all.

I have to admit that I was confusing WTF-8 and CESU-8. When converting from wide to narrow I would favor CESU-8, mostly because it makes the rules simpler and creates less special cases (e.g. when concatenating).

When converting from narrow to wide I agree that accepting any UTF-8/WTF-8/CESU-8 mix would be good.

> I think that overlongs should NOT be accepted, including overlong zero.

Agreed.

Niall Douglas via Boost

unread,
Jun 14, 2017, 6:41:23 AM6/14/17
to bo...@lists.boost.org, Niall Douglas
> Actually I think you provided me a good direction I hadn't considered before.
>
> RtlUTF8ToUnicodeN and other way around function does something very simple:
>
> It substitutes invalid codepoints/encoding with U+FFFD - REPLACEMENT CHARACTER
> which is standard Unicode way to say I failed to convert a thing.
>
> It is something similar to current ANSI / Wide conversions creating ? instead.
>
> It looks like it is better way to do it instead of failing to convert
> entire string all together.
>
> If you get invalid string conversion will success but you'll get
> special characters (that are usually marked as � in UI)
> that will actually tell you something was wrong.

Some additional information about how the NT kernel treats the FFFD
character might be useful to you.

Quite a lot of kernel APIs treat UNICODE_STRING as just a bunch of bytes
e.g. the filesystem. You can supply any path with any characters at all,
including one containing zero bytes. The only character you can't use is
the backslash as that is the path separator. This matches how most Unix
filesystems work, and indeed at the NT kernel level path comparisons are
case sensitive as well as byte sensitive (unless you ask otherwise)
because it's just a memcmp(). You can have real fun here creating lots
of paths completely unparseable by Win32, as in, files totally
inaccessible to any Win32 API with some really random Win32 error codes
being returned.

Other NT kernel APIs will refuse strings containing FFFD or illegal
UTF-16 characters, and if they do it's generally because accepting them
would be an obvious security risk. But those are definitely a minority.

If the Win32 layer doesn't get in the way, doing as RtlUTF8ToUnicodeN()
does should be safe in so far as the kernel team feel it is. They have
placed appropriate checks on appropriate kernel APIs. But in the end,
it's down to the programmer to correctly validate and check all
untrusted input. Doing so always is an unnecessary expense for most end
users who have trusted input.

Niall

Peter Dimov via Boost

unread,
Jun 14, 2017, 9:23:45 AM6/14/17
to bo...@lists.boost.org, Peter Dimov
Artyom Beilis wrote:
> > Also, the test Jamfile is missing
> > import testing ;

...

> What it does and why do I need one?

It contains the definitions of the testing rules such as 'run'. Not having
it is like a missing #include in C++. It only works because someone happened
to #include it first.

> > There are no .travis.yml and appveyor.yml files in the root, which is
> > odd in this day and age. :-)
>
>
> What do I miss?

If you're not familiar with the Travis and Appveyor CIs, you should check
them out, because they are very valuable. What they do is integrate with
Github and perform an automatic test on every repo push, which, if setup
correctly, includes pull requests. Github then displays the test results on
the PR page. This is very useful if you want to decide whether to merge a
PR.

You can look at the Boost.System .travis.yml and appveyor.yml files as
examples.

Frédéric Bron via Boost

unread,
Jun 15, 2017, 2:44:34 AM6/15/17
to bo...@lists.boost.org, Frédéric Bron, Peter Dimov
> Or is it the one where the user receives a path from Nowide? Which currently
> doesn't occur anywhere, because the library does not return paths,

this is not fully true as you can get a path with getenv.

Frédéric

Frédéric Bron via Boost

unread,
Jun 15, 2017, 3:00:10 AM6/15/17
to bo...@lists.boost.org, Frédéric Bron
> RtlUTF8ToUnicodeN and other way around function does something very simple:
> It substitutes invalid codepoints/encoding with U+FFFD - REPLACEMENT CHARACTER
> which is standard Unicode way to say I failed to convert a thing.

so does this allow roundtrip conversion UTF-16->UTF-8->UTF-16?

Frédéric

Frédéric Bron via Boost

unread,
Jun 15, 2017, 8:32:54 AM6/15/17
to bo...@lists.boost.org, Frédéric Bron
This simple program compiles nicely on Linux+gcc 6.3.0 but fails to
compile on x86_64-w64-mingw (6.3.0, std=c++14):

#include "nowide/fstream.hpp"
int main() {
auto f = nowide::fstream("toto.log");
return 0;
}

instead of
auto f=...,
nowide::fstream f("toto.log")
works fine.

The error is:
toto.cpp:4:38: erreur : use of deleted function «
nowide::basic_fstream<char>::basic_fstream(const
nowide::basic_fstream<char>&) »
auto f = nowide::fstream("toto.log");
^
In file included from toto.cpp:1:0:
nowide/fstream.hpp:190:11: note : «
nowide::basic_fstream<char>::basic_fstream(const
nowide::basic_fstream<char>&) » is implicitly deleted because the
default definition would be ill-formed:
class basic_fstream : public std::basic_iostream<CharType,Traits>
^~~~~~~~~~~~~
nowide/fstream.hpp:190:11: erreur : use of deleted function «
std::basic_iostream<_CharT, _Traits>::basic_iostream(const
std::basic_iostream<_CharT, _Traits>&) [with _CharT = char; _Traits =
std::char_t
raits<char>] »
In file included from
/softs/mingw64-6.3.0/x86_64-w64-mingw32/include/c++/6.3.0/iterator:65:0,
from ./nowide/encoding_utf.hpp:14,
from ./nowide/convert.hpp:12,
from nowide/fstream.hpp:13,
from toto.cpp:1:
/softs/mingw64-6.3.0/x86_64-w64-mingw32/include/c++/6.3.0/istream:863:7:
note : declared here
basic_iostream(const basic_iostream&) = delete;
^~~~~~~~~~~~~~
In file included from toto.cpp:1:0:
nowide/fstream.hpp:190:11: erreur : use of deleted function «
std::basic_ios<_CharT, _Traits>::basic_ios(const
std::basic_ios<_CharT, _Traits>&) [with _CharT = char; _Traits =
std::char_traits<char>] »
class basic_fstream : public std::basic_iostream<CharType,Traits>
^~~~~~~~~~~~~
In file included from
/softs/mingw64-6.3.0/x86_64-w64-mingw32/include/c++/6.3.0/ios:44:0,
from
/softs/mingw64-6.3.0/x86_64-w64-mingw32/include/c++/6.3.0/ostream:38,
from
/softs/mingw64-6.3.0/x86_64-w64-mingw32/include/c++/6.3.0/iterator:64,
from ./nowide/encoding_utf.hpp:14,
from ./nowide/convert.hpp:12,
from nowide/fstream.hpp:13,
from toto.cpp:1:
/softs/mingw64-6.3.0/x86_64-w64-mingw32/include/c++/6.3.0/bits/basic_ios.h:475:7:
note : declared here
basic_ios(const basic_ios&) = delete;
^~~~~~~~~
In file included from toto.cpp:1:0:
nowide/fstream.hpp:190:11: erreur : «
nowide::scoped_ptr<T>::scoped_ptr(const nowide::scoped_ptr<T>&) [with
T = nowide::basic_filebuf<char>] » is private within this context
class basic_fstream : public std::basic_iostream<CharType,Traits>
^~~~~~~~~~~~~
In file included from nowide/fstream.hpp:14:0,
from toto.cpp:1:
./nowide/scoped_ptr.hpp:31:5: note : declared private here
scoped_ptr(scoped_ptr const &);

Artyom Beilis via Boost

unread,
Jun 15, 2017, 2:01:22 PM6/15/17
to bo...@lists.boost.org, Artyom Beilis
On Thu, Jun 15, 2017 at 3:32 PM, Frédéric Bron via Boost
<bo...@lists.boost.org> wrote:
> This simple program compiles nicely on Linux+gcc 6.3.0 but fails to
> compile on x86_64-w64-mingw (6.3.0, std=c++14):
>
> #include "nowide/fstream.hpp"
> int main() {
> auto f = nowide::fstream("toto.log");
> return 0;
> }
>
> instead of
> auto f=...,
> nowide::fstream f("toto.log")
> works fine.
>
> The error is:
> toto.cpp:4:38: erreur : use of deleted function «
> nowide::basic_fstream<char>::basic_fstream(const
> nowide::basic_fstream<char>&) »
> auto f = nowide::fstream("toto.log");
> ^


Ok I see,

Nowide misses some of the C++11 interfaces - for example move constructor
but it shouldn't be hard to implement ones.

Also I wasn't aware that auto f = nowide::fstream("toto.log"); Can
actually call move
constructor.

Artyom

Frédéric Bron via Boost

unread,
Jun 15, 2017, 2:27:15 PM6/15/17
to bo...@lists.boost.org, Frédéric Bron
> Also I wasn't aware that auto f = nowide::fstream("toto.log"); Can
> actually call move
> constructor.

I am also surprised, I though it would be equivalent.
And I use the same gcc version (6.3.0) and it causes no issue on
linux, only on mingw cross compiler...

Frédéric

Artyom Beilis via Boost

unread,
Jun 15, 2017, 3:47:09 PM6/15/17
to Artyom Beilis, bo...@lists.boost.org
at Linux nowide fstream is just an alias to std::fstream

Artyom


בתאריך 15 ביוני 2017 9:27 אחה״צ,‏ "Frédéric Bron" <freder...@m4x.org>
כתב:

Frédéric Bron via Boost

unread,
Jun 15, 2017, 3:54:04 PM6/15/17
to bo...@lists.boost.org, Frédéric Bron
> at Linux nowide fstream is
> an alias to std::fstream

If course!

Frédéric Bron via Boost

unread,
Jun 16, 2017, 1:01:00 AM6/16/17
to bo...@lists.boost.org, Frédéric Bron
> Please note: Under POSIX platforms no conversions are performed
> and no UTF-8 validation is done as this is incorrect:
>
> http://cppcms.com/files/nowide/html/index.html#qna

I do not quite understand the rationale behind not converting to UTF-8
on Posix platforms. I naively though I got UTF-8 in argv because my
system is convigured in UTF-8 but I discover that this is not
necessary always the case. In the example you highlight, I do not see
the difference from the Windows case. You could convert to UTF-8 in
argv and back to the local encoding in nowide::remove. I understand it
is not efficient if you do not really use the content of the filename
but if you have to write, say an xml report in UTF-8, you would have
to convert anyway.

Today, what is the portable way to convert argv to UTF-8? i.e. without
having to #ifdef _WIN32...?

Artyom Beilis via Boost

unread,
Jun 16, 2017, 3:00:54 AM6/16/17
to bo...@lists.boost.org, Artyom Beilis
On Fri, Jun 16, 2017 at 8:00 AM, Frédéric Bron <freder...@m4x.org> wrote:
>> Please note: Under POSIX platforms no conversions are performed
>> and no UTF-8 validation is done as this is incorrect:
>>
>> http://cppcms.com/files/nowide/html/index.html#qna
>
> I do not quite understand the rationale behind not converting to UTF-8
> on Posix platforms. I naively though I got UTF-8 in argv because my
> system is convigured in UTF-8 but I discover that this is not
> necessary always the case. In the example you highlight, I do not see
> the difference from the Windows case. You could convert to UTF-8 in
> argv and back to the local encoding in nowide::remove. I understand it
> is not efficient if you do not really use the content of the filename
> but if you have to write, say an xml report in UTF-8, you would have
> to convert anyway.
>
> Today, what is the portable way to convert argv to UTF-8? i.e. without
> having to #ifdef _WIN32...?
>
> Frédéric

Hello Frederic,

There are several reasons for this.


One of is actually original purpose of the library: use same type of strings
internally without creating broken software on Windows and since only
Windows use native Wide instead of narrow API which is native for C++
only Windows case requires encoding conversion.

However there is another deeper issue. Unlike on Windows where native
wide API has well defined UTF-16 encoding it isn't the case for Unix like OSes.

The encoding is defined by current Locale that can be defined
globally, per user,
per process and even change trivially in the same process during the runtime.

There are also several sources of the Locale/Encoding information:

Environment variables:

- LANG/LC_CTYPE - which is UTF-8 on vast majority of modern Unix like
platforms but frequently can be undefined or defined as "C" locale without
encoding information.

This one is what OS defines for the process.

- C locale: setlocale API - which is by default "C" locale by standard
unless explicitly defined otherwise

- C++ locale: std::locale::global() API - which is by default "C" locale by
standard unless explicitly defined otherwise

They are all can be changed at runtime, they aren't synchronized and
they be modified to whatever encoding user wants.

Additionally using std::locale::global as not "C" locale can lead to some
really nasty things like failing to create CSV files due to adding ","
to numbers.

So the safest and the most correct way to handle it is to pass narrow
strings as is
without any conversion.

Regards,
Artyom Beilis

Frédéric Bron via Boost

unread,
Jun 16, 2017, 3:41:14 AM6/16/17
to bo...@lists.boost.org, Frédéric Bron
> The encoding is defined by current Locale that can be defined
> globally, per user,
> per process and even change trivially in the same process during the runtime.
> They are all can be changed at runtime, they aren't synchronized and
> they be modified to whatever encoding user wants.
>
> So the safest and the most correct way to handle it is to pass narrow
> strings as is
> without any conversion.

I understand that this complexity can be a nightmare.

How would you therefore do this simple task:
1. get a directory from nowide::getenv -> base_dir (UTF-8 on Windows,
unknown narrow on Posix)
2. create a file in base_dir which name is file_name encoded in UTF-8
(because it is created by the application).

If I understand well, I should NOT do this:

auto f = nowide::ofstream((boost::filesystem::path(base_dir) /
file_name).string());

because this is guaranteed to work only on Windows where I have the
guarantee that base_dir is UTF-8, right?

On Posix, there is a risk that base_dir is for example ISO-8859-1
while file_name is UTF-8 so that the combination made by filesystem is
wrong.
Am I right?

Frédéric

Artyom Beilis via Boost

unread,
Jun 16, 2017, 5:27:35 AM6/16/17
to Artyom Beilis, bo...@lists.boost.org
How would you therefore do this simple task:
1. get a directory from nowide::getenv -> base_dir (UTF-8 on Windows,
unknown narrow on Posix)
2. create a file in base_dir which name is file_name encoded in UTF-8
(because it is created by the application).

If I understand well, I should NOT do this:

auto f = nowide::ofstream((boost::filesystem::path(base_dir) /
file_name).string());

because this is guaranteed to work only on Windows where I have the
guarantee that base_dir is UTF-8, right?


Actually it is what exactly you should do. Once you turn on **nowide
integration** with boost filesystem

path::string does not perform any conversions on POSIX platforms.

And you can safely concatenate two different strings and valid file will be
created. Even if dir is in ISO 8859-1 and file in UTF-8. The file will be
valid even if not representable in any encoding.

Under windows you'll get the correct conversions from narrow to wide using
the utf8/16 facet installed by boost.nowide/Filesystem integration as on
Windows boost filesystem uses wide strings

Artyom

Frédéric Bron via Boost

unread,
Jun 16, 2017, 5:43:37 AM6/16/17
to bo...@lists.boost.org, Frédéric Bron
> And you can safely concatenate two different strings and valid file will be
> created. Even if dir is in ISO 8859-1 and file in UTF-8. The file will be
> valid even if not representable in any encoding.

but it seems to me that in this case, what I need is UTF-8->ISO 8859-1
conversion of the file name before concatenation with the directory.
Otherwise, OK I will get a file because the system just ask for narrow
string but its name will be wrong in the OS user interface.

Frédéric

Artyom Beilis via Boost

unread,
Jun 16, 2017, 8:20:19 AM6/16/17
to Frédéric Bron, Artyom Beilis, bo...@lists.boost.org
On Fri, Jun 16, 2017 at 12:42 PM, Frédéric Bron <freder...@m4x.org> wrote:
>> And you can safely concatenate two different strings and valid file will be
>> created. Even if dir is in ISO 8859-1 and file in UTF-8. The file will be
>> valid even if not representable in any encoding.
>
> but it seems to me that in this case, what I need is UTF-8->ISO 8859-1
> conversion of the file name before concatenation with the directory.
> Otherwise, OK I will get a file because the system just ask for narrow
> string but its name will be wrong in the OS user interface.
>
> Frédéric


You actually **assume** that the encoding you received (like getenv)
from the system actually matches current locale encoding.

But it is not necessary the same:

1. The file/directory was created by user running in different locale
2. The locale isn't defined properly or was modified
3. You get these files directories from some other location (like
unzipped some stuff)

In reality the OS does not care about encoding (most of the time).

Unlike Windows where wchar_t also defines the encoding UTF-16 under POSIX
platforms "char *" can contain whatever encoding and it can be changed.

Also UTF-8 is the most common encoding on all modern Unix like
systems: Linux, BSD, Mac OS X

So I don't think it is necessary to perform any conversions between UTF-8
and whatever "char *" encoding you get because:

(a) You can't reliable know what kind of encoding you use.
(b) Same "char *" may contain parts from different encoding and
actually be valid path.


Artyom

Peter Dimov via Boost

unread,
Jun 16, 2017, 8:41:59 AM6/16/17
to bo...@lists.boost.org, Peter Dimov
Artyom Beilis wrote:
> Actually it is what exactly you should do. Once you turn on **nowide
> integration** with boost filesystem
>
> path::string does not perform any conversions on POSIX platforms.
>
> And you can safely concatenate two different strings and valid file will
> be created. Even if dir is in ISO 8859-1 and file in UTF-8. The file will
> be valid even if not representable in any encoding.

"POSIX" is not correct here. File names under Mac OS X are UTF-8. Always
UTF-8, not merely by convention, and no mater what locale is set. (They're
stored in UTF-16 in the filesystem, just like in NTFS. Unlike NTFS, it's
_valid_ UTF-16.)

For encoding-neutral Unix-es such as Linux and Solaris, the above is
correct.

In any case, a library that attempts to reencode from the current locale to
UTF-8 in the narrow case would be useless. (If the locale isn't UTF-8,
nothing will work anyway, the library trying to handle this case would just
make things worse.) This is well established empirically by now.

Niall Douglas via Boost

unread,
Jun 16, 2017, 10:15:58 AM6/16/17
to bo...@lists.boost.org, Niall Douglas
>> And you can safely concatenate two different strings and valid file
>> will be created. Even if dir is in ISO 8859-1 and file in UTF-8. The
>> file will be valid even if not representable in any encoding.
>
> "POSIX" is not correct here. File names under Mac OS X are UTF-8. Always
> UTF-8, not merely by convention, and no mater what locale is set.

You were right but are now wrong I am afraid Peter:
https://mjtsai.com/blog/2017/03/24/apfss-bag-of-bytes-filenames/

Apple's shiny new APFS treats paths as a dumb sequence of bytes, and
they are retrofitting HFS+ to behave the same to match.

> (They're stored in UTF-16 in the filesystem, just like in NTFS. Unlike
> NTFS, it's _valid_ UTF-16.)

You should assume that invalid UTF will present on all filing systems
because almost all of them treat paths as dumb byte strings. It's the
only portable assumption. It's why AFIO v2 opts out of case
insensitivity on Windows which no doubt will surprise quite a few end
users, but it does present many security attacks at the AFIO layer (by
pushing the problem further up the stack).

Niall

--
ned Productions Limited Consulting
http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/


Edward Diener via Boost

unread,
Jun 16, 2017, 5:06:26 PM6/16/17
to bo...@lists.boost.org, Edward Diener
On 6/12/2017 12:20 AM, Frédéric Bron via Boost wrote:
> Hi Everyone,
>
> The formal review of Artyom Beilis' Nowide library starts today and
> will last until Wed. 21st of June.
>
> Your participation is encouraged, as the proposed library is
> uncoupled, focused and rather small. Nowadays, everyone needs to
> handle Unicode but this is very difficult using only the standard
> library in a platform independant way. Nowide offers a very simple way
> to handle Unicode the same way on
> Windows/MacOS/Linux.
>
> Key features:
>
> * work with UTF-8 in your code, Nowide converts to OS encoding
> * Easy to use functions for converting UTF-8 to/from UTF-16
> * A class to fixing argc, argc and env main parameters to use UTF-8
> * UTF-8 aware functions:
> - stdio.h functions (fopen, freopen, remove, rename)
> - stdlib.h functions (system, getenv, setenv, unsetenv, putenv)
> - fstream (filebuf, fstream/ofstream/ifstream)
> - iostream (cout, cerr, clog, cin)
>
> Documentation:
> http://cppcms.com/files/nowide/html
>
> GitHub:
> https://github.com/artyom-beilis/nowide
> git clone https://github.com/artyom-beilis/nowide.git
>
> Latest tarballs:
> - to be unzipped in boost source:
> https://github.com/artyom-beilis/nowide/archive/master.zip
> - as a standalone library: http://cppcms.com/files/nowide/nowide_standalone.zip
>
> Nowide has standard boost layout. So, just put it into the boost
> source tree under libs/nowide directory for building and running
> tests.
> Or alternatively - for header only part of the library add the
> nowide/include path to the include path of your projects.

How do I build the documentation for the nowide library ?

>
> Please post your comments and review to the boost mailing list
> (preferably), or privately to the Review Manager (to me ;-). Here are
> some questions you might want to answer in your review:
>
> - What is your evaluation of the design?
> - What is your evaluation of the implementation?


> - What is your evaluation of the documentation?

> - What is your evaluation of the potential usefulness of the library?

> - Did you try to use the library? With what compiler? Did you have any
> problems?

> - How much effort did you put into your evaluation? A glance? A quick
> reading? In-depth study?

> - Are you knowledgeable about the problem domain?
>

> And most importantly:
> - Do you think the library should be accepted as a Boost library? Be
> sure to say this explicitly so that your other comments don't obscure
> your overall opinion.
>
> For more information about Boost Formal Review Process, see:
> http://www.boost.org/community/reviews.html
>
> Thank you very much for your time and efforts.
>
> Frédéric

Artyom Beilis via Boost

unread,
Jun 16, 2017, 5:14:02 PM6/16/17
to bo...@lists.boost.org, Artyom Beilis
On Sat, Jun 17, 2017 at 12:06 AM, Edward Diener via Boost
<bo...@lists.boost.org> wrote:
>
>> Nowide has standard boost layout. So, just put it into the boost
>> source tree under libs/nowide directory for building and running
>> tests.
>> Or alternatively - for header only part of the library add the
>> nowide/include path to the include path of your projects.
>
>
> How do I build the documentation for the nowide library ?
>


Just run doxygen (hand't integrated yet to BB)

Also note full docs can be found there: http://cppcms.com/files/nowide/html/

Artyom

Zach Laine via Boost

unread,
Jun 19, 2017, 1:19:36 PM6/19/17
to bo...@lists.boost.org, Zach Laine
On Sun, Jun 11, 2017 at 11:20 PM, Frédéric Bron via Boost <
bo...@lists.boost.org> wrote:

> Hi Everyone,
>
> The formal review of Artyom Beilis' Nowide library starts today and
> will last until Wed. 21st of June.

> [snip]


> Please post your comments and review to the boost mailing list
> (preferably), or privately to the Review Manager (to me ;-). Here are
> some questions you might want to answer in your review:
>

> - What is your evaluation of the design?
>

1) I'd really much rather have an iterator-based interface for the
narrow/wide conversions. There's an existing set of iterators in
Boost.Regex already, and I've recently written one here:

https://github.com/tzlaine/text/blob/master/boost/text/utf8.hpp

The reliance on a new custom string type is a substantial mistake, IMO
(boost::nowide::basic_stackstring). Providing an iterator interface
(possibly cribbing the one of the two implementations above) would negate
the need for this new string type -- I could use the existing std::string,
MyString, QString, a char buffer, or whatever. Also, I'd greatly prefer
that the new interfaces be defined in terms of string_view instead of
string/basic_stackstring (there's also a string_view implementation already
Boost.Utility). string_view is simply far more usable, since it binds
effortlessly to either a char const * or a string.

2) I don't really understand what happens when a user passes a valid
Windows filename that is *invalid* UTF-16 to a program using Nowide. Is
the invalid UTF-16 filename just broken in the process of trying to convert
it to UTF-8? This is partially a documentation problem, but until I
understand how this is intended to work, I'm also counting it as a design
issue.

> - What is your evaluation of the implementation?

I did not look.

> - What is your evaluation of the documentation?

I think the documentation needs a bit of work. The non-reference portion
is quite thin, and drilling down into the reference did not answer at least
one question I had (the one above, about invalid UTF-16):

Looking at some example code in the "Using the Library" section, I saw this:

"
To make this program handle Unicode properly, we do the following changes:

#include <boost/nowide/args.hpp>
#include <boost/nowide/fstream.hpp>
#include <boost/nowide/iostream.hpp>
int main(int argc,char **argv)
{
boost::nowide::args a(argc,argv); // Fix arguments - make them UTF-8
"

Ok, so I clicked "boost::nowide::args", hoping for an answer. The detailed
description for args says:

"
args is a class that fixes standard main() function arguments and changes
them to UTF-8 under Microsoft Windows.

The class uses GetCommandLineW(), CommandLineToArgvW() and
GetEnvironmentStringsW() in order to obtain the information. It does not
relates to actual values of argc,argv and env under Windows.

It restores the original values in its destructor
"

It tells me nothing about what happens when invalid UTF-16 is encountered.
Is there an exception? Is 0xfffd inserted? If the latter, am I just
stuck? I should not have to read any source code to figure this out, but
it looks like I have to.

This criticism can be applied to most of the documentation. My preference
is that the semantics of primary functionality of the library should be
explained in tutorials or other non-reference formats. The current state
of the docs doesn't even explain things in the references. This must be
fixed before this library can be accepted.

> - What is your evaluation of the potential usefulness of the library?

I think this library is attempting to address a real and important issue.
I just can't figure out if it's a complete solution, because how invalid
UTF-16 is treated remains a question.

> - Did you try to use the library? With what compiler? Did you have any

> problems?
>

I did not.

- How much effort did you put into your evaluation? A glance? A quick
> reading? In-depth study?
>

A quick reading, plus a bit of discussion on the list.


> - Are you knowledgeable about the problem domain?
>

I understand the UTF-8 issues reasonably well, but am ignorant of the
Windows-specific issues.


> And most importantly:
> - Do you think the library should be accepted as a Boost library? Be
> sure to say this explicitly so that your other comments don't obscure
> your overall opinion.
>

I do not think the library should be accepted in its current form. It
seems not to handle malformed UTF-16, which is a requirement for processing
Windows file names (as I understand it -- please correct this if I'm
wrong). Independent of this, I don't find the docs to be sufficient.

Zach

Frédéric Bron via Boost

unread,
Jun 19, 2017, 1:38:19 PM6/19/17
to bo...@lists.boost.org, Frédéric Bron
Question to all:

Why should we try to handle wrong UTF-16 (or wrong UTF-8)?
1. such files should not exist
2. if they exist, why?
- if it is because it is an old file, can the user just rename it properly?
- if it is because it was produced by a program, why should this
program continue to work without fixing? Isn't it the best way that we
get wrong filenames forever?

I do not understand why we cannot just issue an error.

Thanks for explanations,

Frédéric

--
Frédéric Bron

-----------------------------------------------------------
Frédéric Bron (freder...@m4x.org)
Villa des 4 chemins, Centre Hospitalier, BP 208
38506 Voiron Cedex
tél. fixe : +33 4 76 67 17 27, tél. port.: +33 6 67 02 77 35

Zach Laine via Boost

unread,
Jun 19, 2017, 1:43:45 PM6/19/17
to bo...@lists.boost.org, Zach Laine
On Mon, Jun 19, 2017 at 12:37 PM, Frédéric Bron via Boost <
bo...@lists.boost.org> wrote:

> Question to all:
>
> Why should we try to handle wrong UTF-16 (or wrong UTF-8)?
> 1. such files should not exist
> 2. if they exist, why?
> - if it is because it is an old file, can the user just rename it
> properly?
> - if it is because it was produced by a program, why should this
> program continue to work without fixing? Isn't it the best way that we
> get wrong filenames forever?


This came up at C++Now last month. My understanding, from talking to
people who seem to know about this, is that such filenames are considered
valid by Windows. To issue an error in such a case means not allowing
users to access files that Windows sees as well-formed.

Frédéric Bron via Boost

unread,
Jun 19, 2017, 3:41:45 PM6/19/17
to bo...@lists.boost.org, Frédéric Bron
> This came up at C++Now last month. My understanding, from talking to
> people who seem to know about this, is that such filenames are considered
> valid by Windows. To issue an error in such a case means not allowing
> users to access files that Windows sees as well-formed.

but who/what needs ill-formed filenames? For what reason?

Frédéric

Peter Dimov via Boost

unread,
Jun 19, 2017, 3:53:28 PM6/19/17
to bo...@lists.boost.org, Peter Dimov
Frédéric Bron wrote:
> > This came up at C++Now last month. My understanding, from talking to
> > people who seem to know about this, is that such filenames are
> > considered valid by Windows. To issue an error in such a case means not
> > allowing users to access files that Windows sees as well-formed.
>
> but who/what needs ill-formed filenames? For what reason?

The question whether to support such filenames is more a matter of
principle. Namely, whether it is the job of the Nowide library to tell you
what filenames you may use or not. One might argue that it should be of no
concern to it whether, or for what reason, I need such filenames.

Artyom Beilis via Boost

unread,
Jun 19, 2017, 3:57:50 PM6/19/17
to bo...@lists.boost.org, Artyom Beilis
On Mon, Jun 19, 2017 at 8:19 PM, Zach Laine via Boost
<bo...@lists.boost.org> wrote:
> On Sun, Jun 11, 2017 at 11:20 PM, Frédéric Bron via Boost <
> bo...@lists.boost.org> wrote:
>
>
>> - What is your evaluation of the design?
>>
>
> 1) I'd really much rather have an iterator-based interface for the
> narrow/wide conversions. There's an existing set of iterators in
> Boost.Regex already, and I've recently written one here:
>
> https://github.com/tzlaine/text/blob/master/boost/text/utf8.hpp

As well in Boost.Locale - Boost.Nowide uses UTF conversions
from header only part of Boost.Locale.


>
> The reliance on a new custom string type is a substantial mistake, IMO
> (boost::nowide::basic_stackstring).

> [snip]


> (possibly cribbing the one of the two implementations above) would negate
> the need for this new string type -- I could use the existing std::string,
> MyString, QString, a char buffer, or whatever. Also, I'd greatly prefer
> that the new interfaces be defined in terms of string_view instead of
> string/basic_stackstring (there's also a string_view implementation already
> Boost.Utility). string_view is simply far more usable, since it binds
> effortlessly to either a char const * or a string.
>

stackstring is barely on-stack buffer optimization - it isn't general string
and never intended to be so.

It isn't in details because it can actually be useful outside library scope.


> Providing an iterator interface

There is no iterator interface in communication with OS
so no sense to add it there. If you want character by character
interface it exists in Boost.Locale


> 2) I don't really understand what happens when a user passes a valid
> Windows filename that is *invalid* UTF-16 to a program using Nowide. Is
> the invalid UTF-16 filename just broken in the process of trying to convert
> it to UTF-8? This is partially a documentation problem, but until I
> understand how this is intended to work, I'm also counting it as a design
> issue.
>

You are right it wasn't clear in the docs. My policy was return a error in
case of invalid encoding but after comments I received I'll rather
replace invalid sequences with U-FFFD Replacement Character
since it was what WinAPI usually does.

https://github.com/artyom-beilis/nowide/issues/16

It is something easy to fix.


>> - What is your evaluation of the documentation?
>
> I think the documentation needs a bit of work. The non-reference portion
> is quite thin, and drilling down into the reference did not answer at least
> one question I had (the one above, about invalid UTF-16):
>

Actually documentation points out that in case of invalid character
encoding a error is returned and how (in most of places - but
there are some missies)

However I agree that it can be improved
and a special section regarding bad encoding conversion policy
should and will be added.

>
> It tells me nothing about what happens when invalid UTF-16 is encountered.
> Is there an exception? Is 0xfffd inserted? If the latter, am I just
> stuck? I should not have to read any source code to figure this out, but
> it looks like I have to.

Valid point.

>
>> And most importantly:
>> - Do you think the library should be accepted as a Boost library? Be
>> sure to say this explicitly so that your other comments don't obscure
>> your overall opinion.
>>
>
> I do not think the library should be accepted in its current form. It
> seems not to handle malformed UTF-16, which is a requirement for processing
> Windows file names (as I understand it -- please correct this if I'm
> wrong).

I strongly disagree about it. Same as converting invalid ANSI encoding
shouldn't lead to invalid UTF-16 or the other way around.

Since there is no way to present invalid UTF-16 using valid UTF-8
same as there is no way to do it in ANSI<->UTF-16 there is
no non-ambiguous way to provide such a support.


>
> Zach
>

Thank You for your review.

Artyom

Chris Glover via Boost

unread,
Jun 19, 2017, 11:44:40 PM6/19/17
to bo...@lists.boost.org, Chris Glover
Hi,

The following is my review of the proposed Boost.Nowide library. This is my
first official review of a boost library, so please forgive me if I've
missed anything.


- What is your evaluation of the design?

At first I was surprised by lack of a platform agnostic encoding. However,
through reading the docs and the discussion in the list, I believe this to
be the correct choice. If a user has a need for a common encoding, I
believe that would be best served by a different library that would
ultimately have different issues and performance characteristics.

One thing I'm not sure about is if the overlap between the fstream
implementations in filesystem and nowide is necessary. It seems to me that
if I set filesystem to be nowide aware (via
boost::nowide::nowide_filesystem) then there's little reason for the nowide
versions of the stream objects to exist, other than possibly eliminating
one string copy on non-windows platforms and avoiding the filesystem
dependency (and maybe those are good enough reasons).


- What is your evaluation of the implementation?

I only gave the implementation a glance. It looks good; clean, organized;
with a full test suite and nothing weird stands out. It seems to be of the
typical quality I expect from a boost library.


- What is your evaluation of the documentation?

The documentation is clear and complete, but I think it needs to explain at
the top that the library does not present a uniform encoding for strings.
This is discussed in the documentation near the end, under "Q & A", but I
think it should be near the top. A suggestion is to add a section called
"What Boost.Nowide is *not*" directly under the section titled "What is
Boost.Nowide", otherwise I expect some users to be surprised.


- What is your evaluation of the potential usefulness of the library?

It is very useful. I am already using it.


- Did you try to use the library? With what compiler? Did you have any
problems?

I converted one application that was using boost::filesystem::path to
transport filenames around to use nowide instead. I had no issues,
everything just worked. Some code was also simplified as a result -- not
dramatically, but enough that it felt better.

I tested the application on windows only using Visual Studio 2017.

I also ran the tests under MSYS2 on Windows using gcc-6.3, on Windows using
MSVC 2017 and on Ubuntu using gcc-7.

Under MSYS, one test failed (Fail: Error boost::nowide::cin.putback(c) in
test_iostream.cpp:17 main), but I haven't had a chance to look into why
yet.

All of the tests passed on the other platforms.


- How much effort did you put into your evaluation? A glance? A quick
reading? In-depth study?

I followed the discussion, read the docs, integrated the library, used the
API and tested the resulting program. All in, I think that took me about 4
hours. This seems quick, but to me that is a testament to a solid design
and clear documentation because the conversion process was very simple.


- Are you knowledgeable about the problem domain?

I am aware of the problem and have worked around it in the past by using
filesystem::path as a string. However, most of the environments I work in
professionally are controlled in a way that allows us to avoid this problem
most of the time, so I have not had extensive experience with it.

- Do you think the library should be accepted as a Boost library?

Yes. I am already using it and expect I will be using it a lot more going
forward.

-- chris

degski via Boost

unread,
Jun 20, 2017, 1:16:40 AM6/20/17
to boost, degski
On 19 June 2017 at 22:52, Peter Dimov via Boost <bo...@lists.boost.org>
wrote:

> The question whether to support such filenames is more a matter of
> principle. Namely, whether it is the job of the Nowide library to tell you
> what filenames you may use or not. One might argue that it should be of no
> concern to it whether, or for what reason, I need such filenames.
>

I think the ruling principle should be the unicode standard. Nowide should
support the unicode standard and no more (at least in its' intentions). One
of the intentions of the standard, as I read it, is to guarantee that a
conversion can *un-ambiguously (and safely) round-trip*. On windows, if
WTF-8 (what's in a name? Apart from the obvious, even Wobbly sounds
bizarre), CESU-8 or Modified UTF-8 are ways to achieve that, I think that
they should be supported. If not, an exception should be thrown when
encountering these invalid encodings, as this is in my view an IO issue
(the un-certain environment in which an application has to live), in which
context, throwing seems to be the norm.

I'm with Frédéric Bron on this one, though. I don't understand why invalid
encodings are found in the wild in the first place and why they should
continue to exist in the wild. The whole thing sounds like a vector for
exploits, malicious code to generate invalid encodings after which
buffer-overruns open up the system.

Something I cannot understand is that some of those on this list who are
most critical of Windows in view of security concerns are also the same
people who happily perpetuate these weaknesses. Microsoft is very much in
front here, by dropping support for OS'es (and therfore their associated
compilers and CRT's), Boost should do the same and adopt rigour: "Ceterum
autem censeo Carthaginem esse delendam"

What's not terribly clear either is whether we are talking about NTFS or
Explorer (and then there is WinFS lurking in the shadows, bound to be
released someday). Windows does (many) other things that are weird, like
the way it deals with capitalized names, long paths etc., NTFS itself does
actually not have these issues and does quite a bit more than what the
Explorer shell can do.

degski
--
"*Ihre sogenannte Religion wirkt bloß wie ein Opiat reizend, betäubend,
Schmerzen aus Schwäche stillend.*" - Novalis 1798

Artyom Beilis via Boost

unread,
Jun 20, 2017, 4:01:35 AM6/20/17
to bo...@lists.boost.org, Artyom Beilis
On Tue, Jun 20, 2017 at 6:44 AM, Chris Glover via Boost
<bo...@lists.boost.org> wrote:
> Hi,
>
> The following is my review of the proposed Boost.Nowide library.

> One thing I'm not sure about is if the overlap between the fstream
> implementations in filesystem and nowide is necessary. It seems to me that
> if I set filesystem to be nowide aware (via
> boost::nowide::nowide_filesystem) then there's little reason for the nowide
> versions of the stream objects to exist, other than possibly eliminating
> one string copy on non-windows platforms and avoiding the filesystem
> dependency (and maybe those are good enough reasons).
>

The reason nowide implemented fstream is because filesystem::fstream
uses internally the default std::basic_filebuf thus relays on std library
to provide filebuf::open(wchar_t const *).

Also it works OK for MSVC, MinGW libstdc++ does not provide wchar_t
interface. For filesytem the fix is far from trivial as requires implementation
of filebuf so nowide implemented one.


> - What is your evaluation of the documentation?
>
> The documentation is clear and complete, but I think it needs to explain at
> the top that the library does not present a uniform encoding for strings.
> This is discussed in the documentation near the end, under "Q & A", but I
> think it should be near the top. A suggestion is to add a section called
> "What Boost.Nowide is *not*" directly under the section titled "What is
> Boost.Nowide", otherwise I expect some users to be surprised.
>

I think it is very good point.

>
> Under MSYS, one test failed (Fail: Error boost::nowide::cin.putback(c) in
> test_iostream.cpp:17 main), but I haven't had a chance to look into why
> yet.
>
> All of the tests passed on the other platforms.
>

Ok I hope once I have full test suite of Boost suite it will be easier
to find the stuff.

>
> Yes. I am already using it and expect I will be using it a lot more going
> forward.
>

Thank You very much for the review

> -- chris
>

Artyom

Chris Glover via Boost

unread,
Jun 20, 2017, 7:48:49 AM6/20/17
to bo...@lists.boost.org, Chris Glover
On Tue, 20 Jun 2017 at 04:01 Artyom Beilis via Boost <bo...@lists.boost.org>
wrote:

>
>
> The reason nowide implemented fstream is because filesystem::fstream
> uses internally the default std::basic_filebuf thus relays on std library
> to provide filebuf::open(wchar_t const *).
>
> Also it works OK for MSVC, MinGW libstdc++ does not provide wchar_t
> interface. For filesytem the fix is far from trivial as requires
> implementation
> of filebuf so nowide implemented one.
>
>
>
OK, makes sense.


>
> Thank You very much for the review
>
>
You're very welcome.

-- chris

Norbert Wenzel via Boost

unread,
Jun 20, 2017, 5:12:39 PM6/20/17
to bo...@lists.boost.org, Norbert Wenzel
This is my review of the proposed Boost.Nowide library.

On 06/12/2017 06:20 AM, Frédéric Bron via Boost wrote:
> - What is your evaluation of the design?
I had no troubles finding my way. Most APIs are already well known as
they resemble APIs from the standard library. I understand (after
reading the discussions on this list) why Nowide does not convert all
input to a common encoding on all platforms but only deals with wide
strings on Windows. It was a bit of a surprise at first, so this design
decision should be documented more prominently as others have already
pointed out.
In general I like the idea that only valid Unicode is accepted as input,
although I had an issue with invalid encoding written to nowide::cout
that set the stream's failbit and ignored all subsequent output. The
author was already aware of that issue and said invalid characters will
be replaced with U+FFFD (invalid character) after the review. I prefer
this solution to stopping all output after any failure.

> - What is your evaluation of the implementationI did not look at the implementation. It worked as expected/documented
when I tested the library.

> - What is your evaluation of the documentation?

It is enough to start working with the library. I would like to see more
clearly for every function when and how input strings are converted to a
different encoding and what happens in case of a conversion error
(exception, failbit of stream is set, invalid character, etc.).

> - What is your evaluation of the potential usefulness of the library?The library is very useful and I intend to use it as a portable
fire-and-forget solution to typical Unicode problems on Windows. I think
it will be a very helpful library especially when porting software from
Posix to Windows, since the library enables dealing with Unicode without
dealing with (and first learning about) the Windows Unicode idiosyncrasies.
As it came up in the discussion, I do not think I ever encountered
non-normalized Unicode file paths on Windows so I do not think the lack
of support for such filepaths is detrimental to the usefulness of this
library.

> - Did you try to use the library? With what compiler? Did you have any > problems?

I tested with MSVC14 and had no technical problems building or using the
library.

> - How much effort did you put into your evaluation? A glance? A quick
> reading? In-depth study?

I read the discussions on the list, the documentation and wrote a small
toy program to test my expected usage of Nowide. All in all I invested
about three hours.

> - Are you knowledgeable about the problem domain?

Somewhat. I had to work around typical Unicode issues on Windows
multiple times, but always stopped when the issue was solved. So my
knowledge is mostly driven by the specific issues I had so far and not
profound.

> - Do you think the library should be accepted as a Boost library?
Yes.

Norbert

Artyom Beilis via Boost

unread,
Jun 22, 2017, 4:31:50 PM6/22/17
to bo...@lists.boost.org, Artyom Beilis
On Wed, Jun 21, 2017 at 12:14 AM, Norbert Wenzel via Boost
<bo...@lists.boost.org> wrote:
> This is my review of the proposed Boost.Nowide library.
>
> On 06/12/2017 06:20 AM, Frédéric Bron via Boost wrote:
>> - What is your evaluation of the design?
> I had no troubles finding my way. Most APIs are already well known as
> they resemble APIs from the standard library. I understand (after
> reading the discussions on this list) why Nowide does not convert all
> input to a common encoding on all platforms but only deals with wide
> strings on Windows. It was a bit of a surprise at first, so this design
> decision should be documented more prominently as others have already
> pointed out.

Yes indeed I'll have to provide section regarding library policy
in tutorial in one of the first sections.

>> - What is your evaluation of the implementationI did not look at the implementation. It worked as expected/documented
> when I tested the library.
>
>> - What is your evaluation of the documentation?
> It is enough to start working with the library. I would like to see more
> clearly for every function when and how input strings are converted to a
> different encoding and what happens in case of a conversion error
> (exception, failbit of stream is set, invalid character, etc.).
>

Indeed - it should be improved.

>> - Do you think the library should be accepted as a Boost library?
> Yes.
>
> Norbert
>

Thank You very much for the review and the comments.

Artyom Beilis

Reply all
Reply to author
Forward
0 new messages