The formal review of Artyom Beilis' Nowide library starts today and
will last until Wed. 21st of June.
Your participation is encouraged, as the proposed library is
uncoupled, focused and rather small. Nowadays, everyone needs to
handle Unicode but this is very difficult using only the standard
library in a platform independant way. Nowide offers a very simple way
to handle Unicode the same way on
Windows/MacOS/Linux.
Key features:
* work with UTF-8 in your code, Nowide converts to OS encoding
* Easy to use functions for converting UTF-8 to/from UTF-16
* A class to fixing argc, argc and env main parameters to use UTF-8
* UTF-8 aware functions:
- stdio.h functions (fopen, freopen, remove, rename)
- stdlib.h functions (system, getenv, setenv, unsetenv, putenv)
- fstream (filebuf, fstream/ofstream/ifstream)
- iostream (cout, cerr, clog, cin)
Documentation:
http://cppcms.com/files/nowide/html
GitHub:
https://github.com/artyom-beilis/nowide
git clone https://github.com/artyom-beilis/nowide.git
Latest tarballs:
- to be unzipped in boost source:
https://github.com/artyom-beilis/nowide/archive/master.zip
- as a standalone library: http://cppcms.com/files/nowide/nowide_standalone.zip
Nowide has standard boost layout. So, just put it into the boost
source tree under libs/nowide directory for building and running
tests.
Or alternatively - for header only part of the library add the
nowide/include path to the include path of your projects.
Please post your comments and review to the boost mailing list
(preferably), or privately to the Review Manager (to me ;-). Here are
some questions you might want to answer in your review:
- What is your evaluation of the design?
- What is your evaluation of the implementation?
- What is your evaluation of the documentation?
- What is your evaluation of the potential usefulness of the library?
- Did you try to use the library? With what compiler? Did you have any
problems?
- How much effort did you put into your evaluation? A glance? A quick
reading? In-depth study?
- Are you knowledgeable about the problem domain?
And most importantly:
- Do you think the library should be accepted as a Boost library? Be
sure to say this explicitly so that your other comments don't obscure
your overall opinion.
For more information about Boost Formal Review Process, see:
http://www.boost.org/community/reviews.html
Thank you very much for your time and efforts.
Frédéric
_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
> No, modified utf-8 is not supported since it isn't utf-8 it will be
> considered invalid encoding.
>
On a related note, does it support WTF-8? I.e. encoding lone UTF-16
surrogates (malformed UTF-16 sequences) within the UTF-8 scheme. It is
needed to guarantee UTF-16 → UTF-8 → UTF-16 roundtrip of invalid UTF-16
data on Windows, and is not an invalid behavior per se, because all valid
UTF-16 sequences still map bijectively onto valid UTF-8 sequences.
--
Yakov Galka
http://stannum.co.il/
> I know modified UTF-8 is (can be) invalid UTF-8, that's why I asked. I
> think it could make sense to support it anyway though. Round tripping
> (strictly invalid, but possible) file names on Windows, easier
> interoperability with stuff like JNI, ...
>
Don't you mean WTF-8 then? AFAIK "Modified UTF-8" is UTF-8 that encodes the
null character with an overlong sequence, and thus is incompatible with
standard UTF-8, unlike WTF-8 which is a compatible extension.
> OTOH it would add overhead for systems with native UTF-8 APIs, because
> Nowide would at least have to check every string for "modified UTF-8
> encoded" surrogate pairs and convert the string if necessary. Which of
> course is a good argument for not supporting modified UTF-8, because then
> Nowide could just pass the strings through unmodified on those systems.
>
Implementing WTF-8 removes a check in UTF-8 → UTF-16 conversion, and
doesn't change anything in the reverse direction when there is a valid
UTF-16. I suspect it isn't slower.
--
Yakov Galka
http://stannum.co.il/
_______________________________________________
No it does not.
I considered it before but I think that security risk of creating or
accepting malformed UTF-8 or UTF-16.
Converting invalid UTF-16 to WTF-8 and other way around is not obvious
behavior and has potential of security risk especially for users that
are not aware of such an issue. So invalid UTF-8/16 sequences are
rejected by design.
Artyom
> Artyom Beilis wrote:
>
> Converting invalid UTF-16 to WTF-8 and other way around is not obvious
>> behavior and has potential of security risk especially for users that are
>> not aware of such an issue. So invalid UTF-8/16 sequences are rejected by
>> design.
>>
>
> Implying that there are Windows file names that can't be handled by the
> library?
>
Question: "Shouldn't the passing of invalid UTF-8/16 sequences be defined
as UB?" How important is this point? I'm oblivious to this problem, and it
sounds like I would like to keep it that way.
degski
--
"*Ihre sogenannte Religion wirkt bloß wie ein Opiat reizend, betäubend,
Schmerzen aus Schwäche stillend.*" - Novalis 1798
> degski wrote:
>
> Question: "Shouldn't the passing of invalid UTF-8/16 sequences be defined
>> as UB?"
>>
>
> Of course not. Why would one need to use the library then? It defeats the
> whole purpose of it.
From WP (read up on it now): "RFC 3629 states "Implementations of the
decoding algorithm MUST protect against decoding invalid sequences."[13]
<https://en.wikipedia.org/wiki/UTF-8#cite_note-rfc3629-13> *The Unicode
Standard* requires decoders to "...treat any ill-formed code unit sequence
as an error condition. This guarantees that it will neither interpret nor
emit an ill-formed code unit sequence.""
So not UB then, but it should not pass either.
Are we talking FAT32 or NTFS? What Windows verions are affected? I also
think, as some posters below (and in another thread) state, that Windows
should not be treated differently. A new boost library should not
accomodate bad/sloppy windows' historic quirks. The library *can* require
that's it's use depends on the system and its' users adhere to the standard.
Then WP on Overlong encodings: "The standard specifies that the correct
encoding of a code point use only the minimum number of bytes required to
hold the significant bits of the code point. Longer encodings are called
*overlong* and are not valid UTF-8 representations of the code point. This
rule maintains a one-to-one correspondence between code points and their
valid encodings, so that there is a unique valid encoding for each code
point."
The key being: "... are not valid UTF-8 representations ...", i.e. we're
back to the case above.
degski
WP: https://en.wikipedia.org/wiki/UTF-8
--
"*Ihre sogenannte Religion wirkt bloß wie ein Opiat reizend, betäubend,
Schmerzen aus Schwäche stillend.*" - Novalis 1798
_______________________________________________
Actually I think you provided me a good direction I hadn't considered before.
RtlUTF8ToUnicodeN and other way around function does something very simple:
It substitutes invalid codepoints/encoding with U+FFFD - REPLACEMENT CHARACTER
which is standard Unicode way to say I failed to convert a thing.
It is something similar to current ANSI / Wide conversions creating ? instead.
It looks like it is better way to do it instead of failing to convert
entire string all together.
If you get invalid string conversion will success but you'll get
special characters (that are usually marked as � in UI)
that will actually tell you something was wrong.
This way for example getenv on valid key will not return NULL and
create ambiguity of what happened and it is actually
something that is more common behavior in Windows.
I like it and I think I'll change the behavior of the conversion
functions in Boost.Nowide to this one
Thanks!
Artyom Beilis
Some additional information about how the NT kernel treats the FFFD
character might be useful to you.
Quite a lot of kernel APIs treat UNICODE_STRING as just a bunch of bytes
e.g. the filesystem. You can supply any path with any characters at all,
including one containing zero bytes. The only character you can't use is
the backslash as that is the path separator. This matches how most Unix
filesystems work, and indeed at the NT kernel level path comparisons are
case sensitive as well as byte sensitive (unless you ask otherwise)
because it's just a memcmp(). You can have real fun here creating lots
of paths completely unparseable by Win32, as in, files totally
inaccessible to any Win32 API with some really random Win32 error codes
being returned.
Other NT kernel APIs will refuse strings containing FFFD or illegal
UTF-16 characters, and if they do it's generally because accepting them
would be an obvious security risk. But those are definitely a minority.
If the Win32 layer doesn't get in the way, doing as RtlUTF8ToUnicodeN()
does should be safe in so far as the kernel team feel it is. They have
placed appropriate checks on appropriate kernel APIs. But in the end,
it's down to the programmer to correctly validate and check all
untrusted input. Doing so always is an unnecessary expense for most end
users who have trusted input.
Niall
--
ned Productions Limited Consulting
http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/
this is not fully true as you can get a path with getenv.
Frédéric
so does this allow roundtrip conversion UTF-16->UTF-8->UTF-16?
Frédéric
#include "nowide/fstream.hpp"
int main() {
auto f = nowide::fstream("toto.log");
return 0;
}
instead of
auto f=...,
nowide::fstream f("toto.log")
works fine.
The error is:
toto.cpp:4:38: erreur : use of deleted function «
nowide::basic_fstream<char>::basic_fstream(const
nowide::basic_fstream<char>&) »
auto f = nowide::fstream("toto.log");
^
In file included from toto.cpp:1:0:
nowide/fstream.hpp:190:11: note : «
nowide::basic_fstream<char>::basic_fstream(const
nowide::basic_fstream<char>&) » is implicitly deleted because the
default definition would be ill-formed:
class basic_fstream : public std::basic_iostream<CharType,Traits>
^~~~~~~~~~~~~
nowide/fstream.hpp:190:11: erreur : use of deleted function «
std::basic_iostream<_CharT, _Traits>::basic_iostream(const
std::basic_iostream<_CharT, _Traits>&) [with _CharT = char; _Traits =
std::char_t
raits<char>] »
In file included from
/softs/mingw64-6.3.0/x86_64-w64-mingw32/include/c++/6.3.0/iterator:65:0,
from ./nowide/encoding_utf.hpp:14,
from ./nowide/convert.hpp:12,
from nowide/fstream.hpp:13,
from toto.cpp:1:
/softs/mingw64-6.3.0/x86_64-w64-mingw32/include/c++/6.3.0/istream:863:7:
note : declared here
basic_iostream(const basic_iostream&) = delete;
^~~~~~~~~~~~~~
In file included from toto.cpp:1:0:
nowide/fstream.hpp:190:11: erreur : use of deleted function «
std::basic_ios<_CharT, _Traits>::basic_ios(const
std::basic_ios<_CharT, _Traits>&) [with _CharT = char; _Traits =
std::char_traits<char>] »
class basic_fstream : public std::basic_iostream<CharType,Traits>
^~~~~~~~~~~~~
In file included from
/softs/mingw64-6.3.0/x86_64-w64-mingw32/include/c++/6.3.0/ios:44:0,
from
/softs/mingw64-6.3.0/x86_64-w64-mingw32/include/c++/6.3.0/ostream:38,
from
/softs/mingw64-6.3.0/x86_64-w64-mingw32/include/c++/6.3.0/iterator:64,
from ./nowide/encoding_utf.hpp:14,
from ./nowide/convert.hpp:12,
from nowide/fstream.hpp:13,
from toto.cpp:1:
/softs/mingw64-6.3.0/x86_64-w64-mingw32/include/c++/6.3.0/bits/basic_ios.h:475:7:
note : declared here
basic_ios(const basic_ios&) = delete;
^~~~~~~~~
In file included from toto.cpp:1:0:
nowide/fstream.hpp:190:11: erreur : «
nowide::scoped_ptr<T>::scoped_ptr(const nowide::scoped_ptr<T>&) [with
T = nowide::basic_filebuf<char>] » is private within this context
class basic_fstream : public std::basic_iostream<CharType,Traits>
^~~~~~~~~~~~~
In file included from nowide/fstream.hpp:14:0,
from toto.cpp:1:
./nowide/scoped_ptr.hpp:31:5: note : declared private here
scoped_ptr(scoped_ptr const &);
Ok I see,
Nowide misses some of the C++11 interfaces - for example move constructor
but it shouldn't be hard to implement ones.
Also I wasn't aware that auto f = nowide::fstream("toto.log"); Can
actually call move
constructor.
Artyom
I am also surprised, I though it would be equivalent.
And I use the same gcc version (6.3.0) and it causes no issue on
linux, only on mingw cross compiler...
Frédéric
Artyom
בתאריך 15 ביוני 2017 9:27 אחה״צ, "Frédéric Bron" <freder...@m4x.org>
כתב:
If course!
I do not quite understand the rationale behind not converting to UTF-8
on Posix platforms. I naively though I got UTF-8 in argv because my
system is convigured in UTF-8 but I discover that this is not
necessary always the case. In the example you highlight, I do not see
the difference from the Windows case. You could convert to UTF-8 in
argv and back to the local encoding in nowide::remove. I understand it
is not efficient if you do not really use the content of the filename
but if you have to write, say an xml report in UTF-8, you would have
to convert anyway.
Today, what is the portable way to convert argv to UTF-8? i.e. without
having to #ifdef _WIN32...?
Hello Frederic,
There are several reasons for this.
One of is actually original purpose of the library: use same type of strings
internally without creating broken software on Windows and since only
Windows use native Wide instead of narrow API which is native for C++
only Windows case requires encoding conversion.
However there is another deeper issue. Unlike on Windows where native
wide API has well defined UTF-16 encoding it isn't the case for Unix like OSes.
The encoding is defined by current Locale that can be defined
globally, per user,
per process and even change trivially in the same process during the runtime.
There are also several sources of the Locale/Encoding information:
Environment variables:
- LANG/LC_CTYPE - which is UTF-8 on vast majority of modern Unix like
platforms but frequently can be undefined or defined as "C" locale without
encoding information.
This one is what OS defines for the process.
- C locale: setlocale API - which is by default "C" locale by standard
unless explicitly defined otherwise
- C++ locale: std::locale::global() API - which is by default "C" locale by
standard unless explicitly defined otherwise
They are all can be changed at runtime, they aren't synchronized and
they be modified to whatever encoding user wants.
Additionally using std::locale::global as not "C" locale can lead to some
really nasty things like failing to create CSV files due to adding ","
to numbers.
So the safest and the most correct way to handle it is to pass narrow
strings as is
without any conversion.
Regards,
Artyom Beilis
I understand that this complexity can be a nightmare.
How would you therefore do this simple task:
1. get a directory from nowide::getenv -> base_dir (UTF-8 on Windows,
unknown narrow on Posix)
2. create a file in base_dir which name is file_name encoded in UTF-8
(because it is created by the application).
If I understand well, I should NOT do this:
auto f = nowide::ofstream((boost::filesystem::path(base_dir) /
file_name).string());
because this is guaranteed to work only on Windows where I have the
guarantee that base_dir is UTF-8, right?
On Posix, there is a risk that base_dir is for example ISO-8859-1
while file_name is UTF-8 so that the combination made by filesystem is
wrong.
Am I right?
Frédéric
but it seems to me that in this case, what I need is UTF-8->ISO 8859-1
conversion of the file name before concatenation with the directory.
Otherwise, OK I will get a file because the system just ask for narrow
string but its name will be wrong in the OS user interface.
Frédéric
You actually **assume** that the encoding you received (like getenv)
from the system actually matches current locale encoding.
But it is not necessary the same:
1. The file/directory was created by user running in different locale
2. The locale isn't defined properly or was modified
3. You get these files directories from some other location (like
unzipped some stuff)
In reality the OS does not care about encoding (most of the time).
Unlike Windows where wchar_t also defines the encoding UTF-16 under POSIX
platforms "char *" can contain whatever encoding and it can be changed.
Also UTF-8 is the most common encoding on all modern Unix like
systems: Linux, BSD, Mac OS X
So I don't think it is necessary to perform any conversions between UTF-8
and whatever "char *" encoding you get because:
(a) You can't reliable know what kind of encoding you use.
(b) Same "char *" may contain parts from different encoding and
actually be valid path.
Artyom
How do I build the documentation for the nowide library ?
>
> Please post your comments and review to the boost mailing list
> (preferably), or privately to the Review Manager (to me ;-). Here are
> some questions you might want to answer in your review:
>
> - What is your evaluation of the design?
> - What is your evaluation of the implementation?
> - What is your evaluation of the documentation?
> - What is your evaluation of the potential usefulness of the library?
> - Did you try to use the library? With what compiler? Did you have any
> problems?
> - How much effort did you put into your evaluation? A glance? A quick
> reading? In-depth study?
> - Are you knowledgeable about the problem domain?
>
> And most importantly:
> - Do you think the library should be accepted as a Boost library? Be
> sure to say this explicitly so that your other comments don't obscure
> your overall opinion.
>
> For more information about Boost Formal Review Process, see:
> http://www.boost.org/community/reviews.html
>
> Thank you very much for your time and efforts.
>
> Frédéric
> Hi Everyone,
>
> The formal review of Artyom Beilis' Nowide library starts today and
> will last until Wed. 21st of June.
> [snip]
> Please post your comments and review to the boost mailing list
> (preferably), or privately to the Review Manager (to me ;-). Here are
> some questions you might want to answer in your review:
>
> - What is your evaluation of the design?
>
1) I'd really much rather have an iterator-based interface for the
narrow/wide conversions. There's an existing set of iterators in
Boost.Regex already, and I've recently written one here:
https://github.com/tzlaine/text/blob/master/boost/text/utf8.hpp
The reliance on a new custom string type is a substantial mistake, IMO
(boost::nowide::basic_stackstring). Providing an iterator interface
(possibly cribbing the one of the two implementations above) would negate
the need for this new string type -- I could use the existing std::string,
MyString, QString, a char buffer, or whatever. Also, I'd greatly prefer
that the new interfaces be defined in terms of string_view instead of
string/basic_stackstring (there's also a string_view implementation already
Boost.Utility). string_view is simply far more usable, since it binds
effortlessly to either a char const * or a string.
2) I don't really understand what happens when a user passes a valid
Windows filename that is *invalid* UTF-16 to a program using Nowide. Is
the invalid UTF-16 filename just broken in the process of trying to convert
it to UTF-8? This is partially a documentation problem, but until I
understand how this is intended to work, I'm also counting it as a design
issue.
> - What is your evaluation of the implementation?
I did not look.
> - What is your evaluation of the documentation?
I think the documentation needs a bit of work. The non-reference portion
is quite thin, and drilling down into the reference did not answer at least
one question I had (the one above, about invalid UTF-16):
Looking at some example code in the "Using the Library" section, I saw this:
"
To make this program handle Unicode properly, we do the following changes:
#include <boost/nowide/args.hpp>
#include <boost/nowide/fstream.hpp>
#include <boost/nowide/iostream.hpp>
int main(int argc,char **argv)
{
boost::nowide::args a(argc,argv); // Fix arguments - make them UTF-8
"
Ok, so I clicked "boost::nowide::args", hoping for an answer. The detailed
description for args says:
"
args is a class that fixes standard main() function arguments and changes
them to UTF-8 under Microsoft Windows.
The class uses GetCommandLineW(), CommandLineToArgvW() and
GetEnvironmentStringsW() in order to obtain the information. It does not
relates to actual values of argc,argv and env under Windows.
It restores the original values in its destructor
"
It tells me nothing about what happens when invalid UTF-16 is encountered.
Is there an exception? Is 0xfffd inserted? If the latter, am I just
stuck? I should not have to read any source code to figure this out, but
it looks like I have to.
This criticism can be applied to most of the documentation. My preference
is that the semantics of primary functionality of the library should be
explained in tutorials or other non-reference formats. The current state
of the docs doesn't even explain things in the references. This must be
fixed before this library can be accepted.
> - What is your evaluation of the potential usefulness of the library?
I think this library is attempting to address a real and important issue.
I just can't figure out if it's a complete solution, because how invalid
UTF-16 is treated remains a question.
> - Did you try to use the library? With what compiler? Did you have any
> problems?
>
I did not.
- How much effort did you put into your evaluation? A glance? A quick
> reading? In-depth study?
>
A quick reading, plus a bit of discussion on the list.
> - Are you knowledgeable about the problem domain?
>
I understand the UTF-8 issues reasonably well, but am ignorant of the
Windows-specific issues.
> And most importantly:
> - Do you think the library should be accepted as a Boost library? Be
> sure to say this explicitly so that your other comments don't obscure
> your overall opinion.
>
I do not think the library should be accepted in its current form. It
seems not to handle malformed UTF-16, which is a requirement for processing
Windows file names (as I understand it -- please correct this if I'm
wrong). Independent of this, I don't find the docs to be sufficient.
Zach
Why should we try to handle wrong UTF-16 (or wrong UTF-8)?
1. such files should not exist
2. if they exist, why?
- if it is because it is an old file, can the user just rename it properly?
- if it is because it was produced by a program, why should this
program continue to work without fixing? Isn't it the best way that we
get wrong filenames forever?
I do not understand why we cannot just issue an error.
Thanks for explanations,
Frédéric
--
Frédéric Bron
-----------------------------------------------------------
Frédéric Bron (freder...@m4x.org)
Villa des 4 chemins, Centre Hospitalier, BP 208
38506 Voiron Cedex
tél. fixe : +33 4 76 67 17 27, tél. port.: +33 6 67 02 77 35
> Question to all:
>
> Why should we try to handle wrong UTF-16 (or wrong UTF-8)?
> 1. such files should not exist
> 2. if they exist, why?
> - if it is because it is an old file, can the user just rename it
> properly?
> - if it is because it was produced by a program, why should this
> program continue to work without fixing? Isn't it the best way that we
> get wrong filenames forever?
This came up at C++Now last month. My understanding, from talking to
people who seem to know about this, is that such filenames are considered
valid by Windows. To issue an error in such a case means not allowing
users to access files that Windows sees as well-formed.
but who/what needs ill-formed filenames? For what reason?
Frédéric
The question whether to support such filenames is more a matter of
principle. Namely, whether it is the job of the Nowide library to tell you
what filenames you may use or not. One might argue that it should be of no
concern to it whether, or for what reason, I need such filenames.
As well in Boost.Locale - Boost.Nowide uses UTF conversions
from header only part of Boost.Locale.
>
> The reliance on a new custom string type is a substantial mistake, IMO
> (boost::nowide::basic_stackstring).
> [snip]
> (possibly cribbing the one of the two implementations above) would negate
> the need for this new string type -- I could use the existing std::string,
> MyString, QString, a char buffer, or whatever. Also, I'd greatly prefer
> that the new interfaces be defined in terms of string_view instead of
> string/basic_stackstring (there's also a string_view implementation already
> Boost.Utility). string_view is simply far more usable, since it binds
> effortlessly to either a char const * or a string.
>
stackstring is barely on-stack buffer optimization - it isn't general string
and never intended to be so.
It isn't in details because it can actually be useful outside library scope.
> Providing an iterator interface
There is no iterator interface in communication with OS
so no sense to add it there. If you want character by character
interface it exists in Boost.Locale
> 2) I don't really understand what happens when a user passes a valid
> Windows filename that is *invalid* UTF-16 to a program using Nowide. Is
> the invalid UTF-16 filename just broken in the process of trying to convert
> it to UTF-8? This is partially a documentation problem, but until I
> understand how this is intended to work, I'm also counting it as a design
> issue.
>
You are right it wasn't clear in the docs. My policy was return a error in
case of invalid encoding but after comments I received I'll rather
replace invalid sequences with U-FFFD Replacement Character
since it was what WinAPI usually does.
https://github.com/artyom-beilis/nowide/issues/16
It is something easy to fix.
>> - What is your evaluation of the documentation?
>
> I think the documentation needs a bit of work. The non-reference portion
> is quite thin, and drilling down into the reference did not answer at least
> one question I had (the one above, about invalid UTF-16):
>
Actually documentation points out that in case of invalid character
encoding a error is returned and how (in most of places - but
there are some missies)
However I agree that it can be improved
and a special section regarding bad encoding conversion policy
should and will be added.
>
> It tells me nothing about what happens when invalid UTF-16 is encountered.
> Is there an exception? Is 0xfffd inserted? If the latter, am I just
> stuck? I should not have to read any source code to figure this out, but
> it looks like I have to.
Valid point.
>
>> And most importantly:
>> - Do you think the library should be accepted as a Boost library? Be
>> sure to say this explicitly so that your other comments don't obscure
>> your overall opinion.
>>
>
> I do not think the library should be accepted in its current form. It
> seems not to handle malformed UTF-16, which is a requirement for processing
> Windows file names (as I understand it -- please correct this if I'm
> wrong).
I strongly disagree about it. Same as converting invalid ANSI encoding
shouldn't lead to invalid UTF-16 or the other way around.
Since there is no way to present invalid UTF-16 using valid UTF-8
same as there is no way to do it in ANSI<->UTF-16 there is
no non-ambiguous way to provide such a support.
>
> Zach
>
Thank You for your review.
Artyom
> The question whether to support such filenames is more a matter of
> principle. Namely, whether it is the job of the Nowide library to tell you
> what filenames you may use or not. One might argue that it should be of no
> concern to it whether, or for what reason, I need such filenames.
>
I think the ruling principle should be the unicode standard. Nowide should
support the unicode standard and no more (at least in its' intentions). One
of the intentions of the standard, as I read it, is to guarantee that a
conversion can *un-ambiguously (and safely) round-trip*. On windows, if
WTF-8 (what's in a name? Apart from the obvious, even Wobbly sounds
bizarre), CESU-8 or Modified UTF-8 are ways to achieve that, I think that
they should be supported. If not, an exception should be thrown when
encountering these invalid encodings, as this is in my view an IO issue
(the un-certain environment in which an application has to live), in which
context, throwing seems to be the norm.
I'm with Frédéric Bron on this one, though. I don't understand why invalid
encodings are found in the wild in the first place and why they should
continue to exist in the wild. The whole thing sounds like a vector for
exploits, malicious code to generate invalid encodings after which
buffer-overruns open up the system.
Something I cannot understand is that some of those on this list who are
most critical of Windows in view of security concerns are also the same
people who happily perpetuate these weaknesses. Microsoft is very much in
front here, by dropping support for OS'es (and therfore their associated
compilers and CRT's), Boost should do the same and adopt rigour: "Ceterum
autem censeo Carthaginem esse delendam"
What's not terribly clear either is whether we are talking about NTFS or
Explorer (and then there is WinFS lurking in the shadows, bound to be
released someday). Windows does (many) other things that are weird, like
the way it deals with capitalized names, long paths etc., NTFS itself does
actually not have these issues and does quite a bit more than what the
Explorer shell can do.
degski
--
"*Ihre sogenannte Religion wirkt bloß wie ein Opiat reizend, betäubend,
Schmerzen aus Schwäche stillend.*" - Novalis 1798
On 06/12/2017 06:20 AM, Frédéric Bron via Boost wrote:
> - What is your evaluation of the design?
I had no troubles finding my way. Most APIs are already well known as
they resemble APIs from the standard library. I understand (after
reading the discussions on this list) why Nowide does not convert all
input to a common encoding on all platforms but only deals with wide
strings on Windows. It was a bit of a surprise at first, so this design
decision should be documented more prominently as others have already
pointed out.
In general I like the idea that only valid Unicode is accepted as input,
although I had an issue with invalid encoding written to nowide::cout
that set the stream's failbit and ignored all subsequent output. The
author was already aware of that issue and said invalid characters will
be replaced with U+FFFD (invalid character) after the review. I prefer
this solution to stopping all output after any failure.
> - What is your evaluation of the implementationI did not look at the implementation. It worked as expected/documented
when I tested the library.
> - What is your evaluation of the documentation?
It is enough to start working with the library. I would like to see more
clearly for every function when and how input strings are converted to a
different encoding and what happens in case of a conversion error
(exception, failbit of stream is set, invalid character, etc.).
> - What is your evaluation of the potential usefulness of the library?The library is very useful and I intend to use it as a portable
fire-and-forget solution to typical Unicode problems on Windows. I think
it will be a very helpful library especially when porting software from
Posix to Windows, since the library enables dealing with Unicode without
dealing with (and first learning about) the Windows Unicode idiosyncrasies.
As it came up in the discussion, I do not think I ever encountered
non-normalized Unicode file paths on Windows so I do not think the lack
of support for such filepaths is detrimental to the usefulness of this
library.
> - Did you try to use the library? With what compiler? Did you have any > problems?
I tested with MSVC14 and had no technical problems building or using the
library.
> - How much effort did you put into your evaluation? A glance? A quick
> reading? In-depth study?
I read the discussions on the list, the documentation and wrote a small
toy program to test my expected usage of Nowide. All in all I invested
about three hours.
> - Are you knowledgeable about the problem domain?
Somewhat. I had to work around typical Unicode issues on Windows
multiple times, but always stopped when the issue was solved. So my
knowledge is mostly driven by the specific issues I had so far and not
profound.
> - Do you think the library should be accepted as a Boost library?
Yes.
Norbert
Yes indeed I'll have to provide section regarding library policy
in tutorial in one of the first sections.
>> - What is your evaluation of the implementationI did not look at the implementation. It worked as expected/documented
> when I tested the library.
>
>> - What is your evaluation of the documentation?
> It is enough to start working with the library. I would like to see more
> clearly for every function when and how input strings are converted to a
> different encoding and what happens in case of a conversion error
> (exception, failbit of stream is set, invalid character, etc.).
>
Indeed - it should be improved.
>> - Do you think the library should be accepted as a Boost library?
> Yes.
>
> Norbert
>
Thank You very much for the review and the comments.
Artyom Beilis