path::value_type is
wchar_t, according to the docs.
—John
On Windows you should convert it to utf16.
_______________________________________________
Boost-users mailing list
Boost...@lists.boost.org
http://lists.boost.org/mailman/listinfo.cgi/boost-users
Word of warning: the boost utf8 codecvt will cause undefined
operations if you have and cps above U+FFFF. You'll have to hack do_in
to and do_out in order to emit/parse surrogate pairs. Also, hack
do_length to increment the counter by 2 for cp>0xFFFF.
For my rewrite of UTF-8 to UTF-16/32, look at
https://github.com/moshbear/fastcgipp/blob/master/src/utf8_cvt.cpp.
While it can still decode above U+10FFFF, it's still more RFC 3629
compliant than utf8_codecvt_facet. It also supports true UTF-16.
I know that is how it stores it internally.
My question is "how". Given that I have data that are file names and encoded in UTF-8,
how do I make the Boost path class accept them, and operate conveniently enough to be
worth using instead of plain strings?
On Fri, Nov 4, 2011 at 22:54, Andrey Moshbear <andre...@gmail.com> wrote:
> For my rewrite of UTF-8 to UTF-16/32, look at
> https://github.com/moshbear/fastcgipp/blob/master/src/utf8_cvt.cpp.
So this is a codecvt that I should use as the extra argument, that works better than the
undocumented one that came with Boost?
And, the implicit answer is that this is indeed how I do it?
But:
1) When I write something like
path p2= p1 / "Foo" / s1 / name;
there is no place to pass the extra codecvt argument. I thought it might take strings and
keep the existing encoding, but it actually uses the default code page. How can I use
path in a simple and convenient manner given that in this program all the strings I will
use with it are already in UTF-8?
2) How can I write a line like:
path p2 (somestring, codecvt());
in a portable manner? On the Mac the internal representation is char, so will it object
to having the codecvt passed? Once I set things up, I want the bulk of the source code to
be the same on all platforms, so writing the argument on Windows and leaving it out on Mac
is not acceptable.
Thanks,
--John
I don't think "path" object can do such a conversion automatically, so
you should convert it on your own using CRT, WinAPI, ATL macros or any
other facilities.
And the boost utf8<->utf32 one is indeed documented:
http://www.boost.org/doc/libs/1_47_0/libs/serialization/doc/codecvt.html.
It's just not going to work correctly with extended Unicode if you
decide to use 16-bit char as the char type.
The code itself isn't that self-documenting, though, which makes
hacking in the U+10FFFF limit and surrogate pair parsing more
work than simply rewriting the codecvt.
>
> And, the implicit answer is that this is indeed how I do it?
>
> But:
>
> 1) When I write something like
> path p2= p1 / "Foo" / s1 / name;
> there is no place to pass the extra codecvt argument. I thought it might
> take strings and keep the existing encoding, but it actually uses the
> default code page. How can I use path in a simple and convenient manner
> given that in this program all the strings I will use with it are already in
> UTF-8?
>
Make a std::wstringstream.
Imbue it with locale(locale::classic(), new Utf8_cvt).
Use operator<< to build up a path.
Call .str() to get the string.
Pass that to the path constructor.
> 2) How can I write a line like:
> path p2 (somestring, codecvt());
> in a portable manner? On the Mac the internal representation is char, so
> will it object to having the codecvt passed? Once I set things up, I want
> the bulk of the source code to be the same on all platforms, so writing the
> argument on Windows and leaving it out on Mac is not acceptable.
>
Because Mac assumes char, use of wide UTF isn't going to work because
the libraries look for char 0 as terminators,
not wchar_t 0.
The best solution is to #ifdef _WIN32 the utf-8 to utf-16 code.
Uh, I don't think you understood the point of the question at all, nor know about the class.
"If the value type of [begin,end) or source arguments for member functions is not
value_type, and no cvt argument is supplied, conversion to value_type occurs using an
imbued locale."
"For Windows-like implementations, including Cygwin and MinGW, path::value_type is
wchar_t. The default imbued locale provides a codecvt facet that invokes Windows
MultiByteToWideChar or WideCharToMultiByte API's with a codepage of CP_THREAD_ACP if
Windows AreFileApisANSI()is true, otherwise codepage CP_OEMCP. "
It DOES CONVERT, and that is the starting point of my issue. See? It does convert, and
not in the way I want (if I indeed wanted it to).