[Boost-users] boost::filesystem::path in UTF-8 on Windows

1,896 views
Skip to first unread message

John M. Dlugosz

unread,
Nov 3, 2011, 7:24:04 PM11/3/11
to boost...@lists.boost.org
If I have a string that is in UTF-8, how do I tell the path constructor?

   path p1 ("my utf8 data", SOME_CODECVT);

I think it is a matter of passing the right SOME_CODECVT.  What is it?
The path::value_type is wchar_t, according to the docs.

—John

Igor R

unread,
Nov 4, 2011, 11:28:39 AM11/4/11
to boost...@lists.boost.org

On Windows you should convert it to utf16.
_______________________________________________
Boost-users mailing list
Boost...@lists.boost.org
http://lists.boost.org/mailman/listinfo.cgi/boost-users

Andrey Moshbear

unread,
Nov 4, 2011, 10:54:03 PM11/4/11
to boost...@lists.boost.org
On Fri, Nov 4, 2011 at 11:28, Igor R <boost...@gmail.com> wrote:
>> If I have a string that is in UTF-8, how do I tell the path constructor?
>>
>>    path p1 ("my utf8 data", SOME_CODECVT);
>>
>> I think it is a matter of passing the right SOME_CODECVT.  What is it?
>> The path::value_type is wchar_t, according to the docs.
>
> On Windows you should convert it to utf16.

Word of warning: the boost utf8 codecvt will cause undefined
operations if you have and cps above U+FFFF. You'll have to hack do_in
to and do_out in order to emit/parse surrogate pairs. Also, hack
do_length to increment the counter by 2 for cp>0xFFFF.

Andrey Moshbear

unread,
Nov 5, 2011, 7:43:54 AM11/5/11
to boost...@lists.boost.org
On Fri, Nov 4, 2011 at 22:54, Andrey Moshbear <andre...@gmail.com> wrote:
> On Fri, Nov 4, 2011 at 11:28, Igor R <boost...@gmail.com> wrote:
>>> If I have a string that is in UTF-8, how do I tell the path constructor?
>>>
>>>    path p1 ("my utf8 data", SOME_CODECVT);
>>>
>>> I think it is a matter of passing the right SOME_CODECVT.  What is it?
>>> The path::value_type is wchar_t, according to the docs.
>>
>> On Windows you should convert it to utf16.
>
> Word of warning: the boost utf8 codecvt will cause undefined
> operations if you have and cps above U+FFFF. You'll have to hack do_in
> to and do_out in order to emit/parse surrogate pairs. Also, hack
> do_length to increment the counter by 2 for cp>0xFFFF.
>

For my rewrite of UTF-8 to UTF-16/32, look at
https://github.com/moshbear/fastcgipp/blob/master/src/utf8_cvt.cpp.

While it can still decode above U+10FFFF, it's still more RFC 3629
compliant than utf8_codecvt_facet. It also supports true UTF-16.

John M. Dlugosz

unread,
Nov 5, 2011, 12:43:21 PM11/5/11
to boost...@lists.boost.org
On Fri, Nov 4, 2011 at 11:28, Igor R <boost...@gmail.com> wrote:
>>> On Windows you should convert it to utf16.

I know that is how it stores it internally.
My question is "how". Given that I have data that are file names and encoded in UTF-8,
how do I make the Boost path class accept them, and operate conveniently enough to be
worth using instead of plain strings?

On Fri, Nov 4, 2011 at 22:54, Andrey Moshbear <andre...@gmail.com> wrote:
> For my rewrite of UTF-8 to UTF-16/32, look at
> https://github.com/moshbear/fastcgipp/blob/master/src/utf8_cvt.cpp.

So this is a codecvt that I should use as the extra argument, that works better than the
undocumented one that came with Boost?

And, the implicit answer is that this is indeed how I do it?

But:

1) When I write something like
path p2= p1 / "Foo" / s1 / name;
there is no place to pass the extra codecvt argument. I thought it might take strings and
keep the existing encoding, but it actually uses the default code page. How can I use
path in a simple and convenient manner given that in this program all the strings I will
use with it are already in UTF-8?

2) How can I write a line like:
path p2 (somestring, codecvt());
in a portable manner? On the Mac the internal representation is char, so will it object
to having the codecvt passed? Once I set things up, I want the bulk of the source code to
be the same on all platforms, so writing the argument on Windows and leaving it out on Mac
is not acceptable.

Thanks,
--John

Igor R

unread,
Nov 5, 2011, 1:57:24 PM11/5/11
to boost...@lists.boost.org
>>>> On Windows you should convert it to utf16.
>
> I know that is how it stores it internally.
> My question is "how".  Given that I have data that are file names and
> encoded in UTF-8, how do I make the Boost path class accept them, and
> operate conveniently enough to be worth using instead of plain strings?


I don't think "path" object can do such a conversion automatically, so
you should convert it on your own using CRT, WinAPI, ATL macros or any
other facilities.

Andrey Moshbear

unread,
Nov 5, 2011, 2:00:30 PM11/5/11
to boost...@lists.boost.org
On Sat, Nov 5, 2011 at 12:43, John M. Dlugosz <mpbec...@snkmail.com> wrote:
> On Fri, Nov 4, 2011 at 11:28, Igor R <boost...@gmail.com> wrote:
>>>>
>>>> On Windows you should convert it to utf16.
>
> I know that is how it stores it internally.
> My question is "how".  Given that I have data that are file names and
> encoded in UTF-8, how do I make the Boost path class accept them, and
> operate conveniently enough to be worth using instead of plain strings?
>
> On Fri, Nov 4, 2011 at 22:54, Andrey Moshbear <andre...@gmail.com> wrote:
>>
>> For my rewrite of UTF-8 to UTF-16/32, look at
>> https://github.com/moshbear/fastcgipp/blob/master/src/utf8_cvt.cpp.
>
> So this is a codecvt that I should use as the extra argument, that works
> better than the undocumented one that came with Boost?
>

And the boost utf8<->utf32 one is indeed documented:
http://www.boost.org/doc/libs/1_47_0/libs/serialization/doc/codecvt.html.
It's just not going to work correctly with extended Unicode if you
decide to use 16-bit char as the char type.

The code itself isn't that self-documenting, though, which makes
hacking in the U+10FFFF limit and surrogate pair parsing more
work than simply rewriting the codecvt.

>
> And, the implicit answer is that this is indeed how I do it?
>
> But:
>
> 1) When I write something like
>   path p2= p1 / "Foo" / s1 / name;
> there is no place to pass the extra codecvt argument.  I thought it might
> take strings and keep the existing encoding, but it actually uses the
> default code page.  How can I use path in a simple and convenient manner
> given that in this program all the strings I will use with it are already in
> UTF-8?
>

Make a std::wstringstream.
Imbue it with locale(locale::classic(), new Utf8_cvt).
Use operator<< to build up a path.
Call .str() to get the string.
Pass that to the path constructor.


> 2) How can I write a line like:
>   path p2 (somestring, codecvt());
> in a portable manner?  On the Mac the internal representation is char, so
> will it object to having the codecvt passed?  Once I set things up, I want
> the bulk of the source code to be the same on all platforms, so writing the
> argument on Windows and leaving it out on Mac is not acceptable.
>

Because Mac assumes char, use of wide UTF isn't going to work because
the libraries look for char 0 as terminators,
not wchar_t 0.

The best solution is to #ifdef _WIN32 the utf-8 to utf-16 code.

John M. Dlugosz

unread,
Nov 5, 2011, 5:42:09 PM11/5/11
to boost...@lists.boost.org
On 11/5/2011 12:57 PM, Igor R wrote:
>>>>> On Windows you should convert it to utf16.
>
> I don't think "path" object can do such a conversion automatically, so
> you should convert it on your own using CRT, WinAPI, ATL macros or any
> other facilities.

Uh, I don't think you understood the point of the question at all, nor know about the class.

"If the value type of [begin,end) or source arguments for member functions is not
value_type, and no cvt argument is supplied, conversion to value_type occurs using an
imbued locale."

"For Windows-like implementations, including Cygwin and MinGW, path::value_type is
wchar_t. The default imbued locale provides a codecvt facet that invokes Windows
MultiByteToWideChar or WideCharToMultiByte API's with a codepage of CP_THREAD_ACP if
Windows AreFileApisANSI()is true, otherwise codepage CP_OEMCP. "

It DOES CONVERT, and that is the starting point of my issue. See? It does convert, and
not in the way I want (if I indeed wanted it to).

Reply all
Reply to author
Forward
0 new messages