[boost] [locale] [filesystem] Windows local 8 bit encoding

442 views
Skip to first unread message

Thiel, Bjoern

unread,
Oct 31, 2012, 10:07:15 AM10/31/12
to bo...@lists.boost.org
Hi,

developing platform independent code I really like the convenience functions
conv::to_utf, conv::from_utf, and conv::utf_to_utf from locale.
Why not add something like conv::local8bit_to_utf and conv::local8bit_from_utf
following the rational from filesystem (path encoding conversions):

template < typename CharType >
std::basic_string< CharType > local8bit_to_utf
( std::string const & text, method_type how = default_method )
{
char const * encoding = impl::local8bit_encoding() ;
return to_utf< CharType >( text, encoding, how ) ;
}

template< typename CharType >
std::string local8bit_from_utf
( std::basic_string< CharType > const & text, method_type how = default_method )
{
char const * encoding = impl::local8bit_encoding() ;
return from_utf< CharType >( text, encoding, how ) ;
}

with

char const * local8bit_encoding()
{
#ifdef WIN32
UINT codepage = AreFileApisANSI() ? GetACP() : GetOEMCP() ;
return windows_codepage_to_encoding( codepage ) ;
#else
return "UTF-8" ;
#endif
}

and with (better using a map)

char const * windows_codepage_to_encoding( int const codepage )
{
switch (codepage)
{
case 874: return "windows-874" ;

case 932: return "Shift_JIS" ; // but should be "Windows-31J" ;
case 936: return "GB2312" ;
case 949: return "KS_C_5601-1987" ;
case 950: return "Big5" ;

case 1250: return "windows-1250" ;
case 1251: return "windows-1251" ;
case 1252: return "windows-1252" ;
case 1253: return "windows-1253" ;
case 1254: return "windows-1254" ;
case 1255: return "windows-1255" ;
case 1256: return "windows-1256" ;
case 1257: return "windows-1257" ;
case 1258: return "windows-1258" ;

case 20127: return "US-ASCII" ;

case 20866: return "KOI8-R" ;
case 20932: return "EUC-JP" ;
case 21866: return "KOI8-U" ;

case 28591: return "ISO-8859-1" ;
case 28592: return "ISO-8859-2" ;
case 28593: return "ISO-8859-3" ;
case 28594: return "ISO-8859-4" ;
case 28595: return "ISO-8859-5" ;
case 28596: return "ISO-8859-6" ;
case 28597: return "ISO-8859-7" ;
case 28598: return "ISO-8859-8" ;
case 28599: return "ISO-8859-9" ;
case 28603: return "ISO-8859-13" ;
case 28605: return "ISO-8859-15" ;

case 50220: return "ISO-2022-JP" ;
case 50225: return "ISO-2022-KR" ;

case 51949: return "EUC-KR" ;
case 54936: return "GB18030" ;

case 65001: return "UTF-8" ;

default:
{
std::ostringstream message ;
message << "Unknown codepage " << codepage ;
throw std::invalid_argument( message.str() ) ;
}
}
}

Best regards

Bjoern.

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Yakov Galka

unread,
Oct 31, 2012, 12:41:32 PM10/31/12
to bo...@lists.boost.org
On Wed, Oct 31, 2012 at 4:07 PM, Thiel, Bjoern
<bjoern...@mpibpc.mpg.de>wrote:

> Hi,
>

Hi,


> developing platform independent code I really like the convenience
> functions
> conv::to_utf, conv::from_utf, and conv::utf_to_utf from locale.
> Why not add something like conv::local8bit_to_utf and
> conv::local8bit_from_utf
> following the rational from filesystem (path encoding conversions):
>

Cannot talk for Artyom, but IMO there is little use to such functions. On
Windows, 'ANSI' encodings exist solely for legacy reasons, and their use is
limited to legacy code and code that gives up Unicode support in the first
place. Boost.Filesystem uses 'ANSI' for narrow strings because Beman
decided that compatibility with the dinkumware CRT implementation is more
important than portability of Unicode correct code. The same is true for
all other parts of boost (except Locale).

If you are really into platform independent code, take a look at
Boost.Nowide (http://cppcms.com/files/nowide/html/) waiting for review. In
principle, on Windows, you need only two conversions: UTF-8 into UTF-16 and
vice versa.

Cheers,
--
Yakov

Jookia

unread,
Oct 31, 2012, 12:56:33 PM10/31/12
to bo...@lists.boost.org
On 01/11/12 03:41, Yakov Galka wrote:
> On Wed, Oct 31, 2012 at 4:07 PM, Thiel, Bjoern
> <bjoern...@mpibpc.mpg.de>wrote:
>
>> Hi,
>>
>
> Hi,
>
>
>> developing platform independent code I really like the convenience
>> functions
>> conv::to_utf, conv::from_utf, and conv::utf_to_utf from locale.
>> Why not add something like conv::local8bit_to_utf and
>> conv::local8bit_from_utf
>> following the rational from filesystem (path encoding conversions):
>>
>
> Cannot talk for Artyom, but IMO there is little use to such functions. On
> Windows, 'ANSI' encodings exist solely for legacy reasons, and their use is
> limited to legacy code and code that gives up Unicode support in the first
> place. Boost.Filesystem uses 'ANSI' for narrow strings because Beman
> decided that compatibility with the dinkumware CRT implementation is more
> important than portability of Unicode correct code. The same is true for
> all other parts of boost (except Locale).
>
> If you are really into platform independent code, take a look at
> Boost.Nowide (http://cppcms.com/files/nowide/html/) waiting for review. In
> principle, on Windows, you need only two conversions: UTF-8 into UTF-16 and
> vice versa.
>
> Cheers,
>

Hello all.

Although this is right, I do think locales themselves should specify
legacy encodings if they're using them.

To my understanding wouldn't you be able to to from/to_utf and achieve
the same behaviour as wanted without using a separate set of functions?

Jookia.

Stephan T. Lavavej

unread,
Oct 31, 2012, 3:53:32 PM10/31/12
to bo...@lists.boost.org
[Yakov Galka]
> Cannot talk for Artyom, but IMO there is little use to such functions. On
> Windows, 'ANSI' encodings exist solely for legacy reasons, and their use is
> limited to legacy code and code that gives up Unicode support in the first
> place. Boost.Filesystem uses 'ANSI' for narrow strings because Beman
> decided that compatibility with the dinkumware CRT implementation is more
> important than portability of Unicode correct code.

FYI, MSVC's C++ Standard Library implementation is licensed from Dinkumware, but MSVC's CRT is not.

Stephan T. Lavavej
Visual C++ Libraries Developer

Artyom Beilis

unread,
Nov 1, 2012, 4:57:50 AM11/1/12
to bo...@lists.boost.org


>________________________________
> From: "Thiel, Bjoern" <bjoern...@mpibpc.mpg.de>
>To: "bo...@lists.boost.org" <bo...@lists.boost.org>
>Sent: Wednesday, October 31, 2012 4:07 PM
>Subject: [boost] [locale] [filesystem] Windows local 8 bit encoding
>
>Hi,
>
>developing platform independent code I really like the convenience functions
>conv::to_utf, conv::from_utf, and conv::utf_to_utf from locale.
>Why not add something like conv::local8bit_to_utf and conv::local8bit_from_utf

First of all locale encoding is not constant, for example there are numerous
way to change locale

by calling C functions


   setlocale(LC_ALL,"en_US.ISO-8859-1") 
or

   setlocale(LC_ALL,"English_USA.1251") 


you can change it in C++ as


   std::locale::global(std::locale("en_US.ISO-8859-1"));

or

   std::locale::global(std::locale("English_USA.1251"));


Of course under POSIX platform even stuff like


   setenv("LANG","en_US.ISO-8859-1",1)

Right after main() would effectively change the process locale.

Some functions will be effected by such changes some other don't,
it depends on implementation and other things.

Thus the "concept" of the OS locale is quite uncertain and not well
defined especially under Microsoft Windows.


Using Boost.Locale you can convert to locale encoding of a given
std::locale() object generated with Boost.Locale.

boost::locale::generator allows to select legacy "ANSI" encoding
instead of UTF-8 to be default upon creation of the locale object that
corresponds to the system locale.

This object you can use with to_utf and from_utf functions.
>following the rational from filesystem (path encoding conversions):
>
> [snip]
>
I can tell that I think that boost.filesystem's approach is
too simplistic and tries to use default behavior as default
windows encoding under windows making cross platform development
harder.

So if you want to write cross platform software stick to UTF-8
and on the boundary of Win32 API convert it to Wide API
which is the native Windows API and the correct one to use.


>Best regards
>
>Bjoern.
>
Best,


Artyom Beilis
--------------
CppCMS - C++ Web Framework:   http://cppcms.com/
CppDB - C++ SQL Connectivity: http://cppcms.com/sql/cppdb/

Thiel, Bjoern

unread,
Nov 1, 2012, 10:43:28 AM11/1/12
to bo...@lists.boost.org

________________________________________
From: boost-...@lists.boost.org [boost-...@lists.boost.org] on behalf of Artyom Beilis [arty...@yahoo.com]
Sent: Thursday, November 01, 2012 09:57
To: bo...@lists.boost.org
Subject: Re: [boost] [locale] [filesystem] Windows local 8 bit encoding

>________________________________
> From: "Thiel, Bjoern" <bjoern...@mpibpc.mpg.de>
>To: "bo...@lists.boost.org" <bo...@lists.boost.org>
>Sent: Wednesday, October 31, 2012 4:07 PM
>Subject: [boost] [locale] [filesystem] Windows local 8 bit encoding

Hi,

> >Hi,
> >
> >developing platform independent code I really like the convenience functions
> >conv::to_utf, conv::from_utf, and conv::utf_to_utf from locale.
> >Why not add something like conv::local8bit_to_utf and conv::local8bit_from_utf
>
> First of all locale encoding is not constant, for example there are numerous
> way to change locale
>
> [...]
>
> Thus the "concept" of the OS locale is quite uncertain and not well
> defined especially under Microsoft Windows.

Right

> Using Boost.Locale you can convert to locale encoding of a given
> std::locale() object generated with Boost.Locale.
>
> boost::locale::generator allows to select legacy "ANSI" encoding
> instead of UTF-8 to be default upon creation of the locale object that
> corresponds to the system locale.
>
> This object you can use with to_utf and from_utf functions.

Unfortunately that does not work under Microsoft Windows as
generator locale_generator ;
locale_generator.use_ansi_encoding( true ) ;
std::locale const current_locale = locale_generator.generate( name ) ;
needs a name.

If I use the application locale name
std::string const name = std::locale().name() ;
I get "C" which gives me "US-ASCII" encoding and not the "windows-1252"
encoding I have.

Even if I use the system locale name
std::string const name = std::locale( "" ).name() ;
I get "English_United States.1252" which gives me the codepage "1252"
as encoding and not "windows-1252" either (conv::to_utf and conv::from_utf
just throw "Invalid or unsupported charset:1252" in this case).

> [...]
>
> So if you want to write cross platform software stick to UTF-8
> and on the boundary of Win32 API convert it to Wide API
> which is the native Windows API and the correct one to use.

Actually I'm trying to make a shared object (a dll) platform independent
that has to do some character conversions according to the current application
locale.

Best regards

Bjoern.

Artyom Beilis

unread,
Nov 1, 2012, 10:59:29 AM11/1/12
to bo...@lists.boost.org
>> Using Boost.Locale you can convert to locale encoding of a given

>> std::locale() object generated with Boost.Locale.
>>
>> boost::locale::generator allows to select legacy "ANSI" encoding
>> instead of UTF-8 to be default upon creation of the locale object that
>> corresponds to the system locale.
>>
>> This object you can use with to_utf and from_utf functions.
>
>Unfortunately that does not work under Microsoft Windows as
>  generator locale_generator ;
>  locale_generator.use_ansi_encoding( true ) ;
>  std::locale const current_locale = locale_generator.generate( name ) ;
>needs a name.
>

Similar to creating std::locale("") the generation 
locale_generator.generate("") gives the expected result, i.e.
system default locale.


See: http://www.boost.org/doc/libs/1_51_0/libs/locale/doc/html/locale_gen.html



>Best regards
>
>Bjoern.
>

Regards


Artyom Beilis

Jookia

unread,
Nov 1, 2012, 11:00:02 AM11/1/12
to bo...@lists.boost.org
On 02/11/12 01:43, Thiel, Bjoern wrote:
> Unfortunately that does not work under Microsoft Windows as
> generator locale_generator ;
> locale_generator.use_ansi_encoding( true ) ;
> std::locale const current_locale = locale_generator.generate( name ) ;
> needs a name.
>
> If I use the application locale name
> std::string const name = std::locale().name() ;
> I get "C" which gives me "US-ASCII" encoding and not the "windows-1252"
> encoding I have.
>
> Even if I use the system locale name
> std::string const name = std::locale( "" ).name() ;
> I get "English_United States.1252" which gives me the codepage "1252"
> as encoding and not "windows-1252" either (conv::to_utf and conv::from_utf
> just throw "Invalid or unsupported charset:1252" in this case).
>
> Best regards
>
> Bjoern.

Hey!

Sorry if this sounds silly, but have you tried util::get_system_locale?

Jookia.

Thiel, Bjoern

unread,
Nov 1, 2012, 12:08:21 PM11/1/12
to bo...@lists.boost.org

________________________________________
From: boost-...@lists.boost.org [boost-...@lists.boost.org] on behalf of Artyom Beilis [arty...@yahoo.com]
Sent: Thursday, November 01, 2012 15:59
To: bo...@lists.boost.org
Subject: Re: [boost] [locale] [filesystem] Windows local 8 bit encoding

> >> Using Boost.Locale you can convert to locale encoding of a given
> >> std::locale() object generated with Boost.Locale.
> >>
> >> boost::locale::generator allows to select legacy "ANSI" encoding
> >> instead of UTF-8 to be default upon creation of the locale object that
> >> corresponds to the system locale.
> >>
> >> This object you can use with to_utf and from_utf functions.
> >
> >Unfortunately that does not work under Microsoft Windows as
> > generator locale_generator ;
> > locale_generator.use_ansi_encoding( true ) ;
> > std::locale const current_locale = locale_generator.generate( name ) ;
> >needs a name.
> >
>
> Similar to creating std::locale("") the generation
> locale_generator.generate("") gives the expected result, i.e.
> system default locale.

Unfortunately under Microsoft Windows this only gives the 'hardwired'
"UTF-8" encoding:

void prepare_data()
{
...
if(locale_id_.empty()) {
real_id_ = util::get_system_locale(true); // always UTF-8
...
}
...
}

where util::get_system_locale(false) would at least do part of the mapping
from codepage to encoding (see my initial posting).

Best regards

Bjoern.

Thiel, Bjoern

unread,
Nov 6, 2012, 5:47:53 AM11/6/12
to bo...@lists.boost.org

________________________________________
From: boost-...@lists.boost.org [boost-...@lists.boost.org] on behalf of Artyom Beilis [arty...@yahoo.com]
Sent: Thursday, November 01, 2012 09:57
To: bo...@lists.boost.org
Subject: Re: [boost] [locale] [filesystem] Windows local 8 bit encoding

>________________________________
> From: "Thiel, Bjoern" <bjoern.thiel <at> mpibpc.mpg.de>
>To: "boost <at> lists.boost.org" <boost <at> lists.boost.org>
>Sent: Wednesday, October 31, 2012 4:07 PM
>Subject: [boost] [locale] [filesystem] Windows local 8 bit encoding

Hi Artyom,

> Using Boost.Locale you can convert to locale encoding of a given
> std::locale() object generated with Boost.Locale.
>
> boost::locale::generator allows to select legacy "ANSI" encoding
> instead of UTF-8 to be default upon creation of the locale object that
> corresponds to the system locale.
>
> This object you can use with to_utf and from_utf functions.

Right - you can use them. But they are not very helpful.
If you want the SYSTEM locale on Microsoft Windows:
generator locale_generator ;
locale_generator.use_ansi_encoding( true ) ;
wstring = conv::to_utf< wchar_t >( string, locale_generator( "" ) ) ;
unfortunately gives UTF-8 encoding as well.

And if you want the CURRENT locale on Microsoft Windows, I simply can't
see how to get that.

But why not add generator.generate( void ) giving exactly that.
Together with 'really' use_ansi_encoding( true ) it would be perfect.

Best regards

Bjoern.

Artyom Beilis

unread,
Nov 6, 2012, 6:54:15 AM11/6/12
to bo...@lists.boost.org
What backend do you use, there are several applicable for Windows:

- icu - based on ICU library
- win32 - based on win32 API
- std - based on standard C++ library

They are selected in this order of compiled  in.

Currently win32 API supports only UTF-8 encodings, so you need to one of:

- compile with ICU
- select std backend (because the default without ICU on windows is win32)
- disable win32 backend (in build options) so only std backend would be used. But you should note that under Windows only MSVC has std backend support, for gcc and ANSI encodings you need ICU,


 
Artyom Beilis
--------------
CppCMS - C++ Web Framework:   http://cppcms.com/
CppDB - C++ SQL Connectivity: http://cppcms.com/sql/cppdb/



>________________________________
> From: "Thiel, Bjoern" <bjoern...@mpibpc.mpg.de>
>To: "bo...@lists.boost.org" <bo...@lists.boost.org>
>Sent: Tuesday, November 6, 2012 12:47 PM
Reply all
Reply to author
Forward
0 new messages