[boost] Silly Boost.Locale default narrow string encoding in Windows

Alf P. Steinbach

unread,

Oct 27, 2011, 12:25:31 PM10/27/11

to bo...@lists.boost.org

When I engage the compiler-in-my-mind to the example given at

http://cppcms.sourceforge.net/boost_locale/html/

namely

<code>
#include <boost/locale.hpp>
#include <boost/filesystem/path.hpp>
#include <boost/filesystem/fstream.hpp>

int main()
{
// Create and install global locale
std::locale::global(boost::locale::generator().generate(""));
// Make boost.filesystem use it
boost::filesystem::path::imbue(std::locale());
// Now Works perfectly fine with UTF-8!
boost::filesystem::ofstream hello("שלום.txt");
}
</code>

then it fails to work when the literal string is replaced with a `main`
argument.

A conversion is then necessary and must be added.

It breaks the principle of least surprise.

It breaks the principle of not paying for what you don't (want to) use.

I understand, from discussions elsewhere, that the author(s) have chosen
a narrow string encoding that requires inefficient & awkward conversions
in all directions, for political/religious reasons. Maybe my
understanding of that is faulty, that it's no longer politics & religion
but outright war (and maybe that war is even over, with even Luke
Skywalker dead or deadly wounded). However, I still ask:

why FORCE INEFFICIENCY & AWKWARDNESS on Boost users -- why not just do
it right, using the platforms' native encodings.

Cheers,

- Alf

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Peter Dimov

unread,

Oct 27, 2011, 12:47:42 PM10/27/11

to bo...@lists.boost.org

Alf P. Steinbach wrote:

> However, I still ask:
>
> why FORCE INEFFICIENCY & AWKWARDNESS on Boost users -- why not just do
> it right, using the platforms' native encodings.

Comment out the imbue line.

(The platform's native encoding is UTF-16. The "ANSI" code page, which is
not necessarily ANSI or ANSI-like at all, despite your assertion, is not
"native"; the OS just converts from/to it as needed. Your program will work
fine until it's given a file name that is not representable in the ANSI CP.)

Artyom Beilis

unread,

Oct 27, 2011, 1:06:45 PM10/27/11

to bo...@lists.boost.org

>
> then it fails to work when the literal string is replaced
> with a `main` argument.
>
> A conversion is then necessary and must be added.
>
> It breaks the principle of least surprise.
>
> It breaks the principle of not paying for what you don't
> (want to) use.
>

Did you read this?

http://beta.boost.org/doc/libs/1_48_0_beta1/libs/locale/doc/html/default_encoding_under_windows.html

You can **easily** switch to ANSI as default...

But you don't want to (rather switch to UTF-16 or UTF-8)
especially when you actually use localization... :-)

> I understand, from discussions elsewhere, that the
> author(s) have chosen a narrow string encoding that requires
> inefficient & awkward conversions in all directions, for
> political/religious reasons.

No you hadn't read rationale correctly and didn't read
what is written in the link I had given.

If you write "Windows only" software you should either
set Ansi option to use native encoding - UTF-16.

If not stick to cross platform UTF-8.

> Maybe my understanding of that
> is faulty, that it's no longer politics & religion but
> outright war (and maybe that war is even over, with even
> Luke Skywalker dead or deadly wounded). However, I still
> ask:
>
> why FORCE INEFFICIENCY & AWKWARDNESS on Boost
> users -- why not just do it right, using the
> platforms' native encodings.
>

Windows native encoding is not ANSI. It is Wide/UTF-16 encoding.

-----------------------------------------------------

If you still not convinced, using UTF-8 by default was one
of important pluses this library brings and it was noticed
by many reviewers.

Artyom

Alf P. Steinbach

unread,

Oct 27, 2011, 1:17:45 PM10/27/11

to bo...@lists.boost.org

On 27.10.2011 18:47, Peter Dimov wrote:
> Alf P. Steinbach wrote:
>
>> However, I still ask:
>>
>> why FORCE INEFFICIENCY & AWKWARDNESS on Boost users -- why not just do
>> it right, using the platforms' native encodings.
>
> Comment out the imbue line.

But that line is much of the point, isn't it?

> (The platform's native encoding is UTF-16. The "ANSI" code page, which
> is not necessarily ANSI or ANSI-like at all, despite your assertion,

The article you responded to did not contain the word "ANSI".

Thus, when you refer to an assertion about "ANSI", you have fantasized
something.

I hope you are not going to go on like that.

> [ANSI] is not "native"; the OS just converts from/to it as needed.

OK, you need to learn a quite bit but

(1) you appear to be very sure that you're already knowledgeable, and

(2) you attribute things to me that you have just fantasized.

That makes it very difficult to teach you.

For narrow character strings in Windows, "native" and "ANSI" are
interchangeable terms.

They mean the same, namely the codepage identified by the GetACP() function.

This is not a particular codepage, it is configurable.

On my machine, and most probably on yours, it is codepage 1252, Windows
ANSI Western.

"Native" means the encoding used and expected by the OS' API functions.

For narrow character strings in Windows, that's Windows ANSI.

> Your program

No, again you're wrong: it's the Boost.Locale documentation's program.

> will work fine until it's given a file name that is not representable in
> the ANSI CP.)

Nope, sorry, for any /reasonable interpretation/ of what you're writing.

I can imagine that maybe you're thinking about setting ANSI CP to 65001,
which however is not reasonable.

Cheers & hth.,

- Alf

Alf P. Steinbach

unread,

Oct 27, 2011, 1:19:47 PM10/27/11

to bo...@lists.boost.org

On 27.10.2011 19:06, Artyom Beilis wrote:
>
> Windows native encoding is not ANSI. It is Wide/UTF-16 encoding.

Try using UTF-16 with narrow strings.

Cheers & hth.,

- Alf

_______________________________________________

Peter Dimov

unread,

Oct 27, 2011, 2:01:52 PM10/27/11

to bo...@lists.boost.org

Alf P. Steinbach wrote:
> On 27.10.2011 18:47, Peter Dimov wrote:
> > Alf P. Steinbach wrote:
> >
> >> However, I still ask:
> >>
> >> why FORCE INEFFICIENCY & AWKWARDNESS on Boost users -- why not just do
> >> it right, using the platforms' native encodings.
> >
> > Comment out the imbue line.
>
> But that line is much of the point, isn't it?

There wouldn't be much point in calling imbue if you didn't want a change in
the boost::filesystem default behavior, which is to convert using the ANSI
CP (or the OEM CP if AreFIleApisAnsi() returns false, if I'm not mistaken).

> > (The platform's native encoding is UTF-16. The "ANSI" code page, which
> > is not necessarily ANSI or ANSI-like at all, despite your assertion,
>
> The article you responded to did not contain the word "ANSI".
>
> Thus, when you refer to an assertion about "ANSI", you have fantasized
> something.

http://boost.2283326.n4.nabble.com/Making-Boost-Filesystem-work-with-GENERAL-filenames-with-g-in-Windows-a-solution-tp3936857p3944493.html

> I hope you are not going to go on like that.
>
>
> > [ANSI] is not "native"; the OS just converts from/to it as needed.
>
> OK, you need to learn a quite bit but
>
> (1) you appear to be very sure that you're already knowledgeable, and
>
> (2) you attribute things to me that you have just fantasized.
>
> That makes it very difficult to teach you.
>
> For narrow character strings in Windows, "native" and "ANSI" are
> interchangeable terms.

I will accept your definition for the time being and restate what I just
said without using "native":

Under Windows (NT+ and NTFS), the narrow character API is a wrapper over the
wide character API. The system converts from/to the ANSI code page as
needed. The narrowing conversion may lose data.

> > Your program
>
> No, again you're wrong: it's the Boost.Locale documentation's program.
>
>
> > will work fine until it's given a file name that is not representable in
> > the ANSI CP.)
>
> Nope, sorry, for any /reasonable interpretation/ of what you're writing.

File names on NTFS are not necessarily representable in the ANSI code page.
A program that uses narrow strings in the ANSI code page to represents paths
will not necessarily be able to open all files on the system.

Mateusz Łoskot

unread,

Oct 27, 2011, 2:27:10 PM10/27/11

to bo...@lists.boost.org

On 27 October 2011 18:19, Alf P. Steinbach

<alf.p.stein...@gmail.com> wrote:
> On 27.10.2011 19:06, Artyom Beilis wrote:
>>
>> Windows native encoding is not ANSI. It is Wide/UTF-16 encoding.
>
> Try using UTF-16 with narrow strings.

You simply don't do that, do you, without conversion to wide string type.

Best regards,
--
Mateusz Loskot, http://mateusz.loskot.net
Charter Member of OSGeo, http://osgeo.org
Member of ACCU, http://accu.org

Alf P. Steinbach

unread,

Oct 27, 2011, 2:41:09 PM10/27/11

to bo...@lists.boost.org

On 27.10.2011 20:01, Peter Dimov wrote:
> Alf P. Steinbach wrote:
>> On 27.10.2011 18:47, Peter Dimov wrote:
>> > Alf P. Steinbach wrote:
>> >
>> >> However, I still ask:
>> >>
>> >> why FORCE INEFFICIENCY & AWKWARDNESS on Boost users -- why not just do
>> >> it right, using the platforms' native encodings.
>> >
>> > Comment out the imbue line.
>>
>> But that line is much of the point, isn't it?
>
> There wouldn't be much point in calling imbue if you didn't want a
> change in the boost::filesystem default behavior, which is to convert
> using the ANSI CP (or the OEM CP if AreFIleApisAnsi() returns false, if
> I'm not mistaken).

Oh there is.

It is a level of indirection.

You want Boost.Filesystem to assume /the same/ narrow character encoding
as Boost.Locale, whatever it is.

And to quote the docs where I found that program,

"Boost Locale fully supports both narrow and wide API. The default
character encoding is assumed to be UTF-8 on Windows."

>> > (The platform's native encoding is UTF-16. The "ANSI" code page, which
>> > is not necessarily ANSI or ANSI-like at all, despite your assertion,
>>
>> The article you responded to did not contain the word "ANSI".
>>
>> Thus, when you refer to an assertion about "ANSI", you have fantasized
>> something.
>
> http://boost.2283326.n4.nabble.com/Making-Boost-Filesystem-work-with-GENERAL-filenames-with-g-in-Windows-a-solution-tp3936857p3944493.html

That's a different context and a different discussion, where it was
neither necessary nor natural to dot the i's and cross the t's to
perfection.

Talk about dragging in things from out of the blue.

If you wanted to point out the possibility of e.g. a Japanese codepage
as ANSI, then you should have done that over there, in that thread. I
mean in the context where it could make sense and where it could help
prevent readers getting a wrong impression. If it was that important.

[snippety]

> Under Windows (NT+ and NTFS), the narrow character API is a wrapper over
> the wide character API. The system converts from/to the ANSI code page
> as needed. The narrowing conversion may lose data.

OK, we're just talking about two different meanings of "native", for two
different contexts: windows internals, and windows apps.

The relevant context for discussing Boost.Locale's treatment of narrow
strings, is the application level.

>> > [the program] will work fine until it's given a file name that is not

>> > representable in the ANSI CP.)
>>
>> Nope, sorry, for any /reasonable interpretation/ of what you're writing.
>
> File names on NTFS are not necessarily representable in the ANSI code
> page. A program that uses narrow strings in the ANSI code page to
> represents paths will not necessarily be able to open all files on the
> system.

Right, that's one reason why modern Windows programs should best be
wchar_t based. Other reasons include efficiency (avoiding conversions)
and simple convenience. Some API functions do not have narrow wrappers.

However, a default assumption of UTF-8 encoding for narrow strings, as
in Boost.Locale, seems to me to clash with most uses of narrow strings.

For example, if you output UTF-8 on standard output, and then try to
pipe that through `more` in Windows' [cmd.exe], you get this:

<example>
d:\dave> chcp 65001
Active code page: 65001

d:\dave> echo "imagine this is utf8" | more
Not enough memory.

d:\dave> _
</example>

So utf-8 is, to put it less than strongly, not very practical as a
general narrow-character encoding in Windows.

The example that I gave at top of the thread was passing a `main`
argument further on, when using Boost.Locale. It causes trouble because
in Windows `main` arguments are by convention encoded as ANSI, while
Boost.Locale has UTF-8 as default. Treating ANSI as UTF-8 generally
yields gobbledygook, except for the pure ASCII common subset.

But with ANSI as Boost.Locale default, with that more reasonable choice
of default, the imbue call would not cause trouble, but would instead
help to avoid trouble -- which is surely the original intention.

Cheers & hth.,

- Alf

Peter Dimov

unread,

Oct 27, 2011, 3:07:25 PM10/27/11

to bo...@lists.boost.org

Alf P. Steinbach wrote:
> On 27.10.2011 20:01, Peter Dimov wrote:

...

> > File names on NTFS are not necessarily representable in the ANSI code
> > page. A program that uses narrow strings in the ANSI code page to
> > represents paths will not necessarily be able to open all files on the
> > system.
>
> Right, that's one reason why modern Windows programs should best be
> wchar_t based.

This is one of the two options. The other is using UTF-8 for representing
paths as narrow strings. The first option is more natural for Windows-only
code, and the second is better, in practice, for portable code because it
avoids the need to duplicate all path-related functions for char/wchar_t.
The motivation for using UTF-8 is practical, not political or religious.

> The example that I gave at top of the thread was passing a `main` argument
> further on, when using Boost.Locale. It causes trouble because in Windows
> `main` arguments are by convention encoded as ANSI, while Boost.Locale has
> UTF-8 as default. Treating ANSI as UTF-8 generally yields gobbledygook,
> except for the pure ASCII common subset.

Yes. If you (generic second person, not you specifically) want to take your
paths from the narrow API, an UTF-8 default is not practical. But then
again, you shouldn't take your paths from the narrow API, because it can't
represent the names of all the files the user may have.

Artyom Beilis

unread,

Oct 27, 2011, 4:17:53 PM10/27/11

to bo...@lists.boost.org

> From: Alf P. Steinbach <alf.p.stein...@gmail.com>
>
> [...]

>
> It is a level of indirection.
>
> You want Boost.Filesystem to assume /the same/ narrow
> character encoding as Boost.Locale, whatever it is.
>
> And to quote the docs where I found that program,
>
> "Boost Locale fully supports both narrow and wide API. The
> default character encoding is assumed to be UTF-8 on
> Windows."
>

I would probably say it once again and the last time.

1. Boost.Locale is **localization** library and localization
today is done using **Unicode** not cp1252, cp936 or cp1255

And UTF-8 is **Unicode** encoding for narrow strings.

So _any_ localization library **must** use Unicode encoding
otherwise it will be useless crap.

2. If you write software for Windows and what to use ANSI encoding
by default all you need is to add a _single_ line into your code.

I give you a choice to use whatever you want. But the default
should be suitable for **Localization** - the reason this library
is written for,

Now, you may not like the design of Boost.Locale library or you
don't like its defaults. Legitimate. But using UTF-8 by default
was one of few points that had total agreement between all
Boost.Locale reviewers.

Using UTF-8 by default is indeed strategical decision. You may
call it political, I may call it practical. You may do not like
it but this is what will remain because it is the way the library
designed and it is one of its central parts.

You don't like it? Ok... I had given you an option to change it.
I think you and other users will survive this one extra line
that changes the default encoding to ANSI instead of cross platform
and UTF-8.

Best Regards,

Artyom Beilis
--------------
CppCMS - C++ Web Framework: http://cppcms.sf.net/
CppDB - C++ SQL Connectivity: http://cppcms.sf.net/sql/cppdb/

Alf P. Steinbach

unread,

Oct 27, 2011, 5:12:53 PM10/27/11

to bo...@lists.boost.org

On 27.10.2011 21:07, Peter Dimov wrote:
> Alf P. Steinbach wrote:
>> On 27.10.2011 20:01, Peter Dimov wrote:
> ...
>> > File names on NTFS are not necessarily representable in the ANSI code
>> > page. A program that uses narrow strings in the ANSI code page to
>> > represents paths will not necessarily be able to open all files on the
>> > system.
>>
>> Right, that's one reason why modern Windows programs should best be
>> wchar_t based.
>
> This is one of the two options. The other is using UTF-8 for
> representing paths as narrow strings. The first option is more natural
> for Windows-only code, and the second is better, in practice, for
> portable code because it avoids the need to duplicate all path-related
> functions for char/wchar_t. The motivation for using UTF-8 is practical,
> not political or religious.

Thanks for that clarification of the current thinking at Boost.

I suspected that people envisioned those two choices as an exhaustive
set of alternatives, what to choose from, but I wasn't sure.

Anyway, happily, the apparent forced choice between two inefficient
ungoods, is not necessary -- i.e. it's a false dichotomy.

For, there are at least THREE options for representing paths and other
strings internally in the program, in portable single-source code:

1. wide character based (UTF-16 in Windows, possibly UTF-32 in *nix),
as you described above,

2. narrow character based (UTF-8), as you described above, and

3. the most natural sufficiently general native encoding, 1 or 2
depending on the platform that the source is being built for.

Option 3 means -- it requires, as far as I can see -- some
abstraction that hides the narrow/wide representation so as to get
source code level portability, which is all that matters for C++. It
doesn't need to involve very much. Some typedefs, traits, references.

Prior art in this direction, includes Microsoft's [tchar.h].

For example, write a portable string literal like this:

PS( "This is a portable string literal" )

As compared to options 1 and 2, the benefits of option 3 include:

* no inefficient conversions except at the external boundary of the
program (and then in practice only in Windows, where it's already),

* no problems with software and tools that don't understand a chosen
"universal" (option 1 or 2) encoding,

* no need to duplicate functions to adapt to underlying OS: one has
at hand exactly what the OS API wants.

The main drawback is IMO the need to use something like a PS macro for
string and character literals, or a C++11 /user defined literal/.
Windows programmers are used to that, writing _T("blah") all the time as
if Windows 95 was still extant. So, considering that all that current
labor is being done for no reward whatsoever, I think it should be no
problem convincing programmers that writing a few characters more in
order to get portable string literals, is worth it; it just needs
exposure to examples from some authoritative source...

>> The example that I gave at top of the thread was passing a `main`
>> argument further on, when using Boost.Locale. It causes trouble
>> because in Windows `main` arguments are by convention encoded as ANSI,
>> while Boost.Locale has UTF-8 as default. Treating ANSI as UTF-8
>> generally yields gobbledygook, except for the pure ASCII common subset.
>
> Yes. If you (generic second person, not you specifically) want to take
> your paths from the narrow API, an UTF-8 default is not practical. But
> then again, you shouldn't take your paths from the narrow API, because
> it can't represent the names of all the files the user may have.

That's an unrelated issue, really, but I think Boost could use a "get
undamaged program arguments in portable strings" thing, if it isn't
there already?

Cheers & hth.,

- Alf

Peter Dimov

unread,

Oct 27, 2011, 5:56:59 PM10/27/11

to bo...@lists.boost.org

Alf P. Steinbach wrote:
> On 27.10.2011 21:07, Peter Dimov wrote:
> > Alf P. Steinbach wrote:

...

> >> Right, that's one reason why modern Windows programs should best be
> >> wchar_t based.
> >
> > This is one of the two options. The other is using UTF-8 for
> > representing paths as narrow strings. The first option is more natural
> > for Windows-only code, and the second is better, in practice, for
> > portable code because it avoids the need to duplicate all path-related
> > functions for char/wchar_t. The motivation for using UTF-8 is practical,
> > not political or religious.
>
> Thanks for that clarification of the current thinking at Boost.

My opinion is not representative of all of Boost, although I've found that
there is substantial agreement between people who write portable software
that needs to deal with paths (#2, UTF-8, as the way to go).

> 3. the most natural sufficiently general native encoding, 1 or 2
> depending on the platform that the source is being built for.

Yes, with its various suboptions. 3a, TCHAR, 3b, template on char_type, 3c,
providing both char and wchar_t overloads. They all have their problems;
people don't move to UTF-8 merely out of spite.

> Prior art in this direction, includes Microsoft's [tchar.h].

This works, more or less, once you've accumulated the appropriate library of
_T macros, _t functions and T/t typedefs. I've never heard of it actually
being used for a portable code base, but I admit that it's possible to do
things this way, even if it's somewhat alien to POSIX people.

The advantage of using UTF-8 is that, apart from the border layer that calls
the OS (and that needs to be ported either way), the rest of the code is
happily char[]-based. There's no need to be aware of the fact that literals
need to be quoted or that strlen should be spelled _tcslen. There's no need
to convert paths to an external representation when writing them into a
portable config/project file.

> That's an unrelated issue, really, but I think Boost could use a "get
> undamaged program arguments in portable strings" thing, if it isn't there
> already?

We'll be back to the question of what constitutes a portable string. I'd
prefer UTF-8 on Windows and whatever was passed on POSIX. You'd prefer
TCHAR[].

Alf P. Steinbach

unread,

Oct 27, 2011, 10:23:01 PM10/27/11

to bo...@lists.boost.org

[tchar.h], plus the similar support in <windows.h>, was heavily used for
porting applications between Windows 9x ANSI and Windows NT Unicode,
before Microsoft introduced the Layer for Unicode in 2001 or thereabouts
(the layer allowed wchar_t-apps to run in Windows 9x).

I'm not saying it's a good C++ approach for that porting -- it's not,
since it was designed for the C language.

I just gave it as an example of prior art, which includes a neat header
where the names of the relevant functions to wrap (or whatever) can be
extracted by a small Python script. ;-)

> but I admit that it's
> possible to do things this way, even if it's somewhat alien to POSIX
> people.
>
> The advantage of using UTF-8 is that, apart from the border layer that
> calls the OS (and that needs to be ported either way), the rest of the
> code is happily char[]-based.

Oh.

I would be happy to learn this.

How do I make the following program work with Visual C++ in Windows,
using narrow character string?

<code>
#include <stdio.h>
#include <fcntl.h> // _O_U8TEXT
#include <io.h> // _setmode, _fileno
#include <windows.h>

int main()
{
//SetConsoleOutputCP( 65001 );
//_setmode( _fileno( stdout ), _O_U8TEXT );
printf( "Blåbærsyltetøy! 日本国 кошка!\n" );
}
</code>

The out-commented code is from my random efforts to Make It Work(TM).

It refused.

By the way, I'm hoping Boost isn't supporting old versions of g++.

Because old versions of g++ chocked on a BOM at start of UTF-8 encoded
source code, while Visual C++ requires that BOM... So, UTF-8 source code
ungood with old versions of g++, if Visual C++ is also used.

> There's no need to be aware of the fact
> that literals need to be quoted or that strlen should be spelled
> _tcslen. There's no need to convert paths to an external representation
> when writing them into a portable config/project file.

Hm, I'm not so sure.

I'd like to see this magic in action before believing in it, e.g., the
program above working with narrow chars and printf, with Visual C++.

>> That's an unrelated issue, really, but I think Boost could use a "get
>> undamaged program arguments in portable strings" thing, if it isn't
>> there already?
>
> We'll be back to the question of what constitutes a portable string. I'd
> prefer UTF-8 on Windows and whatever was passed on POSIX. You'd prefer
> TCHAR[].

No, not TCHAR, which was designed for the C language (and is an ugly
uppercase name to boot).

Instead, like this:

<code>
#include "u/stdio_h.h" // u::CodingValue, u::sprintf, U

#undef UNICODE
#define UNICODE
#include <windows.h> // MessageBox

int main()
{
u::CodingValue buffer[80];

sprintf( buffer, U( "The answer is %d!" ), 6*7 ); // Koenig lookup.
MessageBox(
0,
buffer->rawPtr(),
U( "This is a title!" )->rawPtr(),
MB_ICONINFORMATION | MB_SETFOREGROUND
);
}
</code>

I coded up that support after reading the article I'm responding to now,
because I felt that without coding it up I would be just spewing gut
feelings and hunches. Well-informed such, but still. So I coded. :-)

Cheers & hth.,

- Alf

Yakov Galka

unread,

Oct 28, 2011, 6:36:58 AM10/28/11

to bo...@lists.boost.org

How will you make this program portable?

The out-commented code is from my random efforts to Make It Work(TM).
>
> It refused.
>

This is because windows narrow-chars can't be UTF-8. You could make it
portable by:

int main()
{
boost::printf("Blåbærsyltetøy! 日本国 кошка!\n");
}

>
> By the way, I'm hoping Boost isn't supporting old versions of g++.
>
> Because old versions of g++ chocked on a BOM at start of UTF-8 encoded
> source code, while Visual C++ requires that BOM... So, UTF-8 source code
> ungood with old versions of g++, if Visual C++ is also used.

If you don't use widechars, you can cheat VC++ to use UTF-8 string-literals.
Just save the file as UTF-8 *without* BOM. It will just embed them verbatim
into the executable.

There's no need to be aware of the fact
>> that literals need to be quoted or that strlen should be spelled
>> _tcslen. There's no need to convert paths to an external representation
>> when writing them into a portable config/project file.
>>
>
> Hm, I'm not so sure.
>
> I'd like to see this magic in action before believing in it, e.g., the
> program above working with narrow chars and printf, with Visual C++.

See above and see
http://permalink.gmane.org/gmane.comp.lib.boost.devel/225036

>
> That's an unrelated issue, really, but I think Boost could use a "get
>>> undamaged program arguments in portable strings" thing, if it isn't
>>> there already?
>>>
>>
>> We'll be back to the question of what constitutes a portable string. I'd
>> prefer UTF-8 on Windows and whatever was passed on POSIX. You'd prefer
>> TCHAR[].
>>
>
> No, not TCHAR, which was designed for the C language (and is an ugly
> uppercase name to boot).
>
> Instead, like this:
>
>
> <code>
> #include "u/stdio_h.h" // u::CodingValue, u::sprintf, U
>
> #undef UNICODE
> #define UNICODE
> #include <windows.h> // MessageBox
>
> int main()
> {
> u::CodingValue buffer[80];
>
> sprintf( buffer, U( "The answer is %d!" ), 6*7 ); // Koenig lookup.
> MessageBox(
> 0,
> buffer->rawPtr(),
> U( "This is a title!" )->rawPtr(),
> MB_ICONINFORMATION | MB_SETFOREGROUND
> );
> }
> </code>
>

You judge from a non-portable coed point-of-view. How about:

#inclued <cstdio>
#include "gtkext/message_box.h" // for gtkext::message_box

int main()
{
char buffer[80];
sprintf(buffer, "The answer is %d!", 6*7);
gtkext::message_box(buffer, "This is a title!", gtkext::icon_blah_blah,
...);
}

And unlike your code, it's magically portable! (thanks to gtk using UTF-8 on
windows)

Sincerely,
--
Yakov

Stewart, Robert

unread,

Oct 28, 2011, 7:11:08 AM10/28/11

to bo...@lists.boost.org

Alf P. Steinbach wrote:
>
> Option 3 means -- it requires, as far as I can see -- some
> abstraction that hides the narrow/wide representation so as to
> get source code level portability, which is all that matters
> for C++. It doesn't need to involve very much. Some typedefs,
> traits, references.
>

> For example, write a portable string literal like this:
>
> PS( "This is a portable string literal" )

[snip]

> The main drawback is IMO the need to use something like a PS
> macro for string and character literals, or a C++11 /user
> defined literal/.
> Windows programmers are used to that, writing _T("blah") all
> the time as if Windows 95 was still extant. So, considering
> that all that current labor is being done for no reward
> whatsoever, I think it should be no problem convincing
> programmers that writing a few characters more in order to get
> portable string literals, is worth it; it just needs exposure
> to examples from some authoritative source...

The problem with that approach is that existing, non-Windows, code must be painstakingly altered to introduce such manual portability constructs. If code was already written using the Microsoft facilities for portability, it's a relatively easy transition to make (s/_T/PS/, for example).

Regardless of authoritative examples, inertia is against your idea.

_____
Rob Stewart robert....@sig.com
Software Engineer using std::disclaimer;
Dev Tools & Components
Susquehanna International Group, LLP http://www.sig.com

________________________________

IMPORTANT: The information contained in this email and/or its attachments is confidential. If you are not the intended recipient, please notify the sender immediately by reply and immediately delete this message and all its attachments. Any review, use, reproduction, disclosure or dissemination of this message or any attachment by an unintended recipient is strictly prohibited. Neither this message nor any attachment is intended as or should be construed as an offer, solicitation or recommendation to buy or sell any security or other financial instrument. Neither the sender, his or her employer nor any of their respective affiliates makes any warranties as to the completeness or accuracy of any of the information contained herein or that this message or any of its attachments is free of viruses.

Alf P. Steinbach

unread,

Oct 28, 2011, 7:17:06 AM10/28/11

to bo...@lists.boost.org

On 28.10.2011 12:36, Yakov Galka wrote:
> On Fri, Oct 28, 2011 at 04:23, Alf P. Steinbach<
> alf.p.stein...@gmail.com> wrote:
>
>> On 27.10.2011 23:56, Peter Dimov wrote:
>>>
>>> The advantage of using UTF-8 is that, apart from the border layer that
>>> calls the OS (and that needs to be ported either way), the rest of the
>>> code is happily char[]-based.
>>
>> Oh.
>>
>> I would be happy to learn this.
>>
>> How do I make the following program work with Visual C++ in Windows, using
>> narrow character string?
>>
>>
>> <code>
>> #include<stdio.h>
>> #include<fcntl.h> // _O_U8TEXT
>> #include<io.h> // _setmode, _fileno
>> #include<windows.h>
>>
>> int main()
>> {
>> //SetConsoleOutputCP( 65001 );
>> //_setmode( _fileno( stdout ), _O_U8TEXT );
>> printf( "Blåbærsyltetøy! 日本国 кошка!\n" );
>> }
>> </code>
>>
>
> How will you make this program portable?

Well, that was *my* question.

The claim that this minimal "Hello, world!" program puts to the point,
is that "the rest of the [UTF-8 based] code is happily char[]-based".

Apparently that is not so.

> The out-commented code is from my random efforts to Make It Work(TM).
>>
>> It refused.
>>
>
> This is because windows narrow-chars can't be UTF-8. You could make it
> portable by:
>
> int main()
> {
> boost::printf("Blåbærsyltetøy! 日本国 кошка!\n");
> }

Thanks, TIL boost::printf.

The idea of UTF-8 as a universal encoding seems now to be to use some
workaround such as boost::printf for each and every case where it turns
out that it doesn't work portably.

When every portability problem has been diagnosed and special cased to
use functions that translate to/from UTF-8 translation, and ignoring the
efficiency aspect of that, then UTF-8 just magically works, hurray.

E.g., if 'fopen( "rød.txt", "r" )' fails in the universal UTF-8 code,
then just replace with 'boost::fopen', or 'my_special_casing::fopen'.

However, with these workaround details made manifest, it is /much less/
convincing than the original general vague claim that UTF-8 just works.

[snip]

> You judge from a non-portable coed point-of-view. How about:
>

> #include <cstdio>

> #include "gtkext/message_box.h" // for gtkext::message_box
>
> int main()
> {
> char buffer[80];
> sprintf(buffer, "The answer is %d!", 6*7);
> gtkext::message_box(buffer, "This is a title!", gtkext::icon_blah_blah,
> ...);
> }
>
> And unlike your code, it's magically portable! (thanks to gtk using UTF-8 on
> windows)

Aha. When you use a library L that translates in platform-specific ways
to/from UTF-8 for you, then UTF-8 is magically portable. For use of L.

However, try to pass a `main` argument over to gtkext::message_box.

Then you have involved some /ohter code/ (namely the runtime library
code that calls 'main') that may not necessarily translate for you, and
in fact in Windows is extremely unlikely to translate for you.

Such code is prevalent.

Most code does not translate to/from UTF-8.

Cheers & hth., & thanks for mention of boost::printf,

- Alf

PS: With C++11 there is no longer any reason to use <cstdio> instead of
<stdio.h>, because <cstdio> no longer formally guarantees to not pollute
the global namespace (and in practice it has never honored its C++98
guarantee). The code above is a good example why <stdio.h> is preferable
-- it is too easy to write non-portable code with <cstdio>, such as
using unqualified sprintf (not to mention size_t!).

Yakov Galka

unread,

Oct 28, 2011, 7:31:40 AM10/28/11

to bo...@lists.boost.org

On Fri, Oct 28, 2011 at 13:17, Alf P. Steinbach <
alf.p.stein...@gmail.com> wrote:

> On 28.10.2011 12:36, Yakov Galka wrote:
>
>> On Fri, Oct 28, 2011 at 04:23, Alf P. Steinbach<

>> alf.p.steinbach+usenet@gmail.**com <alf.p.steinbach%2Bus...@gmail.com>>

My point is that you cannot talk about things without comparison.

> The out-commented code is from my random efforts to Make It Work(TM).
>>
>>>
>>> It refused.
>>>
>>>
>> This is because windows narrow-chars can't be UTF-8. You could make it
>> portable by:
>>
>> int main()
>> {
>> boost::printf("Blåbærsyltetøy! 日本国 кошка!\n");
>> }
>>
>
> Thanks, TIL boost::printf.
>
> The idea of UTF-8 as a universal encoding seems now to be to use some
> workaround such as boost::printf for each and every case where it turns out
> that it doesn't work portably.
>

You pull things out of context. We should COMPARE the UTF-8 approach to the
wide-char on windows narrow-char on non-windows approach. Your approach
involves using your own printf just as well:

#include "u/stdio_h.h" // u::CodingValue, u::printf, U
printf(U("Blåbærsyltetøy! 日本国 кошка!\n")); // ADL?
u::printf(U("Blåbærsyltetøy! 日本国 кошка!\n")); // or not ADL? depends on what
exactly U is.

but anyway you have to do O(N) work to wrap the N library functions you use.

Your approach is no way better.

> [...]

>
> [snip]
>
>> You judge from a non-portable coed point-of-view. How about:
>>
>> #include <cstdio>
>>
>> #include "gtkext/message_box.h" // for gtkext::message_box
>>
>> int main()
>> {
>> char buffer[80];
>> sprintf(buffer, "The answer is %d!", 6*7);
>> gtkext::message_box(buffer, "This is a title!",
>> gtkext::icon_blah_blah,
>> ...);
>> }
>>
>> And unlike your code, it's magically portable! (thanks to gtk using UTF-8
>> on
>> windows)
>>
>
> Aha. When you use a library L that translates in platform-specific ways
> to/from UTF-8 for you, then UTF-8 is magically portable. For use of L.
>
> However, try to pass a `main` argument over to gtkext::message_box.
>

See the argv explanation in
http://permalink.gmane.org/gmane.comp.lib.boost.devel/225036

--
Yakov

Peter Dimov

unread,

Oct 28, 2011, 7:58:08 AM10/28/11

to bo...@lists.boost.org

Alf P. Steinbach wrote:

> How do I make the following program work with Visual C++ in Windows, using
> narrow character string?
>
> <code>
> #include <stdio.h>
> #include <fcntl.h> // _O_U8TEXT
> #include <io.h> // _setmode, _fileno
> #include <windows.h>
>
> int main()
> {
> //SetConsoleOutputCP( 65001 );
> //_setmode( _fileno( stdout ), _O_U8TEXT );
> printf( "Blåbærsyltetøy! 日本国 кошка!\n" );
> }
> </code>

Output to a console wasn't our topic so far (and is not one of my strong
points), but the specific problem with this program is that the embedded
literal is not UTF-8, as the warning C4566 tells us, so there is no way for
you to get UTF-8 in the output. (You should be able to set VC++'s code page
to 65001, but I don't think you can.)

int main()
{
printf( utf8_encode( L"кошка" ).c_str() );
}

This is not a practical problem for "proper" applications because Russian
text literals should always come from the equivalent of gettext and never be
embedded in code.

int main()
{
printf( gettext( "cat" ).c_str() );
}

So, yes, I admit that you can't easily write a portable application (or a
command-line utility) that has its Russian texts hardcoded, if that's your
point. But you can write a command-line utility that can take кошка.txt as
input and work properly, which is what I've been saying, and what sparked
the original debate (argv[1]).

Beman Dawes

unread,

Oct 28, 2011, 8:38:26 AM10/28/11

to bo...@lists.boost.org

Alf,

On Thu, Oct 27, 2011 at 5:12 PM, Alf P. Steinbach
<alf.p.stein...@gmail.com> wrote:
>...

> Thanks for that clarification of the current thinking at Boost.

>...

Please understand that Boost isn't a single library, but rather a
collection of 100 or so individual libraries. So there isn't any
single "current thinking at Boost" on any topic that has library or
application dependent aspects.

That said, Peter Dimov's replies do represent the thinking of many
Boost developers and library maintainers, include me:-)

--Beman

Peter Dimov

unread,

Oct 28, 2011, 8:38:28 AM10/28/11

to bo...@lists.boost.org

Alf P. Steinbach wrote:
On 28.10.2011 12:36, Yakov Galka wrote:
> > This is because windows narrow-chars can't be UTF-8. You could make it
> > portable by:
> >
> > int main()
> > {
> > boost::printf("Blåbærsyltetøy! 日本国 кошка!\n");
> > }
>
> Thanks, TIL boost::printf.

No, I don't think that this works. The problem here is not the printf call,
it's the literal. When a char[] that does contain the proper UTF-8 text is
passed, printf works under chcp 65001.

In principle, you should still need to use the hypothetical boost::printf,
though, if you want the program to properly support arbitrary code pages
(not that the text above can be output in any code page other than 65001).

> When every portability problem has been diagnosed and special cased to use
> functions that translate to/from UTF-8 translation, and ignoring the
> efficiency aspect of that, then UTF-8 just magically works, hurray.
>
> E.g., if 'fopen( "rød.txt", "r" )' fails in the universal UTF-8 code, then
> just replace with 'boost::fopen', or 'my_special_casing::fopen'.

Yes, exactly. It's not a silver bullet, but... try coming up with a better
alternative.

Yakov Galka

unread,

Oct 28, 2011, 8:41:23 AM10/28/11

to bo...@lists.boost.org

On Fri, Oct 28, 2011 at 13:58, Peter Dimov <pdi...@pdimov.com> wrote:

> Alf P. Steinbach wrote:
>
> How do I make the following program work with Visual C++ in Windows, using
>> narrow character string?
>>
>> <code>
>> #include <stdio.h>
>> #include <fcntl.h> // _O_U8TEXT
>> #include <io.h> // _setmode, _fileno
>> #include <windows.h>
>>
>> int main()
>> {
>> //SetConsoleOutputCP( 65001 );
>> //_setmode( _fileno( stdout ), _O_U8TEXT );
>> printf( "Blåbærsyltetøy! 日本国 кошка!\n" );
>> }
>> </code>
>>
>
> Output to a console wasn't our topic so far (and is not one of my strong
> points), but the specific problem with this program is that the embedded
> literal is not UTF-8, as the warning C4566 tells us, so there is no way for
> you to get UTF-8 in the output. (You should be able to set VC++'s code page
> to 65001, but I don't think you can.)
>
> int main()
> {
> printf( utf8_encode( L"кошка" ).c_str() );
> }
>

You don't need to configure anything, in fact you cannot do it properly in
VS. What you can do is:

1) don't use wide-char literals with non ascii characters
2) use UTF-8 literals for narrow-char.

All you need is to save the source as UTF-8 WITHOUT BOM. Works as charm on
VS2005 and VS2010. Apparently it's portable. The IDE can detect UTF-8 even
without BOM ("☑ Auto-detect UTF-8 encoding without signature").

> This is not a practical problem for "proper" applications because Russian
> text literals should always come from the equivalent of gettext and never be
> embedded in code.
>

+1

Personally I'm happy with

printf( "Blåbærsyltetøy! 日本国 кошка!\n" );

writing UTF-8. Even if I cannot configure the console, I still can redirect
it to a file, and it will correctly save this as UTF-8. Preventing data-loss
is more important for me.

--
Yakov

Peter Dimov

unread,

Oct 28, 2011, 9:00:57 AM10/28/11

to bo...@lists.boost.org

Yakov Galka wrote:
> Personally I'm happy with
>
> printf( "Blåbærsyltetøy! 日本国 кошка!\n" );
>
> writing UTF-8. Even if I cannot configure the console, I still can
> redirect
> it to a file, and it will correctly save this as UTF-8.

You can configure the console. Select Consolas or Lucida Console as the
font, then issue chcp 65001. chcp 65001 apparently breaks .bat files though.
:-)

Alf P. Steinbach

unread,

Oct 28, 2011, 9:34:37 AM10/28/11

to bo...@lists.boost.org

I think that means that I failed to communicate to you what I compared.

There was a claim that the UTF-8 based code should just work, but the
minimal hello world like code in my example does /not/ work.

Thus, it is a comparison between (1) reality, and (2) the claim, OK?

>> The out-commented code is from my random efforts to Make It Work(TM).
>>>
>>>>
>>>> It refused.
>>>>
>>>>
>>> This is because windows narrow-chars can't be UTF-8. You could make it
>>> portable by:
>>>
>>> int main()
>>> {
>>> boost::printf("Blåbærsyltetøy! 日本国 кошка!\n");
>>> }
>>>
>>
>> Thanks, TIL boost::printf.
>>
>> The idea of UTF-8 as a universal encoding seems now to be to use some
>> workaround such as boost::printf for each and every case where it turns out
>> that it doesn't work portably.
>>
>
> You pull things out of context. We should COMPARE the UTF-8 approach to the
> wide-char on windows narrow-char on non-windows approach. Your approach
> involves using your own printf just as well:
>
> #include "u/stdio_h.h" // u::CodingValue, u::printf, U
> printf(U("Blåbærsyltetøy! 日本国 кошка!\n")); // ADL?
> u::printf(U("Blåbærsyltetøy! 日本国 кошка!\n")); // or not ADL? depends on what
> exactly U is.

The relevant difference is in my opinion between

* re-implementing
e.g. the standard library to support UTF-8 (like boost::printf, and
although I haven't tested the claim that it works for the program we
discussed, it is enough for me that it /could/ work), or

* wrapping
it with some constant time data conversions (e.g. u::printf).

The hello world program demonstrated that one or the other is necessary.

So, we can forget the earlier silly claim that UTF-8 just magically
works, and now really compare, for a simplest relevant program.

And yes, with the functionality that I sketched and coded up a demo of,
you get strong type checking and argument dependent lookup. It is
however possible to design this in e.g. C level ways where it would be
much less convenient. I think the opinions in community may have been
influenced by one particularly bad such design, the [tchar.h]... ;-)

For an UTF-16 platform a printf wrapper can simply be like this:

inline int printf( CodingValue const* format, ... )
{
va_list args;
va_start( args, format );
return ::vwprintf( format->rawPtr(), args );
}

The sprintf wrapper that I used in my example is more interesting, though:

inline int sprintf( CodingValue* buffer, size_t count, CodingValue
const* format, ... )
{
va_list args;
va_start( args, format );
return ::vswprintf( buffer->rawPtr(), count, format->rawPtr(),
args );
}

inline int sprintf( CodingValue* buffer, CodingValue const* format,
... )
{
va_list args;
va_start( args, format );
return ::vswprintf( buffer->rawPtr(), size_t( -1 ),
format->rawPtr(), args );
}

The problem that the above solves is that standard vswprintf is not a
simple wchar_t version of standard vsprintf. As I recall Microsoft's
[tchar.h] relies on a compiler-specific overload, but that approach does
not cut it for platform independent code. For wchar_t/char independent
code, one solution (as above) is two offer both signatures.

Note that these wrappers do not (and do not have to) do data conversion.

Whereas re-implementations for the UTF-8 scheme have to convert data.

> but anyway you have to do O(N) work to wrap the N library functions you use.

Not quite.

It is so for the UTF-8 scheme for platform independent things such as
standard library i/o, and it is so also for the native string scheme for
platform independent things such as standard library i/o.

But when you're talking about the OS API, then with the UTF-8 scheme you
need inefficient string data conversions and N wrappers, while with the
native string scheme no string data conversions and no wrappers are
needed. Only simple "get raw pointer" calls are needed, as illustrated
in my example. Those calls could even be made implicit, but I think it's
best to have them explicit in order to avoid unexpected effects.

This difference in conversion & wrapping effort was the reason that I
used both the standard library and the OS API in my original example.

The standard library call used a thin wrapper, as shown above, while the
OS API function (MessageBoxW) could be and was called directly.

> Your approach is no way better.

I hope to convince you that the native string approach is objectively
better for portable code, for any reasonable criteria, e.g.:

* Native encoded strings avoid the inefficient string data conversions
of the UTF-8 scheme for OS API calls and for calls to functions that
follow OS conventions.

* Native encoded strings avoids many bug traps such as passing a UTF-8
string to a function expecting ANSI, or vice versa.

* Native encoded strings work seamlessly with the largest amount of code
(Windows code and nix code), while the UTF-8 approach only works
seamlessly with nix-oriented code.

Conversely, points such as those above mean that the UTF-8 approach is
objectively much worse for portable code.

In particular, the UTF-8 approach violates the principle of not paying
for what you don't (need to or want to) use, by adding inefficient
conversions in all directions; it violates the principle of least
surprise (where did that gobbledygook come from); and it violates the
KISS principle ("Keep It Simple, Stupid!", forcing Windows programmers
to deal with 3 internal string encodings instead of just 2).

>>> You judge from a non-portable coed point-of-view. How about:
>>>
>>> #include<cstdio>
>>>
>>> #include "gtkext/message_box.h" // for gtkext::message_box
>>>
>>> int main()
>>> {
>>> char buffer[80];
>>> sprintf(buffer, "The answer is %d!", 6*7);
>>> gtkext::message_box(buffer, "This is a title!",
>>> gtkext::icon_blah_blah,
>>> ...);
>>> }
>>>
>>> And unlike your code, it's magically portable! (thanks to gtk using UTF-8
>>> on
>>> windows)
>>
>> Aha. When you use a library L that translates in platform-specific ways
>> to/from UTF-8 for you, then UTF-8 is magically portable. For use of L.
>>
>> However, try to pass a `main` argument over to gtkext::message_box.
>
> See the argv explanation in
> http://permalink.gmane.org/gmane.comp.lib.boost.devel/225036

I'm sorry, I don't see what's relevant there. You suggest there that
boost::program_options can be used if it is fixed to support UTF-8;
quote "she can use boost::program_options (assuming it's also changed to
follow the UTF-8 convention)". I think that suggestion is probably
misguided. For as far as I can see boost::program_options do not provide
any way to obtain the undamaged command line in Windows (and anyway that
command line is UTF-16 encoded). Without a portable way to obtain
undamaged program arguments, portable support for parsing them with this
encoding or that encoding seems to me to be irrelevant.

Anyway, where does this introduction of special cases end?

At every point where UTF-8 does not work, the suggested solution is to
add an inefficient data conversion and support that on all platforms.

Cheers & hth.,

- Alf

Peter Dimov

unread,

Oct 28, 2011, 9:48:05 AM10/28/11

to bo...@lists.boost.org

Alf P. Steinbach wrote:

> * wrapping
> it with some constant time data conversions (e.g. u::printf).

In this particular example, wrapping doesn't work, because wprintf is
broken. (At least I haven't been able to make it work.) You'll still need
the hypothetical boost::wprintf here.

Peter Dimov

unread,

Oct 28, 2011, 10:10:43 AM10/28/11

to bo...@lists.boost.org

Alf P. Steinbach wrote:

> This difference in conversion & wrapping effort was the reason that I used
> both the standard library and the OS API in my original example.

Using the OS API makes your program non-portable, so it's not clear what the
example is supposed to demonstrate. You may as well stick to wchar_t;
Windows 95 is ancient history and the whole wrapping effort is completely
unnecessary.

The portable version of your example would be something along the lines of:

#include "message_box.hpp"
#include <stdio.h>

int main()
{
char buffer[ 80 ];

sprintf( buffer, "The answer is %d!", 6*7 );

message_box( buffer, "Title", mb_icon_information );
}

where message_box has implementations for the various OSes the program
supports. On Windows, it will utf8_decode its arguments and call
MessageBoxW.

A localizable version would not embed readable texts:

#include "message_box.hpp"
#include "get_text.hpp"
#include <stdio.h>

int main()
{
char buffer[ 80 ]; // ignore buffer overflow for now
sprintf( buffer, get_text( "the_answer_is" ).c_str(), 6*7 );

message_box( buffer, get_text( "title" ), mb_icon_information );
}

Now get_text may return something in Chinese (UTF-8 encoded) and it will all
work.

It's also possible to use wchar_t for human-readable text throughout the
code base - this provides a layer of type safety. You'll have to replace
sprintf with swprintf then. Paths, however, are better kept as char[].

Alf P. Steinbach

unread,

Oct 28, 2011, 11:47:20 AM10/28/11

to bo...@lists.boost.org

This is interesting in a perverse sort of way.

In order to make Visual C++ produce UTF-8 encoded compiled narrow
strings, one must /lie/ to the compiler. The source code is UTF-8. And
one lies and tells the Visual C++ compiler that it's ANSI.

And in order to make g++ produce ANSI encoded compiled narrow strings,
one must /lie/ the compiler. The source code is ANSI. And one lies and
tells the g++ compiler that it's UTF-8.

As I see it, there's something wrong here.

Notwithstanding the limitation that codepage 65000 is impractical in the
Windows command interpreter -- e.g. 'more' command CRASHES.

>> This is not a practical problem for "proper" applications because Russian
>> text literals should always come from the equivalent of gettext and never be
>> embedded in code.
>
> +1

I find that a very narrow minded view.

Would you like to be the one telling Norwegian student Åshild Bjørnson
that you favor the notion that she should waste hours or days installing
Boost and some other nix-oriented library and use 'gettext', in order to
be able to display her name in her first C++ program?

That text representation and output in C++ has been designed (with your
not just willing but enthusiastic vote) to be so inherently complex that
it requires hours and days of efforts just to display your name?

> Personally I'm happy with
>
> printf( "Blåbærsyltetøy! 日本国 кошка!\n" );
>
> writing UTF-8. Even if I cannot configure the console, I still can redirect
> it to a file, and it will correctly save this as UTF-8. Preventing data-loss
> is more important for me.

I find it thoroughly disgusting to have to lie to your tools, and to
rely on an assumption that the tools will not wisen up in the future.

However, I concede the point that IF one is happy with output that's
encoded so that most Windows command line tools fail (e.g. `more`
crashes), and IF one is happy with lying to the compiler about the
source encoding, and IF one is happy assuming that the compiler won't
wisen up about encodings in a future version, then -- the UTF-8 scheme
allows literals with national language characters, not just A through Z.

However, those are pretty constricting conditions.

Cheers & hth.,

- Alf

Alf P. Steinbach

unread,

Oct 28, 2011, 11:47:57 AM10/28/11

to bo...@lists.boost.org

On 28.10.2011 15:00, Peter Dimov wrote:
> Yakov Galka wrote:
>> Personally I'm happy with
>>
>> printf( "Blåbærsyltetøy! 日本国 кошка!\n" );
>>
>> writing UTF-8. Even if I cannot configure the console, I still can
>> redirect
>> it to a file, and it will correctly save this as UTF-8.
>
> You can configure the console. Select Consolas or Lucida Console as the
> font, then issue chcp 65001. chcp 65001 apparently breaks .bat files
> though. :-)

it break a hell of a lot more than batch files. try `more`.

cheers & hth.,

- Alf

Alf P. Steinbach

unread,

Oct 28, 2011, 11:59:01 AM10/28/11

to bo...@lists.boost.org

On 28.10.2011 16:10, Peter Dimov wrote:
> Alf P. Steinbach wrote:
>
>> This difference in conversion & wrapping effort was the reason that I
>> used both the standard library and the OS API in my original example.
>
> Using the OS API makes your program non-portable, so it's not clear what
> the example is supposed to demonstrate.

It demonstrates what I said.

That no data conversion is needed for calling API functions.

Your argument would to some extent make sense if you are assuming that
the two worlds of OS-specific and portable code will always be
completely separate, never touching -- but I find that unrealistic.

> You may as well stick to
> wchar_t; Windows 95 is ancient history and the whole wrapping effort is
> completely unnecessary.

Sorry, that's meaningless to me.

It sounds like free association.

> The portable version of your example would be something along the lines of:
>
> #include "message_box.hpp"
> #include <stdio.h>
>
> int main()
> {
> char buffer[ 80 ];
> sprintf( buffer, "The answer is %d!", 6*7 );
>
> message_box( buffer, "Title", mb_icon_information );
> }
>
> where message_box has implementations for the various OSes the program
> supports. On Windows, it will utf8_decode its arguments and call
> MessageBoxW.

There is no need to add inefficient translation to the mix.

> A localizable version would not embed readable texts:
>
> #include "message_box.hpp"
> #include "get_text.hpp"
> #include <stdio.h>
>
> int main()
> {
> char buffer[ 80 ]; // ignore buffer overflow for now
> sprintf( buffer, get_text( "the_answer_is" ).c_str(), 6*7 );
>
> message_box( buffer, get_text( "title" ), mb_icon_information );
> }
>
> Now get_text may return something in Chinese (UTF-8 encoded) and it will
> all work.

This may conceivably make sense at some enterprise level.

> It's also possible to use wchar_t for human-readable text throughout the
> code base - this provides a layer of type safety. You'll have to replace
> sprintf with swprintf then. Paths, however, are better kept as char[].

Sorry, again I fail to discern the underlying thoughts.

It sounds like free association.

Cheers & sorry, no comprende,

- Alf

Peter Dimov

unread,

Oct 28, 2011, 12:36:13 PM10/28/11

to bo...@lists.boost.org

Alf P. Steinbach wrote:

> Would you like to be the one telling Norwegian student Åshild Bjørnson
> that you favor the notion that she should waste hours or days installing
> Boost and some other nix-oriented library and use 'gettext', in order to
> be able to display her name in her first C++ program?

No, of course not. Our original topic was Boost.Locale, not the first
program of a Norwegian student. But, consider the topic changed, and please
do let me know what you suggest said student should do.

Yakov Galka

unread,

Oct 29, 2011, 8:14:29 AM10/29/11

to bo...@lists.boost.org

On Fri, Oct 28, 2011 at 17:47, Alf P. Steinbach <
alf.p.stein...@gmail.com> wrote:

> On 28.10.2011 15:00, Peter Dimov wrote:
>
>> Yakov Galka wrote:
>>
>>> Personally I'm happy with
>>>
>>> printf( "Blåbærsyltetøy! 日本国 кошка!\n" );
>>>
>>> writing UTF-8. Even if I cannot configure the console, I still can
>>> redirect
>>> it to a file, and it will correctly save this as UTF-8.
>>>
>>
>> You can configure the console. Select Consolas or Lucida Console as the
>> font, then issue chcp 65001. chcp 65001 apparently breaks .bat files
>> though. :-)
>>
>
> it break a hell of a lot more than batch files. try `more`.
>
> cheers & hth.,
>
> - Alf
>
>

So I tried to make YOUR approach work (i.e. use wchar_t):

Created a file with:

#include <cstdio>
int main() {
::wprintf( L"Blåbærsyltetøy! 日本国 кошка!\n" );
}

saved as UTF-8 with BOM. Compiled with VS2005, windows XP.

M:\bin> a.exe
Blσbµrsyltet°y!

M:\bin> a.exe > a.txt
Contents of a.txt:
42 6C E5 62 E6 72 73 79 6C 74 65 74 F8 79 21 20

What happens to Japanese and Russian? What's the mojibake? Maybe the
compiler corrupted the string? Let's see, change to:

wchar_t s[] = L"Blåbærsyltetøy! 日本国 кошка!\n";
::wprintf( s );

Recompile, step into the debugger. No. It's your favorite, correct UTF-16
that's passed to wprintf. Same result. Let's try a European codepage:

M:\bin> chcp 1252
M:\bin> a.exe
Blåbærsyltetøy!

Somewhat better. But how do I get to see the whole string?

M:\bin> chcp 65001
M:\bin> a.exe
Blbrsyltety!

M:\bin> chcp 1200
Invalid code page

OK, let's drop the requirement that the user sees the string at all. Let's
restrict to a simpler case: a.exe writes unicode to stdout, b.exe reads it
from stdin and writes verbatim to a file. Here is program b.exe:

int main() {
wchar_t s[256];
_getws(s);

std::ofstream fout("out.txt", std::ios::binary);
fout.write((const char*)s, 2*wcslen(s)); // I want to see what I really
get
}

Compile, run.

M:\a> a.exe | b.exe

Independent of chcp I get:

42 00 6C 00 E5 00 62 00 E6 00 72 00 73 00 79 00 6C 00 74 00 65 00 74 00 F8
00 79 00 21 00 20 00

Why the hell this is lossy‽ Where IS my lovely Japanese? What am I doing
wrong⸘

Ah! it's IMPOSSIBLE with wprintf!

Let's try UTF-8 instead. Write the program as we've written it for 40 years,
even before UTF-8 and the whole wide-char crap was introduced†.

Open VS2005:

#include <stdio.h>
int main() {

printf("Blåbærsyltetøy! 日本国 кошка!\n");
}

† I mean the C functions used. Of course we couldn't mix Japanese and
Russian back then.

Save in UTF-8 WITHOUT BOM. Compile to a-utf8.exe.

int main() {
char s[256];
gets(s);
std::ofstream fout("out.txt", std::ios::binary);
fout.write((const char*)s, strlen(s));
}

Compile b-utf8.exe;

M:\> a-utf8.exe
BlÃ¥bÃ¦rsyltetÃ¸y! æ—¥æœ¬å›½ ÐºÐ¾ÑˆÐºÐ°!

Something is bad. [The user goes to the documentation/support. Alright, I
need UTF-8. This software is Unicode aware! Good, they care about their
customers!]:

M:\> chcp 65001
M:\> a-utf8.exe
Blåbærsyltetøy! 日本国 кошка!

Correct! (Ok, I see squares for the Japanese because I don't have a
monospace font for it, but copy/paste works correctly.)

M:\> a-utf8.exe > a.txt

a.txt:
42 6C C3 A5 62 C3 A6 72 73 79 6C 74 65 74 C3 B8 79 21 20 E6 97 A5 E6 9C AC
E5 9B BD 20 D0 BA D0 BE D1 88 D0 BA D0 B0 21 0D 0A

Correct!

M:\> a-utf8.exe | b-utf8.exe
M:\> type out.txt
Blåbærsyltetøy! 日本国 кошка!

out.txt:
42 6C C3 A5 62 C3 A6 72 73 79 6C 74 65 74 C3 B8 79 21 20 E6 97 A5 E6 9C AC
E5 9B BD 20 D0 BA D0 BE D1 88 D0 BA D0 B0 21

It works! MAGIC! More importantly: ***It's the only way to make it work!***

⇒ What if it's automatic and the user cannot intervene to change the
codepage?
‽ If it's automatic, then you don't care how it's displayed in the console.
You will log it to a file anyway. The case of:
M:\> a-utf8.exe | b-utf8.exe
Works correctly independent of what the current codepage was set.

⟹ more doesn't work.
‽ Report the bug to microsoft. UTF-8 is a documented codepage. Microsoft
itself encourages to use either UTF-8 or UTF-16. Other 'ANSI' codepages are
unofficially deprecated.
http://msdn.microsoft.com/en-us/library/windows/desktop/dd317756%28v=vs.85%29.aspxsays:

Note ANSI code pages can be different on different computers, or can
be changed for a single computer, leading to data corruption. For the most
consistent results, applications should use Unicode, such as UTF-8 or
UTF-16, instead of a specific code page.

Since you cannot set UTF-16 codepage for the console, UTF-8 is your only
options from the said above. Furthermore, if people will pester microsoft we
will get more benefit (no pun intended) than rewriting our code to use some
unknown encoding that is different on each platform.

--
Yakov

Peter Dimov

unread,

Oct 29, 2011, 10:41:03 AM10/29/11

to bo...@lists.boost.org

Alf P. Steinbach wrote, about chcp 65001:

> it break a hell of a lot more than batch files. try `more`.

Yes. Life isn't perfect. Incidentally, 'more' demonstrates once again the
superiority of UTF-8 (if it worked):

C:\Projects\testbed\tmp>dir
Volume in drive C has no label.
Volume Serial Number is 34C7-A38D

Directory of C:\Projects\testbed\tmp

29.10.2011 17:28 <DIR> .
29.10.2011 17:28 <DIR> ..
29.10.2011 17:25 0 Blåbærsyltetøy! 日本国 кошка!.txt
1 File(s) 0 bytes
2 Dir(s) 856,726,167,552 bytes free

C:\Projects\testbed\tmp>dir | more
Volume in drive C has no label.
Volume Serial Number is 34C7-A38D

Directory of C:\Projects\testbed\tmp

29.10.2011 17:28 <DIR> .
29.10.2011 17:28 <DIR> ..
29.10.2011 17:25 0 Blåbærsyltetoy! ??? ?????!.txt
1 File(s) 0 bytes
2 Dir(s) 856,726,167,552 bytes free

The "dir" command has no problem displaying arbitrary file names directly to
the console (presumably via WriteConsoleW), but once it has to write to a
file, it needs to convert to narrow and no code page other than 65001 can
express the above file name. (My default console code page is 437, which
doesn't even have ø. The Consolas font doesn't have glyphs for 日本国, but the
characters are present, just not displayable, which is why I could copy and
paste them here.)

It would've been nice for Microsoft to set all the narrow code pages to
UTF-8 in Windows NT (or Windows 64 bit, the other transition point), but
they didn't, so here we are.

Anders Dalvander

unread,

Oct 29, 2011, 10:59:55 AM10/29/11

to bo...@lists.boost.org

On 20:59, Alf P. Steinbach wrote:
> Would you like to be the one telling Norwegian student Åshild Bjørnson
> that you favor the notion that she should waste hours or days installing
> Boost and some other nix-oriented library and use 'gettext', in order to
> be able to display her name in her first C++ program?

Or you could only use ASCII in source code, encode the strings in UTF-8
manually using octal escape sequences.

#include <iostream>
int main()
{
std::cout << "\303\205shild Bj\303\270rnson\n";
}

Not that nice to the eye, but anyway...

Regards,
Anders Dalvander

--
WWFSMD?

Yakov Galka

unread,

Oct 29, 2011, 11:12:12 AM10/29/11

to bo...@lists.boost.org

On Sat, Oct 29, 2011 at 16:41, Peter Dimov <pdi...@pdimov.com> wrote:

> It would've been nice for Microsoft to set all the narrow code pages to
> UTF-8 in Windows NT (or Windows 64 bit, the other transition point), but
> they didn't, so here we are.

They can do it anytime. It won't break anything. You already cannot rely on
a specific narrow code page, and it even can be variable-length (e.g.
Shift-JIS). They don't do it intentionally (http://bit.ly/2Pdaa).

--
Yakov

Peter Dimov

unread,

Oct 29, 2011, 11:21:32 AM10/29/11

to bo...@lists.boost.org

Yakov Galka wrote:
> On Sat, Oct 29, 2011 at 16:41, Peter Dimov <pdi...@pdimov.com> wrote:
>
> > It would've been nice for Microsoft to set all the narrow code pages to
> > UTF-8 in Windows NT (or Windows 64 bit, the other transition point), but
> > they didn't, so here we are.
>
> They can do it anytime. It won't break anything. You already cannot rely
> on
> a specific narrow code page, and it even can be variable-length (e.g.
> Shift-JIS). They don't do it intentionally (http://bit.ly/2Pdaa).

They can, but it will be a lot of pain in the short term. It will break all
programs that require a specific code page (such as Latin-1 or Shift-JIS)
and can afford to do so because all Windows installations in the country are
on the same page and hardly anyone uses file names outside this code page.

Yakov Galka

unread,

Oct 29, 2011, 11:33:33 AM10/29/11

to bo...@lists.boost.org

On Sat, Oct 29, 2011 at 17:21, Peter Dimov <pdi...@pdimov.com> wrote:

> Yakov Galka wrote:
>
>> On Sat, Oct 29, 2011 at 16:41, Peter Dimov <pdi...@pdimov.com> wrote:
>>
>> > It would've been nice for Microsoft to set all the narrow code pages to
>> > UTF-8 in Windows NT (or Windows 64 bit, the other transition point), but
>> > they didn't, so here we are.
>>
>> They can do it anytime. It won't break anything. You already cannot rely
>> on
>> a specific narrow code page, and it even can be variable-length (e.g.
>> Shift-JIS). They don't do it intentionally (http://bit.ly/2Pdaa).
>>
>
> They can, but it will be a lot of pain in the short term. It will break all
> programs that require a specific code page (such as Latin-1 or Shift-JIS)
> and can afford to do so because all Windows installations in the country are
> on the same page and hardly anyone uses file names outside this code page.
>

OK, the problem here is not that it's not the default (we have a long way to
go for this) but that they don't even implement it as an option. I can't
even imagine how hard they had to fail in-order to implement 'more' so it
doesn't work with UTF-8. Maybe it's intentional? Who volunteers to RE
'more'?

--
Yakov

Alf P. Steinbach

unread,

Oct 29, 2011, 12:21:23 PM10/29/11

to bo...@lists.boost.org

On 29.10.2011 14:14, Yakov Galka wrote:
> On Fri, Oct 28, 2011 at 17:47, Alf P. Steinbach<
> alf.p.stein...@gmail.com> wrote:
>
>> On 28.10.2011 15:00, Peter Dimov wrote:
>>
>>> Yakov Galka wrote:
>>>
>>>> Personally I'm happy with
>>>>
>>>> printf( "Blåbærsyltetøy! 日本国 кошка!\n" );
>>>>
>>>> writing UTF-8. Even if I cannot configure the console, I still can
>>>> redirect
>>>> it to a file, and it will correctly save this as UTF-8.
>>>>
>>>
>>> You can configure the console. Select Consolas or Lucida Console as the
>>> font, then issue chcp 65001. chcp 65001 apparently breaks .bat files
>>> though. :-)
>>>
>>
>> it break a hell of a lot more than batch files. try `more`.
>>
>>

> So I tried to make YOUR approach work (i.e. use wchar_t):

I am afraid that you are misrepresenting me a bit here.

But I am sure it is not intentional.

Let's walk through this.

> Created a file with:
>
> #include<cstdio>
> int main() {
> ::wprintf( L"Blåbærsyltetøy! 日本国 кошка!\n" );
> }
>
> saved as UTF-8 with BOM. Compiled with VS2005, windows XP.

Except that <cstdio> is not guaranteed to place wprintf in the global
namespace (I commented on that before, better use <stdio.h>), that code
works OK in the sense of doing what you have specified should happen.

Which apparently is not what you think, heh.

You have specified a conversion to narrow characters using the C++
executable narrow character set, i.e. a conversion to Windows ANSI. It
surprises a lot of programmers that that's what 'wcout' does: a
NARROWING CONVERSION. It did surprise me at one time in the 1990's. I
was very disappointed. After that I have become more and more sure that
there was no design of the C++ iostreams, but that's another story...

> M:\bin> a.exe
> Blσbµrsyltet°y!

Yes -- that's what Windows ANSI Western, which you asked for, looks like
when it is presented with the original IBM PC character set, codepage
437. Switch to codepage to 1252, the codepage number for Windows ANSI,
to get the Windows ANSI result that you asked for to display properly.
Of course it will lack the Unicode-only characters:

<example>
P:\test> type jam.cpp
∩╗┐#include <cstdio>
int main() {
::wprintf( L"Bl├Ñb├ªrsyltet├╕y! µùÑµ£¼σ¢╜ ╨║╨╛╤ê╨║╨░!\n" );
}

P:\test> chcp 65001
Active code page: 65001

P:\test> type jam.cpp
#include <cstdio>

int main() {
::wprintf( L"Blåbærsyltetøy! 日本国 кошка!\n" );
}

P:\test> cl jam.cpp
jam.cpp

P:\test> jam
Bl�b�rsyltet�y!
P:\test> chcp 437
Active code page: 437

P:\test> jam
Blσbµrsyltet°y!
P:\test> chcp 1252
Active code page: 1252

P:\test> jam
Blåbærsyltetøy!
P:\test> _
</example>

[snip]

>
> M:\bin> chcp 1252
> M:\bin> a.exe
> Blåbærsyltetøy!
>
> Somewhat better. But how do I get to see the whole string?

Not with any single-byte-per-character encoding. ;-)

You can use UTF-8 or UTF-16 for the output.

UTF-8 is a bit problematic because the Windows support is really flaky.

[snip effort with wide text]

> Ah! it's IMPOSSIBLE with wprintf!

No no, you're jumping to conclusions.

The Microsoft runtime has special support for this at the C library
level, but unfortunately, as far as I know, not at the C++ level.

Still, since you're using 'wprintf', that's at the C level, so it's no
problem:

<example>
P:\test> chcp 65001
Active code page: 65001

P:\test> type jam.cpp
#include <stdio.h>
#include <io.h> // _setmode
#include <fcntl.h> // _O_U8TEXT

int main()
{

_setmode( _fileno( stdout ), _O_U8TEXT );

::wprintf( L"Blåbærsyltetøy! 日本国 кошка!\n" );
}

P:\test> cl jam.cpp
jam.cpp

P:\test> jam
Blåbærsyltetøy! 日本国 кошка!

P:\test> g++ jam.cpp
jam.cpp: In function 'int main()':
jam.cpp:7: error: '_O_U8TEXT' was not declared in this scope

P:\test> g++ jam.cpp -D __MSVCRT_VERSION__=0x0800

P:\test> a
Blåbærsyltetøy! 日本国 кошка!

P:\test> _
</example>

> Let's try UTF-8 instead.

[snip effort]

> It works! MAGIC! More importantly: ***It's the only way to make it work!***

See above.

Those statements are wrong in two important respects.

First wrongness, the Windows console window support for UTF-8 is really
really flaky, so that you get more or less arbitrary "errors". They seem
to be connected with timing or something. So UTF-8 is not good: I showed
above how to generate UTF-8 from wide char literals just to be exactly
comparable to your example code, and the big difference is that I did
not have to lie to the compiler and hope for the best. Instead, the code
I presented above is well-defined. The result, for my program and for
yours (since both output UTF-8) isn't well defined though -- it
depends somewhat on the phase of the moon in Seattle, or something.

Second wrongness, it's not the only way.

I started very near the top of this thread by giving a concrete example
that worked very neatly. It gets tiresome repeating myself. But as you
could see in that example, the end programmer does not have to deal with
the dirty platform-specific details any more than with all-UTF8.

---

And you absolutely don't want to work with codepage 65001 in the
console: it causes batch files and 'more' and pipes etc. to fail.

But, you may ask, what about Alf's program, then, it's the same for
heaven's sake?

Well, let's check:

<example>
P:\test> chcp 1252
Active code page: 1252

P:\test> a
Blåbærsyltetøy! 日本国 кошка!

P:\test> jam
Blåbærsyltetøy! 日本国 кошка!

P:\test>
</example>

He he. :-)

It works also with the more practical codepage 1252 in the console.

The reason is probably that it uses WriteConsole internally, but it
doesn't matter much how the runtime library accomplishes this.

On the other hand, as with much else Microsoft there are probably hidden
costs.

It is possible that invoking this C level support may wreak havoc at the
C++ iostreams level, so that a good solution may have to provide custom
iostream buffers working around the Microsoft bugs.

[snip about reporting one of the myriad console bugs, to Microsoft]

> Since you cannot set UTF-16 codepage for the console, UTF-8 is your only
> options from the said above.

No that's incorrect.

In my (limited) experience UTF-16 is more reliable for this.

However, UTF-16 as an external encoding feels sort of wrong, even if it
is very efficient for Japanese network traffic.

> Furthermore, if people will pester microsoft we
> will get more benefit (no pun intended) than rewriting our code to use some
> unknown encoding that is different on each platform.

I believe that could greatly ease the porting of *nix tools to Windows.

Cheers & hth.,

- Alf

Daniel James

unread,

Oct 29, 2011, 12:23:15 PM10/29/11

to bo...@lists.boost.org

On Saturday, 29 October 2011, Peter Dimov wrote:
>
>
> The "dir" command has no problem displaying arbitrary file names directly
> to the console (presumably via WriteConsoleW), but once it has to write to
> a file, it needs to convert to narrow and no code page other than 65001 can
> express the above file name.
>

This is not that relevant to the wider issue, but wide streams will work
for console output if you first do this:

if (_isatty(_fileno(stdout))) _setmode(_fileno(stdout), _O_U16TEXT);
if (_isatty(_fileno(stderr))) _setmode(_fileno(stderr), _O_U16TEXT);

i.e. set the output mode to UTF-16 when writing to the console. This only
works for recent versions of Visual C++. Obviously doesn't fix piped output.

Peter Dimov

unread,

Oct 29, 2011, 12:59:16 PM10/29/11

to bo...@lists.boost.org

Alf P. Steinbach wrote:

> But, you may ask, what about Alf's program, then, it's the same for
> heaven's sake?
>
> Well, let's check:
>
> <example>
> P:\test> chcp 1252
> Active code page: 1252
>
> P:\test> a
> Blåbærsyltetøy! 日本国 кошка!

Neat trick. Apparently, _O_U8TEXT switches to Unicode mode when stdout is a
console. Let me try...

Yeah.

| C:\Projects\testbed>release\testbed.exe | more

| Bl├Ñb├ªrsyltet├╕y! µùÑµ£¼σ¢╜ ╨║╨╛╤ê╨║╨░!

Well. You can't have everything. :-)

| C:\Projects\testbed>release\testbed.exe > testbed.txt
|
| C:\Projects\testbed>type testbed.txt

| Bl├Ñb├ªrsyltet├╕y! µùÑµ£¼σ¢╜ ╨║╨╛╤ê╨║╨░!
|

Of course, chcp 65001 breaks everything and more. Not that more worked in
the first place. :-)

Alf P. Steinbach

unread,

Oct 29, 2011, 1:07:56 PM10/29/11

to bo...@lists.boost.org

On 29.10.2011 18:23, Daniel James wrote:
> On Saturday, 29 October 2011, Peter Dimov wrote:
>>
>>
>> The "dir" command has no problem displaying arbitrary file names directly
>> to the console (presumably via WriteConsoleW), but once it has to write to
>> a file, it needs to convert to narrow and no code page other than 65001 can
>> express the above file name.
>>
>
> This is not that relevant to the wider issue, but wide streams will work
> for console output if you first do this:
>
> if (_isatty(_fileno(stdout))) _setmode(_fileno(stdout), _O_U16TEXT);
> if (_isatty(_fileno(stderr))) _setmode(_fileno(stderr), _O_U16TEXT);
>
> i.e. set the output mode to UTF-16 when writing to the console. This only
> works for recent versions of Visual C++. Obviously doesn't fix piped output.

Right.

But the added 'if's produce another problem, namely that redirection to
a file is prevented from working.

<example>
P:\test> chcp 65001
Active code page: 65001

P:\test> type jam.cpp

#include <stdio.h>
#include <io.h> // _setmode
#include <fcntl.h> // _O_U8TEXT

int main()
{
//_setmode( _fileno( stdout ), _O_U8TEXT );
if( _isatty( _fileno( stdout ) ) )
{
_setmode( _fileno( stdout ), _O_U16TEXT );

}
::wprintf( L"Blåbærsyltetøy! 日本国 кошка!\n" );
}

P:\test> cl jam.cpp
jam.cpp

P:\test> jam
Blåbærsyltetøy! 日本国 кошка!

P:\test> jam >x

P:\test> type x
Bl�b�rsyltet�y!
P:\test> _
</example>

Without the added 'if's, and instead adding a Unicode BOM to the start
of the text, it works fine for redirection:

<example
P:\test> chcp 65001
Active code page: 65001

P:\test> type jam.cpp

#include <stdio.h>
#include <io.h> // _setmode

#include <fcntl.h> // _O_U16TEXT

int main()
{
_setmode( _fileno( stdout ), _O_U16TEXT );
::wprintf( L"\uFEFF" L"Blåbærsyltetøy! 日本国 кошка!\n" );
}

P:\test> cl jam.cpp
jam.cpp

jam.cpp(8) : warning C4428: universal-character-name encountered in source

P:\test> jam
Blåbærsyltetøy! 日本国 кошка!

P:\test> jam >x

P:\test> type x
Blåbærsyltetøy! 日本国 кошка!

P:\test> chcp 437
Active code page: 437

P:\test> type x
Blåbærsyltetøy! 日本国 кошка!

P:\test> _
</example>

UTF-8 is even more forgiving as an external format. You don't see the
BOM. Oh, I see that it's disappeared above, difficult to copy-paste, but
it's there in the direct output as a rectangle.

Cheers & hth.,

- Alf

Peter Dimov

unread,

Oct 29, 2011, 1:34:48 PM10/29/11

to bo...@lists.boost.org

Alf P. Steinbach wrote:
> int main()
> {
> _setmode( _fileno( stdout ), _O_U16TEXT );
> ::wprintf( L"\uFEFF" L"Blåbærsyltetøy! 日本国 кошка!\n" );
> }

This produces an UTF-16 text file though. It works with "type", but would
probably confuse most other programs. And more.

C:\Projects\testbed>release\testbed.exe > testbed.txt

C:\Projects\testbed>type testbed.txt

Blåbærsyltetøy! 日本国 кошка!

C:\Projects\testbed>type testbed.txt | more
Blåbærsyltetoy! ??? ?????!

C:\Projects\testbed>cat testbed.txt
▒▒B l ▒ b ▒ r s y l t e t ▒ y ! ▒e,g▒V :♦>♦H♦:♦0♦!

Yakov Galka

unread,

Oct 30, 2011, 1:28:23 PM10/30/11

to bo...@lists.boost.org

On Sat, Oct 29, 2011 at 18:21, Alf P. Steinbach <
alf.p.stein...@gmail.com> wrote:

> [...]

>
>>
>> M:\bin> chcp 1252
>> M:\bin> a.exe
>> Blåbærsyltetøy!
>>
>> Somewhat better. But how do I get to see the whole string?
>>
>
> Not with any single-byte-per-character encoding. ;-)
>

That's why ANSI codepages other than UTF-8 are crap, they're not suitable
for internationalization.

> UTF-8 is a bit problematic because the Windows support is really flaky.
>

It's problem of windows, not UTF-8. Report to microsoft, demand UTF-8
support, meanwhile develop workarounds that let people use UTF-8 portably.
In 20 years we may get a working UTF-8 support. I understand that you give
a damn about what will be in 20 years, but I do care.

> Still, since you're using 'wprintf', that's at the C level, so it's no
> problem:
>

Congratulations! You found a WORKAROUND to properly support WIDE-CHAR, when
UTF-8 support is ALREADY THERE. But you know what? There's a similar
workaround to output UTF-8 when UTF-8 is not set for the console. Now
explain, how is this:

int main()
{
_setmode( _fileno( stdout ), _O_U8TEXT );
wprintf( L"Blåbærsyltetøy! 日本国 кошка!\n" );
}

M:\>chcp 1252
Active code page: 1252

M:\>a.exe
Blåbærsyltetøy! 日本国 кошка!\n

better than this:

int main()
{
SetConsoleOutputCP(CP_UTF8);

printf( "Blåbærsyltetøy! 日本国 кошка!\n" );
}

M:\>chcp 1252
Active code page: 1252

M:\>a.exe
Blåbærsyltetøy! 日本国 кошка!\n

‽

How will you explain Åshild Bjørnson why he can use plain-old printf on the
workstations in the university but he needs to use all the w and L (or your
proprietary Unicode-wrappers) on his private computer at home? 'w' stands
for windows? Or perhaps you want to infect the non-windows world with
wchar_t too?

They seem to be connected with timing or something. So UTF-8 is not good: I
> showed above how to generate UTF-8 from wide char literals just to be
> exactly comparable to your example code,

I showed you how you can continue to use UTF-8, resulting in portable code
(modulo a call to SetConsoleOutputCP) which behaves the same as yours.

> and the big difference is that I did not have to lie to the compiler and
> hope for the best.

It's not lying. It's just not telling the truth. And in C++11 you won't
need it either:

int main()
{
SetConsoleOutputCP(CP_UTF8);
printf( u8"Blåbærsyltetøy! 日本国 кошка!\n" );
}

Instead, the code I presented above is well-defined. The result, for my
> program and for yours (since both output UTF-8) isn't well defined though
> -- it depends somewhat on the phase of the moon in Seattle, or something.
>

What? It's well defined: both will write UTF-8 bytes to stdout. If you
redirect to a file, it's well defined. If you redirect to another program,
it's well defined. What's may not be well defined is how the reciever
interprets this. It will break only when the receiver tries to convert the
data to UTF-16 if it doesn't know that it's UTF-8. But then it's again not
restricted to UTF-8. The problem is the same for any 'ANSI' encoding. This
is why standardizing on UTF-8 is important.

Second wrongness, it's not the only way.
>

You don't have stdin and wstdin. stdin has a byte oriented encoding an thus
the only way to transfer unicode data through it is with UTF-8. If you want
to use wprintf—good, the library will do the conversion for you. But it
still has to be translated to UTF-8. If you don't use UTF-8 you won't be
Unicode-compatible. If you're not Unicode compatible, that means you're
stuck in the 20th century.

⚠ The importance of Unicode is not only in multilingual support, it's
important even within one language such as English—“ﬁﬂﬀﬃﬄ”… No 'ANSI'
non-UTF-8 codepage can encode these.

I started very near the top of this thread by giving a concrete example
> that worked very neatly. It gets tiresome repeating myself. But as you
> could see in that example, the end programmer does not have to deal with
> the dirty platform-specific details any more than with all-UTF8.
>

She does. She need to use your redundant u::sprintf when the
narrow-character STANDARD sprintf works just fine.

It works also with the more practical codepage 1252 in the console.
>

My default is not 1252. Stop being Euro-centric. UTF-8 works with 1252 too
as shown above.

[...]

> In my (limited) experience UTF-16 is more reliable for this.
>

How it's more reliable?

--
Yakov

Yakov Galka

unread,

Oct 30, 2011, 1:38:59 PM10/30/11

to bo...@lists.boost.org

On Sun, Oct 30, 2011 at 19:28, Yakov Galka <ybunga...@gmail.com> wrote:

> M:\>a.exe
> Blåbærsyltetøy! 日本国 кошка!\n
>

And don't come upon me that it wasn't copied from the console. Here is a
real copy-pasta for your calm:

M:\censored>chcp 1252
Active code page: 1252

M:\censored>a.exe
Blåbærsyltetøy! 日本国 кошка!

Yakov Galka

unread,

Oct 30, 2011, 2:54:24 PM10/30/11

to bo...@lists.boost.org

On Fri, Oct 28, 2011 at 15:34, Alf P. Steinbach <
alf.p.stein...@gmail.com> wrote:

> [...]

> There was a claim that the UTF-8 based code should just work,

I can't recall anyone saying this. What people were saying is that it's the
most sane way to write portable code. And if the vendors hadn't been
resisting UTF-8 adoption, it would just work.

> [...]

> * re-implementing
> e.g. the standard library to support UTF-8 (like boost::printf, and
> although I haven't tested the claim that it works for the program we
> discussed, it is enough for me that it /could/ work), or
>
> * wrapping
> it with some constant time data conversions (e.g. u::printf).
>
> The hello world program demonstrated that one or the other is necessary.
>

My last mail demonstrated that we don't need either when on windows. printf
just works.

> So, we can forget the earlier silly claim that UTF-8 just magically works,
> and now really compare, for a simplest relevant program.
>

Now we can recall this claim and continue to apply it to your silly claim
that wrapping everything is easier.

[...]

> For an UTF-16 platform a printf wrapper can simply be like this:
>
> inline int printf( CodingValue const* format, ... )
> {
> va_list args;
> va_start( args, format );
> return ::vwprintf( format->rawPtr(), args );
> }
>

Apparently we don't need it. In linux world requesting the user to use
UTF-8 is legitimate. It's already almost everywhere the default. In some
non-linux systems UTF-8 is the default too (Mac OS X?). In windows we can
use narrow printf just fine.

> The sprintf wrapper that I used in my example is more interesting, though:
>
> inline int sprintf( CodingValue* buffer, size_t count, CodingValue
> const* format, ... )
> {
> va_list args;
> va_start( args, format );
> return ::vswprintf( buffer->rawPtr(), count, format->rawPtr(), args
> );
> }
>
> inline int sprintf( CodingValue* buffer, CodingValue const* format, ...
> )
> {
> va_list args;
> va_start( args, format );
> return ::vswprintf( buffer->rawPtr(), size_t( -1 ),
> format->rawPtr(), args );
> }
>

Oh! thank you! You suggest to wrap each function that comes in two kinds...
You don't need to either wrap or re-implement sprintf for the UTF-8
approach. The whole point of UTF-8 is that it already works with most of
the existing narrow library functions (strlen, strstr, str*, std::string,
etc...) It's simpler, ah!?

The problem that the above solves is that standard vswprintf is not a
> simple wchar_t version of standard vsprintf. As I recall Microsoft's
> [tchar.h] relies on a compiler-specific overload, but that approach does
> not cut it for platform independent code. For wchar_t/char independent
> code, one solution (as above) is two offer both signatures.
>

No such problems in UTF-8 world.

> but anyway you have to do O(N) work to wrap the N library functions you
>> use.
>>
>
> Not quite.
>
> It is so for the UTF-8 scheme for platform independent things such as
> standard library i/o, and it is so also for the native string scheme for
> platform independent things such as standard library i/o.
>

As we see it's the other way around...

> But when you're talking about the OS API, then with the UTF-8 scheme you
> need inefficient string data conversions

It's quite efficient. In fact it was never a bottleneck. Invoking the OS
usually yields complex operations anyway. Moreover, even in non-English
speaking world, most of the text internal to programs is still ASCII. UTF-8
saves space, saves cache usage. This compensates the conversion penalty. To
make a definite statements you must measure. Otherwise it's premature
optimization, if it's an optimization at all.

Also note that in multi-threaded world with hierarchical memory,
computation becomes faster than memory access.

and N wrappers, while with the native string scheme no string data
> conversions and no wrappers are needed.

The difference is what you wrap: the standard interface or the proprietary
OS-interface. We benefit more from wrapping the later, as was done for
hundreds of times in each portable library that tries to accomplish
something beyond primitive file-io. This is because you get a portable
library as a side-product.

>
> Your approach is no way better.
>>
>
>
> I hope to convince you that the native string approach is objectively
> better for portable code, for any reasonable criteria, e.g.:
>
>
> * Native encoded strings avoid the inefficient string data conversions of
> the UTF-8 scheme for OS API calls and for calls to functions that follow OS
> conventions.
>

Stop calling it inefficient. If you store portable data on some storage, or
receive it through network—as any serious application does today—you can't
avoid conversions. You just have to decide where you do them, closer to the
OS or further. Anyway, see above.

* Native encoded strings avoids many bug traps such as passing a UTF-8
> string to a function expecting ANSI, or vice versa.
>

Yeah, and "multiple inheritance causes multiple abuse of multiple
inheritance"[1] microsoft said? UTF-8 avoids many bug traps such as
forgetting that UTF-16 is actually a variable length encoding. EVERYBODY
knows that UTF-8 has vaaariiable-lng codepoints.

* Native encoded strings work seamlessly with the largest amount of code
> (Windows code and nix code), while the UTF-8 approach only works seamlessly
> with nix-oriented code.
>

Hmmm... I prefer the later, just to avoid all the boilerplate wrappers for
what has been standard for years. And I'm a windows programmer. Besides,
how will you return unicode from std::exception::what() if not by UTF-8?

Conversely, points such as those above mean that the UTF-8 approach is
> objectively much worse for portable code.
>

Since I'm tired of repeating the same again and again, see "Using the
native encoding" in
http://permalink.gmane.org/gmane.comp.lib.boost.devel/225036

In particular, the UTF-8 approach violates the principle of not paying for
> what you don't (need to or want to) use

UTF-16 violates the principle of you don't pay for what you don't use: If
most of your text is ASCII (which is true for internal text even in
non-English countries) you don't want to waste twice as much memory.

> , by adding inefficient conversions in all directions;

Again? seekg(0) and read(). You'll have to do conversions anyway, e.g. when
you read from a file. You don't store native encoding in portable file, do
you?

> [...] and it violates the KISS principle ("Keep It Simple, Stupid!",

> forcing Windows programmers to deal with 3 internal string encodings
> instead of just 2).

If you're working with 2 encodings, you're doing something terribly wrong.
Seriously, it looks like you're still living in the 20th century. You shall
not use ANSI encodings (other than UTF-8) on windows because they don't
work with Unicode. They are mostly deprecated. Microsoft encourages you to
use either UTF-8 or UTF-16 (
http://msdn.microsoft.com/en-us/library/windows/desktop/dd317756%28v=vs.85%29.aspx
).

Now, assuming you stopped using legacy 'ANSI' encodings, you left with only
UTF-16 (internal) and UTF-8 (external). Replace internal UTF-16 with UTF-8,
and you're left with only ONE encoding used for EVERYTHING, internal and
external. UTF-16 at OS calls doesn't count as it's not stored anywhere
(you're not 'dealing' with it).

[1] from some C# book by microsoft I glanced a few years ago.

--
Yakov

Alf P. Steinbach

unread,

Oct 30, 2011, 10:54:43 PM10/30/11

to bo...@lists.boost.org

On 30.10.2011 18:28, Yakov Galka wrote:
> On Sat, Oct 29, 2011 at 18:21, Alf P. Steinbach<
>> Upthread, Yakov Galka wrote:
>>>
>>> Somewhat better. But how do I get to see the whole string?
>>
>> Not with any single-byte-per-character encoding. ;-)
>
> That's why ANSI codepages other than UTF-8 are crap, they're not suitable
> for internationalization.

Nobody have suggested using Windows ANSI for internationalization.

So your use of the four-letter word "crap" is, so to speak, wasted.

>> UTF-8 is a bit problematic because the Windows support is really flaky.
>>
>
> It's problem of windows, not UTF-8. Report to microsoft, demand UTF-8
> support, meanwhile develop workarounds that let people use UTF-8 portably.
> In 20 years we may get a working UTF-8 support. I understand that you give
> a damn about what will be in 20 years, but I do care.

Uh, four letter word again. I suggest reserving them for where they
suitably describe reality. E.g., I used a four letter word once in this
discussion, namely "hell of a lot more" about the Windows console bugs.

By the way, I can assure you that telepathy does not work:

the claimed insight into my motivations etc. is incorrect (making such a
claim is also an invalid form of rhetoric, but that's less important).

>> Still, since you're using 'wprintf', that's at the C level, so it's no
>> problem:
>
> Congratulations! You found a WORKAROUND to properly support WIDE-CHAR, when
> UTF-8 support is ALREADY THERE.

Please reserve all uppercase for macro names.

And no, I have so far not had the pleasure of learning anything
technical from this thread, unless you count the lie-to-g++ trick
applied to the visual c++ compiler, but that's more psychological.

I think that this one-sided learning, that various aspects of the
reality have apparently not been well known to Boosters, means that the
review process at Boost for this case has probably not involved the
right kind of critical people knowledgeable in the domain.

> But you know what? There's a similar
> workaround to output UTF-8 when UTF-8 is not set for the console. Now
> explain, how is this:
>
> int main()
> {
> _setmode( _fileno( stdout ), _O_U8TEXT );
> wprintf( L"Blåbærsyltetøy! 日本国 кошка!\n" );
> }
>
> M:\>chcp 1252
> Active code page: 1252
>
> M:\>a.exe
> Blåbærsyltetøy! 日本国 кошка!\n
>
> better than this:
>
> int main()
> {
> SetConsoleOutputCP(CP_UTF8);
> printf( "Blåbærsyltetøy! 日本国 кошка!\n" );
> }

The first program, with wide string literal, does not require you to lie
to the Visual C++ compiler about the source code encoding.

The second program does require you to lie to the compiler.

Hence, (1) wide string literals with non-ASCII characters will be
mangled, cutting off use of an otherwise well defined language feature,
(2) a later version of the compiler may be able to infer the UTF-8
encoding in spite of lacking BOM, then mangling the text, (3) you have
to invoke inefficient data conversions for any use of functions that
adhere to Windows conventions, which includes most Windows libraries and
of course the Windows API -- e.g., try MessageBox, (4) you force Windows
programmers to deal with 3 text encodings (ANSI, UTF-8 and UTF-16)
instead of just 2 (ANSI and UTF-16), and (5) by "overloading" the narrow
character strings with two main character encodings, you make it easy to
introduce encoding-related bugs which can only be found by laborious
run-time testing.

So, it is an inefficient hack that can stop working, that cuts off a
well defined language feature, forces complexity and attracts bugs.

In my opinion it is not a good idea to base a Boost library on an
inefficient hack that can stop working and that cuts off a language
feature that's much used in Windows, and that on top of that forces
complexity and attracts bugs that can only be found by testing.

[snip]

> How will you explain Åshild Bjørnson why he can use plain-old printf on the
> workstations in the university but he needs to use all the w and L (or your
> proprietary Unicode-wrappers) on his private computer at home?

I would not, since that's not the case.

Also, I do not have any proprietary wrappers, that's also incorrect.

> 'w' stands for windows?

AFAIK "w" has not appeared in this thread, unless you're thinking of the
standard library's wprintf etc. I do not know what it otherwise stands
for or is. Note that using wprintf or "L" literals directly is not
portable, so if that's what you thinking of then it's a non-issue.

> Or perhaps you want to infect the non-windows world with
> wchar_t too?

I am baffled by your assumption that wchar_t is not used at all in the
*nix world.

And I am also baffled by your lack of understanding of the scheme I have
described many times. So for the record, I have not been talking about
using an unnatural representation for the platform at hand. Instead I
have argued for the opposite, namely to use the natural encoding for the
platform -- which to my mind is much of the good things that C++ is
all about, namely diversity, adaption & raw efficiency, and instead of
the Java idea of binary level portability, C++-like efficient (but less
convenient) source code level portability.

For that matter I'm also baffled by the attack with four letter word on
Windows ANSI for internationalization, which is impossible and so is not
done; it is a non-existing scheme you attacked there.

[snip]

> It's not lying. It's just not telling the truth.

To lie is to intentionally make someone believe something that one
thinks one knows is not true. One can lie by stating the truth. And in
this case, one lies by omitting a crucial fact (namely the BOM).

> And in C++11 you won't
> need it either:
>
> int main()
> {
> SetConsoleOutputCP(CP_UTF8);
> printf( u8"Blåbærsyltetøy! 日本国 кошка!\n" );
> }

Yes, this is indeed a point in favor of the UTF-8 scheme: that C++11
partially supports it.

Knowledge of the encoding is however discarded: the end result is just
an array of 'char', which unfortunately, on the Windows platform, by
convention is expected to be encoded as ANSI... -> bugs.

> Instead, the code I presented above is well-defined. The result, for my
>> program and for yours (since both output UTF-8) isn't well defined though
>> -- it depends somewhat on the phase of the moon in Seattle, or something.
>>
>
> What? It's well defined: both will write UTF-8 bytes to stdout. If you
> redirect to a file, it's well defined. If you redirect to another program,
> it's well defined. What's may not be well defined is how the reciever
> interprets this. It will break only when the receiver tries to convert the
> data to UTF-16 if it doesn't know that it's UTF-8. But then it's again not
> restricted to UTF-8. The problem is the same for any 'ANSI' encoding. This
> is why standardizing on UTF-8 is important.

No, I was talking about the console window support. A console window
will itself often partially mangle UTF-8 output, in particular the first
letter. At least it has done that when I have tested out the examples
for this thread. However, supporting UTF-8 more directly with e.g. a
SetConsoleOutputCP call, appears to work for direct presentation.

>> Second wrongness, it's not the only way.
>>
> You don't have stdin and wstdin. stdin has a byte oriented encoding an thus
> the only way to transfer unicode data through it is with UTF-8. If you want
> to use wprintf—good, the library will do the conversion for you. But it
> still has to be translated to UTF-8. If you don't use UTF-8 you won't be
> Unicode-compatible. If you're not Unicode compatible, that means you're
> stuck in the 20th century.

I am not sure what you're arguing here. The bit about "the only way" is
technically wrong. However, I /think/ what you're trying to communicate
is that UTF-8 is good as a kind of universal external encoding.

And if so, then I wholeheartedly agree.

However, we have been discussing internal text representation.

> ⚠ The importance of Unicode is not only in multilingual support, it's
> important even within one language such as English—“ﬁﬂﬀﬃﬄ”… No 'ANSI'
> non-UTF-8 codepage can encode these.

Yes, you have words like "maneuver", which properly is spelled with an
oe contraction that I once mistakenly thought was a Norwegian "æ"!

> I started very near the top of this thread by giving a concrete example
>> that worked very neatly. It gets tiresome repeating myself. But as you
>> could see in that example, the end programmer does not have to deal with
>> the dirty platform-specific details any more than with all-UTF8.
>
> She does. She need to use your redundant u::sprintf when the
> narrow-character STANDARD sprintf works just fine.

Oh, the standard sprintf starts yielding incorrect results as soon as
some ANSI text has sneaked into the mix, or when Visual C++ 12 (say) has
discovered that your BOM-less source code is UTF-8 encoded.

With something like u::sprintf one is to some extent protected by having
the encoding statically type-checked.

You can say that with C++ compared to C, more and stronger static type
checking is a large part of what C++ is all about. ;-)

> It works also with the more practical codepage 1252 in the console.
>
> My default is not 1252. Stop being Euro-centric. UTF-8 works with 1252 too
> as shown above.
>
> [...]
>> In my (limited) experience UTF-16 is more reliable for this.
>
> How it's more reliable?

A console window will in some cases mangle the first character of UTF-8
output. I don't know why. And the [cmd.exe] "/u" option for supporting
Unicode in pipes, is reportedly UTF-16 (disclaimer: I haven't used it).

Cheers & hth.,

- Alf

PS: Sorry that I don't have time to answer all responses.

Alf P. Steinbach

unread,

Oct 30, 2011, 10:59:34 PM10/30/11

to bo...@lists.boost.org

On 30.10.2011 18:38, Yakov Galka wrote:
> On Sun, Oct 30, 2011 at 19:28, Yakov Galka<ybunga...@gmail.com> wrote:
>
>> M:\>a.exe
>> Blåbærsyltetøy! 日本国 кошка!\n
>>
>
> And don't come upon me that it wasn't copied from the console. Here is a
> real copy-pasta for your calm:
>
> M:\censored>chcp 1252
> Active code page: 1252
>
> M:\censored>a.exe
> Blåbærsyltetøy! 日本国 кошка!

Hey, have I ever indicated that I don't believe you?

He he.

I think what that indicates is that console windows and some other tools
apparently use statistical methods to infer the encoding. There was,
once, an infamous bug in that functionality. Which meant that when you
wrote a particular sentence about George Bush in Notepad, and saved as
Unicode, and reloaded the file, then Notepad told you he's a liar... :-)

Cheers & hth.,

- Alf

bugpower

unread,

Oct 31, 2011, 1:18:01 PM10/31/11

to bo...@lists.boost.org

Alf, All,

What replies seem to be missing here is that what you call the "least
surprise" behavior of the code with argument of main(), is simply incorrect
from the software engineering point of view. Let me explain:

> 3. the most natural sufficiently general native encoding, 1 or 2
> depending on the platform that the source is being built for.

Now, when accepting filename from the user's command line on Windows, it is
simply not possible to use narrow-string version of main(). Your code cannot
enforce your user to limit his input to characters representable in the
current ANSI codepage. If the command line parameter is a filename as in the
example you suggested, you cannot tell them "never double click on some
files" (if a program used in a file association). Supporting is always
better than white-listing, so the only acceptable way of using command line
parameter which is a filename on windows is with UTF-16 version - _tmain().

Then, proceed as Artyom explained. The surprise is then justified - it
prevented a hard-to-spot bug. My preference on Windows though would be
different (and not due to religious reasons) - convert all API-returned
strings to UTF-8 as soon as possible and forget about encoding issues for
good. See
http://programmers.stackexchange.com/questions/102205/should-utf-16-be-considered-harmful

--
View this message in context: http://boost.2283326.n4.nabble.com/Silly-Boost-Locale-default-narrow-string-encoding-in-Windows-tp3945105p3956482.html
Sent from the Boost - Dev mailing list archive at Nabble.com.

Alf P. Steinbach

unread,

Nov 1, 2011, 11:03:23 AM11/1/11

to bo...@lists.boost.org

On 31.10.2011 18:18, bugpower wrote:
> Alf, All,
>
> What replies seem to be missing here is that what you call the "least
> surprise" behavior of the code with argument of main(), is simply incorrect
> from the software engineering point of view. Let me explain:
>
>> 3. the most natural sufficiently general native encoding, 1 or 2
>> depending on the platform that the source is being built for.
>
> Now, when accepting filename from the user's command line on Windows, it is
> simply not possible to use narrow-string version of main().

Well, there are three aspects of that claim:

1 The limitations of `main` in Windows.

Regarding aspect (1), the C++ Standard does describe the `main`
arguments as "MBCS" strings, meaning they can (should) be encoded with
possibly two or more bytes per character, as in for example UTF-8, which
to me is strongly implied. However, the Windows convention for C++
executable narrow character set predates the C++ standard by a long
shot, and even predates the C standard, and is Windows ANSI. And that
convention is /very/ deeply embedded, not only in the runtime library
implementations but e.g. in how Visual C++ translates string literals.

2 What you're trying to communicate.

Regarding aspect (2), more quoted concrete context could help make it
more clear to readers what you're trying to say.

I'm not a telepath. But it does sound like you're arguing against a
straw man of your own devising. As if someone had argued for using
ANSI-encoded arguments in general, or as some solution of i18n.

So, I will put my initial remark above, more strongly:

Please always /quote/ what you're referring to.

Especially when you are offering something that sounds as an argument
against something, then please /quote/ what you're referring to.

3 The literal claim that "it is simply not possible to use narrow-

string version of main()".

Regarding aspect (3), this claim is incorrect.

However, many people think that one has to use a non-standard startup
function like WinMain, that one has to ditch some parts of the C++
standard as soon as one does Windows, so, some basic technical fact:

The GNU toolchain (the g++ compiler) happily accepts a standard `main`
startup function without further ado, regardless of Windows subsystem.

The Microsoft toolchain (the Visual C++ compiler and linker), however,
is less adept at recognizing your startup function as such. So with the
MS toolchain you have to specify the startup function explicitly if
you're building a GUI subsystem program and want a standard `main`. The
relevant linker options: "/entry:mainCRTStartup /subsystem:windows".

---

Finally, note how I had to cover a lot of bases and use a lot of time on
responding to your single little sentence.

That's because that sentence was very *unclear* and *misleading*, and,
given the next sentence, quoted below, I hope it was not so by design.

> Your code cannot
> enforce your user to limit his input to characters representable in the
> current ANSI codepage.

Ignoring the misleading "your", and responding to the technical content
only:

the previous sentence talked about `main` arguments, and this following
sentence talks about "input", so it seems that you are confusing two
different aspects that have very different behaviors in Windows.

In Windows the program arguments are always passed to the process as a
single UTF-16 encoded command line string, available via the API
function GetCommandLine.

For the command line it is therefore meaningless to talk about
restricting the user.

Standard /input/, OTOH., is always passed via some narrow character
encoding, which does not include UTF-16, and which by convention is
neither ANSI nor UTF-16 but the extraordinarily impractical OEM codepage
(on English PC that codepage is the original IBM PC char set). Happily
it is possible to change the narrow character encoding used for input.
For example, it can be changed to UTF-8 as the external encoding. This
is called the "active codepage" in a console window, and it can also be
changed by the user, e.g. commands 'mode' and 'chcp'.

The dangers of selecting UTF-8 as active codepage in a command
interpreter console window, have been discussed else-thread; in short,
Microsoft has a large number of ridiculous bugs in their support.

But that discussion also showed that it's (very probably) OK under
program control.

> If the command line parameter is a filename as in the
> example you suggested, you cannot tell them "never double click on some
> files" (if a program used in a file association).

What example?

Please always quote what you refer to, and quote enough of that context:
don't be ambiguous, don't leave it to readers to infer a context based
on your possibly wrong understanding of it.

Anyway, text passed as `main` arguments can be e.g. the user's name,
which is not necessarily a filename.

> Supporting is always
> better than white-listing, so the only acceptable way of using command line
> parameter which is a filename on windows is with UTF-16 version - _tmain().

Oh dear.

Are you seriously suggesting using `_tmain` to keep compatible with
Windows 9x? Note that for Windows 9x, `_tmain` maps to standard `main`.
And note that in Windows, `main` has Windows ANSI-encoded arguments...

`_tmain` is a Microsoft macro that helped support compatibility with
Windows 9x before the Layer for Unicode was introduced in 2001.

`_tmain` maps to narrow character standard `main` or wide character
non-standard `wmain` depending on the `_UNICODE` macro symbol.

`_tmain` was an abomination even in its day, and today there are no
reasons whatsoever to obfuscate the code that way.

> Then, proceed as Artyom explained. The surprise is then justified - it
> prevented a hard-to-spot bug. My preference on Windows though would be
> different (and not due to religious reasons) - convert all API-returned
> strings to UTF-8 as soon as possible and forget about encoding issues for
> good.

No, it does not let you forget about encoding issues.

Rather it introduces extra bug attractors, since you have then
overloaded the meaning of char-based text. By convention in Windows and
in most existing Windows code, char is ANSI-encoded, so other code will
expect ANSI encoding from the UTF-8 based code, which will tend to
introduce bugs. And other code will produce ANSI encoded text to the
UTF-8 based code, which will tend to introduce bugs. Thus adding another
possible encoding is absolutely not a good idea wrt. bugs.

And you're adding inefficiency for all the myriad internal conversions.

And you're adding either an utterly silly restriction to English A-Z in
literals, or, for the let's-lie-to-the-compiler UTF-8 source without BOM
served to the Visual C++ compiler, requiring that the wide character
literals language feature is not used, and hoping for the best with
respect to how smart later versions of the compiler will be.

And most Windows libraries abide by Windows conventions, so it means
extra work for supporting most library code. I.e. O(n) work for writing
inefficient data converting wrappers for an unbounded set of functions
instead of just O(1) work for writing efficient pointer type converting
wrappers for a fixed set of functions. Think about it.

As far as I know there is not *one single technical aspect* that the all
UTF-8 scheme solves. I.e., AFAIK from a purely technical POV it's dumb.

> See
> http://programmers.stackexchange.com/questions/102205/should-utf-16-be-considered-harmful

Hm, was that an associative reference?

Let me quote from the question:

<quote>
For example, try to create file names in Windows that include these
characters; try to delete these characters with a "backspace" to see how
they behave in different applications that use UTF-16. I did some tests
and the results are quite bad:

Opera has problem with editing them (delete required 2 presses on
backspace)
Notepad can't deal with them correctly (delete required 2 presses
on backspace)
File names editing in Window dialogs in broken (delete required 2
presses on backspace)
All QT3 applications can't deal with them - show two empty squares
instead of one symbol.
Python encodes such characters incorrectly when used directly
u'X'!=unicode('X','utf-16') on some platforms when X in character
outside of BMP.
Python 2.5 unicodedata fails to get properties on such characters
when python compiled with UTF-16 Unicode strings.
StackOverflow seems to remove these characters from the text if
edited directly in as Unicode characters (these characters are shown
using HTML Unicode escapes).
WinForms TextBox may generate invalid string when limited with
MaxLength.
</quote>

Here the poster lists concrete examples of how many common applications
already have bugs in their Unicode handling.

Showing by example that Unicode is tricky to get right.

Is it then a good idea to needlessly, and at great cost, add further
confusion about whether narrow characters are encoded as ANSI or UTF-8?