Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Why this UTF-8-conversion code works with Visual C++ and not with g++?

115 views
Skip to first unread message

Alf P. Steinbach

unread,
May 13, 2017, 2:52:22 AM5/13/17
to
I'm working on a (raw C++) minimal library I've called “stdlib”, a
portable wrapper for the ordinary C++ standard library that

• sets up working Unicode based console i/o for the standard streams,
in particular so they'll work for international text in Windows,
• adds necessary defines for <math.h> to get M_PI etc,
• provides functionality-area headers, e.g. all of i/o,

etc.

Because I realized that this is more fundamental than the Expressive C++
stuff.

For example, the following should work fine with general Unicode text in
Windows:

#include <stdlib/iostream.hpp>
#include <stdlib/string.hpp>
using namespace std;

auto main() -> int
{
cout << "Hi, what’s your name? ";
string name;
getline( cin, name );
cout << "Pleased to meet you, " << name << "!" << endl;
}

The header wrappers used here install custom iostream buffers that do
UTF-8 / UTF-16 conversion via std::codecvt_utf8_utf16<wchar_t>, and
access the Windows console Unicode API directly without dragging in the
<windows.h> header.

And it works nicely with Visual C++, but with g++ I get Chinese or
whatever it is (garbage?), showing in the console as just squares:

[C:\my\dev\libraries\stdlib\examples\hello_world]
> g++ hello_world.cpp && a
䠀椀Ⰰ 眀栀愀琀ᤠ猀 礀漀甀爀 渀愀洀攀㼀 my ÆØÅ-input
倀氀攀愀猀攀搀 琀漀 洀攀攀琀 礀漀甀Ⰰ 洀礀 였�씀ⴀ椀渀瀀甀琀℀਀
[C:\my\dev\libraries\stdlib\examples\hello_world]
> cl hello_world.cpp /Feb /wd4373 && b
hello_world.cpp
Hi, what’s your name? my ÆØÅ-input
Pleased to meet you, my ÆØÅ-input!

[C:\my\dev\libraries\stdlib\examples\hello_world]
> _

With wide text i/o the thing works also with g++, so it's not the
no-windows.h-binding to the API that's at fault, hence my strong
assumption that it's the UTF-8 / UTF-16 conversion.

The library is header only, at <url:
https://github.com/alf-p-steinbach/stdlib>. The conversion sources are
all in “workarounds” folder. Possibly the bug resides / the bugs reside
in “source/workarounds/impl/windows_console_io/Byte_to_wide_converter.hpp”,


------------------------------------------------------------------------
#pragma once // Source encoding: utf-8 ∩
// #include <stdlib/workarounds/impl/windows_console_io/Codecvt.hpp>
// Copyright © 2017 Alf P. Steinbach, distributed under Boost license 1.0.

#include <codecvt> // std::codecvt_utf8

namespace stdlib{ namespace impl{ namespace windows_console_io{
using std::codecvt_utf8_utf16;

using Codecvt = codecvt_utf8_utf16<wchar_t>;
using Codecvt_result = decltype( Codecvt::ok );
using Codecvt_state = Codecvt::state_type;

}}} // namespace stdlib::impl::windows_console_io
------------------------------------------------------------------------

------------------------------------------------------------------------
#pragma once // Source encoding: utf-8 ∩
// #include
<stdlib/workarounds/impl/windows_console_io/Byte_to_wide_converter.hpp>
// Copyright © 2017 Alf P. Steinbach, distributed under Boost license 1.0.

#include <assert.h> // assert

#include <stdlib/workarounds/impl/Size.hpp> // Size
#include <stdlib/workarounds/impl/windows_console_io/Codecvt.hpp> //
Codecvt, Codecvt_result
#include <stdlib/workarounds/impl/windows_console_io/constants.hpp> //
ascii::del

namespace stdlib{ namespace impl{ namespace windows_console_io{
using std::begin;
using std::copy;
using std::end;

class Byte_to_wide_converter
{
public:
static Size constexpr in_buf_size = general_buffer_size;

private:
Codecvt codecvt_{};
Codecvt_state conversion_state_{}; // mb_state
char in_buf_[in_buf_size];
Size n_buffered_ = 0;

auto start_of_buffer() -> char* { return begin(
in_buf_ ); }
auto put_position() -> char* { return begin(
in_buf_ ) + n_buffered_; }
auto beyond_buffer() -> char const* { return end(
in_buf_ ); }

public:
auto n_buffered() const -> Size { return n_buffered_; }
auto available_space() const -> Size { return in_buf_size -
n_buffered_; }

void add( Size const n, char const* const chars )
{
assert( n <= available_space() );
copy( chars, chars + n, put_position() );
n_buffered_ += n;
}

auto convert_into( wchar_t* const result, Size const result_size )
-> Size
{
char const* p_next_in = start_of_buffer();
wchar_t* p_next_out = result;

for( ;; )
{
auto const p_start_in = p_next_in;
auto const p_start_out = p_next_out;
auto const result_code = static_cast<Codecvt_result>(
codecvt_.in(
conversion_state_,
p_start_in, put_position(), p_next_in, //
begin, end, beyond processed
p_start_out, result + result_size, p_next_out //
begin, end, beyond processed
) );

switch( result_code )
{
case Codecvt::ok:
case Codecvt::partial:
case Codecvt::noconv:
{
copy<char const*>( p_next_in, put_position(),
start_of_buffer() );
n_buffered_ = put_position() - p_next_in;
return p_next_out - result;
}

case Codecvt::error:
{
*p_next_out++ = static_cast<wchar_t>( ascii::del );
break; // p_next_in points past the
offending byte.
}

default:
{
assert(( "Should never get here.", false ));
throw 0;
break;
}
}
}
}
};

}}} // namespace stdlib::impl::windows_console_io
------------------------------------------------------------------------


Maybe someone has encountered the same phenomenon? Or maybe someone just
by looking at it can see a glaringly obvious bug? I know I often fail to
see my own bugs, while I can spot others' bugs easily.

Hopefully!


Cheers!

- Alf

Christian Gollwitzer

unread,
May 13, 2017, 4:57:15 AM5/13/17
to
Am 13.05.17 um 08:51 schrieb Alf P. Steinbach:
> I'm working on a (raw C++) minimal library I've called “stdlib”, a
> portable wrapper for the ordinary C++ standard library that
>
> • sets up working Unicode based console i/o for the standard streams,
> in particular so they'll work for international text in Windows,
> • adds necessary defines for <math.h> to get M_PI etc,
> • provides functionality-area headers, e.g. all of i/o,
>
> etc.
>
> Because I realized that this is more fundamental than the Expressive C++
> stuff.

So essentially it fixes a broken platform.

>
> For example, the following should work fine with general Unicode text in
> Windows:

...on Linux and OSX terminals it works without such quirks. Which is why
many programmers are annoyed of developing under Windows, at least if
they target platform independent code.

I think this is useful, indeed.

About your "Expressive C++" ;) stuff - I think it does not result in a
usable language, but that is mostly the failure of the limited power of
the preprocessor. It does indeed have a point, IMHO - it shows, that a
statically compiled zero-overhead language like C++ /could/ have a
feature-rich expressive syntax competing with modern "scripting"
languages such as Python. I am quite sure, that C++, if designed from
the grounds up with todays compiler technology available, would give a
clean, superfast and comfortable programming environment. There is
simply too much ballast along the way now. Template metaprogramming,
seriously? You misuse the static typesystem as a functional programming
language preprocessor embedded in the compiler with a brainfuck-like
syntax and almost no overlap with the main language syntax? This sounds
sooo wrong, and yet many "modern" features depend on it????

Christian

Alf P. Steinbach

unread,
May 14, 2017, 2:58:21 AM5/14/17
to
On 13-May-17 10:56 AM, Christian Gollwitzer wrote:
> Am 13.05.17 um 08:51 schrieb Alf P. Steinbach:
>> I'm working on a (raw C++) minimal library I've called “stdlib”, a
>> portable wrapper for the ordinary C++ standard library that
>>
>> • sets up working Unicode based console i/o for the standard streams,
>> in particular so they'll work for international text in Windows,
>> • adds necessary defines for <math.h> to get M_PI etc,
>> • provides functionality-area headers, e.g. all of i/o,
>>
>> etc.
>>
>> Because I realized that this is more fundamental than the Expressive
>> C++ stuff.
>
> So essentially it fixes a broken platform.

Literally, yes. But we have probably different ideas about what that
platform is. :) I think the C++ standard library's i/o and text handling
is fundamentally broken, because it was designed with the codepage model
in mind, and now is used for variable length encodings.


>> For example, the following should work fine with general Unicode text
>> in Windows:
>
> ...on Linux and OSX terminals it works without such quirks. Which is why
> many programmers are annoyed of developing under Windows, at least if
> they target platform independent code.
>
> I think this is useful, indeed.

Thank you.


> About your "Expressive C++" ;) stuff - I think it does not result in a
> usable language, but that is mostly the failure of the limited power of
> the preprocessor. It does indeed have a point, IMHO - it shows, that a
> statically compiled zero-overhead language like C++ /could/ have a
> feature-rich expressive syntax competing with modern "scripting"
> languages such as Python. I am quite sure, that C++, if designed from
> the grounds up with todays compiler technology available, would give a
> clean, superfast and comfortable programming environment. There is
> simply too much ballast along the way now. Template metaprogramming,
> seriously? You misuse the static typesystem as a functional programming
> language preprocessor embedded in the compiler with a brainfuck-like
> syntax and almost no overlap with the main language syntax? This sounds
> sooo wrong, and yet many "modern" features depend on it????

Again thanks for that input.

I've identified three + 1 problems contributing to the gobbledygook
produced by g++ for hello world, namely

1. g++ compiler. With my MinGW g++ 6.3.0 `std::codecvt_utf8_utf16`
produces big-endian `wchar_t` values by default. That's rather
impractical nonsensical behavior on any little-endian platform since
`wchar_t` is usually internally in the program, not data to be sent over
the network. It /may/ however be how the standard (impractically)
defines the default functionality. In that case MSVC is non-conforming
but practical.

2. C++ standard. The `codecvt...` types offer a template parameter where
one can explicitly request little-endian encoding of the result, but the
standard library offers no way to detect the endianness at compile time.

3. g++ compiler. With little-endian specified g++'s codecvt converts
pure ASCII text successfully, but fails with e.g. Norwegian text.

4. C++ standard. The `codecvt...` specialized classes are deprecated in
C++17.

I suspect strongly that point (4) is related to points (1) and (2) and
(3). :(

Oh well, I've implemented DIY UTF-8 conversion before, to do this kind
of thing in C++03. As I recall I then found out later that the Unicode
consortium had some example code that I could just have used directly.
So maybe I can still find that, hopefully!

• • •

The code I used to explore these issues:


#include <assert.h>
#include <codecvt>
#include <iostream>
#include <string>
using namespace std;

using Codecvt = std::codecvt_utf8_utf16<wchar_t, 0x10ffff, little_endian>;
using Codecvt_result = decltype( Codecvt::ok );
using Codecvt_state = Codecvt::state_type;

auto to_wstring( Codecvt_result const result_code )
-> wstring
{
switch( result_code )
{
case Codecvt::ok: return L"ok";
case Codecvt::partial: return L"partial";
case Codecvt::noconv: return L"noconv";
case Codecvt::error: return L"error";

default:
{
return L"Codevt_result(" + std::to_wstring( +result_code )
+ L")";
}
}
}

auto main()
-> int
{
#if defined( NON_ENGLISH )
static char const s[] = u8"blåbærsyltetøy";
#else
static char const s[] = u8"blaabaersyltetoey";
#endif
int const s_len = sizeof( s ) - 1;

//wcout << s << endl;
wcout << s << endl;

Codecvt codecvt;
Codecvt_state conversion_state{};

wstring buffer( 80, L'#' );

char const* const s_end = s + s_len;
wchar_t* const buffer_end = &buffer[0] + buffer.size();

char const* p_next_in = s;
wchar_t* p_next_out = &buffer[0];
while( p_next_in != s_end )
{
auto const new_chunk_begin = p_next_out;
auto const result_code = static_cast<Codecvt_result>( codecvt.in(
conversion_state,
p_next_in, s_end, p_next_in, // begin, end,
beyond processed
p_next_out, buffer_end, p_next_out // begin, end,
beyond processed
) );

switch( result_code )
{
case Codecvt::ok:
case Codecvt::partial:
case Codecvt::noconv:
{
int const n_values = p_next_out - new_chunk_begin;
wcout
<< L"Produced " << n_values << L" wide values,"
<< L" result code " << to_wstring( result_code )
<< endl;
wcout << "`" << wstring{new_chunk_begin,
p_next_out} << "`" << endl;
for( int i = 0; i < n_values; ++i )
{
wcout << +buffer[i] << " ";
}
wcout << endl;

wstring dummy; getline( wcin, dummy ); // Press return
to go on.
break;
}

case Codecvt::error:
{
wcout << L"Oops, error." << endl;
return EXIT_FAILURE;
}

default:
{
assert(( "Should never get here.", false )); throw 0;
}
}
}
}


Cheers!,

- Alf



Scott Lurndal

unread,
May 15, 2017, 9:12:00 AM5/15/17
to
"Alf P. Steinbach" <alf.p.stein...@gmail.com> writes:
>On 13-May-17 10:56 AM, Christian Gollwitzer wrote:
>> Am 13.05.17 um 08:51 schrieb Alf P. Steinbach:
>>> I'm working on a (raw C++) minimal library I've called “stdlib”, a
>>> portable wrapper for the ordinary C++ standard library that
>>>
>>> • sets up working Unicode based console i/o for the standard streams,
>>> in particular so they'll work for international text in Windows,
>>> • adds necessary defines for <math.h> to get M_PI etc,
>>> • provides functionality-area headers, e.g. all of i/o,
>>>
>>> etc.
>>>
>>> Because I realized that this is more fundamental than the Expressive
>>> C++ stuff.
>>
>> So essentially it fixes a broken platform.
>
>Literally, yes. But we have probably different ideas about what that
>platform is. :) I think the C++ standard library's i/o and text handling
>is fundamentally broken, because it was designed with the codepage model
>in mind, and now is used for variable length encodings.

I don't believe the SGI folks ever even considered windows, or codepages
when designing the STL.

Alf P. Steinbach

unread,
May 15, 2017, 5:04:59 PM5/15/17
to
On 15-May-17 3:11 PM, Scott Lurndal wrote:
> "Alf P. Steinbach" <alf.p.stein...@gmail.com> writes:
>> On 13-May-17 10:56 AM, Christian Gollwitzer wrote:
>>> Am 13.05.17 um 08:51 schrieb Alf P. Steinbach:
>>>> I'm working on a (raw C++) minimal library I've called “stdlib†, a
>>>> portable wrapper for the ordinary C++ standard library that
>>>>
>>>> • sets up working Unicode based console i/o for the standard streams,
>>>> in particular so they'll work for international text in Windows,
>>>> • adds necessary defines for <math.h> to get M_PI etc,
>>>> • provides functionality-area headers, e.g. all of i/o,
>>>>
>>>> etc.
>>>>
>>>> Because I realized that this is more fundamental than the Expressive
>>>> C++ stuff.
>>>
>>> So essentially it fixes a broken platform.
>>
>> Literally, yes. But we have probably different ideas about what that
>> platform is. :) I think the C++ standard library's i/o and text handling
>> is fundamentally broken, because it was designed with the codepage model
>> in mind, and now is used for variable length encodings.
>
> I don't believe the SGI folks ever even considered windows, or codepages
> when designing the STL.

The STL was designed by Alexander Stepanov, starting in 1992 at HP.

See <url: https://en.wikipedia.org/wiki/Standard_Template_Library#History>.

“SGI folks” did not design the STL but provided an STL implementation,
apparently still available at <url: https://www.sgi.com/tech/stl/>.
Here's a nice article about it: <url:
http://www.drdobbs.com/cpp/the-sgi-standard-template-library/184410249>.

The STL has roughly nothing to do with text handling and streams.
Stepanov focused on (the separation of) algorithms and containers.
Possibly you're conflating the STL with the standard library. That's
natural because it's a common misconception that's floating around.

Codepages are not particular to Windows. “Codepage” is a term that
originally denoted only single byte encodings and that stems from IBM,
as far as I know. Today that's the main meaning. Most systems including
Unix were based on the codepage model (single byte encodings, with some
special support for Shift-JIS etc.) before the advent of Unicode. See
<url: https://en.wikipedia.org/wiki/Code_page>.

Summing up, your comment provides many opportunities to learn. ;-)


Cheers & hth.,

- Alf

Juha Nieminen

unread,
May 16, 2017, 2:08:08 AM5/16/17
to
Alf P. Steinbach <alf.p.stein...@gmail.com> wrote:
> using namespace std;

Haven't we discussed this already?

That line doesn't make the code more readable. In fact, it does the
exact opposite.

Alf P. Steinbach

unread,
May 16, 2017, 2:53:47 AM5/16/17
to
On 16-May-17 8:08 AM, Juha Nieminen wrote:
> Alf P. Steinbach <alf.p.stein...@gmail.com> wrote:
>> using namespace std;
>
> Haven't we discussed this already?

No, I can't say I remember that.


> That line doesn't make the code more readable. In fact, it does the
> exact opposite.

Works fine for me.

You seem to have a silly hangup about it. Speculation about why:
possibly you prefer absolute mechanical rules for programming, instead
of intelligence. I know many do. And this thing about not using the
quoted statement can work for beginners. Teaches them things and avoids
some problems. Plus it gives some of them a sense of group membership,
usually in fan-boy groups of some sort. But if absolute mechanical rules
worked well for programmers in general (it doesn't), not just beginners,
then you'd have been replaced by a robot by now. ;-)

All this said, this is the second ** NOISE **-posting in this thread,
and I think I'm on much firmer ground than above when I speculate about
why people would like to inject noise here.

Namely, to remove focus from some grave problems with the g++ compiler
and standard library implementation, such as its codecvt producing silly
big-endian wchar_t values in Windows, and that it does not work for e.g.
Norwegian even when correct endianness is specified.

After all, g++ is a perfect perfect perfect compiler, yes?


Cheers!,

- Alf

Juha Nieminen

unread,
May 17, 2017, 2:42:16 AM5/17/17
to
Alf P. Steinbach <alf.p.stein...@gmail.com> wrote:
> You seem to have a silly hangup about it. Speculation about why:
> possibly you prefer absolute mechanical rules for programming, instead
> of intelligence. I know many do. And this thing about not using the
> quoted statement can work for beginners. Teaches them things and avoids
> some problems. Plus it gives some of them a sense of group membership,
> usually in fan-boy groups of some sort. But if absolute mechanical rules
> worked well for programmers in general (it doesn't), not just beginners,
> then you'd have been replaced by a robot by now. ;-)

The std:: prefixes nicely mark all uses of the standard library in the
code, making them much easier to distinguish from other names, and thus
making the code easier to read and understand at a glance. Even when you
don't know or remember what a particular function or class does, the
std:: prefix immediately gives you the hint that it indeed *is* a standard
library element, and you thus immediately know where to look for the info,
rather than having to chasing wild geese. It's like color-coding of the
source code, but it works even in contexts where there is no code coloring
(such as here, or in many text editors).

I actually find it a curious psychological phenomenon why so many people
think that code that's "littered" with tons of std:: prefixes is somehow
ugly and hard to read. When I had this discussion in the past with someone,
he tried to demonstrate how "bad" code becomes when there are tons of
those prefixes by posting a piece of code that had a large amount of
calls to standard library functions and names. Ironically, it was
precisely the std:: prefixes that made the code actually *easier* to
understand, rather than harder (because of the abovementioned reason that
they give you an immediate visual cue about where the standard library
names are used.)

> All this said, this is the second ** NOISE **-posting in this thread,
> and I think I'm on much firmer ground than above when I speculate about
> why people would like to inject noise here.

My comment was besides the point of the original post for sure, but
so what? Is there a law that forbids tangential discussion?

Your conspiracy theory is really strange.

Alf P. Steinbach

unread,
May 17, 2017, 6:11:56 AM5/17/17
to
On 17-May-17 8:42 AM, Juha Nieminen wrote:
>
> Your conspiracy theory is really strange.

I think now you're off your rocker.

Cheers!,

- Alf


0 new messages