How to emit UTF-8 from console mode program?

Siegfried Heintze

unread,

Oct 1, 2008, 2:25:59 PM10/1/08

to

The following perl program works when I run it from urxvt-X console on
cygwin-x windows when running on Microsoft Windows XP:

LC_CTYPE=en_US.UTF-8 urxvt-X.exe&
perl -wle "binmode STDOUT, q[:utf8]; print chr() for 0x410 .. 0x430;"

This little one liner prints the Russian alphabet in Cryllic. With some
slight modification it will also print a lot of other alphabets too --
including Hebrew, chinese and japanese.

It does not work with cmd.exe because apparently cmd.exe cannot deal with
UTF-8.

Can someone help me translate it into C++? I would not expect it to work
from cmd.exe with C++, but I am hopeful it will work with urxvt-X!

This does not work:

for(int ii = 0x410; ii < 0x430; ++ii) std::wcout << (wchar_t) ii;

I obviously need to tell urxvt-X that I want to use utf-8 but I don't know
how! I suppose UTF-16 would be fine too. I just want to see some Chinese and
Cyrillic glyphs.

Thanks,
Siegfried

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Martin T.

unread,

Oct 2, 2008, 12:59:05 PM10/2/08

to

Siegfried Heintze wrote:
> The following perl program works when I run it from urxvt-X console on
> cygwin-x windows when running on Microsoft Windows XP:
>
> LC_CTYPE=en_US.UTF-8 urxvt-X.exe&
> perl -wle "binmode STDOUT, q[:utf8]; print chr() for 0x410 .. 0x430;"
>
> This little one liner prints the Russian alphabet in Cryllic. With some
> slight modification it will also print a lot of other alphabets too --
> including Hebrew, chinese and japanese.
>
> It does not work with cmd.exe because apparently cmd.exe cannot deal with
> UTF-8.
>
> Can someone help me translate it into C++? I would not expect it to work
> from cmd.exe with C++, but I am hopeful it will work with urxvt-X!
>
> This does not work:
>
> for(int ii = 0x410; ii < 0x430; ++ii) std::wcout << (wchar_t) ii;
>

When you want to output UTF-8 from a C++ program you should use
std::cout and char, not the wide-string versions as the whole point of
these is to allow for UTF-16/32 as far as I can tell.

On Windows you can use the WideCharToMultiByte API to convert your
wchar_t string to an utf-8 string:
wchar_t* pStr; int buf_len; char* buf;
buf_len = WideCharToMultiByte(CP_UTF8, 0, pStr, -1, buf, buf_len, NULL,
NULL);

> I obviously need to tell urxvt-X that I want to use utf-8 but I don't know
> how! I suppose UTF-16 would be fine too. I just want to see some Chinese and
> Cyrillic glyphs.
>

You might have to explicitly tell it to your terminal, but you should
not have to do this from C++. I thought setting LC_CTYPE=en_US.UTF-8 is
what you use to tell the terminal it should use UTF-8 ... ?

br,
Martin

Alberto Ganesh Barbati

unread,

Oct 2, 2008, 7:40:25 PM10/2/08

to

Siegfried Heintze ha scritto:

> The following perl program works when I run it from urxvt-X console on
> cygwin-x windows when running on Microsoft Windows XP:
>
> LC_CTYPE=en_US.UTF-8 urxvt-X.exe&
> perl -wle "binmode STDOUT, q[:utf8]; print chr() for 0x410 .. 0x430;"
>
> This little one liner prints the Russian alphabet in Cryllic. With some
> slight modification it will also print a lot of other alphabets too --
> including Hebrew, chinese and japanese.
>
> It does not work with cmd.exe because apparently cmd.exe cannot deal with
> UTF-8.
>
> Can someone help me translate it into C++? I would not expect it to work
> from cmd.exe with C++, but I am hopeful it will work with urxvt-X!
>
> This does not work:
>
> for(int ii = 0x410; ii < 0x430; ++ii) std::wcout << (wchar_t) ii;
>
> I obviously need to tell urxvt-X that I want to use utf-8 but I don't know
> how! I suppose UTF-16 would be fine too. I just want to see some Chinese and
> Cyrillic glyphs.
>

I'm afraid but it can't be done portably. Assuming you work on Windows
and UTF-16 is good for you, then you could use:

for(int ii = 0x410; ii < 0x430; ++ii)

std::cout
<< (char)(unsigned char)(ii & 0xff)
<< (char)(unsigned char)((ii >> 8) & 0xff);

Noticed that I used cout and not wcout.

However, there's a big pitfall that you should be aware of! As cout is
opened as a *text* file if you ever tried to output a character such as
U+040A (CYRILLIC CAPITAL LETTER NJE) then the output will get corrupted
because the character '\x0a' == '\n' will trigger CR/LF expansion.
Therefore I discourage using this approach at all.

For UTF-8 the problem is slightly better, but you must implement
yourself the algorithm to convert from the Unicode code point to the
UTF-8 encoding:

std::string uft8encode(int u);

for(int ii = 0x410; ii < 0x430; ++ii)

std::cout << utf8encode(ii);

I say that it's slightly better because the character '\n' can occur as
an UTF-8 code unit only when encoding U+0A, so you never trigger CR/LF
expansion inadvertently.

Other options include writing a codecvt<> facet performing the wchar_t
to UTF-8 encoding (not an easy task!), make a locale object with it and
then imbue the locale in an ofstream. Imbuing the locale in cout/wcout
wouldn't solve your problem because only file stream buffers actually
use the codecvt facet. The advantage of this approach is that it's going
to be portable.

HTH,

Ganesh