LC_CTYPE=en_US.UTF-8 urxvt-X.exe&
perl -wle "binmode STDOUT, q[:utf8]; print chr() for 0x410 .. 0x430;"
This little one liner prints the Russian alphabet in Cryllic. With some
slight modification it will also print a lot of other alphabets too --
including Hebrew, chinese and japanese.
It does not work with cmd.exe because apparently cmd.exe cannot deal with
UTF-8.
Can someone help me translate it into C++? I would not expect it to work
from cmd.exe with C++, but I am hopeful it will work with urxvt-X!
This does not work:
for(int ii = 0x410; ii < 0x430; ++ii) std::wcout << (wchar_t) ii;
I obviously need to tell urxvt-X that I want to use utf-8 but I don't know
how! I suppose UTF-16 would be fine too. I just want to see some Chinese and
Cyrillic glyphs.
Thanks,
Siegfried
--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
When you want to output UTF-8 from a C++ program you should use
std::cout and char, not the wide-string versions as the whole point of
these is to allow for UTF-16/32 as far as I can tell.
On Windows you can use the WideCharToMultiByte API to convert your
wchar_t string to an utf-8 string:
wchar_t* pStr; int buf_len; char* buf;
buf_len = WideCharToMultiByte(CP_UTF8, 0, pStr, -1, buf, buf_len, NULL,
NULL);
> I obviously need to tell urxvt-X that I want to use utf-8 but I don't know
> how! I suppose UTF-16 would be fine too. I just want to see some Chinese and
> Cyrillic glyphs.
>
You might have to explicitly tell it to your terminal, but you should
not have to do this from C++. I thought setting LC_CTYPE=en_US.UTF-8 is
what you use to tell the terminal it should use UTF-8 ... ?
br,
Martin
I'm afraid but it can't be done portably. Assuming you work on Windows
and UTF-16 is good for you, then you could use:
for(int ii = 0x410; ii < 0x430; ++ii)
std::cout
<< (char)(unsigned char)(ii & 0xff)
<< (char)(unsigned char)((ii >> 8) & 0xff);
Noticed that I used cout and not wcout.
However, there's a big pitfall that you should be aware of! As cout is
opened as a *text* file if you ever tried to output a character such as
U+040A (CYRILLIC CAPITAL LETTER NJE) then the output will get corrupted
because the character '\x0a' == '\n' will trigger CR/LF expansion.
Therefore I discourage using this approach at all.
For UTF-8 the problem is slightly better, but you must implement
yourself the algorithm to convert from the Unicode code point to the
UTF-8 encoding:
std::string uft8encode(int u);
for(int ii = 0x410; ii < 0x430; ++ii)
std::cout << utf8encode(ii);
I say that it's slightly better because the character '\n' can occur as
an UTF-8 code unit only when encoding U+0A, so you never trigger CR/LF
expansion inadvertently.
Other options include writing a codecvt<> facet performing the wchar_t
to UTF-8 encoding (not an easy task!), make a locale object with it and
then imbue the locale in an ofstream. Imbuing the locale in cout/wcout
wouldn't solve your problem because only file stream buffers actually
use the codecvt facet. The advantage of this approach is that it's going
to be portable.
HTH,
Ganesh