Need help with printing Unicode! (C++ on CentOS)

Zerex71

unread,

Aug 28, 2009, 12:51:13 PM8/28/09

to

Greetings,

I'm sure this has been addressed before but I've hunted all over the
web and no one seems to provide a comprehensive answer. I just want
to do one thing: Under CentOS, in a simple C++ program, I'd like to be
able to print Unicode characters to a console output. For example,
I'd like to print the musical flat, natural, and sharp signs.

Here's what I've done so far:
1. Using Eclipse, created a small C++ console project.
2. Declare three chars, each of type wchar_t, and assigned them their
Unicode values (0x266d, 0x266e, 0x266f).
3. Attempted to print them out using wprintf().
4. Set my output console to a font which can represent the characters
(glyphs?) - Lucida Console

A few observations:
1. I can go to a Unicode code page website and copy the characters
displayed and paste them into my source file which is in the same font
(that was my first trick which ultimately blew me out of the water
because Eclipse was bitching about not being to save the files due to
encoding...tried changing it...then it promptly deleted all my lines
and left me with a bunch of NUL).
2. Mixing cout and wprintf results in the wprintf statements being
totally ignored.
3. Using only wprintf results in "Sign: ?" displayed in the console
output, even though it can display the glyphs correctly when I pasted
them (1.)
4. Calling setlocale() as directed by an example has no effect on my
program.
5. Using fwide() to determine if my setup is legit works because I
don't hit the exit condition that I wrote for that test.

So, I don't know what else to try to get this to work. There's a lot
of stuff about Unicode on Windows out there but I'm not doing Windows,
and figured the Linux community might have an answer.

Thanks.

Paavo Helde

unread,

Aug 28, 2009, 1:08:14 PM8/28/09

to

Zerex71 <mfehe...@gmail.com> kirjutas:

> Greetings,
>
> I'm sure this has been addressed before but I've hunted all over the
> web and no one seems to provide a comprehensive answer. I just want
> to do one thing: Under CentOS, in a simple C++ program, I'd like to be
> able to print Unicode characters to a console output. For example,
> I'd like to print the musical flat, natural, and sharp signs.
>
> Here's what I've done so far:
> 1. Using Eclipse, created a small C++ console project.
> 2. Declare three chars, each of type wchar_t, and assigned them their
> Unicode values (0x266d, 0x266e, 0x266f).
> 3. Attempted to print them out using wprintf().
> 4. Set my output console to a font which can represent the characters
> (glyphs?) - Lucida Console

I am not not sure about CentOS, but in Linux generally UTF-8 is used. One
should have an UTF8 locale (e.g. LANG=en_US.utf8). If your code
internally uses wchar_t, then it should be converted to UTF-8 before
output. I am not sure if wprintf() or wcout() can do that automatically.
In our software we use UTF-8 and std::string internally, and it is
working perfectly in Linux.

hth
Paavo

Zerex71

unread,

Aug 28, 2009, 1:58:36 PM8/28/09

to

On Aug 28, 1:08 pm, Paavo Helde <pa...@nospam.please.ee> wrote:
> Zerex71 <mfeher1...@gmail.com> kirjutas:

Hi Paavo,

Here's my locale setting:

(mfeher) mfeher-l4 [~] > locale
LANG=en_US.UTF-8
LC_CTYPE="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_PAPER="C"
LC_NAME="C"
LC_ADDRESS="C"
LC_TELEPHONE="C"
LC_MEASUREMENT="C"
LC_IDENTIFICATION="C"
LC_ALL=C

I was under the impression that I had more of an "environment setup"
issue than a coding issue, i.e. I was unaware that I had to do
anything more to the code than change from cout/printf to wprintf.
Also, from a brief, brief reading of all this material on the
Internet, I don't want UTF-8 because that's too small to hold the
character codes I wish to print. Here's the code I am trying:

#include <iostream>
using namespace std;

int main() {
// cout << "Testing Unicode" << endl; // prints Testing Unicode
// If you try to mix Unicode printing with non-Unicode printing, the
switch
// causes you to lose output!
setlocale(LC_ALL, ""); // Does nothing

// Let's check our orientation...it never fails
if (fwide(stdout, 1) < 0)
{
cerr << "ERROR: Output not set to wide. Exiting..." << endl;
return -1;
}

// Declare a Unicode character and try to print it out
wchar_t mychar = 0x266d; // The music flat sign
wprintf(L"Here's mychar: %lc\n", mychar);
return 0;
}

Juha Nieminen

unread,

Aug 28, 2009, 2:00:43 PM8/28/09

to

Zerex71 wrote:
> I'm sure this has been addressed before but I've hunted all over the
> web and no one seems to provide a comprehensive answer. I just want
> to do one thing: Under CentOS, in a simple C++ program, I'd like to be
> able to print Unicode characters to a console output. For example,
> I'd like to print the musical flat, natural, and sharp signs.
>
> Here's what I've done so far:
> 1. Using Eclipse, created a small C++ console project.
> 2. Declare three chars, each of type wchar_t, and assigned them their
> Unicode values (0x266d, 0x266e, 0x266f).
> 3. Attempted to print them out using wprintf().

You can't output raw unicode values and expect your terminal emulator
to understand them. You have to output them *encoded* with the same
encoding scheme as your terminal. Usually this will be UTF-8.

Either output the encoded values directly, or use an UTF-8 encoding
library to convert your raw unicode values into UTF-8 codes. One such
library is, for example: http://utfcpp.sourceforge.net/

Juha Nieminen

unread,

Aug 28, 2009, 2:03:01 PM8/28/09

to

Zerex71 wrote:
> I don't want UTF-8 because that's too small to hold the
> character codes I wish to print.

I think you have a misunderstanding of what UTF-8 is. UTF-8 can
represent the entire unicode address space.

You might not "want" it, but you have no option because your terminal
emulator most probably wants UTF-8. It doesn't want raw unicode values.

Zerex71

unread,

Aug 28, 2009, 2:08:40 PM8/28/09

to

I probably do have a misunderstanding, but like I said, it appears
that UTF-8 is "smaller" or more restrictive than UTF-16, UTF-32, etc.
Is that the case? If the -8 means 8 bits, there's no way I can
convert numbers in the upper ranges (e.g. 0x266d) into 8 bits and even
expect to get my glyphs out on-screen.

So it sounds like my terminal/environment is set up to UTF-8, and I
just have to add a little code to my program before, during, or after
the wprintf() call to make sure they are displayed properly on-
screen. At least this is what I gather from your responses.

Mike

Paavo Helde

unread,

Aug 28, 2009, 5:36:31 PM8/28/09

to

Zerex71 <mfehe...@gmail.com> kirjutas:

> On Aug 28, 2:03�pm, Juha Nieminen <nos...@thanks.invalid> wrote:
>> Zerex71 wrote:
>> > I don't want UTF-8 because that's too small to hold the
>> > character codes I wish to print.
>>
>> � I think you have a misunderstanding of what UTF-8 is. UTF-8 can
>> represent the entire unicode address space.
>>
>> � You might not "want" it, but you have no option because your
terminal
>> emulator most probably wants UTF-8. It doesn't want raw unicode
values.
>
> I probably do have a misunderstanding, but like I said, it appears
> that UTF-8 is "smaller" or more restrictive than UTF-16, UTF-32, etc.
> Is that the case? If the -8 means 8 bits, there's no way I can
> convert numbers in the upper ranges (e.g. 0x266d) into 8 bits and even
> expect to get my glyphs out on-screen.

You are mistaken. Larger Unicode code values are encoded as multibyte
sequences in UTF-8. See e.g. http://en.wikipedia.org/wiki/Utf-8

Paavo

Juha Nieminen

unread,

Aug 29, 2009, 7:05:47 AM8/29/09

to

Zerex71 wrote:
> I probably do have a misunderstanding, but like I said, it appears
> that UTF-8 is "smaller" or more restrictive than UTF-16, UTF-32, etc.
> Is that the case?

No.

> If the -8 means 8 bits

It means that the unicode values are encoded into units of 8 bits.
Larger unicode values are encoded into more than one 8-bit unit.

There exists also UTF-7, where each unit is 7 bits (in other words, no
byte will have a value larger than 127). Larger values are encoded using
even more units.

UTF-16 uses units of 16 bits. Unicode values which don't fit in them
are encoded using two 16-bit units, similarly to the previous two.

UTF-8 is the most popular because it "compresses" typical English text
the best, while still allowing the full range of unicode characters to
be represented.

> So it sounds like my terminal/environment is set up to UTF-8, and I
> just have to add a little code to my program before, during, or after
> the wprintf() call to make sure they are displayed properly on-
> screen. At least this is what I gather from your responses.

What you do is that you convert your unicode values into a stream of
UTF-8 encoded bytes, and then output those bytes to the terminal. As I
mentioned in another post, there are libraries to help you do this.

James Kanze

unread,

Aug 29, 2009, 7:42:19 AM8/29/09

to

On Aug 28, 6:51 pm, Zerex71 <mfeher1...@gmail.com> wrote:

> I'm sure this has been addressed before but I've hunted all
> over the web and no one seems to provide a comprehensive
> answer. I just want to do one thing: Under CentOS, in a
> simple C++ program, I'd like to be able to print Unicode
> characters to a console output.

I've never heard of CentOS, so I can't address any system
specific problems here (and they would be off topic).

> For example, I'd like to print the musical flat, natural, and
> sharp signs.

> Here's what I've done so far:
> 1. Using Eclipse, created a small C++ console project.
> 2. Declare three chars, each of type wchar_t, and assigned them their
> Unicode values (0x266d, 0x266e, 0x266f).
> 3. Attempted to print them out using wprintf().
> 4. Set my output console to a font which can represent the characters
> (glyphs?) - Lucida Console

What locale are you using? And what encoding does the font use?
You need to ensure that the encoding in the locale is the same
as the one used by the renderer for the font.

> A few observations:
> 1. I can go to a Unicode code page website and copy the
> characters displayed and paste them into my source file which
> is in the same font (that was my first trick which ultimately
> blew me out of the water because Eclipse was bitching about
> not being to save the files due to encoding...tried changing
> it...then it promptly deleted all my lines and left me with a
> bunch of NUL).

First, a source file isn't in a "font". A source file is a
sequence of text characters, in a certain encoding. A font
defines how specific characters will be rendered.

Secondly, in order to be displayable everywhere, I think that
the Unicode code pages use images, and not characters, for the
characters in the code pages. This allows displaying characters
which aren't in any font installed on the machine. There's no
way copy/pasting an image to your source file can possibly work.

> 2. Mixing cout and wprintf results in the wprintf statements being
> totally ignored.

You've raised an interesting point. According to the C standard
(relevant to wprintf), you can't mix wide and narrow output on
the same stream (in this case, stdout). C++ has a similar
restriction---if you've output to cout, use of wcout becomes
illegal, and vice versa. And since stdout and cout/wcout are
supposed to use the same stream, and are synchronized with one
another (by default), I'm pretty sure that the intent is not to
allow this either. In general, all of your IO to a given source
or sink should be of the same type; if you want to output
wchar_t somewhere, all output should be as wchar_t.

> 3. Using only wprintf results in "Sign: ?" displayed in the
> console output, even though it can display the glyphs
> correctly when I pasted them (1.)

Probably a question of locale. In the "C" locale, most
implementations only allow characters in the range 0...127 when
converting wchar_t to char.

For wprintf, you'll have to set the global locale. For
std::wcout, you'll have to imbue the desired locale (since the
object was constructed using the global locale before you could
modify the global locale).

> 4. Calling setlocale() as directed by an example has no effect
> on my program.

What did you use as an argument to setlocale()? (But this is
very OS dependent. I know how it works under Unix, but not for
other systems.)

> 5. Using fwide() to determine if my setup is legit works
> because I don't hit the exit condition that I wrote for that
> test.

> So, I don't know what else to try to get this to work.
> There's a lot of stuff about Unicode on Windows out there but
> I'm not doing Windows, and figured the Linux community might
> have an answer.

Linux is pretty simple. Just use a UTF-8 locale and a UTF-8
encoded font, and everything works pretty well. For that
matter, under Unix, if all you're concerned with is a few
special characters, I'd just manually encode them as strings in
UTF-8, and output them as char. Most (in not all) of the
locales simply pass all char straight through, without worrying
whether they're legal or not. So instead of a wchar_t with
0x266D, you'd use:
char const flat[] = "\xE2\x99\xAD" ;
and output that directly. (At least, that's what I think should
happen. I don't get any output for the above, but it works with
other Unicode characters, so I suspect that the problem is
simply that my fonts don't contain the characters you give. All
of the Wingbats (codes 2600 to 26FF) display as a simple blank
on my Linux machine.)

--
James Kanze (GABI Software) email:james...@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

James Kanze

unread,

Aug 29, 2009, 7:54:45 AM8/29/09

to

On Aug 28, 7:58 pm, Zerex71 <mfeher1...@gmail.com> wrote:
> On Aug 28, 1:08 pm, Paavo Helde <pa...@nospam.please.ee> wrote:
> > Zerex71 <mfeher1...@gmail.com> kirjutas:

> > > I'm sure this has been addressed before but I've hunted

> > > all over the web and no one seems to provide a
> > > comprehensive answer. I just want to do one thing: Under
> > > CentOS, in a simple C++ program, I'd like to be able to
> > > print Unicode characters to a console output. For
> > > example, I'd like to print the musical flat, natural, and
> > > sharp signs.

> > > Here's what I've done so far:
> > > 1. Using Eclipse, created a small C++ console project.
> > > 2. Declare three chars, each of type wchar_t, and assigned them their
> > > Unicode values (0x266d, 0x266e, 0x266f).
> > > 3. Attempted to print them out using wprintf().
> > > 4. Set my output console to a font which can represent the characters
> > > (glyphs?) - Lucida Console

> > I am not not sure about CentOS, but in Linux generally UTF-8
> > is used. One should have an UTF8 locale (e.g.
> > LANG=en_US.utf8). If your code internally uses wchar_t, then
> > it should be converted to UTF-8 before output. I am not sure
> > if wprintf() or wcout() can do that automatically. In our
> > software we use UTF-8 and std::string internally, and it is
> > working perfectly in Linux.

> Here's my locale setting:

> (mfeher) mfeher-l4 [~] > locale
> LANG=en_US.UTF-8
> LC_CTYPE="C"
> LC_NUMERIC="C"
> LC_TIME="C"
> LC_COLLATE="C"
> LC_MONETARY="C"
> LC_MESSAGES="C"
> LC_PAPER="C"
> LC_NAME="C"
> LC_ADDRESS="C"
> LC_TELEPHONE="C"
> LC_MEASUREMENT="C"
> LC_IDENTIFICATION="C"
> LC_ALL=C

> I was under the impression that I had more of an "environment
> setup" issue than a coding issue, i.e. I was unaware that I
> had to do anything more to the code than change from
> cout/printf to wprintf. Also, from a brief, brief reading of
> all this material on the Internet, I don't want UTF-8 because
> that's too small to hold the character codes I wish to print.

UTF-8, UTF-16 and UTF-32 are "transformation formats",
specifying how to "present" any Unicode (UCS-4) character as a
sequence of 8 bit bytes, 16 bit words, or 32 bit words. Since
all of the data interfaces under Unix are 8 bits, UTF-8 is the
transformation format you need.

> Here's the code I am trying:

> #include <iostream>
> using namespace std;

> int main() {
> // cout << "Testing Unicode" << endl; // prints Testing Unicode
> // If you try to mix Unicode printing with non-Unicode printing, the
> switch
> // causes you to lose output!
> setlocale(LC_ALL, ""); // Does nothing

> // Let's check our orientation...it never fails
> if (fwide(stdout, 1) < 0)
> {
> cerr << "ERROR: Output not set to wide. Exiting..." << endl;
> return -1;
> }

> // Declare a Unicode character and try to print it out
> wchar_t mychar = 0x266d; // The music flat sign
> wprintf(L"Here's mychar: %lc\n", mychar);
> return 0;
> }

That should work, unless the font doesn't have a rendering for
0x266D (the ones I have installed under Linux don't). This is
easily checked---try some more "usual" Unicode character, e.g.
0x00E9 (an é). If that displays, then the problem is almost
certainly that the font doesn't contain a rendering for the
character you want. In which case, there's no way you'll be
able to display it (other than by finding some font which does
support it, installing it and using it).

Zerex71

unread,

Aug 31, 2009, 8:48:43 AM8/31/09

to

> James Kanze (GABI Software) email:james.ka...@gmail.com

> Conseils en informatique orientée objet/
> Beratung in objektorientierter Datenverarbeitung
> 9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

Hi James,

The font I am using (Lucida Console) supports the characters. My
assertion of this is based on the fact that I can go to one of the
websites with extended character maps, copy the symbol(s) desired, and
stick them into my source file, which also uses the same font. The
characters appeared just fine, but I couldn't save the file and ran
immediately into encoding problems (I am using Eclipse C++) that
resulted in me basically being unable to save or open the file
anymore, so I copied my source into a new project and started anyway.
But I was able to copy the original symbols and drop them directly in
my file editor (Lucida Console) and they displayed fine.

Also, help me understand in your example how my code 0x266D gets
turned into "\xE2\x99\xAD".

Mike

Zerex71

unread,

Aug 31, 2009, 9:10:37 AM8/31/09

to

That encoding library looked way too involved for what I want to do,
and in the end, I didn't see any simple method to set my encoding or
do whatever I need to do to print my characters. I just want to pass
my Unicode code string to a function and have it print out correctly.
Thanks.

Zerex71

unread,

Aug 31, 2009, 9:21:19 AM8/31/09

to

> What locale are you using? And what encoding does the font use?
> You need to ensure that the encoding in the locale is the same
> as the one used by the renderer for the font.

How do I check what encoding the font has?

Zerex71

unread,

Aug 31, 2009, 10:03:10 AM8/31/09

to

> James Kanze (GABI Software) email:james.ka...@gmail.com

> Conseils en informatique orientée objet/
> Beratung in objektorientierter Datenverarbeitung
> 9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

So let me see if I can explain my understanding of this whole thing
(because I want to finally solve this problem, having been trying to
figure it out off and on for quite a while):

1. Let's say I have a file, and it's nothing more than a string of 1s
and 0s when you get right down to it.
2. The encoding that I will use to read/display the file specifies to
the OS how to group and treat the bits.
3. A selected encoding then specifies (for lack of a better term) a
set of codepages from which to select the characters to display (i.e.
based on a particular grouping of bits/bytes, this will index into an
appropriate set of characters).
4. The bytes are presented to the display portion of the OS and it
will reference the operable font in the window, editor, dialog, etc.
to display the individual characters.
5. If the specified font doesn't have a glyph for a given byte
combination, the resulting behavior will be unpredictable.
6. If it does, it will basically do a table lookup for the appropriate
glyph, and fetch that glyph and dump it to the screen.

Is any of this correct?

Zerex71

unread,

Aug 31, 2009, 10:14:10 AM8/31/09

to

For reference, the page I am using to obtain my chars is
http://www.atm.ox.ac.uk/user/iwi/charmap.html.
I select 2 and 6 from the codepage dropdowns at top, then click on 6D,
6E, and 6F, and it places the chars up above in an output box. I can
copy those chars and paste them into my sourcecode editor and they
display properly, but run immediately into problems if I try to save
or compile. But they display just fine.

Mike

Juha Nieminen

unread,

Aug 31, 2009, 12:13:31 PM8/31/09

to

Zerex71 wrote:
> That encoding library looked way too involved for what I want to do,

That's because you didn't really find out how to use it. You were most
probably confused by the large example at the beginning. The library is
really simple to use.

std::string encodedString;
for(size_t i = 0; i < unicodeValues.size(); ++i)
utf8::append(unicodeValues[i], std::back_inserter(encodedString);
std::cout << encodedString;

> and in the end, I didn't see any simple method to set my encoding or
> do whatever I need to do to print my characters. I just want to pass
> my Unicode code string to a function and have it print out correctly.
> Thanks.

The above code does exactly that.

Zerex71

unread,

Aug 31, 2009, 12:46:21 PM8/31/09

to

Thanks for your information, but you know what, I really don't care to
spend hours looking for something that should be fairly simple to do.
I wasn't "confused" by anything. I just don't have time an interest
in becoming a Unicode expert to do something very simple. I wonder if
this means I have to download and install the utfcpp library, or if I
can just do this as-is in the code above.

Paavo Helde

unread,

Aug 31, 2009, 1:00:40 PM8/31/09

to

Zerex71 <mfehe...@gmail.com> kirjutas:

>
> Also, help me understand in your example how my code 0x266D gets
> turned into "\xE2\x99\xAD".

Presumably this is UTF-8 encoding of your character.

One thing is the encoding your source file uses, and the other is what
you want to output. I'm not familiar with Eclipse so I cannot comment on
the former. If needed, you can use iconv() to convert from your encoding
to UTF-8.

The following program works for me on a SuSE Linux and produces some kind
of music sign on the console. My locale is LANG=en_US.utf8.

#include <stdio.h>

int main() {
const unsigned char test[4]={0xE2, 0x99, 0xAD, 0};
printf("Test: %s\n", test);
}

hth
Paavo

Zerex71

unread,

Aug 31, 2009, 1:54:45 PM8/31/09

to

On Aug 31, 1:00 pm, Paavo Helde <pa...@nospam.please.ee> wrote:
> Zerex71 <mfeher1...@gmail.com> kirjutas:

Hi Paavo,

Thanks for the help. I will try that. I still do not see how 0x266D
=> E299AD. Where is the conversion for that explained?

Zerex71

unread,

Aug 31, 2009, 1:56:31 PM8/31/09

to

On Aug 31, 1:00 pm, Paavo Helde <pa...@nospam.please.ee> wrote:
> Zerex71 <mfeher1...@gmail.com> kirjutas:

I just tried that but it did not work for me - but, I'm running the
console output to the Eclipse console tab, not within an xterm.

Paavo Helde

unread,

Aug 31, 2009, 2:07:21 PM8/31/09

to

Zerex71 <mfehe...@gmail.com> kirjutas:

UTF sequences are usually not expressed as a single number because they
are of variable length.

> Where is the conversion for that explained?

In the URL I posted few days ago: http://en.wikipedia.org/wiki/UTF-8

Paavo Helde

unread,

Aug 31, 2009, 2:08:45 PM8/31/09

to

Zerex71 <mfehe...@gmail.com> kirjutas:

Can you check what is your LANG setting in this console? Maybe you should
turn to an Eclipse forum?

hth
Paavo

Paavo Helde

unread,

Aug 31, 2009, 2:12:12 PM8/31/09

to

Zerex71 <mfehe...@gmail.com> kirjutas:

>
> So let me see if I can explain my understanding of this whole thing
> (because I want to finally solve this problem, having been trying to
> figure it out off and on for quite a while):
>
> 1. Let's say I have a file, and it's nothing more than a string of 1s
> and 0s when you get right down to it.

Ok so far.

> 2. The encoding that I will use to read/display the file specifies to
> the OS how to group and treat the bits.

I think the OS really does not care, at least the Linux OS. The visual
output is produced by certain applications, e.g. xterm. The application
has to know the encoding of the input data it receives. The encoding of
the file can be sometimes extracted from the file, like in case of XML
files or BOM markers in Unicode files, and sometimes it is determined
otherwise, like by the locale settings.

> 3. A selected encoding then specifies (for lack of a better term) a
> set of codepages from which to select the characters to display (i.e.
> based on a particular grouping of bits/bytes, this will index into an
> appropriate set of characters).

This is internal business of the visual application or the X window
system (don't know the details).

> 4. The bytes are presented to the display portion of the OS and it
> will reference the operable font in the window, editor, dialog, etc.
> to display the individual characters.

This is internal business of the visual application or the X window
system (don't know the details).

> 5. If the specified font doesn't have a glyph for a given byte
> combination, the resulting behavior will be unpredictable.

This is internal business of the visual application or the X window
system (don't know the details).

> 6. If it does, it will basically do a table lookup for the appropriate
> glyph, and fetch that glyph and dump it to the screen.

This is internal business of the visual application or the X window
system (don't know the details).

>
> Is any of this correct?

Maybe, but hardly anything of this is relevant. The terminal program
expects some kind of encoding, and you have to provide it. In Linux the
encoding is usually UTF-8. If all your files use the same UTF-8 encoding
internally, then there is no problem, you just output the data. If you
still insist on using wchar_t and UCS-4 encoding internally, then you
have to perform the translation by yourself. It's as simple as that.

hth
Paavo

Juha Nieminen

unread,

Aug 31, 2009, 3:14:32 PM8/31/09

to

Zerex71 wrote:
> Thanks for your information, but you know what, I really don't care to
> spend hours looking for something that should be fairly simple to do.

Yet you have already spent days asking about it in this newsgroup. If
you had googled about it instead and read a few pieces of documentation,
you would have probably saved yourself a lot of trouble.

(Not that it's wrong to ask here or anywhere else for help. It's just
that your attitude feels a bit picky. When someone suggests a relatively
easy solution to your problem you dismiss it without even trying to see
how that solution works.)

> I wasn't "confused" by anything. I just don't have time an interest
> in becoming a Unicode expert to do something very simple.

Unicode and its encodings are, unfortunately not a simple matter.
Fortunately people have already gone through the trouble and offer free
libraries to do the hard part.

> I wonder if
> this means I have to download and install the utfcpp library, or if I
> can just do this as-is in the code above.

You don't have to install it. It's just a set of header files. You put
it anywhere your compiler will find them (eg. inside your project
directory) and then just #include the appropriate header and start using
it. I gave you a simple example of the usage.

Don't immediately dismiss a solution just because you don't understand
it in 10 seconds.

Juha Nieminen

unread,

Aug 31, 2009, 3:16:01 PM8/31/09

to

Paavo Helde wrote:
>> Where is the conversion for that explained?
>
> In the URL I posted few days ago: http://en.wikipedia.org/wiki/UTF-8

The conversion itself is relevant only if you are going to do it
yourself (or are genuinely interested in how it works, which wouldn't be
a bad idea, really; general knowledge about things never hurts).

There exist libraries (like the one I gave a link to in another post)
to do the conversion for you.

Zerex71

unread,

Aug 31, 2009, 3:18:25 PM8/31/09

to

Ah, I see that. Thanks.

Zerex71

unread,

Aug 31, 2009, 3:19:01 PM8/31/09

to

I don't know if there is a LANG setting for this console, but I will
check. I can check the encoding of the files/project, but that's
about it. I'll see what I can find for this.

Zerex71

unread,

Aug 31, 2009, 3:22:35 PM8/31/09

to

On Aug 31, 2:12 pm, Paavo Helde <pa...@nospam.please.ee> wrote:
> Zerex71 <mfeher1...@gmail.com> kirjutas:

Well, I'm trying to ascertain how much of the problem needs to be
fixed in the code and/or in the output environment. I thought that by
making the output use wide format that that would solve the problem,
but apparently not. Right now I am trying to find out if a font
change in the output console is in order, but I still maintain that my
selected font is capable of displaying these characters properly, so
I'm assuming I'm doing something wrong in the code. However, the fact
that I am getting things like "Here's your character: ???" is somewhat
encouraging, in that it is attempting to print it out, but can't fetch
a suitable glyph for a variety of reasons.

Incidentally, in Java, I didn't have this problem. I was able to use
its Unicode facilities and life was easy, once I figured out how to do
it. I can get it to print most chars. When I went back to look for
that old code which I knew I'd done, I realized I didn't ever try to
do this in C++, and even if I had, it was on WinXP.

Zerex71

unread,

Aug 31, 2009, 3:24:08 PM8/31/09

to

And don't immediately dismiss someone because they don't have the
interest or inclination in spending a lot of time for what seems like
it should be a simple answer. Moreover, don't lecture me. See
Paavo's posts for an example of how to do this.

Zerex71

unread,

Aug 31, 2009, 3:25:34 PM8/31/09

to

On Aug 31, 2:12 pm, Paavo Helde <pa...@nospam.please.ee> wrote:

> Zerex71 <mfeher1...@gmail.com> kirjutas:

I don't understand your statement that hardly any of this is
relevant. I am describing to you my understanding of how characters
are stored and displayed, and am asking for corrections on the model.
It's a closely related tangent to my problem of direct screen output
(not file-based, because I don't yet have that issue to deal with).

Paavo Helde

unread,

Aug 31, 2009, 5:22:59 PM8/31/09

to

Zerex71 <mfehe...@gmail.com> kirjutas:

In Unix they say everything is file-based. And what is relevant is the
communication between you (your program) and the one who is listening to
you (in this case xterm or "eclipse console tab"). Yes it is nice to know
how the things go further, in what exact format the fonts are stored and
how the LCD display would make them appear in color, how the cone
receptors in the eye are transforming this into the nerve impulses, how
the brain visual cortex is interpreting the signals and translating back
to visually indistinguished characters, how they are further interpreted
as symbols carrying a specific meaning - yes, that would be nice!

FWIW, I suspect that Unicode fonts are not stored as flat lookup arrays,
rather as piecewise arrays, because of size considerations. But I'm not
at all sure I'm right here.

Paavo

Zerex71

unread,

Sep 1, 2009, 9:05:00 AM9/1/09

to

Okay, I tried the following code:

const unsigned char test[4]={0xE2, 0x99, 0xAD, 0}; // This is the
conversion from 0x266d; NUL to pad

printf("Test: %s\n", test);

and it didn't work for me. I get the output: "Test: ��". I did some
more digging on Eclipse and apparently there is a startup option and
also the same option for each defined run configuration:

-Dfile.encoding=UTF8

I'm not sure that I want to change the file encoding, only the output
encoding.

I am also trying to find out how to generate a binary in Eclipse that
can be run outside of Eclipse i.e. the program executable so that I
can just run it in an xterm to see what happens. I am cross-posting
to their forums to see how to get this to work. If that fails, I may
fall back to the utf8 library referenced earlier.

Mike

Juha Nieminen

unread,

Sep 1, 2009, 10:31:20 AM9/1/09

to

Zerex71 wrote:
> And don't immediately dismiss someone because they don't have the
> interest or inclination in spending a lot of time for what seems like
> it should be a simple answer.

That's your problem: You want a solution but you are not ready to do
the necessary work to learn the solution. Even when someone outright
gives you a simple answer to your problem, you still immediately
dismissed it because you didn't go through the trouble of spending a few
minutes learning the solution.

> Moreover, don't lecture me. See
> Paavo's posts for an example of how to do this.

Which post? The one where he basically instructs you to make the UTF-8
encoding by hand?

Well, if you *really* want to encode all your strings to UTF-8 by
hand, then please go right ahead. I won't stop you.

On the other hand, you said you wanted an *easy* solution to this
problem. Using an encoding library is at least a hundred times easier
than trying to do the encoding yourself. But whatever floats your boat.

Using the library I mentioned, encoding one unicode character to UTF-8
is basically one single utf8::append() call. Encoding it by hand
requires quite a lot of work. You could, of course, write an encoder
yourself, but then you would be basically replicating utf8::append().
What would be the point? (Especially since you don't seem to have the
time for this.)

Since you seem to detest the library solution and prefer to make the
UTF-8 encoding yourself, please let me know how that worked for you. I'm
honestly curious.

Zerex71

unread,

Sep 1, 2009, 2:18:46 PM9/1/09

to

What I'm getting at is, is it really necessary for me to incorporate
all of that stuff just for three lines of code?

Paavo Helde

unread,

Sep 1, 2009, 4:03:11 PM9/1/09

to

Zerex71 <mfehe...@gmail.com> kirjutas:

> On Sep 1, 10:31�am, Juha Nieminen <nos...@thanks.invalid> wrote:
>

>> � Using the library I mentioned, encoding one unicode character to

UTF-
> 8
>> is basically one single utf8::append() call. Encoding it by hand
>> requires quite a lot of work. You could, of course, write an encoder
>> yourself, but then you would be basically replicating utf8::append().
>> What would be the point? (Especially since you don't seem to have the
>> time for this.)
>>
>> � Since you seem to detest the library solution and prefer to make the
>> UTF-8 encoding yourself, please let me know how that worked for you.
I'm
>> honestly curious.

Implementing an Unicode-to-UTF8 converter is not really so hard. It takes
about 20-30 lines of C code IIRC.

>
> What I'm getting at is, is it really necessary for me to incorporate
> all of that stuff just for three lines of code?
>

Not at all. If all your source and input files are in the right encoding,
there should be no need for doing anything. I am sure emacs can handle
files in UTF-8 encoding, don't know anything about Eclipse.

The reason why everything is UTF-8 in Linux is that all the string
interfaces are 8-bit ASCIIZ by historic reasons, with zero bytes used as
string terminators. UTF-8 fills in here perfectly, allowing to pass
Unicode content through such interfaces even if they have not devised
especially for that.

<rant>
On the other hand, Microsoft made a premature attempt to standardize 16-
bit Unicode, but landed on UTF-16 later when they realized 16 bits are
not enough, ending up with basically using the same trick of passing
variable-length elements through fixed-element-size interfaces which
already were present. Sadly, UTF-16 has no benefits over UTF-8
whatsoever, at least in Western countries. The final outcome for Windows
is that each SDK function having string arguments is present in 2
versions (narrow and wide), and there is a huge pile of nasty macros
trying to leverage that for the user programs.

Well, the thousands of programmers have to be kept busy by something,
right? I hope Linux developers do not have time for such nonsense.
</rant>

hth
Paavo

Juha Nieminen

unread,

Sep 1, 2009, 6:05:22 PM9/1/09

to

Zerex71 wrote:
> What I'm getting at is, is it really necessary for me to incorporate
> all of that stuff just for three lines of code?

Are you planning on having more unicode data in your program than just
that one symbol? Or do you think in some future program you might want
more extensive support for unicode?

If your answer to either question was yes, then it definitely will pay
off learning how to handle unicode, UTF-8 and related libraries. It will
save you a lot of work in the future.

Note that handling UTF-8 encoded text directly (without ever
converting it to raw unicode values and back) is not always feasible in
all possible situations. For example, advancing in a UTF-8 encoded
string one character at a time is not trivial because UTF-8 is a
variable-length encoding: Some characters will take more than one byte
(between 2 and 4), and in fact, some of the characters can be composite
characters (in other words, composed of more than one unicode value).

Thus if you ever need to write a program which needs to distinguish
between different unicode characters (let's say, for example, count the
number of characters in a line), using an unicode/UTF-8 library will
make it enormously easier than trying to do it for yourself.

Zerex71

unread,

Sep 2, 2009, 9:05:06 AM9/2/09

to

On Sep 1, 4:03 pm, Paavo Helde <pa...@nospam.please.ee> wrote:
> Zerex71 <mfeher1...@gmail.com> kirjutas:

So, as per my posts, I set the file encoding to UTF-8...made sure my
environment is UTF-8 (Linux locale)...and am trying to determine how
to set the runtime (console) output in Eclipse to be UTF-8 (I have
posted to an Eclipse forum, still no response). I figure I shouldn't
have to do anything else in my code. I don't know why it's not
working.

Zerex71

unread,

Sep 2, 2009, 9:06:25 AM9/2/09

to

I'm not planning on doing anything extensive with Unicode, which is
why I'm not pursuing a more encompassing route. Obviously if I had a
big Unicode issue on my hand, I would have started to think about a
more long-range solution. I definitely would be thinking of a UTF-8
library or incorporating some functionality but right now this is just
on the nit level.

Juha Nieminen

unread,

Sep 3, 2009, 5:59:11 AM9/3/09

to

Zerex71 wrote:
> I'm not planning on doing anything extensive with Unicode, which is
> why I'm not pursuing a more encompassing route.

Well, suit yourself.

Maybe it's just me, but I still don't find using the library so
difficult as you make it sound. And once you have learned it, using it
again in the future will be a breeze.