How to replace Umlauts?

Heinz-Mario Frühbeis

unread,

Apr 16, 2016, 10:54:19 AM4/16/16

to

Hi,

I have an issue with XDrawString from XLib, because it prints only
special characters instead of ü, ö, a.s.o.
E.g.:
string test = "Hüh";
XDrawString(..., test, ...) // isn't printing ü

But what works is:
string test = "h";
char nChar = (char) 252; // is ASCII ü
test = test + nChar;
test = test + "h";
XDrawString(..., test, ...) // is now printing ü

So I wanted to replace ü with (char) 252...
But I do get it working, this is one of my tries:

strimg nCaption = "ausführen";

if(nCaption != ""){
if(nCaption.find("u") > 0){
char nChar = (char) 252;
const char* nChar1 = &nChar;
std::string umlaut = "ü";
nCaption.replace(nCaption.find("ü"), umlaut.length() , nChar1);
}
}

The error is:
terminate called after throwing an instance of 'std::out_of_range'
what(): basic_string::replace

Can someone please help me out?

Regards
Heinz-Mario Frühbeis

Kalle Olavi Niemitalo

unread,

Apr 16, 2016, 12:52:25 PM4/16/16

to

Heinz-Mario Frühbeis <D...@Earlybite.individcore.de> writes:

> char nChar = (char) 252;
> const char* nChar1 = &nChar;
> std::string umlaut = "ü";
> nCaption.replace(nCaption.find("ü"), umlaut.length() , nChar1);

I'm not sure why it'd throw std::out_of_range but that code has
at least one bug: because nChar1 is a const char*, the replace
function assumes it points to a null-terminated string but you
didn't put any null character there. Also, if basic_string::find
doesn't find anything, it returns npos rather than 0.

You could do it this way instead:

const string umlaut = "ü";
for (string::size_type pos = nCaption.find(umlaut);
pos != string::npos;
pos = nCaption.find(umlaut, pos + 1)) {
nCaption.replace(pos, umlaut.length(), 1, (char) 252);
}

(Using pos + 1 in the second find call to prevent an infinite
loop if "ü" is "\xfc" in the execution character set.)

However, a proper solution would convert all characters to the
correct encoding, rather than only "ü". Which font are you using
in X11?

AFAIK, X11 core text requests are not recommended nowadays,
and toolkits use X Rendering Extension instead.
http://www.x.org/wiki/Development/X12/#char2b

Jens Thoms Toerring

unread,

Apr 16, 2016, 1:22:56 PM4/16/16

to

Heinz-Mario Frühbeis <D...@earlybite.individcore.de> wrote:
> I have an issue with XDrawString from XLib, because it prints only
> special characters instead of ü, ö, a.s.o.
> E.g.:
> string test = "Hüh";
> XDrawString(..., test, ...) // isn't printing ü

> But what works is:
> string test = "h";
> char nChar = (char) 252; // is ASCII ü

That's where things start to go wrong: there is no 'ü' in
ASCII - ASCII defines only the values up to 127 (ASCII
is an abbreviation for American Standard Code for In-
formation Interchange and the Americans don't use um-
lauts and, moreover, back when it was designed it wan't
uncommon to use only 7 bits for representing characters).
There are lots of different encodings that use the values
above 127, one of them being 'iso_8859-1' (commonly used
for Western European languages) and that's probably the one
you got the idea from the the 'ü' is represented by the
value 252 (same in a number of other iso_8859-x encodings
but not all of them - in the encoding used for Cyrillic,
'iso_8859-5', 252 represents 'ќ'). And XDrawString() will
render that value as 'ü' only if you also use a font that
is made for these encodings.

Next problem: if you have

> string test = "Hüh";

in your code then what is stored in 'test' depends on what
encoding your editor uses. Nowadays it's not unlikely that
this is UTF-8 and, if you look at the individual bytes, e.g.
by going through 'test.c_str()', you will find that it con-
tains 4 chars, the first one being 0x48, the second 0xC3, the
third 0xBC and the fourth 0x68. The 0x48 and 0x68 are 'H' and
'h' as in ASCII (ASCII characters, i.e. stuff up to 127 are
encoded the same way in UTF-8) and the combination of 0xC3
and 0xBC is the way UTF-8 encodes 'ü', officially called
"LATIN SMALL LETTER U WITH DIAERESIS". If you use a font
for iso_8859-1 encoding XDrawstring() will render them as
'Ã' and '¼', but if you'd use a UTF-8 font it would be ren-
dered as 'ü'.

It gets trickier when you use input coming from outside
the program: what you will read depends on the encoding
used by whatever sends the data - if you e.g. try to draw
strings entered into a terminal what 'test' will contain
depends on what encoding the terminal uses. And if you use
input you got from Xlib functions things get even a bit
more "intereting".

> test = test + nChar;
> test = test + "h";
> XDrawString(..., test, ...) // is now printing ü

> So I wanted to replace ü with (char) 252...
> But I do get it working, this is one of my tries:

Well, it works somehow because you use a (probaby) iso_8959-1)
font and forced the value of 252 into the string (and
std::sstring doesn't care at all about the encoding, you
can actually store any binary data in a std::string. And
the length method doesn't tell you how many "letters" there
are since that would depend on how letters are encoded but
just the plain number of bytes.

The following is for sure not the code you're using (it
won't even compile), so anything one can say about it may
have nothing to do with the problems you're facing...

> strimg nCaption = "ausführen";
^

> if(nCaption != ""){

What's that test for - the find() method of std::string will
work quite fine on an empty string? And why, if you insist on
this test, not use the empty() method?

> if(nCaption.find("u") > 0){

Why do you look for "u" in the string? And the find() method
does return the position of what you where looking for, which
can include 0 (the very start of the string). What you should
ompare to is std::string::npos which is what gets returned if
the string does not contain what you were looking for. So your
test asks: is there an "u" somewhere in the string beyond the
first byte or non at all? What you need here is

if (nCaption.find("ü") != std::string::npos)

to make sense since it asks: is there an "ü" in that string?

> char nChar = (char) 252;
> const char* nChar1 = &nChar;
> std::string umlaut = "ü";
> nCaption.replace(nCaption.find("ü"), umlaut.length() , nChar1);

The third argument to the replace() method must, when you
call it with a char pointer, be a pointer to a C string
(which must include a '\0' at the end!), but what you pass
it is a pointer to a single char with no '\0' following it
(or only just by chance). Be prepared for lots of strange
looking stuff to get inserted for the 'ü'...

> }
> }

All that could have been written much simpler and cleaner as

std::string nCaption = "ausführen";
const char iso_8859_1_uuml[] = {252, '\0'};
const size_t len = std::string("ü").length();
size_t pos;

while ((pos = nCaption.find("ü")) != std::string::npos)
nCaption.replace(pos, len, iso_8859_1_uuml);

But that only "solves" the problem for 'ü', you've got to do
the same replacements for all non-ASCII characters you're pre-
pared to deal with (and decide what to do with those that can't
be represented in your choosen encoding)... And if you switch
to a font that's made for UTF-8 this will actually break things.

Regards, Jens
--
\ Jens Thoms Toerring ___ j...@toerring.de
\__________________________ http://toerring.de

Heinz-Mario Frühbeis

unread,

Apr 16, 2016, 1:51:40 PM4/16/16

to

Am 16.04.2016 um 18:46 schrieb Kalle Olavi Niemitalo:
> Heinz-Mario Frühbeis <D...@Earlybite.individcore.de> writes:
>
>> char nChar = (char) 252;
>> const char* nChar1 = &nChar;
>> std::string umlaut = "ü";
>> nCaption.replace(nCaption.find("ü"), umlaut.length() , nChar1);
>
> I'm not sure why it'd throw std::out_of_range but that code has
> at least one bug: because nChar1 is a const char*, the replace
> function assumes it points to a null-terminated string but you
> didn't put any null character there. Also, if basic_string::find
> doesn't find anything, it returns npos rather than 0.
>
> You could do it this way instead:
>
> const string umlaut = "ü";
> for (string::size_type pos = nCaption.find(umlaut);
> pos != string::npos;
> pos = nCaption.find(umlaut, pos + 1)) {
> nCaption.replace(pos, umlaut.length(), 1, (char) 252);
> }
>

This is working. Thanks!

I still have an issue with string, char, 0-byte /yes/no. I think, you
know what I mean. I'm coming from VB6 for over 20 years, and with C++
and GNU/Linux I'm now working for ~ 2 years.

> (Using pos + 1 in the second find call to prevent an infinite
> loop if "ü" is "\xfc" in the execution character set.)
>
> However, a proper solution would convert all characters to the
> correct encoding, rather than only "ü". Which font are you using
> in X11?
>

I'm using different fonts...
And the program can change the font while runtime.

> AFAIK, X11 core text requests are not recommended nowadays,
> and toolkits use X Rendering Extension instead.
> http://www.x.org/wiki/Development/X12/#char2b
>

I'm not sure if I know what you mean... Currently I use the XLib and
XListFonts, so I have many 'misc', some 'schumacher', and less 'sony'.
AFAIK will those fonts run on every GNU/Linux...
Maybe later I will implement other fonts too, but for now it is OK for me.

Thank you very much for your help!

Regards
Heinz-Mario Frühbeis

Heinz-Mario Frühbeis

unread,

Apr 16, 2016, 1:56:29 PM4/16/16

to

Am 16.04.2016 um 18:46 schrieb Kalle Olavi Niemitalo:

Maybe an AddOn-question:
How do you handle e.g. a 10 pages , or even a 100 pages document?
Does it really need to parse them in this way?

Regards
Heinz-Mario Frühbeis

Kalle Olavi Niemitalo

unread,

Apr 16, 2016, 4:17:01 PM4/16/16

to

Heinz-Mario Frühbeis <D...@Earlybite.individcore.de> writes:

> I'm using different fonts...
> And the program can change the font while runtime.

XCreateFontSet + either XmbDrawString or Xutf8DrawString seems
the best solution, as long as you're using X11 core text.
Then your program doesn't itself have to convert the strings
to the encoding of the font.

Jens Thoms Toerring

unread,

Apr 16, 2016, 4:24:54 PM4/16/16

to

Heinz-Mario Frühbeis <D...@earlybite.individcore.de> wrote:
> Maybe an AddOn-question:
> How do you handle e.g. a 10 pages , or even a 100 pages document?
> Does it really need to parse them in this way?

I'm not really sure what this question means. If you need to
re-encode something, then yes, something has to run over all
of it and do the conversion. But, of course, it doesn't have
to be your program that does it again and again if this is
about some text that you read in from some file. There are
tools for re-encoding, the usual suspect being 'iconv'. If
you want to convert some file from say UTF-8 to ISO_8859-1
you'd do

iconv -f UTF-8 -t ISO_8859-1 -o outputfile inputfile

Use 'iconv -l' to get a list of all encodings known to to it
(there are more than 1000;-) Of course, there are limits:
if the file in its source encoding contains characters that
don't exist in the target encoding then there's nothing
useful it could do and it will complain about an "illegal
input sequence" at the first character that can't be con-
verted).

And then there's libiconv that you can use within your
program to do re-encodings (though it's a C library, but
there's also a C++ wrapper for it, called 'iconvpp', and
available on github).

Heinz-Mario Frühbeis

unread,

Apr 16, 2016, 4:33:57 PM4/16/16

to

I will have a look on it. Thanks.

Regards
Heinz-Mario Frühbeis

Kalle Olavi Niemitalo

unread,

Apr 17, 2016, 3:38:36 AM4/17/16

to

j...@toerring.de (Jens Thoms Toerring) writes:

> If you use a font for iso_8859-1 encoding XDrawstring() will
> render them as 'Ã' and '¼', but if you'd use a UTF-8 font it

> would be rendered as 'ü'.

I don't think UTF-8 fonts are feasible in the X11 core protocol.
http://www.x.org/releases/X11R7.7/doc/xproto/x11protocol.html#glossary
(X Window System Protocol; X Version 11, Release 7.7; Version 1.0)
says about Font, "The client simply indicates values used to
index the glyph array." These values are 1-byte or 2-byte.
If you used UTF-8 there, it would only support U+0000 to U+07FF
(encoded as 0xDF 0xBF).

Kalle Olavi Niemitalo

unread,

Apr 17, 2016, 3:01:07 PM4/17/16

to

j...@toerring.de (Jens Thoms Toerring) writes:

> Next problem: if you have
>
>> string test = "Hüh";
>
> in your code then what is stored in 'test' depends on what
> encoding your editor uses.

Even if the text in the source file is UTF-8, that doesn't mean
the bytes in the string variable will be UTF-8 too. The compiler
may recode the string. GCC nowadays has -finput-charset and
-fexec-charset options for controlling this, and Microsoft's
compiler likewise has /source-charset and /execution-charset:
https://blogs.msdn.microsoft.com/vcblog/2016/02/22/new-options-for-managing-character-sets-in-the-microsoft-cc-compiler/

For example, if the source code says "Hüh" and you use
g++ -fexec-charset=EBCDIC-AT-DE, then the string will contain
{ 0xC8, 0xD0, 0x88 } at run time. But if the source code
says u8"Hüh", then you get the UTF-8 { 0x48, 0xC3, 0xBC, 0x68 }
regardless of the execution character set.

Heinz-Mario Frühbeis

unread,

Apr 28, 2016, 2:24:46 AM4/28/16

to

Am 16.04.2016 um 22:10 schrieb Kalle Olavi Niemitalo:

Hi,

what I noticed is that the printing of the umlauts depends on the fontname.
E.g. currently I use for my Label-Area a different fontname as my
Textbox-Area has.
Umlauts in the Label are special characters, but for the Textbox it
prints 'Ü', 'ü', 'ß', etc...

So it is part of the given/selected font.

Til then
Heinz-Mario Frühbeis