unicode and string

Fulvio Esposito

unread,

Jan 5, 2012, 12:05:14 PM1/5/12

to

Hi all,
I was recently studying unicode and internationalization and
some questions come to my mind about C++ string.

Correct me if I'm wrong. std::string and std::wstring member
functions simply doesn't work for UTF-8 or UTF-16 encoded
strings 'cause they wrongly assume "code_point==code_unit"
(for example length() returns the length of the sequence
and not the size of the unicode string, operator[] could
not return the code point if it's represented by two or
more code unit, etc.).

So in the end, what's the best strategy to handle unicode
strings in C++? Many suggest to use UTF-8 and std::string,
many others UTF-16 and std::wstring (but on linux wchar_t
are often 32-bit wide :S), ICU uses UTF-16 by default but
has its own UnicodeString.

As a use case, imagine a GUI Toolkit, what should be the
type of the Text property for a TextBox?

Fulvio Esposito

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

cpisz

unread,

Jan 5, 2012, 4:17:02 PM1/5/12

to

On Jan 5, 11:05 am, Fulvio Esposito <esposito.ful...@gmail.com> wrote:
> Hi all,
> I was recently studying unicode and internationalization and
> some questions come to my mind about C++ string.
>
> Correct me if I'm wrong. std::string and std::wstring member
> functions simply doesn't work for UTF-8 or UTF-16 encoded
> strings 'cause they wrongly assume "code_point==code_unit"
> (for example length() returns the length of the sequence
> and not the size of the unicode string, operator[] could
> not return the code point if it's represented by two or
> more code unit, etc.).
>
> So in the end, what's the best strategy to handle unicode
> strings in C++? Many suggest to use UTF-8 and std::string,
> many others UTF-16 and std::wstring (but on linux wchar_t
> are often 32-bit wide :S), ICU uses UTF-16 by default but
> has its own UnicodeString.
>
> As a use case, imagine a GUI Toolkit, what should be the
> type of the Text property for a TextBox?

I'm going through the same kind of thing. In a Windows environment all
the literals are 'UTF-16', but more accurately it is the subset of
UTF-16 that fits into 2 bytes. I have services that talk across the
network, some talking UTF-8, some talking UTF16BE, some talking
UTF16LE, some talking ANSI C strings. It is a royal pain in my arse.
Especially when the data goes bad somewhere and I have to figure out
if the bytes obtained are indeed valid in the encoding expected.

The best advice I can give for now is to treat the data as bytes to
and from files, across sockets, etc. Use standard streams otherwise
and keep in mind that you can only represent those character that will
fit into 2 bytes.

Cx11 is supposed to change this with better locales and factets I
believe.

Martin B.

unread,

Jan 5, 2012, 4:17:10 PM1/5/12

to

On 05.01.2012 18:05, Fulvio Esposito wrote:
> Hi all,
> I was recently studying unicode and internationalization and
> some questions come to my mind about C++ string.
>

> (...)

> So in the end, what's the best strategy to handle unicode

> strings in C++? ...

>
> As a use case, imagine a GUI Toolkit, what should be the
> type of the Text property for a TextBox?
>

For this use case, I'd use *the string class that the GUI toolkit provides*. If it doesn't provide one, it should document what kind of "strings" it expects and you're left to choose.

cheers,
Martin

Fulvio Esposito

unread,

Jan 6, 2012, 12:28:06 AM1/6/12

to

> For this use case, I'd use *the string class that
> the GUI toolkit provides*. If it doesn't provide one,
> it should document what kind of "strings" it expects
> and you're left to choose.
>

I was thinking about writing a GUI Toolkit, and how to
tackle localization problems. If I write the toolkit
and I wanna use only standard library, how can I deal
best with Unicode strings? What's a resonably way to
handle it?

Regards,
Fulvio

Goran

unread,

Jan 6, 2012, 3:54:53 PM1/6/12

to

On Jan 5, 6:05 pm, Fulvio Esposito <esposito.ful...@gmail.com> wrote:
> Hi all,
> I was recently studying unicode and internationalization and
> some questions come to my mind about C++ string.
>
> Correct me if I'm wrong. std::string and std::wstring member
> functions simply doesn't work for UTF-8 or UTF-16 encoded
> strings 'cause they wrongly assume "code_point==code_unit"
> (for example length() returns the length of the sequence
> and not the size of the unicode string, operator[] could
> not return the code point if it's represented by two or
> more code unit, etc.).
>
> So in the end, what's the best strategy to handle unicode
> strings in C++? Many suggest to use UTF-8 and std::string,
> many others UTF-16 and std::wstring (but on linux wchar_t
> are often 32-bit wide :S), ICU uses UTF-16 by default but
> has its own UnicodeString.
>
> As a use case, imagine a GUI Toolkit, what should be the
> type of the Text property for a TextBox?

I think that you need to put aside the idea that a Unicode string
needs to have one unicode code point per element, because this isn't
the case in either Unix variants nor Windows. It isn't the case
because none are using Unicode encoding that allows that (UTF-32). If
you need per-code-point processing, go from whatever representation
you get from the system, to e.g. ICU.

Finaly, cspisz is wrong, Windows does know UTF-16 (what he describes
is UCS2, and that has been abandoned in windows about a decade ago).

As for "best strategy", I would go for "play with the system", that
is, use system's native encoding as much as possible, and, for cross-
system storage or transport, use encoding that you "main" system knows
best, and convert when on another system. So for example, if your main
system is Linux, go for UTF-8 for storage and transport. But never,
ever, forget to convert that to UTF-16 when using your stuff to
interact with Windows. That would typically mean using
MultiByteToWideChar with CP_UTF8 to get your wstring.

Goran.

Alf P. Steinbach

unread,

Jan 6, 2012, 3:57:27 PM1/6/12

to

On 05.01.2012 18:05, Fulvio Esposito wrote:
>

[snip]

> So in the end, what's the best strategy to handle unicode
> strings in C++? Many suggest to use UTF-8 and std::string,
> many others UTF-16 and std::wstring (but on linux wchar_t
> are often 32-bit wide :S), ICU uses UTF-16 by default but
> has its own UnicodeString.

It's a bit more complex.

On a typical modern Linux `char` means UTF-8 and `wchar_t` means UTF-32, ...

while in Windows `char` definitiely means Windows ANSI (which is a
locale specific encoding, defined by GetACP API function) and `wchar_t`
definitely means UTF-16 or, in consoles, the UCS-2 subset.

The Windows meanings of the built-in types are at odds with C++98, e.g.
the arguments to `main`, and they're at odds with C++11, e.g. `u8`
literals, which produce entirely the wrong type in Windows.

So, with the built-in types as victims of some apparent political war,
and therefore ungood, IMHO the only reasonable thing to do, starting at
the fundamental level, is to define a new basic encoding value type, one
that is type-wise different from `char` and `wchar_t`.

One hurdle is then to make such a new encoding value type work with
std::basic_string, which is desirable.

If you define the type as e.g.

struct EncodingValue { char value; };

then, while in practice the size of that beast will be very suitable, as
soon as you define a constructor you will likely run into problems with
std::basic_string implementations, since for the short string
optimization the implementation may put such in a union, and if you
don't define a constructor then you can't support existing code that
does things like char_type( intValue ), for a generic char_type.

C++11 provides a way out, namely the based enum,

enum EncodingValue: char {};

Then you can both support existing char_type( intValue ) constructs, and
std::basic_string implementations with dirty unions inside.

However, given this beast that Possibly Can Do The Job(TM), the question
now becomes what the job really is. Personally on the most basic level I
want such a type that is defined in a system-specific manner, like
`char` in Linux and like `wchar_t` in Windows. But an alternative is
such a type that is defined like `Uint32` everywhere.

Both have trade-offs.

> As a use case, imagine a GUI Toolkit, what should be the
> type of the Text property for a TextBox?

Oh, I think definitely some type based on a custom encoding value type
as discussed above. But then? The question of what the Job is, is
difficult, and has perhaps many possible answers...

Cheers & hth.,

- Alf

Jean-Marc Bourguet

unread,

Jan 6, 2012, 4:04:34 PM1/6/12

to

Fulvio Esposito <esposit...@gmail.com> writes:

> Correct me if I'm wrong. std::string and std::wstring member functions
> simply doesn't work for UTF-8 or UTF-16 encoded strings 'cause they
> wrongly assume "code_point==code_unit" (for example length() returns
> the length of the sequence and not the size of the unicode string,
> operator[] could not return the code point if it's represented by two
> or more code unit, etc.).

I think you are wrong for string.

The character handling model in C and C++ is

* char is used for what Unicode TR17 calls a compound CES whose CEFs can
be variable width (the C and C++ standardese terms indicating this are
"multibyte characters" and "shift states")

* wchar_t is indeed used for a simple CES associated with a CEF which is
of fixed width one.

Those two CES are locale dependant.

So with the adequate locale, having UTF-8 data in std::string seems the
correct thing.

C11 and C++11 add UTF-8 string literals (u8" " with type char), UTF-16
string and character literals (u" ", u' ' char16_t) and UTF-32 string
and character literals (U" ", U' ', char32_t). There are some more
things available in the library, but the support is pretty minimal.

> So in the end, what's the best strategy to handle unicode strings in
> C++? Many suggest to use UTF-8 and std::string,

This correspond to the intented use with a Unicode locale.

> many others UTF-16 and std::wstring (but on linux wchar_t are often
> 32-bit wide :S),

A locale could meaningfully use UCS-16 with 16 bit wchar_t or UTF-32
with a more than 21 bit wchar_t.

> As a use case, imagine a GUI Toolkit, what should be the type of the
> Text property for a TextBox?

For the interface, I'd tend to use appropriately the locale mecanism,
i.e. accepting and returning std::string and std::wstring in the narrow
and wide encoding for the global locale (or a user defined one, but I
don't think it worth the pain).

Internally I'd convert to an Unicode representation (with a fast path
without conversion for Unicode locales, they are of common use
nowadays). To do the conversion to Unicode, you'll have to rely on
implementation dependence as there is no C++ way to get access to the
CES used by a given locale (for instance POSIX has nl_info which gives
that information for the C locale) nor to convert it to Unicode (C11 and
C++11 have mbrtoc16 and mbrtoc32 but there is no interface with the C++
locale mechanism and I've no idea how well they are available -- they
come form an earlier TR on the subject, which could have helped their
avaibility --, POSIX has iconv for the conversion).

Yours,

--
Jean-Marc

Thiago Adams

unread,

Jan 6, 2012, 4:10:10 PM1/6/12

to

> I was thinking about writing a GUI Toolkit, and how to
> tackle localization problems. If I write the toolkit
> and I wanna use only standard library, how can I deal
> best with Unicode strings? What's a resonably way to
> handle it?

Use wstring, it will work.

You may have problems if you have to save or transmit strings; in that
case you have to choose an encoding.

The sample below shows how to save and load a std::wstring using UTF8
encoding.

#include <fstream>
#include <codecvt>
#include <string>
using namespace std;

int main()
{

// writing
{
std::locale ulocale(locale(), new codecvt_utf8<wchar_t>) ;
std::wofstream ofs("test.txt");
ofs.imbue(ulocale);
ofs << L"maçã"; //apple in portuguese

}

// reading
{
std::locale ulocale(locale(), new codecvt_utf8<wchar_t>) ;
std::wifstream ifs("test.txt");
ifs.imbue(ulocale);
std::wstring ws;
std::getline(ifs, ws);
}
}

---
http://www.thradams.com/

cpp4ever

unread,

Jan 6, 2012, 8:25:04 PM1/6/12

to

On 05/01/12 17:05, Fulvio Esposito wrote:
> Hi all,
> I was recently studying unicode and internationalization and
> some questions come to my mind about C++ string.
>
> Correct me if I'm wrong. std::string and std::wstring member
> functions simply doesn't work for UTF-8 or UTF-16 encoded
> strings 'cause they wrongly assume "code_point==code_unit"
> (for example length() returns the length of the sequence
> and not the size of the unicode string, operator[] could
> not return the code point if it's represented by two or
> more code unit, etc.).
>
> So in the end, what's the best strategy to handle unicode
> strings in C++? Many suggest to use UTF-8 and std::string,
> many others UTF-16 and std::wstring (but on linux wchar_t
> are often 32-bit wide :S), ICU uses UTF-16 by default but
> has its own UnicodeString.
>
> As a use case, imagine a GUI Toolkit, what should be the
> type of the Text property for a TextBox?
>
> Fulvio Esposito
>
>

Hmmmm, not something I'd like to try to do, but then I use Qt GUI
toolkit which already provides unicode string handling for QString. Rest
assured your concerns are well justified, having experienced problems
with Japanese characters when unicode was not maintained for all strings.

regards

cpp4ever