On 05.01.2012 18:05, Fulvio Esposito wrote:
>
[snip]
> So in the end, what's the best strategy to handle unicode
> strings in C++? Many suggest to use UTF-8 and std::string,
> many others UTF-16 and std::wstring (but on linux wchar_t
> are often 32-bit wide :S), ICU uses UTF-16 by default but
> has its own UnicodeString.
It's a bit more complex.
On a typical modern Linux `char` means UTF-8 and `wchar_t` means UTF-32, ...
while in Windows `char` definitiely means Windows ANSI (which is a
locale specific encoding, defined by GetACP API function) and `wchar_t`
definitely means UTF-16 or, in consoles, the UCS-2 subset.
The Windows meanings of the built-in types are at odds with C++98, e.g.
the arguments to `main`, and they're at odds with C++11, e.g. `u8`
literals, which produce entirely the wrong type in Windows.
So, with the built-in types as victims of some apparent political war,
and therefore ungood, IMHO the only reasonable thing to do, starting at
the fundamental level, is to define a new basic encoding value type, one
that is type-wise different from `char` and `wchar_t`.
One hurdle is then to make such a new encoding value type work with
std::basic_string, which is desirable.
If you define the type as e.g.
struct EncodingValue { char value; };
then, while in practice the size of that beast will be very suitable, as
soon as you define a constructor you will likely run into problems with
std::basic_string implementations, since for the short string
optimization the implementation may put such in a union, and if you
don't define a constructor then you can't support existing code that
does things like char_type( intValue ), for a generic char_type.
C++11 provides a way out, namely the based enum,
enum EncodingValue: char {};
Then you can both support existing char_type( intValue ) constructs, and
std::basic_string implementations with dirty unions inside.
However, given this beast that Possibly Can Do The Job(TM), the question
now becomes what the job really is. Personally on the most basic level I
want such a type that is defined in a system-specific manner, like
`char` in Linux and like `wchar_t` in Windows. But an alternative is
such a type that is defined like `Uint32` everywhere.
Both have trade-offs.
> As a use case, imagine a GUI Toolkit, what should be the
> type of the Text property for a TextBox?
Oh, I think definitely some type based on a custom encoding value type
as discussed above. But then? The question of what the Job is, is
difficult, and has perhaps many possible answers...
Cheers & hth.,
- Alf