On Sun, 22 Feb 2015 14:15:13 +0000, JiiPee wrote:
> So what encoding you guys use? UTF-8 or UTF-16? What is the
> recommendation and your experiences.
Use UTF-8 for anything written to a file or sent over a byte-oriented
communication channel (e.g. most network protocols).
Internally, use whatever works best for what you're going to do with the
data.
E.g. on Windows, you typically need to use wchar_t* (ignore the question
of whether it's UTF-16 or UCS-2; it isn't either of those, its JUST an
array of wchar_t). Windows filenames are arrays of wchar_t; if you use the
char* APIs (e.g. CreateFileA() or fopen()), the program will only be able
to access files whose name is representable in the active "codepage"
(essentially Microsoft-speak for "encoding").
> I read on the web and there was argument whether UTF-8 or UTF-16 was
> better and they both had strong arguments. But seems like here people
> prefer UTF-8? And can you please shortly tell how to practically use
> UTF-8? Like how to get its length, find a certain character, how to
> store it (well, i guess just a char [] array does the job, or even
> std::string).
>
> Does UTF-8 work with all normal string functions like find, replace etc.
> If not, how do you deal with these and what needs to be done so they can
> be used. Say I use Russian letters, how I practically find a certain
> letter and use all the SDT functions/classes.
If you need to do almost anything which "interprets" text, you need a
library such as ICU (
site.icu-project.org).
The built-in methods of a std::string work with bytes, not "characters"
(in any sense of the word). Similarly, the built in methods of a
std::wstring work with wchar_t-s, not characters (the fact that a wchar_t
is closer to being equivalent to a character just makes the bugs less
obvious than if you'd used char instead).
A 16-bit wchar_t either means that it takes 1 or 2 values to represent a
single Unicode codepoint (UTF-16) or limits you to the basic multilingual
plane (UCS-2).
A 32-bit wchar_t gives you the full Unicode range with a 1:1
correspondence between values and codepoints, at the expense of using more
memory (potentially four times what you need if you mostly deal with
Latin-based languages).
But even that isn't really a "character" because of the existence of
combining characters, e.g. a lower-case letter a with an acute accent
could be represented as either the precomposed character U+00E1 = LATIN
SMALL LETTER A WITH ACUTE, or the sequence U+0061 = LATIN SMALL LETTER A
followed by U+0301 = COMBINING ACUTE ACCENT.
For comparisons, these forms ought to be equivalent, meaning that
strings need to be normalised (and there are 4 standard normal forms).
But software which deals with wide strings often treats them simply as
arrays of codepoints. This includes core Windows APIs, which will happily
allow you to create two files in the same directory with the same
(apparent) name but which differ in whether accented characters are
pre-composed.
Note that there is no normal form which guarantees that each "character"
is pre-composed, as not all characters have pre-composed forms (e.g.
Hangul (Korean) only has pre-composed forms for "modern" Korean, which
is insufficient for a number of uses).
You also have to deal with issues such as capitalisation being more
complex in languages other than English. E.g. the upper-case equivalent of
a German "sharp s" (looks a bit like lower-case "beta") is "SS" (two
characters). Turkish has dotted and un-dotted "I" characters, each with
lower-case and upper-case versions; lower-case dotted-I (i) and upper-case
un-dotted-I (I) are the same characters as Latin, but case conversion
using the rules for Latin will give the wrong result. Ligatures often have
lower-case, upper-case and title-case variants (i.e. to convert a string
to title case, the first character of each word must be converted to
title case, not upper case).