JiiPee <
n...@notvalid.com> wrote:
> I am trying to learn how to use unicode string.. its not so easy really.
> And difficult to find a guidelines how to do it. So still searching
> (some say use UTF-8 , some UTF-16. but using UTF-8 in a code would make
> life difficult as many functions like lenght would not work).
> Everybody say that we should use unicode in our code. If so, then why
> all turorials and C++ books use almost always char as a character type
> (1 byte)? Why use examples which are not used in real world? This I do
> not understand.
As some others have pointed out it adds an additipnal burden
because in a tutorial you now also have to explain how UTF-8
works, which distracts from what is the topic.
The std::string method length() doesn't stop working when you
use UTF-8 - of course dependent on what you call "work". It still
tells you ow many bytes the string occupies. What it doesn't do
anymore is telling you how many "letters" (or "glyphs" or what-
ever name you prefer - I try to avoid the word character since
it's too easily confused with the concept of a 'char') are con-
tained in the string. As you know UTF-8 a "letter" can be re-
presented by anything between 1 to 4 bytes. Thus you can't
equate "number of bytes" to "number of ;etters" anymore, as
has been traditionally done with ASCII. So you need a new
function for counting those "letters". Fortunately, writing
that isn't too hard: By inspecting the upper bits of the first
byte you can easily determine how many bytes that "letter"
occupies, which makes iterating over a string to count the
number of letters relatively simple.
There are a few pitfalls, though: not all 1 to 4 byte long
byte sets are valid UTf-8 entities, so, if you deal with
external input, you must check for that possibility and
design some strategy of dealing with such cases.
An imprtant aspect is dealing with the environment: if, for
example, the users keybooard is set up to send LATIN1 en-
coded charcters but you're expecting UTF-8 input that will
end in grief. Or when the output medium is set up to use a
different encoding than what your program emits the output
will look rather strange. So you will have to spend some
time giving more attention to locale settings etc. which in
a pure ASCII world usually are taken to be arcane stuff.
Another aspect is that, if you're serious about it, have
to start thinking about questions like: how do I enter
Chinese or Japanese or Greek etc. characters using a
standard US-English keyboard (or what tools does my
system supply for that purpose).
Dealing with UTF-8 in a program actually is relatively trivial
- you have to distinguish between byte count and letter count,
you should check if the input is "legal UTF-8 and you may have
to write some UTF-8 aware iterator when looping over a string
(so it gives you the next letter, not the next 'char') etc.
And, if this is for an already existing application, you'll
have to check whereever strings are used if what you want is
the "length" in bytes or in "letters".
I've recently done the switch from pure ASCII to UTF-8 for
a legacy library back from about 20 years ago. I've dragged
my feet for a long time doing that since I always thought that
"char-count equals letter count" would be that deep-rooted in
a piece of software that old that it would be nearly impossible
to fix that basic assumption. But, when I finally made the at-
tempt I was positively surprised that it was a lot easier than
I'd ever imagined - in most places dealing with strings it was
immediately clear if this was about the letters or bytes in a
string and, with a few functions for dealing with UTF-8, it took
me a very short time.
From that experience I tend to conclude that most pf the "angst"
about UTF-8 is more from unfamilarity than anything else. The
actual problems are often more the enviroment the user is wor-
king in - if the keyboard is set up to send LATIN1 or CJK or
whatever other legacy encoding, then there's were the real pro-
blems are. So, it's a new world definitely, and one has to
learn a few new things and become aware of a new potential
problems (and existing solutions;-).
I can only recommend to do a few experiments with some "toy"
programs. The concept behind UTF-8 is IMHO, while ingenious,
surprisingly simple, so I found it more helpful to write a few
functions for counting "letters" in a string or detecting in-
valid byte sequences than trying to understand some rather
complex libraries that do all the work for you. Not that I'd
consider those libaries to be useless, but to understand what
they're doing for you it's good to have spend a bit of time
trying to solve the simpler problems to get a feel of what's
involved - otherwise the documentation can often hard to un-
derstand;-)
Best regards, Jens