Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

What is the best encoding (experiences...) for unicode?

90 views
Skip to first unread message

JiiPee

unread,
Feb 22, 2015, 9:15:28 AM2/22/15
to
We already started talking about but I will start a new one as this is
a separate issue.

So what encoding you guys use? UTF-8 or UTF-16? What is the
recommendation and your experiences.
I read on the web and there was argument whether UTF-8 or UTF-16 was
better and they both had strong arguments. But seems like here people
prefer UTF-8? And can you please shortly tell how to practically use
UTF-8? Like how to get its length, find a certain character, how to
store it (well, i guess just a char [] array does the job, or even
std::string).

Does UTF-8 work with all normal string functions like find, replace etc.
If not, how do you deal with these and what needs to be done so they can
be used. Say I use Russian letters, how I practically find a certain
letter and use all the SDT functions/classes.

I am just quite new to this and trying to implement it to my first
projects really. So looking for direction.

People already gave instructions and I read them, but just asking if
there is more.

Öö Tiib

unread,
Feb 22, 2015, 10:34:22 AM2/22/15
to
On Sunday, 22 February 2015 16:15:28 UTC+2, JiiPee wrote:
> We already started talking about but I will start a new one as this is
> a separate issue.
>
> So what encoding you guys use? UTF-8 or UTF-16? What is the
> recommendation and your experiences.
> I read on the web and there was argument whether UTF-8 or UTF-16 was
> better and they both had strong arguments. But seems like here people
> prefer UTF-8? And can you please shortly tell how to practically use
> UTF-8? Like how to get its length, find a certain character, how to
> store it (well, i guess just a char [] array does the job, or even
> std::string).

On general case UTF-8 is superior since it:
1) is compatible with ASCII (ASCII text is subset of UTF-8)
2) does not have alignment issues (UTF-16 code point may need to be at even address)
3) does not have endianness issues (UTF-16 may be LE or BE)
4) fits into std::string (std::wstring is unspecified if it is UTF-16 or
UTF-32 or something else entirely)
5) is encoding of majority of internet text content

UTF-16 may be more convenient on Windows or with Qt. Still if significant
part of input or output goes in UTF-8 (I already mentioned internet) then
I would pick UTF-8 as internal representation for texts in your application.

> Does UTF-8 work with all normal string functions like find, replace etc.

You just have to accept that 'char' is a byte (not text character) and
'std::string' is continuous container of such bytes (specific encoding
of possible text in it is not guaranteed by it). When that is accepted
then everything works.

> If not, how do you deal with these and what needs to be done so they can
> be used. Say I use Russian letters, how I practically find a certain
> letter and use all the SDT functions/classes.

Not sure what you mean by "SDT". You have to make sure that when your
program receives some text from somewhere then it may need to be
converted to UTF-8 or at least checked if it *is* UTF-8 and when your
program outputs text to somewhere then it may need to be converted to
what is expected at other side (plus inevitable error handling). C++
itself offers too few and inconvenient methods for that so we typically
seek help for converting and checking from outside of C++ standard.

> I am just quite new to this and trying to implement it to my first
> projects really. So looking for direction.
>
> People already gave instructions and I read them, but just asking if
> there is more.

The other tricky thing you eventually stumble upon is that sometimes
people expect your program to ignore case of characters or to convert
to upper-case or to convert to lower case (or even to title case) and
how such things are done may be specific to local traditions.
Again the implementations of C++ tend to be quite unhelpful with it.

Paavo Helde

unread,
Feb 22, 2015, 10:51:10 AM2/22/15
to
JiiPee <n...@notvalid.com> wrote in news:WvlGw.914189$Fr6.5...@fx14.am4:

> We already started talking about but I will start a new one as this is
> a separate issue.
>
> So what encoding you guys use? UTF-8 or UTF-16? What is the
> recommendation and your experiences.

Fully UTF-8, for portability. When the OS interface requires something else
(notably UTF-16 on Windows), the translation is done on the application
border.

Works well, but may become a bit tedious if large parts of the application
are using some UTF-16-only frameworks or libraries (like MFC or its newer
cousins). MFC also attempts to do its own narrow-to-wide conversions, but
using a wrong codepage, so these automatic conversions need to be switched
off for avoiding surprises.

hth
Paavo

JiiPee

unread,
Feb 22, 2015, 11:15:49 AM2/22/15
to
On 22/02/2015 15:34, Öö Tiib wrote:
> On Sunday, 22 February 2015 16:15:28 UTC+2, JiiPee wrote:
>> We already started talking about but I will start a new one as this is
>> a separate issue.
>>
>> So what encoding you guys use? UTF-8 or UTF-16? What is the
>> recommendation and your experiences.
>> I read on the web and there was argument whether UTF-8 or UTF-16 was
>> better and they both had strong arguments. But seems like here people
>> prefer UTF-8? And can you please shortly tell how to practically use
>> UTF-8? Like how to get its length, find a certain character, how to
>> store it (well, i guess just a char [] array does the job, or even
>> std::string).
> On general case UTF-8 is superior since it:
> 1) is compatible with ASCII (ASCII text is subset of UTF-8)
> 2) does not have alignment issues (UTF-16 code point may need to be at even address)
> 3) does not have endianness issues (UTF-16 may be LE or BE)
> 4) fits into std::string (std::wstring is unspecified if it is UTF-16 or
> UTF-32 or something else entirely)
> 5) is encoding of majority of internet text content

yes. I knew most of these and agree.

>
> UTF-16 may be more convenient on Windows or with Qt. Still if significant
> part of input or output goes in UTF-8 (I already mentioned internet) then
> I would pick UTF-8 as internal representation for texts in your application.

yes, with windows maybe utf-16. I do Windows programming....
Ok, so I think I better actually maybe use utf-8 and switch to 16 when
needed with windows.

>
>> Does UTF-8 work with all normal string functions like find, replace etc.
> You just have to accept that 'char' is a byte (not text character) and
> 'std::string' is continuous container of such bytes (specific encoding
> of possible text in it is not guaranteed by it). When that is accepted
> then everything works.

ok but lets say I use 3 russian letters. Many times I want to know that
there is 3 letters rather than the byte size is like 7.

>> If not, how do you deal with these and what needs to be done so they can
>> be used. Say I use Russian letters, how I practically find a certain
>> letter and use all the SDT functions/classes.
> Not sure what you mean by "SDT". You have to make sure that when your
> program receives some text from somewhere then it may need to be
> converted to UTF-8 or at least checked if it *is* UTF-8 and when your
> program outputs text to somewhere then it may need to be converted to
> what is expected at other side (plus inevitable error handling). C++
> itself offers too few and inconvenient methods for that so we typically
> seek help for converting and checking from outside of C++ standard.

I meant std:: -. typo.

I mean, if I have a Russian word with 3 russian letters, I want to know
how many letters there are rather than how many bytes it is total.
Sometimes we need that, right?

>
>> I am just quite new to this and trying to implement it to my first
>> projects really. So looking for direction.
>>
>> People already gave instructions and I read them, but just asking if
>> there is more.
> The other tricky thing you eventually stumble upon is that sometimes
> people expect your program to ignore case of characters or to convert
> to upper-case or to convert to lower case (or even to title case) and
> how such things are done may be specific to local traditions.

ye I understand, read about it ....

JiiPee

unread,
Feb 22, 2015, 11:20:45 AM2/22/15
to
On 22/02/2015 15:50, Paavo Helde wrote:
> JiiPee <n...@notvalid.com> wrote in news:WvlGw.914189$Fr6.5...@fx14.am4:
>
>> We already started talking about but I will start a new one as this is
>> a separate issue.
>>
>> So what encoding you guys use? UTF-8 or UTF-16? What is the
>> recommendation and your experiences.
> Fully UTF-8, for portability.

ok, I take your word for it and others words as well as people here
seems to agree with that. Am sure people have good experience.
And good to hear this opinion like this. They did not give much
recommendations on web, just pos and neg sides of eatch. But I want to
hear recommendations.

> When the OS interface requires something else
> (notably UTF-16 on Windows), the translation is done on the application
> border.

ok. sounds logical

>
> Works well, but may become a bit tedious if large parts of the application
> are using some UTF-16-only frameworks or libraries (like MFC or its newer
> cousins).

hmmm, I am. But on the other hand I also use a lot there TinyXml C++
library (http://www.grinninglizard.com/tinyxmldocs/index.html) to save
my data which uses UTF-8 format. So I guess that makes it so that I
should use UTF-8.

> MFC also attempts to do its own narrow-to-wide conversions, but
> using a wrong codepage, so these automatic conversions need to be switched
> off for avoiding surprises.
>

well, I guess I better then check all characters work ok.

Chris Vine

unread,
Feb 22, 2015, 12:01:29 PM2/22/15
to
On Sun, 22 Feb 2015 14:15:13 +0000
JiiPee <n...@notvalid.com> wrote:
> We already started talking about but I will start a new one as this
> is a separate issue.
>
> So what encoding you guys use? UTF-8 or UTF-16? What is the
> recommendation and your experiences.
> I read on the web and there was argument whether UTF-8 or UTF-16 was
> better and they both had strong arguments. But seems like here people
> prefer UTF-8? And can you please shortly tell how to practically use
> UTF-8? Like how to get its length, find a certain character, how to
> store it (well, i guess just a char [] array does the job, or even
> std::string).

The point you need to realise about unicode is that there is no way to
index into unicode codepoints with either UTF-8 and UTF-16, because they
are variable length encodings. To find a character at a given character
index, you have to go to the start of the string and work your way
along. Furthermore, any given UTF-8 or UTF-16 string must be treated as
const, because altering a given character may change the length (in 8
or 16 bit code units) of the entire string, so indexing is not
particularly useful.

You might think that would not be true of UTF-32, and you would be half
right. But only half right, because although you can index into a
codepoint with UTF-32, combining characters also come into play. For
example there are two renditions of characters with a diaresis. Lower
case o with diaresis could be given as either 'U+006F U+0308' (two code
points) or 'U+00F6' (one code point). Any reasonable equality test
should treat them as the same. So how would you describe the "length"
of that string?

There are libraries available to help with this, such a ICU and
UTF8-CPP. In practice I have rarely found them necessary. For UTF-8 I
use std::string, and if I want to progress by code points along the
string I have an iterator which does that for me, for which operator++
and operator-- iterate by whole unicode code points, and which when
dereferenced returns a 32 bit value (the current unicode code point
value) rather than a char.

Chris

JiiPee

unread,
Feb 22, 2015, 12:22:00 PM2/22/15
to
ok, this is good to know what people use here. If others also tell what
they use that is helpful...
Ok, you use just string.... I guess that might work for me also actually.
So how do you calculate the lenght of a Russian word for example. lets
say you have 4 letter Russina word... what function you use to calculate
that letter lenght?

>
> Chris

JiiPee

unread,
Feb 22, 2015, 12:25:36 PM2/22/15
to
On 22/02/2015 17:01, Chris Vine wrote:
> There are libraries available to help with this, such a ICU and
> UTF8-CPP. In practice I have rarely found them necessary. For UTF-8 I
> use std::string, and if I want to progress by code points along the
> string I have an iterator which does that for me, for which operator++
> and operator-- iterate by whole unicode code points, and which when
> dereferenced returns a 32 bit value (the current unicode code point
> value) rather than a char.

interesting . So the data , lets say 10000 characters, is stored as
utf-8 (to save space). Then when you need one character you give it as
32 bit value. hmmm ...

Paavo Helde

unread,
Feb 22, 2015, 12:59:32 PM2/22/15
to
JiiPee <n...@notvalid.com> wrote in news:BeoGw.779195$et7.3...@fx45.am4:

> So how do you calculate the lenght of a Russian word for example. lets
> say you have 4 letter Russina word... what function you use to
> calculate
> that letter lenght?

Why would you need that? Sure, there are programs for which this is
relevant, like a program for solving or composing Russian crosswords, but
for the type of software you are writing, why would you need this? Just
curious...


If I would need this (and nothing else Unicode-related) then I probably
would use a little function like that (not tested):

size_t LengthInCodepoints(const unsigned char* utf8, size_t sizeInBytes)
{
size_t result = 0;
for (size_t i=0; i<sizeInBytes; ++i) {
if ((utf8[i] & 0x80)==0 || (utf8[i] & 0x40)!=0) {
++result;
}
}
return result;
}

If I would need more, I would use the ICU library or something.

Cheers
Paavo

JiiPee

unread,
Feb 22, 2015, 1:10:44 PM2/22/15
to
ok, I give better example: say you have a Russian sentence and you want
to change a certain word in side it, say "car" to "vehicle". How would
you do it? So how do you do replace? That is needed surely...

Jens Thoms Toerring

unread,
Feb 22, 2015, 1:43:04 PM2/22/15
to
JiiPee <n...@notvalid.com> wrote:
> ok but lets say I use 3 russian letters. Many times I want to know that
> there is 3 letters rather than the byte size is like 7.

Here's a function for counting the number of UTF-8 letters in
C string (this is from a project written in C, but it should
not be too much trouble coverting it to C++ and use std:string
instead).

/***************************************
* Function for determing the number of (UTF-8) characters in
* a string. If it's not a valid UTF-8 string -1 is returned.
***************************************/

ssize_t
utf8_length( const char * str )
{
const unsigned char * p = ( const unsigned char * ) str;
ssize_t cnt = 0;

if ( ! str )
return -1;

for ( ; *p; p++, cnt++ )
{
if ( *p <= 0x7F ) // ASCII
/* empty */ ;
else if ( ( *p & 0xE0 ) == 0xC0 ) // should be 2 bytes
{
if ( ( *++p & 0xC0 ) != 0x80 )
return -1;
}
else if ( ( *p & 0xF0 ) == 0xE0 ) // should be 3 bytes
{
if ( ( *++p & 0xC0 ) != 0x80
|| ( *++p & 0xC0 ) != 0x80 )
return -1;
}
else if ( ( *p & 0xF8 ) == 0xF0 ) // should be 4 bytes
{
if ( ( *++p & 0xC0 ) != 0x80
|| ( *++p & 0xC0 ) != 0x80
|| ( *++p & 0xC0 ) != 0x80 )
return -1;
}
else // anything else is invalid
return -1;
}

return cnt;
}

You can probably already see the elements of an iterator lurking
in there;-)
Best regards. Jens
--
\ Jens Thoms Toerring ___ j...@toerring.de
\__________________________ http://toerring.de

Robert Wessel

unread,
Feb 22, 2015, 2:09:31 PM2/22/15
to
You might want to read this:

http://utf8everywhere.org/

Robert Wessel

unread,
Feb 22, 2015, 2:15:15 PM2/22/15
to
On 22 Feb 2015 18:42:54 GMT, j...@toerring.de (Jens Thoms Toerring)
wrote:
That will count code points, but not what you'd think of as characters
if you consider things like combining code points.

Paavo Helde

unread,
Feb 22, 2015, 2:23:52 PM2/22/15
to
JiiPee <n...@notvalid.com> wrote in news:nYoGw.674692$4b6.2...@fx44.am4:
Here you go:

std::string sentence = ...;
std::string carInRussian =
"\xd0\xb0\xd0\xb2\xd1\x82\xd0\xbe\xd0\xbc"
"\xd0\xbe\xd0\xb1\xd0\xb8\xd0\xbb\xd1\x8c";
std::string vehicleInRussian =
"\xd0\xbc\xd0\xb0\xd1\x88\xd0\xb8\xd0\xbd\xd0\xb0";

std::string::size_type pos = sentence.find(carInRussian);
if (pos!=std::string::npos) {
sentence.replace(pos, carInRussian.length(), vehicleInRussian);
}

No Unicode knowledge needed for such replacements. Zilch. Nada.

hth
Paavo



Chris Vine

unread,
Feb 22, 2015, 2:26:48 PM2/22/15
to
On Sun, 22 Feb 2015 18:10:22 +0000
JiiPee <n...@notvalid.com> wrote:
> ok, I give better example: say you have a Russian sentence and you
> want to change a certain word in side it, say "car" to "vehicle". How
> would you do it? So how do you do replace? That is needed surely...

You would normalize to either use or not use precomposed characters (a
unicode issue), decide how to deal with singular and plural forms and
other grammatical inflection (a language issue bearing on your problem
space) and then partition on the word(s) in question and construct a new
string from the two parts with substitute word inserted, having regard
to whatever decisions you have reached about the grammatical forms. (Be
pleased that unlike Celtic languages such a Welsh or Irish, Russian does
not as far as I am aware have initial morphisms to contend with as well
as suffixed inflexions.)

The unicode question is but one of your issues here.

On your earlier question of how do you find the length of a Russian
word, then no answer can be given until you specify what you are
measuring as your unit of length. Code points? If so, with or without
precomposed characters? Or glyphs? Or graphemes? On the latter,
(and looking more widely than just Russian) how do you treat ligatures?
For example, is eszett (ß) one or two "characters" and if one does it
become two in its upper case form? Is ligatured fi one or two
"characters"? If using Hangul, how do you deal with Hangul jamos? If
using Indic scripts, how are you counting consonant clusters? Asking
the question is usually completely pointless.

Chris

Paavo Helde

unread,
Feb 22, 2015, 2:43:29 PM2/22/15
to
Paavo Helde <myfir...@osa.pri.ee> wrote in
news:XnsA449D9A6D3B04m...@216.196.109.131:

> JiiPee <n...@notvalid.com> wrote in news:nYoGw.674692$4b6.2...@fx44.am4:
>> ok, I give better example: say you have a Russian sentence and you
>> want to change a certain word in side it, say "car" to "vehicle". How
>> would you do it? So how do you do replace? That is needed surely...
>
> No Unicode knowledge needed for such replacements. Zilch. Nada.

Maybe I should have pointed out that the UTF-8 (und UTF-16) encodings have
been carefully designed to make such things working.

Complications arise if the strings are not normalized in the same way and
are using different ways for representing the same letters. But such things
can happen with ASCII as well, e.g. "$3400" and "$3,400", a tab versus
space, capitalization etc. Anyway, the point is that assuming the texts are
normalized to the needed extent, processing them without any knowledge of
code point borders is often trivial.

Cheers
Paavo

Jens Thoms Toerring

unread,
Feb 22, 2015, 3:00:56 PM2/22/15
to
Robert Wessel <robert...@yahoo.com> wrote:
> That will count code points, but not what you'd think of as characters
> if you consider things like combining code points.

Good catch! Hadn't thougt of that. Back to the drawing
board;-)
Best regards, Jens

JiiPee

unread,
Feb 22, 2015, 3:48:07 PM2/22/15
to
ok thanks, i ll save this

Geoff

unread,
Feb 22, 2015, 3:52:04 PM2/22/15
to
On Sun, 22 Feb 2015 16:15:34 +0000, JiiPee <n...@notvalid.com> wrote:

>I meant std:: -. typo.
>
>I mean, if I have a Russian word with 3 russian letters, I want to know
>how many letters there are rather than how many bytes it is total.
>Sometimes we need that, right?

std::wstring wstr = L"123";

std::cout << "There are " << wstr.length() << " characters in wstr, "
<< "its size is " << sizeof(wstr) << std::endl;

JiiPee

unread,
Feb 22, 2015, 4:04:29 PM2/22/15
to
I already read that... but then there was somebody arguing strongly
there that utf-16 is better and seemed like he won the argument and this
man a bit backed down..... donno.... thats why still asking.

Nobody

unread,
Feb 22, 2015, 8:22:36 PM2/22/15
to
On Sun, 22 Feb 2015 14:15:13 +0000, JiiPee wrote:

> So what encoding you guys use? UTF-8 or UTF-16? What is the
> recommendation and your experiences.

Use UTF-8 for anything written to a file or sent over a byte-oriented
communication channel (e.g. most network protocols).

Internally, use whatever works best for what you're going to do with the
data.

E.g. on Windows, you typically need to use wchar_t* (ignore the question
of whether it's UTF-16 or UCS-2; it isn't either of those, its JUST an
array of wchar_t). Windows filenames are arrays of wchar_t; if you use the
char* APIs (e.g. CreateFileA() or fopen()), the program will only be able
to access files whose name is representable in the active "codepage"
(essentially Microsoft-speak for "encoding").

> I read on the web and there was argument whether UTF-8 or UTF-16 was
> better and they both had strong arguments. But seems like here people
> prefer UTF-8? And can you please shortly tell how to practically use
> UTF-8? Like how to get its length, find a certain character, how to
> store it (well, i guess just a char [] array does the job, or even
> std::string).
>
> Does UTF-8 work with all normal string functions like find, replace etc.
> If not, how do you deal with these and what needs to be done so they can
> be used. Say I use Russian letters, how I practically find a certain
> letter and use all the SDT functions/classes.

If you need to do almost anything which "interprets" text, you need a
library such as ICU (site.icu-project.org).

The built-in methods of a std::string work with bytes, not "characters"
(in any sense of the word). Similarly, the built in methods of a
std::wstring work with wchar_t-s, not characters (the fact that a wchar_t
is closer to being equivalent to a character just makes the bugs less
obvious than if you'd used char instead).

A 16-bit wchar_t either means that it takes 1 or 2 values to represent a
single Unicode codepoint (UTF-16) or limits you to the basic multilingual
plane (UCS-2).

A 32-bit wchar_t gives you the full Unicode range with a 1:1
correspondence between values and codepoints, at the expense of using more
memory (potentially four times what you need if you mostly deal with
Latin-based languages).

But even that isn't really a "character" because of the existence of
combining characters, e.g. a lower-case letter a with an acute accent
could be represented as either the precomposed character U+00E1 = LATIN
SMALL LETTER A WITH ACUTE, or the sequence U+0061 = LATIN SMALL LETTER A
followed by U+0301 = COMBINING ACUTE ACCENT.

For comparisons, these forms ought to be equivalent, meaning that
strings need to be normalised (and there are 4 standard normal forms).
But software which deals with wide strings often treats them simply as
arrays of codepoints. This includes core Windows APIs, which will happily
allow you to create two files in the same directory with the same
(apparent) name but which differ in whether accented characters are
pre-composed.

Note that there is no normal form which guarantees that each "character"
is pre-composed, as not all characters have pre-composed forms (e.g.
Hangul (Korean) only has pre-composed forms for "modern" Korean, which
is insufficient for a number of uses).

You also have to deal with issues such as capitalisation being more
complex in languages other than English. E.g. the upper-case equivalent of
a German "sharp s" (looks a bit like lower-case "beta") is "SS" (two
characters). Turkish has dotted and un-dotted "I" characters, each with
lower-case and upper-case versions; lower-case dotted-I (i) and upper-case
un-dotted-I (I) are the same characters as Latin, but case conversion
using the rules for Latin will give the wrong result. Ligatures often have
lower-case, upper-case and title-case variants (i.e. to convert a string
to title case, the first character of each word must be converted to
title case, not upper case).

Juha Nieminen

unread,
Feb 23, 2015, 12:56:34 AM2/23/15
to
Öö Tiib <oot...@hot.ee> wrote:
> On general case UTF-8 is superior since it:
> 1) is compatible with ASCII (ASCII text is subset of UTF-8)
> 2) does not have alignment issues (UTF-16 code point may need to be at even address)
> 3) does not have endianness issues (UTF-16 may be LE or BE)
> 4) fits into std::string (std::wstring is unspecified if it is UTF-16 or
> UTF-32 or something else entirely)
> 5) is encoding of majority of internet text content

That's quite a unilateral view, as you didn't contrast it with the
advantages of UTF-16:

1) UTF-16 is faster to handle, especially if you can limit it to UCS-2
(but even if you don't.)

2) UTF-16 takes less space with many non-western languages, such as
Japanese. (Most, if not all, Japanese characters take 3 bytes with
UTF-8 but only 2 bytes with UTF-16.)

--- news://freenews.netfront.net/ - complaints: ne...@netfront.net ---

JiiPee

unread,
Feb 23, 2015, 2:39:11 AM2/23/15
to
so does ICU -library handle these correctly, the uppercases? And others?

Paavo Helde

unread,
Feb 23, 2015, 3:47:36 AM2/23/15
to
JiiPee <n...@notvalid.com> wrote in news:mOAGw.61559$7E1....@fx31.am4:

>
> so does ICU -library handle these correctly, the uppercases? And
> others?

It very much depends on what is considered "correct". And this very much
depends on the intended purpose and scope of the program. Filenames, for
example, should be left intact by most of the code - if the user wants to
put mixed uppercase, conjugate pairs or zero-width space in filenames, then
that's what s?he ought to get. No ICU needed here.

On the other hand, when you are implementing a google-like text search an
arguably correct way is to find all possible cases/variants of the word
(e.g. "ferry" and "ferries", "color" and "colour", etc) in all of the
languages of the world (English being one of the simplest in this respect),
or at least in languages whose alphabets have been included in the Unicode
standard. ICU most probably does not cover this (but may help to some
extent).

hth
Paavo

Öö Tiib

unread,
Feb 23, 2015, 12:18:45 PM2/23/15
to
On Monday, 23 February 2015 07:56:34 UTC+2, Juha Nieminen wrote:
> Öö Tiib <oot...@hot.ee> wrote:
> > On general case UTF-8 is superior since it:
> > 1) is compatible with ASCII (ASCII text is subset of UTF-8)
> > 2) does not have alignment issues (UTF-16 code point may need to be at even address)
> > 3) does not have endianness issues (UTF-16 may be LE or BE)
> > 4) fits into std::string (std::wstring is unspecified if it is UTF-16 or
> > UTF-32 or something else entirely)
> > 5) is encoding of majority of internet text content
>
> That's quite a unilateral view, as you didn't contrast it with the
> advantages of UTF-16:

I mentioned that UTF-16 can be more convenient depending on platform
/ framework.

> 1) UTF-16 is faster to handle, especially if you can limit it to UCS-2
> (but even if you don't.)
>
> 2) UTF-16 takes less space with many non-western languages, such as
> Japanese. (Most, if not all, Japanese characters take 3 bytes with
> UTF-8 but only 2 bytes with UTF-16.)

These are really only performance factors. Megabyte of Japanese takes
days to read for a person regardless if it is UTF-8 or UTF-16. If it
is meant for a computer to read and I need extreme performance then
I don't see the reason why to choose Japanese (or text I/O in general).

Juha Nieminen

unread,
Feb 24, 2015, 3:39:37 AM2/24/15
to
Öö Tiib <oot...@hot.ee> wrote:
>> 1) UTF-16 is faster to handle, especially if you can limit it to UCS-2
>> (but even if you don't.)
>>
>> 2) UTF-16 takes less space with many non-western languages, such as
>> Japanese. (Most, if not all, Japanese characters take 3 bytes with
>> UTF-8 but only 2 bytes with UTF-16.)
>
> These are really only performance factors. Megabyte of Japanese takes
> days to read for a person regardless if it is UTF-8 or UTF-16. If it
> is meant for a computer to read and I need extreme performance then
> I don't see the reason why to choose Japanese (or text I/O in general).

You are only thinking about it in terms of a human reading the text.
There *are* other possible situations, you know.

For example, you may have a word game with a dictionary, and you might
need to implement, for example, a puzzle solver algorithm which, for
instance, resolves from a set of given characters all the possible
words that can be formed using those characters, as fast as possible.

UTF-8 would be horrible in terms of speed for this, while UTF-16
would be quite optimal.

Paavo Helde

unread,
Feb 24, 2015, 3:58:39 AM2/24/15
to
Juha Nieminen <nos...@thanks.invalid> wrote in news:mchdbq$24pk$1
@adenine.netfront.net:

> 嘱 Tiib <oot...@hot.ee> wrote:
>> These are really only performance factors. Megabyte of Japanese takes
>> days to read for a person regardless if it is UTF-8 or UTF-16. If it
>> is meant for a computer to read and I need extreme performance then
>> I don't see the reason why to choose Japanese (or text I/O in
general).
>
> You are only thinking about it in terms of a human reading the text.
> There *are* other possible situations, you know.
>
> For example, you may have a word game with a dictionary, and you might
> need to implement, for example, a puzzle solver algorithm which, for
> instance, resolves from a set of given characters all the possible
> words that can be formed using those characters, as fast as possible.
>
> UTF-8 would be horrible in terms of speed for this, while UTF-16
> would be quite optimal.

I do not see any big difference between UTF-8 and UTF-16 here, both are
variable-length formats. Maybe you assumed that the task is limited to
Japanese and one can use UCS-2 instead of UTF-16? UCS-2 would probably be
faster than UTF-8 indeed, but not sure how much because most Japanese
words are quite short. It might well be that dictionary search dominates
the word construction and there is actually not much difference.

Cheers
Paavo

Öö Tiib

unread,
Feb 24, 2015, 1:01:46 PM2/24/15
to
On Tuesday, 24 February 2015 10:39:37 UTC+2, Juha Nieminen wrote:
> Öö Tiib <oot...@hot.ee> wrote:
> >> 1) UTF-16 is faster to handle, especially if you can limit it to UCS-2
> >> (but even if you don't.)
> >>
> >> 2) UTF-16 takes less space with many non-western languages, such as
> >> Japanese. (Most, if not all, Japanese characters take 3 bytes with
> >> UTF-8 but only 2 bytes with UTF-16.)
> >
> > These are really only performance factors. Megabyte of Japanese takes
> > days to read for a person regardless if it is UTF-8 or UTF-16. If it
> > is meant for a computer to read and I need extreme performance then
> > I don't see the reason why to choose Japanese (or text I/O in general).
>
> You are only thinking about it in terms of a human reading the text.
> There *are* other possible situations, you know.

Certainly there are tons of situations that I have not measured.

> For example, you may have a word game with a dictionary, and you might
> need to implement, for example, a puzzle solver algorithm which, for
> instance, resolves from a set of given characters all the possible
> words that can be formed using those characters, as fast as possible.

In situations like that the performance bottle-necks are more likely in
design of dictionary and algorithms ran on it. In those things I have
seen improvements made that rose average performance 4 to 200 times.

> UTF-8 would be horrible in terms of speed for this, while UTF-16
> would be quite optimal.

Yes, I have yet to see differences between UTF-16 and UTF-8 that can
be somehow profiled "horrible". Your own extreme case all-Japanese
is only 1.5 times difference of storage. Things run typically
sufficiently fast so I can rarely convince shareholders of maintenance
that supposedly improves performance by less than 2 times.

Juha Nieminen

unread,
Feb 26, 2015, 3:12:04 AM2/26/15
to
Öö Tiib <oot...@hot.ee> wrote:
>> UTF-8 would be horrible in terms of speed for this, while UTF-16
>> would be quite optimal.
>
> Yes, I have yet to see differences between UTF-16 and UTF-8 that can
> be somehow profiled "horrible". Your own extreme case all-Japanese
> is only 1.5 times difference of storage. Things run typically
> sufficiently fast so I can rarely convince shareholders of maintenance
> that supposedly improves performance by less than 2 times.

If the tight inner loop of your dictionary search consists of
character comparisons, and you are performing millions of such
comparisons (as is very easily the case with eg. puzzle solvers),
UTF-16/UCS-2 vs. UTF-8 makes a significant difference.

It also makes the code simpler and shorter.

JiiPee

unread,
Feb 26, 2015, 3:34:32 AM2/26/15
to
both are variable byte strings. You mean its faster because it contains
less bytes on average than UTF-8? Thats why its faster? Or because the
encoding rules are faster?

Martijn Lievaart

unread,
Feb 26, 2015, 5:00:46 AM2/26/15
to
Assuming he ment UCS2 where he said UTF-16, I would guess because
indexing is faster. In applications like the puzzle solver above, that
would be the bottleneck.


M4
0 new messages