On Sat, 28 Mar 2015 22:07:05 +0000
Vir Campestris <vir.cam...@invalid.invalid> wrote:
> On 28/03/2015 08:57, Jorgen Grahn wrote:
> > Surely utf8/is/ unicode?
>
> Oh no, not at all.
Oh yes.
> Unicode is a wide character set
That is wrong. Unicode is a defined set of code points. There are a
number of encodings which unicode recognizes for the purposes of
encoding those code points, namely UTF-7, UTF-8, UTF-16 and UTF-32.
Because the latter in fact is a one to one encoding with code points,
it is identical to UCS-4.
> Now most of us use characters that are only in the first 256; so a
> byte-wide set works fine. On the other hand in Chinese there are way
> more characters than that. UTF-8 is a way of encoding the characters
> that is efficient for us westerners. It isn't bad for China either, as
> a lot of the characters can be compressed down to 2 bytes - which
> means UTF-8 is no bigger than 16-bit Unicode (UCS-2?).
UTF-16 is the 16 bit encoding for unicode. UCS-2 only supports the
basic multilingual plane, whereas UTF-16 can represent all unicode code
points (and accordingly is a variable length encoding as it uses
surrogate pairs of 16 bit code units). UTF-16 occupies more space than
UTF-8 for the average european script. It is reputed to occupy slightly
less for the average Japanese script.
> Windows uses 16-bit characters for its native APIs. But to meet all
> the characters - and this includes some musical ones - you have to
> have more than 16 bits. UTF-8 does this - but takes up to 4 bytes per
> character. and ++ on a char string doesn't work any more - you have
> to know you're using UTF-8 and take special measures to get whole
> characters. But to meet all the characters - and this includes some
> musical ones - you have to have more than 16 bits. UTF-8 does this -
> but takes up to 4 bytes per character. and ++ on a char string
> doesn't work any more - you have to know you're using UTF-8 and take
> special measures to get whole characters.
Many "characters" (given the meaning most people think it has) require
more than two unicode code points in normalized non-precomposed form.
Some "characters" are not representable in precomposed form. Such
representations require more than one UTF-32 code unit, more than two
UTF-16 code units and can require more than four UTF-8 code units.
Because UTF-16 is a variable length encoding, your '++' does not work
(for your meaning of "work") with UTF-16 either. Because of combining
characters, nor does UTF-32 if by "character" you mean a grapheme,
which is what most people think it means (namely, what they see as a
"character" in their terminal).
> AIUI Unix has used UTF-8 since the year dot, and hence Linux since
> birth. DOS had national variants :( - which is why the Japanese (used
> to?) use the yen currency character for backslash in path seperators.
No. For narrow encodings, unix used to be as incoherent as microsoft
code pages for its narrow codesets. ISO-8859 was common for
non-cyrillic european scripts, KOI8-R for cyrillic, and EUC ("Extended
Unix Code") for JKC scripts. JIS and Shift-JIS was also in use for
Japanese scripts and GB 2312 for Chinese scripts.
Chris