On 08/01/2022 20:12, Mateusz Viste wrote:
> 2022-01-08 at 10:53 -0800, Öö Tiib wrote:
>> That is incorrect as glyph 👌🏽 is U+1F44C U+1F3FD so all three are of
>> varying length. That makes UTF-8 the sole sane one.
>
> You are right, the implementations of UTF-16 I worked on were limited
> to the BMP (ie. always 2 bytes), hence my simplified view.
>
When Unicode was young, the intention was that every glyph was one
character, and it would all fit in 16-bits - that was UCS2, as used
originally by Windows NT, Java, Python, QT, and other systems, languages
and libraries. But it was quickly discovered that this was far from
sufficient.
> Still, UTF-32 is always 4 bytes for any possible glyph, isn't it?
>
The terminology of Unicode can be a little confusing. (And I'm sure
someone will correct me if I get it wrong.)
A "code point" is an entry in the Unicode tables. Each code point is
uniquely identified by a 32-bit number. The code points are organised
in "planes" for convenience, and designed so that the first 128 code
points match ASCII and that a wide range of languages can be covered by
the code units in the range 0x0000 .. 0xffff (excluding 0xd800 ..
0xdfff) so that 16 bits would often be enough.
A "code unit" is the container for the bits of the encoding. In UTF-8,
a code unit is an 8-bit unit. In UTF-16, it is 16-bit, in UTF-32 it is
32-bit.
UTF-8 takes up to four code units (32 bits total) per code point, UTF-16
takes up to two code units, and UTF-32 takes exactly one code unit per
code point. UTF-8 is always at least as compact as UTF-32, and will be
more or less compact than UTF-16 depending on the content. These are
just different encodings - different ways to write the code points.
There are others, such as GB18030 which is a 16-bit encoding popular in
China because it matches their traditional GB encodings in the same way
UTF-8 matches ASCII.
A "grapheme" is a written mark - a letter, punctuation, accent, etc.,
that conveys meaning. Sometimes it is useful to break them down,
sometimes it is useful to treat them separately. For example, "é" can
considered as a single grapheme, or as a grapheme "e" followed by a
combining graphene "'" acute accent. The same grapheme can match
multiple code points - a Latin alphabet capital A is the same as a Greek
alphabet capital Alpha.
A "glyph" is a rendering of a grapheme - the letter "A" in different
fonts are different glyphs of the same grapheme.
What the reader perceives as a "character" is often a single grapheme,
but might be several graphemes together.
So, with that in mind, all three UTF formats require multiple code units
to cover all graphemes. But UTF-32 always gets one code point per code
unit, making it simpler and more consistent for processing Unicode text.
As a file or transfer encoding, it has the big inconvenience of being
endian-specific as well as being bulkier than UTF-8. UTF-16 combines
the worst features of UTF-8 with the worst features of UTF-32, with none
of the benefits - it exists solely because early Unicode adopters
committed too strongly to UCS2.
People are often concerned that UTF-8 is difficult or complex to decode
or split up. It is not, in practice. It is actually quite rare that
you need to divide up a string based on characters or even find its
length in code points - for most uses of strings, you just pass them
around without bothering about the details of the contents. You need to
know how much memory the string takes, not how many code points it has.
And simply treating it as an abstract stream of data terminated by a
zero character can be enough to give you a useable sorting and
uniqueness comparison for many uses. The point where you need to decode
the code units and know what they mean is when you are doing rendering,
sorting, or other human interaction - and then you have such a vastly
bigger task that turning UTF-8 coding into UTF-32 code points is
negligible effort in comparison.
(And UTF-8 is not much harder to encode or decode than UTF-16.)