albert@cherry.(none) (albert) writes:
>In article <
2023Mar1...@mips.complang.tuwien.ac.at>,
>Anton Ertl <
an...@mips.complang.tuwien.ac.at> wrote:
>>Zbig <
zbigni...@gmail.com> writes:
>>>While UTF-16 does take up less space than UTF-8 for some Asian languages
>>
>>Often claimed, but often not true. E.g., consider the web page
>>
>>
https://ctee.com.tw/news/tech/823656.html
>>
>>This is encoded in UTF-8. Let's see how big it would be in UTF-16:
>>
>>wget
https://ctee.com.tw/news/tech/823656.html
>>recode utf8..utf16 <823656.html >823656-utf16.html
>>ls -l 823656*
>>
>>This shows:
>>
>>-rw-r--r-- 1 anton users 175148 Mar 19 19:06 823656-utf16.html
>>-rw-r--r-- 1 anton users 92601 Mar 19 19:05 823656.html
>>
>>So for this Taiwanese web page UTF-16 is *bigger* by a factor 1.89
>>than UTF-8.
>
>Viewing the ridiculous waste of website bandwidth for pictures,
>I think size is hardly relevant.
So even if there happened to be a case where UTF-16 was smaller, it
would be hardly relevant according to your argument.
>While at the moment English is the "lingua franca" of the Internet
>and science, Chinese will become more important.
That may or may not be the case, but does not make UTF-16 more
relevant than it is now, because pictures will still take more space
and because there will be just as much ASCII in Chinese text as now
(as demonstrated in the HTML page above).
Lest one think that I cherry-picked a web page to demonstrate my
point, here's the numbers for Daniel Lemire's unicode_lipsum
<
https://github.com/lemire/unicode_lipsum>:
utf8 utf16 utf32 16/8
81685 91530 183056 1.120 Arabic-Lipsum.$u.txt
69840 46922 93840 0.671 Chinese-Lipsum.$u.txt
65542 65542 65544 1.000 Emoji-Lipsum.$u.txt
66495 74612 149220 1.122 Hebrew-Lipsum.$u.txt
87997 65532 131060 0.744 Hindi-Lipsum.$u.txt
67808 46750 93496 0.689 Japanese-Lipsum.$u.txt
66600 54290 108576 0.815 Korean-Lipsum.$u.txt
86940 173882 347760 2.000 Latin-Lipsum.$u.txt
104770 115962 231920 1.106 Russian-Lipsum.$u.txt
The first three columns are in bytes, the fourth gives the size
disadvantage factor of UTF-16 compared to UTF-8 (<1 means UTF-16 has
an advantage).
Lemir also gives Wikipedia entries on Mars (the utf* numbers are from
conversion to "text" (looks like Markdown to me)):
html
954430 533857 849690 1699376 Mar 20 13:33 arabic
382079 181321 274418 548832 Mar 20 13:33 chinese
368442 152721 287666 575328 Mar 20 13:33 czech
1005060 390368 775020 1550036 Mar 20 13:33 english
192461 86963 168252 336500 Mar 20 13:33 esperanto
1032638 446908 869736 1739468 Mar 20 13:33 french
397376 205779 402432 804860 Mar 20 13:33 german
326722 181348 286000 571996 Mar 20 13:33 greek
327412 190114 292704 585404 Mar 20 13:33 hebrew
712465 396593 547918 1095832 Mar 20 13:33 hindi
304786 164355 237784 475564 Mar 20 13:33 japanese
193001 97859 145838 291672 Mar 20 13:33 korean
293677 156209 249390 498776 Mar 20 13:33 persan
692409 280660 547232 1094456 Mar 20 13:33 portuguese
713817 407095 624076 1248148 Mar 20 13:33 russian
1088085 593589 809518 1619032 Mar 20 13:33 thai
387007 195078 370886 741768 Mar 20 13:33 turkish
674255 319029 564840 1129676 Mar 20 13:33 vietnamese
For the latter files the UTF16 variants are always bigger than the
UTF8 variants. Looking at chinese.utf8.txt, there is a lot of ASCII
there in the links/URLs (where non-ASCII is encoded in ASCII,
e.g. "/wiki/Wikipedia:%E6%B6%88%E6%AD%A7%E4%B9%89"), and also a bit in
the form of Markdown Markup (e.g., []() in links, or **...**, but
there is also numbers and percentages shown in ASCII; temperatures use
a combined "degrees-C" sign rather than the degree sign followed by
"C". There are also references to sources that are predominantly
ASCII.
The Lipsum Chinese text, OTOH, contains just ideograms and newlines,
not even a blank or an ASCII digit in sight (which, looking at the
Taiwanese web page linked to above seems to have become customary at
least in Taiwan). The Russian text also contains no digits, but it
does contain spaces and punctuation marks in ASCII.
So the Lipsum texts seem to be the best case for UTF-16. And indeed,
there the Chinese, Hindi, Japanese, and Korean UTF-16 texts are
smaller than their UTF-8 counterparts, but for Arabic and Russian
UTF-8 is smaller. And of course for the pseudo-Latin Lorem Ipsum,
where the UTF-16 version is more than twice as big as the UTF-8
version (More? How so? My guess is that the BOM causes two extra
bytes for UTF-16).