Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

OT: Why use UTF-16 for simple text?

310 views
Skip to first unread message

Lorem Ipsum

unread,
Mar 18, 2023, 2:09:59 PM3/18/23
to
I'm posting this here, because this group seems to have fairly intelligent members who have more than basic knowledge on things computerish.

I'm working with LTspice and started using command scripts to generate measurement output files. But they display in my editor as all the characters being separated by nulls. On asking in the LTspice group about this, it seems that while most of the textual output files spit out something viewable in a text editor, (namely UTF-8), this one file is generated in UTF-16!

The whole point of using the command script is to facilitate getting the results expediently, I now have to convert the durn files before I can usefully view them.

There's no facility to write anything into this file other than simple text. Given the other file formats this program generates are either UTF-8 or Western (ISO-8859-1), can anyone think of a reason why they would spit out UTF-16 for this one file format???

LTspice is free, but it's not cheap. Everytime I use it, I run into problems like this, that waste my time trying to work around them. It's like the user interface was designed by asylum inmates, *for* asylum inmates.

I know there's no real fix for this. I'm not looking for ways to convert the file and I can't change LTspice. I'm mostly just venting my frustration for the last week of dealing with the poor documentation and the religious fanaticism of the support group.

--

Rick C.

- Get 1,000 miles of free Supercharging
- Tesla referral code - https://ts.la/richard11209

minforth

unread,
Mar 18, 2023, 2:30:54 PM3/18/23
to
FWIW the free Notepad++ text editor has a menu item Encoding for such conversions.

Lorem Ipsum

unread,
Mar 18, 2023, 2:35:16 PM3/18/23
to
Yes, I can convert the file many ways. But that is a silly step. Here's a file you can't use, but you can use this other program to convert it to a format that works for.

This is not the only issue with LTspice. Most of it has to do with the terse documentation. The guy who designed the program is a bit of a genius, really. But he knows F**K ALL about UIs. I understand it's managed by a committee now. Oh, the horror!

--

Rick C.

+ Get 1,000 miles of free Supercharging
+ Tesla referral code - https://ts.la/richard11209

dxforth

unread,
Mar 18, 2023, 9:43:59 PM3/18/23
to
On 19/03/2023 5:09 am, Lorem Ipsum wrote:
> ...
> I know there's no real fix for this. I'm not looking for ways to convert the file and I can't change LTspice. I'm mostly just venting my frustration for the last week of dealing with the poor documentation and the religious fanaticism of the support group.

You've come to the right place to vent. We always knew you were one of us.
I liked the appeal at the beginning of your post. It was different from your usual :)

Zbig

unread,
Mar 19, 2023, 12:17:19 PM3/19/23
to
> The whole point of using the command script is to facilitate getting the results expediently, I now have to convert the durn files before I can usefully view them.

Googling around I've found this thread:
https://www.quora.com/When-should-UTF-16-encoding-be-preferred-over-UTF-8

To me the conclusion is:
„UTF-16 should only be used for interoperability with existing APIs that are incompatible with UTF-8. Absent such requirements, UTF-8 should be preferred to UTF-16. UTF-8 has a few clear advantages over UTF-16, such as:

* compatibility with ASCII
* self-synchronizing property
* endianness-independence

On the other hand, UTF-16 has zero clear advantages over UTF-8. While UTF-16 does take up less space than UTF-8 for some Asian languages, you can always just compress the UTF-8 encoding. The case for using UTF-8 everywhere is so compelling that this should be considered a minor inconvenience.”

Lorem Ipsum

unread,
Mar 19, 2023, 2:09:05 PM3/19/23
to
Meanwhile, in the LTspice group, I'm being labeled a troll for talking about this.

I get that various groups have a common interest and may not be very interested in hearing about issues with a tool. But the LTspice group seems to really come down on people for even mentioning that problems exist.

People don't have that shortcoming here. They mostly just come down on people for not much at all. lol

But thanks for the reference. I may share that with the LTspice developers.

--

Rick C.

-- Get 1,000 miles of free Supercharging
-- Tesla referral code - https://ts.la/richard11209

Anton Ertl

unread,
Mar 19, 2023, 2:15:00 PM3/19/23
to
Zbig <zbigni...@gmail.com> writes:
>While UTF-16 does take up less space than UTF-8 for some Asian languages

Often claimed, but often not true. E.g., consider the web page

https://ctee.com.tw/news/tech/823656.html

This is encoded in UTF-8. Let's see how big it would be in UTF-16:

wget https://ctee.com.tw/news/tech/823656.html
recode utf8..utf16 <823656.html >823656-utf16.html
ls -l 823656*

This shows:

-rw-r--r-- 1 anton users 175148 Mar 19 19:06 823656-utf16.html
-rw-r--r-- 1 anton users 92601 Mar 19 19:05 823656.html

So for this Taiwanese web page UTF-16 is *bigger* by a factor 1.89
than UTF-8.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2022: https://euro.theforth.net

S Jack

unread,
Mar 19, 2023, 2:20:00 PM3/19/23
to
On Saturday, March 18, 2023 at 1:35:16 PM UTC-5, Lorem Ipsum wrote:
> Yes, I can convert the file many ways. But that is a silly step. Here's a file you can't use, but you can use this other program > to convert it to a format that works for.

UTF was new toy just as color was decades ago when one could go
to an office and see women who changed their display background
to magenta and print many color memos so that 100 dollar ink jets
replaced 2 dollar ribbons.

Like the 5 year old after getting into her mother's makeup stands
in front of a mirror and proudly gazes at her new visage, heavily
powered face and rouged cheeks with smeared crimson lips and eyes
darkened almost black.
--
me

Lorem Ipsum

unread,
Mar 19, 2023, 5:54:59 PM3/19/23
to
Someone please explain that to the crowd at the groups.io LTspice group.

--

Rick C.

-+ Get 1,000 miles of free Supercharging
-+ Tesla referral code - https://ts.la/richard11209

dxforth

unread,
Mar 19, 2023, 8:16:57 PM3/19/23
to
On 20/03/2023 5:09 am, Lorem Ipsum wrote:
> On Sunday, March 19, 2023 at 12:17:19 PM UTC-4, Zbig wrote:
>>> The whole point of using the command script is to facilitate getting the results expediently, I now have to convert the durn files before I can usefully view them.
>> Googling around I've found this thread:
>> https://www.quora.com/When-should-UTF-16-encoding-be-preferred-over-UTF-8
>>
>> To me the conclusion is:
>> „UTF-16 should only be used for interoperability with existing APIs that are incompatible with UTF-8. Absent such requirements, UTF-8 should be preferred to UTF-16. UTF-8 has a few clear advantages over UTF-16, such as:
>>
>> * compatibility with ASCII
>> * self-synchronizing property
>> * endianness-independence
>>
>> On the other hand, UTF-16 has zero clear advantages over UTF-8. While UTF-16 does take up less space than UTF-8 for some Asian languages, you can always just compress the UTF-8 encoding. The case for using UTF-8 everywhere is so compelling that this should be considered a minor inconvenience.”
>
> Meanwhile, in the LTspice group, I'm being labeled a troll for talking about this.
>
> I get that various groups have a common interest and may not be very interested in hearing about issues with a tool. But the LTspice group seems to really come down on people for even mentioning that problems exist.
>
> People don't have that shortcoming here. They mostly just come down on people for not much at all. lol

Forth tells each programmer he can be a genius. This can result in over-achievers.
Every system has a way of self-regulation. Other languages tell users they exist to
be seen, not heard. Forth has tried that with mixed results.

Ron AARON

unread,
Mar 20, 2023, 12:40:43 AM3/20/23
to
Not entirely. UTF-16 has an advantage that seeking to a specific
character offset is O(1) whereas it's O(n) for UTF-8. Likewise seeking
backwards through a string is easier for UTF-16.

That said, 8th uses UTF-8 because it takes up less space in general
(especially when most code and text is not multibyte), and modern
systems understand it perfectly well. Only Windows (among the popular
OSes) insists on UTF-16, and at least has conversion routines for it.

Anton Ertl

unread,
Mar 20, 2023, 2:35:34 AM3/20/23
to
Ron AARON <c...@8th-dev.com> writes:
>UTF-16 has an advantage that seeking to a specific
>character offset is O(1) whereas it's O(n) for UTF-8.

Wrong. Even seeking to a specific code point offset is O(n) for
UTF-16. Even UTF-32 does not give us O(1) character seeking, because
a character can be composed of several code points; UTF-32 does give
us O(1) code-point seeking, but why would one want that?

>Likewise seeking
>backwards through a string is easier for UTF-16.

In what way?

Ron AARON

unread,
Mar 20, 2023, 3:36:05 AM3/20/23
to


On 20/03/2023 8:28, Anton Ertl wrote:
> Ron AARON <c...@8th-dev.com> writes:
>> UTF-16 has an advantage that seeking to a specific
>> character offset is O(1) whereas it's O(n) for UTF-8.
>
> Wrong. Even seeking to a specific code point offset is O(n) for
> UTF-16. Even UTF-32 does not give us O(1) character seeking, because
> a character can be composed of several code points; UTF-32 does give
> us O(1) code-point seeking, but why would one want that?

Ah, you are correct; I was thinking of UCS-2.

As for seeking in O(1) it's useful if you're splitting strings on X
characters. Admittedly less frequently useful for most people.

none albert

unread,
Mar 20, 2023, 7:53:10 AM3/20/23
to
In article <2023Mar1...@mips.complang.tuwien.ac.at>,
Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>Zbig <zbigni...@gmail.com> writes:
>>While UTF-16 does take up less space than UTF-8 for some Asian languages
>
>Often claimed, but often not true. E.g., consider the web page
>
>https://ctee.com.tw/news/tech/823656.html
>
>This is encoded in UTF-8. Let's see how big it would be in UTF-16:
>
>wget https://ctee.com.tw/news/tech/823656.html
>recode utf8..utf16 <823656.html >823656-utf16.html
>ls -l 823656*
>
>This shows:
>
>-rw-r--r-- 1 anton users 175148 Mar 19 19:06 823656-utf16.html
>-rw-r--r-- 1 anton users 92601 Mar 19 19:05 823656.html
>
>So for this Taiwanese web page UTF-16 is *bigger* by a factor 1.89
>than UTF-8.

Viewing the ridiculous waste of website bandwidth for pictures,
I think size is hardly relevant.

Working with D1 I come accross source files with comment in Chinese. I
can decipher it with my youdoa pen (or google) and I prefer this
situation over no comment.
While at the moment English is the "lingua franca" of the Internet
and science, Chinese will become more important.

>
>- anton

Groetjes Albert
--
Don't praise the day before the evening. One swallow doesn't make spring.
You must not say "hey" before you have crossed the bridge. Don't sell the
hide of the bear until you shot it. Better one bird in the hand than ten in
the air. First gain is a cat spinning. - the Wise from Antrim -

dxforth

unread,
Mar 20, 2023, 8:11:08 AM3/20/23
to
On 20/03/2023 10:53 pm, albert wrote:
> In article <2023Mar1...@mips.complang.tuwien.ac.at>,
> Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>> Zbig <zbigni...@gmail.com> writes:
>>> While UTF-16 does take up less space than UTF-8 for some Asian languages
>>
>> Often claimed, but often not true. E.g., consider the web page
>>
>> https://ctee.com.tw/news/tech/823656.html
>>
>> This is encoded in UTF-8. Let's see how big it would be in UTF-16:
>>
>> wget https://ctee.com.tw/news/tech/823656.html
>> recode utf8..utf16 <823656.html >823656-utf16.html
>> ls -l 823656*
>>
>> This shows:
>>
>> -rw-r--r-- 1 anton users 175148 Mar 19 19:06 823656-utf16.html
>> -rw-r--r-- 1 anton users 92601 Mar 19 19:05 823656.html
>>
>> So for this Taiwanese web page UTF-16 is *bigger* by a factor 1.89
>> than UTF-8.
>
> Viewing the ridiculous waste of website bandwidth for pictures,
> I think size is hardly relevant.
>
> Working with D1 I come accross source files with comment in Chinese. I
> can decipher it with my youdoa pen (or google) and I prefer this
> situation over no comment.
> While at the moment English is the "lingua franca" of the Internet
> and science, Chinese will become more important.

Back to ideograms?

What can't be done in 7-bit ASCII isn't worth doing. Less is Moore.


Anton Ertl

unread,
Mar 20, 2023, 9:41:52 AM3/20/23
to
albert@cherry.(none) (albert) writes:
>In article <2023Mar1...@mips.complang.tuwien.ac.at>,
>Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>>Zbig <zbigni...@gmail.com> writes:
>>>While UTF-16 does take up less space than UTF-8 for some Asian languages
>>
>>Often claimed, but often not true. E.g., consider the web page
>>
>>https://ctee.com.tw/news/tech/823656.html
>>
>>This is encoded in UTF-8. Let's see how big it would be in UTF-16:
>>
>>wget https://ctee.com.tw/news/tech/823656.html
>>recode utf8..utf16 <823656.html >823656-utf16.html
>>ls -l 823656*
>>
>>This shows:
>>
>>-rw-r--r-- 1 anton users 175148 Mar 19 19:06 823656-utf16.html
>>-rw-r--r-- 1 anton users 92601 Mar 19 19:05 823656.html
>>
>>So for this Taiwanese web page UTF-16 is *bigger* by a factor 1.89
>>than UTF-8.
>
>Viewing the ridiculous waste of website bandwidth for pictures,
>I think size is hardly relevant.

So even if there happened to be a case where UTF-16 was smaller, it
would be hardly relevant according to your argument.

>While at the moment English is the "lingua franca" of the Internet
>and science, Chinese will become more important.

That may or may not be the case, but does not make UTF-16 more
relevant than it is now, because pictures will still take more space
and because there will be just as much ASCII in Chinese text as now
(as demonstrated in the HTML page above).

Lest one think that I cherry-picked a web page to demonstrate my
point, here's the numbers for Daniel Lemire's unicode_lipsum
<https://github.com/lemire/unicode_lipsum>:

utf8 utf16 utf32 16/8
81685 91530 183056 1.120 Arabic-Lipsum.$u.txt
69840 46922 93840 0.671 Chinese-Lipsum.$u.txt
65542 65542 65544 1.000 Emoji-Lipsum.$u.txt
66495 74612 149220 1.122 Hebrew-Lipsum.$u.txt
87997 65532 131060 0.744 Hindi-Lipsum.$u.txt
67808 46750 93496 0.689 Japanese-Lipsum.$u.txt
66600 54290 108576 0.815 Korean-Lipsum.$u.txt
86940 173882 347760 2.000 Latin-Lipsum.$u.txt
104770 115962 231920 1.106 Russian-Lipsum.$u.txt

The first three columns are in bytes, the fourth gives the size
disadvantage factor of UTF-16 compared to UTF-8 (<1 means UTF-16 has
an advantage).

Lemir also gives Wikipedia entries on Mars (the utf* numbers are from
conversion to "text" (looks like Markdown to me)):

html
954430 533857 849690 1699376 Mar 20 13:33 arabic
382079 181321 274418 548832 Mar 20 13:33 chinese
368442 152721 287666 575328 Mar 20 13:33 czech
1005060 390368 775020 1550036 Mar 20 13:33 english
192461 86963 168252 336500 Mar 20 13:33 esperanto
1032638 446908 869736 1739468 Mar 20 13:33 french
397376 205779 402432 804860 Mar 20 13:33 german
326722 181348 286000 571996 Mar 20 13:33 greek
327412 190114 292704 585404 Mar 20 13:33 hebrew
712465 396593 547918 1095832 Mar 20 13:33 hindi
304786 164355 237784 475564 Mar 20 13:33 japanese
193001 97859 145838 291672 Mar 20 13:33 korean
293677 156209 249390 498776 Mar 20 13:33 persan
692409 280660 547232 1094456 Mar 20 13:33 portuguese
713817 407095 624076 1248148 Mar 20 13:33 russian
1088085 593589 809518 1619032 Mar 20 13:33 thai
387007 195078 370886 741768 Mar 20 13:33 turkish
674255 319029 564840 1129676 Mar 20 13:33 vietnamese

For the latter files the UTF16 variants are always bigger than the
UTF8 variants. Looking at chinese.utf8.txt, there is a lot of ASCII
there in the links/URLs (where non-ASCII is encoded in ASCII,
e.g. "/wiki/Wikipedia:%E6%B6%88%E6%AD%A7%E4%B9%89"), and also a bit in
the form of Markdown Markup (e.g., []() in links, or **...**, but
there is also numbers and percentages shown in ASCII; temperatures use
a combined "degrees-C" sign rather than the degree sign followed by
"C". There are also references to sources that are predominantly
ASCII.

The Lipsum Chinese text, OTOH, contains just ideograms and newlines,
not even a blank or an ASCII digit in sight (which, looking at the
Taiwanese web page linked to above seems to have become customary at
least in Taiwan). The Russian text also contains no digits, but it
does contain spaces and punctuation marks in ASCII.

So the Lipsum texts seem to be the best case for UTF-16. And indeed,
there the Chinese, Hindi, Japanese, and Korean UTF-16 texts are
smaller than their UTF-8 counterparts, but for Arabic and Russian
UTF-8 is smaller. And of course for the pseudo-Latin Lorem Ipsum,
where the UTF-16 version is more than twice as big as the UTF-8
version (More? How so? My guess is that the BOM causes two extra
bytes for UTF-16).

Lorem Ipsum

unread,
Mar 20, 2023, 10:03:13 AM3/20/23
to
On Monday, March 20, 2023 at 7:53:10 AM UTC-4, none albert wrote:
> In article <2023Mar1...@mips.complang.tuwien.ac.at>,
> Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
> >Zbig <zbigni...@gmail.com> writes:
> >>While UTF-16 does take up less space than UTF-8 for some Asian languages
> >
> >Often claimed, but often not true. E.g., consider the web page
> >
> >https://ctee.com.tw/news/tech/823656.html
> >
> >This is encoded in UTF-8. Let's see how big it would be in UTF-16:
> >
> >wget https://ctee.com.tw/news/tech/823656.html
> >recode utf8..utf16 <823656.html >823656-utf16.html
> >ls -l 823656*
> >
> >This shows:
> >
> >-rw-r--r-- 1 anton users 175148 Mar 19 19:06 823656-utf16.html
> >-rw-r--r-- 1 anton users 92601 Mar 19 19:05 823656.html
> >
> >So for this Taiwanese web page UTF-16 is *bigger* by a factor 1.89
> >than UTF-8.
> Viewing the ridiculous waste of website bandwidth for pictures,
> I think size is hardly relevant.

It's not always about the Internet. My problem is compatibility. I use ASCII tools, such as the text editor. I don't know a single reason why a program would output files in UTF-8, UTF-16 and Western (ISO-8859-1) (an 8 bit extension of ASCII), but not by the users choice. It's based on which file is being output. Of 8 different file types, the one that is generated as a textual record of measurement taken, i.e. very likely to be read by another program, is in UTF-16, not compatible with ASCII bytes.


> Working with D1 I come accross source files with comment in Chinese. I
> can decipher it with my youdoa pen (or google) and I prefer this
> situation over no comment.
> While at the moment English is the "lingua franca" of the Internet
> and science, Chinese will become more important.

Does UTF-16 work with Chinese better than UTF-8? As others have said, UTF-8 is a superset of 8 bit ASCII and inter-workable.

--

Rick C.

+- Get 1,000 miles of free Supercharging
+- Tesla referral code - https://ts.la/richard11209

S Jack

unread,
Mar 20, 2023, 12:02:44 PM3/20/23
to
On Monday, March 20, 2023 at 7:11:08 AM UTC-5, dxforth wrote:
> Back to ideograms?

Pure ideograms are very nice in an environment comprised of many
dialects. Knowing the ideograms, one sounds in his mind his
own dialect without translation to disturb the mind's harmony.

But Chinese got corrupted long ago when some progressive
"improved" it by adding phoneme elements to characters and
the LC (language committee) got carried away and produced thousands
of characters providing job security for the scribes.

Romanji works in Japan so would assume it should work in China
and elsewhere.
--
me

Paul Rubin

unread,
Mar 20, 2023, 4:13:11 PM3/20/23
to
dxforth <dxf...@gmail.com> writes:
> What can't be done in 7-bit ASCII isn't worth doing. Less is Moore.

Touché.

dxforth

unread,
Mar 20, 2023, 7:29:47 PM3/20/23
to
Humans do have a habit of romanticising language and culture - especially
if they view it as being on the ascendency or as possessing something they
don't. The expression of the human condition in any language is fine by me
but let's keep it simple :)


dxforth

unread,
Mar 20, 2023, 7:42:00 PM3/20/23
to
Precisely :)


S Jack

unread,
Mar 21, 2023, 10:32:45 AM3/21/23
to
Contrary to what many think major languages were designed not
something that evolved naturally. What was natural was their erosion
to better suit the general speaking public, prime example being
English. For a couple hundred years it was spoken by the uneducated;
the educated spoke and wrote in French the language used by the
court. As result inflections were replaced by prepositions. The
former more efficient for the court scribes that had to write many
official documents and the the latter with less rules to learn and
more facilitating to the ear of the general speaking public many of
whom didn't know how to write. The same natural trend as seen in
vulgate Latin, vulgate ancient Egyptian and vulgate anything (Koo in
the extreme).

One could think that the designers lacked skill but anyone involved
in standard Forth would be very sympathetic to those attempting such
task. Don't think it's prudent to take on such an endeavor for real;
imagine the push back from all the world. But long ago for fun made
an attempt. The idea was to come up with a simple workable grammar
and use different sets of vocabularies. One set with words rich in
labials giving the pleasant lilt of an African language, another set
using gutturals for harsh sounds of Klingon. The grammar would be
the same and words of different vocabulary sets would have one to
one correspondence but from their sound and looks it wouldn't be
obvious. Hollywood actors would have a simple grammar and some basic
words to learn which sounds could easily be modified to fit the
scenes they were performing.

Esperanto, Ido and such had the daunting task of having to create
very large vocabularies to be accepted. But for the above that task
could be circumvented by picking Latin roots that correspond to
basic English and adapting them. (Basic English used for selecting a
complete working set of words. The words wouldn't appear as English
nor Latin in the vocabularies.)
--
me

Doug Hoffman

unread,
Mar 22, 2023, 5:20:03 AM3/22/23
to
Users in various countries may differ. For example, the euro glyph is common ( € ) .

OT, but:
This epitaph was taken from a real-life tombstone found in Tombstone, Arizona.
The headstone reads, "Here lies Lester Moore, Four slugs from a .44, No Les No more."

-Doug

dxforth

unread,
Mar 22, 2023, 6:06:58 AM3/22/23
to
On 22/03/2023 8:20 pm, Doug Hoffman wrote:
> On Monday, March 20, 2023 at 7:42:00 PM UTC-4, dxforth wrote:
>> On 21/03/2023 7:13 am, Paul Rubin wrote:
>>> dxforth <dxf...@gmail.com> writes:
>>>> What can't be done in 7-bit ASCII isn't worth doing. Less is Moore.
>>>
>>> Touché.
>> Precisely :)
>
> Users in various countries may differ. For example, the euro glyph is common ( € ) .

Assuming an ASCII world, one byte should be plenty - 128 slots for ASCII
and 128 slots for whatever else one believes is important.

S Jack

unread,
Mar 22, 2023, 8:57:10 AM3/22/23
to
On Wednesday, March 22, 2023 at 5:06:58 AM UTC-5, dxforth wrote:
> Assuming an ASCII world, one byte should be plenty - 128 slots for ASCII
> and 128 slots for whatever else one believes is important.

That's what I do.
I'm on a UTF hterm, not my choice, and I use code page to map
128 characters to the upper register.
--
me

Brian Fox

unread,
Mar 22, 2023, 2:05:28 PM3/22/23
to
On Monday, March 20, 2023 at 7:53:10 AM UTC-4, none albert wrote:

> While at the moment English is the "lingua franca" of the Internet
> and science, Chinese will become more important.

<sidebar>
As famous American baseball player, Yogi Berra, was reputed to say:
"It's hard to tell what's gonna happen, especially when it's in the future"

I won't be around to see it but China is on a demographic precipice.
Some are saying by early in the 3rd quarter of this century the population
will be half of current number.

Who knows what that does? It might put Hindi in ascent.
</sidebar>

Doug Hoffman

unread,
Mar 22, 2023, 2:25:32 PM3/22/23
to
I think I'll use shift-option-2 for € (whatever of the remaining 128 slots that is). Wonder
what others will use?

-Doug

Lorem Ipsum

unread,
Mar 22, 2023, 2:59:17 PM3/22/23
to
On Wednesday, March 22, 2023 at 6:06:58 AM UTC-4, dxforth wrote:
UTF-8 is code compatible with ASCII, while supporting as many characters as you would like. If you use ASCII, it is also UTF-8 encoded, automagically. I'm pretty sure UTF-8 includes the euro glyph without machinations.

--

Rick C.

++ Get 1,000 miles of free Supercharging
++ Tesla referral code - https://ts.la/richard11209

Lorem Ipsum

unread,
Mar 22, 2023, 3:04:46 PM3/22/23
to
I don't think the selection of the language for business is done by a popular vote of the world's population. Until India changes course and finds another dominate export, other than telemarking phone calls purporting to be selling medical insurance or credit card services, they will remain other than a first world country.

I have to give them credit for ingenuity though. Who would have thought you could turn fraud into an export?

--

Rick C.

--- Get 1,000 miles of free Supercharging
--- Tesla referral code - https://ts.la/richard11209

Marcel Hendrix

unread,
Mar 22, 2023, 3:15:41 PM3/22/23
to
On Wednesday, March 22, 2023 at 7:25:32 PM UTC+1, Doug Hoffman wrote:
[..]
> I think I'll use shift-option-2 for € (whatever of the remaining 128 slots that is). Wonder
> what others will use?

What about "Euro" ? It definitely looks better in a sentence, column heading or caption.

-marcel

Paul Rubin

unread,
Mar 22, 2023, 3:52:35 PM3/22/23
to
Lorem Ipsum <gnuarm.del...@gmail.com> writes:
> I'm pretty sure UTF-8 includes the euro glyph without machinations.

The codepoint is U+20AC so the utf-8 encoding is 3 bytes long. In
Windows-1252 it has a single byte encoding (0x80). It doesn't seem to
exist in ISO-8859-1. In ISO-8859-15 it is 0xa4. Especially in the
Forth milieu on limited systems, I can understand the attraction of
having a single byte encoding for every character, even if that limits
the character set. I think Unicode was originally intended to be a 16
bit character set corresponding to the Unicode BMP (basic multilingual
plane), but the BMP ran out of characters and now we have a contorted
mess with slightly over 20 bits but plenty of literally crap characters
(viz. U+1F4A9, the poop emoji).

Zbig

unread,
Mar 22, 2023, 5:28:42 PM3/22/23
to
> I think I'll use shift-option-2 for € (whatever of the remaining 128 slots that is). Wonder
> what others will use?

In Linux Alt-5 is used.

dxforth

unread,
Mar 22, 2023, 9:05:55 PM3/22/23
to
On 23/03/2023 5:59 am, Lorem Ipsum wrote:
> On Wednesday, March 22, 2023 at 6:06:58 AM UTC-4, dxforth wrote:
>> On 22/03/2023 8:20 pm, Doug Hoffman wrote:
>>> On Monday, March 20, 2023 at 7:42:00 PM UTC-4, dxforth wrote:
>>>> On 21/03/2023 7:13 am, Paul Rubin wrote:
>>>>> dxforth <dxf...@gmail.com> writes:
>>>>>> What can't be done in 7-bit ASCII isn't worth doing. Less is Moore.
>>>>>
>>>>> Touché.
>>>> Precisely :)
>>>
>>> Users in various countries may differ. For example, the euro glyph is common ( € ) .
>> Assuming an ASCII world, one byte should be plenty - 128 slots for ASCII
>> and 128 slots for whatever else one believes is important.
>
> UTF-8 is code compatible with ASCII, while supporting as many characters as you would like. If you use ASCII, it is also UTF-8 encoded, automagically. I'm pretty sure UTF-8 includes the euro glyph without machinations.
>

Sure but who is going to implement UTF-8 when ASCII will do? AFAIK
for every currency there is corresponding ASCII abbreviation e.g. AUD

Ron AARON

unread,
Mar 23, 2023, 1:03:44 AM3/23/23
to
And they'll keep adding garbage characters because we've got all that
space now. And then the space aliens will show up and we'll need
characters for their language, but we won't have any space left, so
they'll zap us.

Lorem Ipsum

unread,
Mar 23, 2023, 2:50:25 AM3/23/23
to
Didn't you read the post that started this thread???

--

Rick C.

--- Get 1,000 miles of free Supercharging
--- Tesla referral code - https://ts.la/richard11209

Anton Ertl

unread,
Mar 23, 2023, 3:00:17 AM3/23/23
to
dxforth <dxf...@gmail.com> writes:
>Sure but who is going to implement UTF-8 when ASCII will do? AFAIK
>for every currency there is corresponding ASCII abbreviation e.g. AUD

For AUD there is even an ASCII character: $

However, this demonstrastes trhe advantage of currency codes over
currency signs: Currency signs are ambiguous.

The currency code for the Euro is EUR.

Anton Ertl

unread,
Mar 23, 2023, 3:44:04 AM3/23/23
to
Ron AARON <c...@8th-dev.com> writes:
>And they'll keep adding garbage characters because we've got all that
>space now. And then the space aliens will show up and we'll need
>characters for their language, but we won't have any space left, so
>they'll zap us.

Unicode has grown from 110,117 code points in September 2012 to
149,186 characters in September 2022. UTF-16 supports 1,112,064 code
points, while UTF-8 would straightforwardly support 2G code points,
but software should complain about code points outside the 1,112,064
ones, so a lot of UTF-8 software probably does not support more code
points. UTF-32 obviously can support 4G code points, and eliminating
the limit of 1,112,064 code points will be pretty easy for UTF-32.

As for "garbage characters", looking at the notes in
<https://en.wikipedia.org/wiki/Unicode>, the vast majority of
additions are not for stuff like new emojis (which probably would not
have been introduced if space was tight), with 4,192 out of 4,489 code
points added in the most recent version of Unicode being CJK
ideographs, but also adding 20 emojis. Interestingly, the emojis are
making the headlines, and they are probably more widely used than the
newly added CJK ideographs or control characters for Egyptian
hieroglyphs, so who are we to say that they are garbage characters.

It's interesting that, while in Forth standardization we have strong
resistance against new optional features from implementors, Unicode
standardization seems to have little resistance to adding stuff
(probably due to its roots: they want to support all writing systems
rather than computer-established practice). And there can be quite
substantial implementation cost. As a result, implementation tends to
lag behind (at one point I wanted to run the program I had used to
produce Figure 1 of
<https://www.complang.tuwien.ac.at/anton/euroforth2005/papers/ertl%26paysan05.pdf>,
but had trouble finding a font that supported all the scripts I had
used. But over time, more stuff seems to be supported.

dxforth

unread,
Mar 23, 2023, 5:04:34 AM3/23/23
to
Yes - I loved it.

On 19/03/2023 5:09 am, Lorem Ipsum wrote:
> ...
> I know there's no real fix for this. I'm not looking for ways to convert the file and I can't change LTspice. I'm mostly just venting my frustration for the last week of dealing with the poor documentation and the religious fanaticism of the support group.

Anton Ertl

unread,
Mar 23, 2023, 5:16:18 AM3/23/23
to
an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:
> (at one point I wanted to run the program I had used to
>produce Figure 1 of
><https://www.complang.tuwien.ac.at/anton/euroforth2005/papers/ertl%26paysan05.pdf>,
>but had trouble finding a font that supported all the scripts I had
>used. But over time, more stuff seems to be supported.

I just tried it again. It works fine on the xterm setup I normally
use (with the face "Noto Sans Mono"), and the program also displays
fine on the emacs setup I use. In Emacs I looked at various
characters (using C-u C-x =) to see what fonts are used, and they are:

ASCII: x:-misc-fixed-medium-r-normal--15-140-75-75-c-90-iso8859-1
Runic: x:-misc-fixed-medium-r-normal--15-140-75-75-c-90-iso10646-1
Thai: ftcrhb:-PfEd-Tlwg Typist-normal-normal-normal-*-15-*-*-*-*-0-iso10646-1
Hebrew: x:-misc-fixed-medium-r-normal--15-140-75-75-c-90-iso10646-1
Cyrillic: x:-misc-fixed-medium-r-normal--15-140-75-75-c-90-iso10646-1
Greek: x:-misc-fixed-medium-r-normal--15-140-75-75-c-90-iso10646-1

Doug Hoffman

unread,
Mar 23, 2023, 7:51:56 AM3/23/23
to
On Thursday, March 23, 2023 at 3:00:17 AM UTC-4, Anton Ertl wrote:
> dxforth <dxf...@gmail.com> writes:
> >Sure but who is going to implement UTF-8 when ASCII will do? AFAIK
> >for every currency there is corresponding ASCII abbreviation e.g. AUD
> For AUD there is even an ASCII character: $
>
> However, this demonstrastes trhe advantage of currency codes over
> currency signs: Currency signs are ambiguous.
>
> The currency code for the Euro is EUR.
> - anton

Interesting to know that the world has bent its glyph usage to ASCII (7-bit).
I did not know that. I guess that makes things much simpler as dxforth suggests.
Makes me wonder why we bothered with the XCHAR extension in the Forth standard.

-Doug

dxforth

unread,
Mar 23, 2023, 9:27:32 AM3/23/23
to
On 23/03/2023 5:50 pm, Anton Ertl wrote:
> dxforth <dxf...@gmail.com> writes:
>> Sure but who is going to implement UTF-8 when ASCII will do? AFAIK
>> for every currency there is corresponding ASCII abbreviation e.g. AUD
>
> For AUD there is even an ASCII character: $

Unless it's ANS-Forth or 200x


Anton Ertl

unread,
Apr 2, 2023, 1:19:00 PM4/2/23
to
Paul Rubin <no.e...@nospam.invalid> writes:
>Especially in the
>Forth milieu on limited systems, I can understand the attraction of
>having a single byte encoding for every character, even if that limits
>the character set.

The limited system does not have a display. It sends its output to a
computer than knows how to display UTF-8. There is no technical
reason for limiting yourself to single-byte encodings.

none albert

unread,
Apr 2, 2023, 2:45:03 PM4/2/23
to
In article <87h6uci...@nightsong.com>,
ciforth may be the simplest simple Forth around. It has no
problems with huge character strings with whatever encoding,
provided EMIT is adapted a little bit.
ciforth follows linux that the primitive for char output is the
string. As long as the length is known in bytes the terminal can
take care of it. I can see a Chinese VT100, automatically displaying
the Chinese character for DROP.

So EMIT is defined as
: EMIT DSP@ 1 TYPE DROP ;
It could be equally wel be
: EMIT DSP@ 2 TYPE DROP ;
(As long as a single character is at most 8 bytes.)

I have had no complaints from Chinese users, this far.

Groetjes Albert
--
Don't praise the day before the evening. One swallow doesn't make spring.
You must not say "hey" before you have crossed the bridge. Don't sell the
hide of the bear until you shot it. Better one bird in the hand than ten in
the air. First gain is a cat spinning. - the Wise from Antrim -
0 new messages