unicode: UTF / UCS

85 views
Skip to first unread message

Johannes Köhler

unread,
Jul 27, 2021, 12:09:22 PM7/27/21
to vim...@googlegroups.com

Beloved vim'er!

until shortly before... I never came up with
the idea of doing: "thinking about the text file encoding
of my files@hdd"

I used unicode like a definition at my locales. Still in
mind that my files are utf-8 encoded.

BUT, after a file crash - during the system play with an
old ext2 filesystem and gnu tar, i had an file header
without file in my inodes. Like an condensor without
payload :) AND, out of curiosity i probed a bit with vim
files, and utf-8 (but btrfs) and an up-to-date archlinux.

Then, I realized that there are three encoding views:
keyboard, display(terminal), vim. Like, decoding pipes to
an encoded socket. The encoded socket, the file itself,
works partly inconsistent together with vim, xterm and
the unixtool file.

Setting: I create an file using xterm console and touch.
Then, i open it with vim.

Vim: enc & fenc = utf-8
BUT file -i: us-ascii

The file results with 2-byte per Character, yet like
us-ascii inside of an unicode container. However, i
like to have real unicode and not an endianness
of us-ascii using 2-byte instead of 1-byte.

Then @vim, i change the encoding to ucs-2 with :set fenc=ucs-2. I
read@vimdoku ucs-2 and utf-8 is similar@linux
Now :write, vim tells me [converted] and
file (sometimes) tells me utf-8 like expected. The file
size increases to 4-byte per character, like expected
for ucs-4. Then reread @vim, shows me unreadable content.
I have to ++enc it back to ucs-2. So, inside vim ucs-2 and utf-8 seems
to be different. And @linux ucs-2 using
filespace like ucs-4.

Imaginary reasoning: my system wide (or kernel working)
utf-8 differs from real unicode utf-8 by endianness
abuse. Maybe because of compatibility...
That is why the file tool works inconsistent
(partly tells binary stuff instead of text encoding).

Is there a way to ensure working with true utf-8
or better utf-16 files? Aim is to work with source
files in unicode to exclude the deprecated ascii...

Sincerly
-kefko

--
Wonderful vim doku:
When a mapping triggers itself, it will run forever
WEB www.johannes-koehler.de

Gabriele F

unread,
Jul 27, 2021, 2:44:04 PM7/27/21
to 'Johannes Köhler' via vim_use
Hi, first of all you seem to have misunderstandings about what UTF-8 and
the other Unicode encodings are. If you're interested and confident with
low-level things I advise you to learn exactly what they are. The
relevant portions of the Unicode specification (unicode.org) are not
very long or exceedingly hard to understand, but maybe you can find some
more accessible description.

Most of all, UTF-8 is (normally) absolutely indistinguishable from
normal US-ASCII until you use characters that were not in US-ASCII; so
for example most English files will be bit-per-bit identical whether
written in US-ASCII or UTF-8.

Then, there are many fairly complex issues in how files are read,
converted and written by the various parts of the system. Vim is an
especially problematic part, I had made an attempt of understanding it
in the message
https://www.mail-archive.com/vim...@googlegroups.com/msg57383.html and
the others of that thread. But you probably won't make much out of it
until you know how at least UTF-8 is encoded.

Finally, if you really want to be sure of having all your files encoded
in Unicode (in UTF-8 or other encodings), then I applaud you and agree
with your concern, and I suggest the way I do it (yes there actually is
a way):
https://www.mail-archive.com/vim...@googlegroups.com/msg57385.html .
The BOM mentioned there is a byte sequence that can be placed at the
beginning of text files and will be interpreted by unicode-aware
software as a sort of invisible declaration that the file is in a
certain Unicode encoding.

By the way, all of this means that it's not ascii that is "deprecated",
but the various complimentary or alternative encodings that were (and
still partly are) used to support non-English characters.

Kind regards,
Gabriele



P.S. I'm not sure I'll be able to further reply in the next days, I'm in
a complex situation

Gabriele F

unread,
Jul 27, 2021, 2:46:51 PM7/27/21
to vim...@googlegroups.com
Apologies for top-quoting in the previous message, I forgot to delete
the quote

Johannes Köhler

unread,
Jul 27, 2021, 2:50:20 PM7/27/21
to vim...@googlegroups.com
On 27.07.21 18:09, Johannes Köhler wrote:
>
> [...]
>
> Is there a way to ensure working with true utf-8
> or better utf-16 files? Aim is to work with source
> files in unicode to exclude the deprecated ascii...
>

*disorientation*

The unix _manpage_ utf-8 describes unicode with 2-byte encoding. But
_wikipedia_ indicates also 1-byte unicode
with ascii compatibility.

So my issue is _partly_ obsolete.

I realized when setting 'bomb' option in vim there
is no inconsistent behavior anymore when using 2-byte
unicode ucs-2. The BOM header tells about
endianness, also.

Furthermore, be interested myself in the filesystem behavior
and unicode with ucs-2. Is it possible to use a linux
filesystem with 2-byte unicode encoding on principle.
Due to the cause that linux creates a 2-byte file
(1-byte character & 1-byte EOF) when creating it with
touch, and inserting one character into it with vim.
The bottom line is a 1-byte ascii file... Or a 1-byte
unicode with ascii compatibility (that what i meant with
endian abuse appearance).

Present, i study autodidactic with electric circuits and
the logical behavior. With that in mind it should be
faster to use 2-byte all over instead of a 1-byte, 2-byte
decision with the encoder, decoder.

- kefko




Gabriele F

unread,
Jul 27, 2021, 3:37:11 PM7/27/21
to 'Johannes Köhler' via vim_use
'Johannes Köhler' via vim_use wrote:
> *disorientation*
>
> The unix _manpage_ utf-8 describes unicode with 2-byte encoding. But
> _wikipedia_ indicates also 1-byte unicode
> with ascii compatibility.

(If I remember correctly) the first versions of Unicode had only a
2-byte encoding, so that (part of the) manpage is very old.



> Furthermore, be interested myself in the filesystem behavior
> and unicode with ucs-2. Is it possible to use a linux
> filesystem with 2-byte unicode encoding on principle.

I'm not so strong on Linux but filesystems shouldn't have anything to do
with text files encodings



> Due to the cause that linux creates a 2-byte file
> (1-byte character & 1-byte EOF) when creating it with
> touch, and inserting one character into it with vim.

I think it's vim that puts the EOF (see :help 'fixendofline'), not the
touch program or linux



> The bottom line is a 1-byte ascii file... Or a 1-byte
> unicode with ascii compatibility (that what i meant with
> endian abuse appearance).

I haven't understood this or other parts of the first message, but
you're probably thinking too much ahead, these issues have likely
nothing to do with endianness



> Present, i study autodidactic with electric circuits and
> the logical behavior. With that in mind it should be
> faster to use 2-byte all over instead of a 1-byte, 2-byte
> decision with the encoder, decoder.
>

It's not that simple unfortunately, UTF-16 (let's leave aside UCS-2, it
shouldn't matter) cannot be assumed to always have two bytes per
character, and some tests indicated that UTF-8 usually ends up being
better overall (utf8everywhere.org is certainly worth a look, I don't
remember if I agreed with it completely but it for sure is an
interesting document).



All in all, it's nice if you want to understand how things are at the
lower levels, it's quite fun to know it, but in order to achieve that
for text files these days you need to read the Unicode specification, at
least in its first parts; other sources are quite likely to cause more
confusion than clarity. To tackle the varied things you can run into on
the web and other information sources you'll probably also need to know
some of the earlier history of Unicode and the older encodings /
character sets.


Kind regards,
Gabriele

Johannes Köhler

unread,
Jul 27, 2021, 3:38:45 PM7/27/21
to vim...@googlegroups.com
On 27.07.21 20:42, Gabriele F wrote:
> relevant portions of the Unicode specification (unicode.org) are not
> very long or exceedingly hard to understand, but maybe you can find some
> more accessible description.

I will - i just prefer to believe in the working passion of
the crowd (not cloud:) ). Means, when my linux
is already configured with unicode it should
not use ascii anymore, and without the order that i have
to stick deep into the science of charsets.

Now, i am awakened with the interest of knowing more
about unicode charsets, especially with linux. I keep
myself busy with it hereafter.

> Most of all, UTF-8 is (normally) absolutely indistinguishable from
> normal US-ASCII until you use characters that were not in US-ASCII; so
> for example most English files will be bit-per-bit identical whether
> written in US-ASCII or UTF-8.

I spent some time with unicode pages. Therefore i thought to
remember that ascii characters encoded in unicode using
an different endian. But maybe i am wrong with this in
mind.
> https://www.mail-archive.com/vim...@googlegroups.com/msg57383.html and
> the others of that thread. But you probably won't make much out of it
> until you know how at least UTF-8 is encoded.
> https://www.mail-archive.com/vim...@googlegroups.com/msg57385.html .

thx, i appreciate *entropy*

> By the way, all of this means that it's not ascii that is "deprecated",
> but the various complimentary or alternative encodings that were (and
> still partly are) used to support non-English characters.

Inside of my amateur and dilettante mind i think it like
"right now our cpu's enforces themself with 64bit, why should
i notice an 7bit process as not deprecated"

*i like to write with green and comical slang because of
the entertainment value - hopefully its ok with our netiquette!*

sincerely
-kefko

Gabriele F

unread,
Jul 27, 2021, 4:08:49 PM7/27/21
to 'Johannes Köhler' via vim_use
'Johannes Köhler' via vim_use wrote:
> I spent some time with unicode pages. Therefore i thought to
> remember that ascii characters encoded in unicode using
> an different endian. But maybe i am wrong with this in
> mind.

By the way, the word endianness is only used when speaking about the
order of bytes, the order of the individual bits inside each byte is a
different thing (and it's virtually always something that only the
lowest levels -electronic components- concern themselves with)

Gabriele F

unread,
Jul 27, 2021, 4:15:12 PM7/27/21
to 'Johannes Köhler' via vim_use
Johannes Köhler wrote:
> when my linux
> is already configured with unicode it should
> not use ascii anymore, and without the order that i have
> to stick deep into the science of charsets.

Unfortunately the only way to achieve that, BOMs, is particularly
disputed on Linux, where indeed it has more likelihood to cause problems
By the way, many things "should"


> I spent some time with unicode pages. Therefore i thought to
> remember that ascii characters encoded in unicode using
> an different endian. But maybe i am wrong with this in
> mind.

The multi-byte encodings (UTF-16, UTF-32 and the legacy UCSs) do have
endianness, the specific versions are called UTF-16LE, UTF-16BE etc. .
In Unicode these latter are called "Encoding Schemes" (Unicode chapter
3.10), while the higher level concept that doesn't concern itself with
endianness is called "Encoding Form" (Unicode ch. 3.9).
By the way, I'm actually quite rusty in all these things, take my words
with a grain of salt

Gabriele F

unread,
Jul 27, 2021, 4:23:50 PM7/27/21
to vim...@googlegroups.com
Gabriele F wrote:
> By the way, many things "should"
I meant to put a " :) " after that, didn't mean to be rude

Johannes Köhler

unread,
Jul 31, 2021, 6:37:17 AM7/31/21
to vim...@googlegroups.com
On 27.07.21 21:37, Gabriele F wrote:
> (If I remember correctly) the first versions of Unicode had only a
> 2-byte encoding, so that (part of the) manpage is very old.

"_that would appeal to me_
... unsalaried working for the linux manpage people, to
keep them far off of "muddle-headed", always up-to-date
and referencing...
I ❤ reading manpages"

Is there a global internet group like for the vim developing?

>> Furthermore, be interested myself in the filesystem behavior
>> and unicode with ucs-2. Is it possible to use a linux
>> filesystem with 2-byte unicode encoding on principle.
>
> I'm not so strong on Linux but filesystems shouldn't have anything to do
> with text files encodings

I thought it like that, because the vim option ":set fenc"
implies file->system...

> you're probably thinking too much ahead, these issues have likely
> nothing to do with endianness

by myself aspirate to think inferring instead of ahead, but
maybe yu meant "keep the point", then i agree ;)

> It's not that simple unfortunately, UTF-16 (let's leave aside UCS-2, it
> shouldn't matter) cannot be assumed to always have two bytes per

UCS: _Uni_versal _Cod_ed Character Set

In my mind, UCS is the mathematical quantum and UTF the
encoding/decoding function using this:
magnitudes: 16(32)bit
plurality: charset / coded character

> All in all, it's nice if you want to understand how things are at the
> lower levels, it's quite fun to know it, but in order to achieve that

I did experience, that knowledge in "low-level" results into
the possible inferring thought to discuss with others...
(without having to stick deep into the science of the
electronic subject).

Such as,

a hardware hdd controller is using a bit buffer for
transferring the bit word to the static memory. This
controller can handle an defined bit length (normally the
bus width).

Assuming that the data of the hdd partition tables (e.g.UID),
used by the operating system, are encoded in 16bit Unicode.
Well, my inferring thoughts were that UCS-2 is a
hardware encoding, UTF-8 for ASCII purpose, UTF-32 a
high level programmer attitude and UTF-16 the real unicode.

In the end that means, the controller is made for 2-byte.
The old ASCII code needs 7bit and probably one for
sth., now than UTF-8 has to work with a different endian.

And... why should i use a deprecated ASCII scheme
at my system, when i can have lots of advantage
using utf-16 (e.g. control/hash functions). It fells
like utf-8 is a "work around" wrapper for
the ASCII scheme...

And... ucs (by princible) is probably using the science
leap from block oriented sequential access (HDD) to
byte oriented random access memory (SSD). Maybe, it plays
with the one to four bit-octets and the endian. ASCII
seems to be developed on the sequential encoding form.

sincerely
-kefko

Eike Rathke

unread,
Aug 1, 2021, 10:05:13 PM8/1/21
to vim...@googlegroups.com
Hi 'Johannes,

On Saturday, 2021-07-31 12:37:08 +0200, 'Johannes Köhler' via vim_use wrote:

> > It's not that simple unfortunately, UTF-16 (let's leave aside UCS-2, it
> > shouldn't matter) cannot be assumed to always have two bytes per
>
> UCS: _Uni_versal _Cod_ed Character Set
>
> In my mind, UCS is the mathematical quantum and UTF the
> encoding/decoding function using this:
> magnitudes: 16(32)bit
> plurality: charset / coded character

You are confusing things.

UCS-4 and UTF-32 as its subset are capable to hold respectively encode
assigned Unicode characters as direct representations of the Unicode
characters' code points.

UCS-2 is a 2-byte fixed width character set capable of encoding 65536
characters, or just the Unicode Basic Multilingual Plane (BMP).

UTF-16 is capable to encode the entire Unicode character range. It is
almost identical to UCS-2 in the first 64k characters, except the
"escape sequences" it needs to represent surrogate pairs for characters
of higher planes.


> Assuming that the data of the hdd partition tables (e.g.UID),
> used by the operating system, are encoded in 16bit Unicode.
> Well, my inferring thoughts were that UCS-2 is a
> hardware encoding, UTF-8 for ASCII purpose, UTF-32 a
> high level programmer attitude and UTF-16 the real unicode.

That's all nonsense. Really.

> In the end that means, the controller is made for 2-byte.
> The old ASCII code needs 7bit and probably one for
> sth., now than UTF-8 has to work with a different endian.

There is no endianess in UTF-8. Unless your hardware has less than
8 bits per word..

> And... why should i use a deprecated ASCII scheme
> at my system, when i can have lots of advantage
> using utf-16 (e.g. control/hash functions). It fells
> like utf-8 is a "work around" wrapper for
> the ASCII scheme...

UTF-8 is an efficient encoding that for Unicode characters <128 (which
happen to be identical with ASCII and a subset of Unicode) needs only
1 byte per character, whereas UTF-16 needs at least 2 bytes for each
character.

UTF-16 is a workaround for those who wanted Unicode and started off with
UCS-2 but then realized there's more than just BMP.
Or, UTF-16 is the devil's work:
https://robert.ocallahan.org/2008/01/string-theory_08.html

Eike

--
OpenPGP/GnuPG encrypted mail preferred in all private communication.
GPG key 0x6A6CD5B765632D3A - 2265 D7F3 A7B0 95CC 3918 630B 6A6C D5B7 6563 2D3A
Use LibreOffice! https://www.libreoffice.org/
signature.asc

Tony Mechelynck

unread,
Aug 2, 2021, 7:59:26 AM8/2/21
to vim_use
As some have said above, UTF-8 is a variable-length encoding, which
encodes 7-bit ASCII characters exactly like us-ascii, and characters
(codepoints) above U+007F in two or more bytes, each of them with the
high bit set. Originally Unicode was foreseen to be able to go as far
up as U+3FFFFFFF, but when UTF-16 was crafted and surrogate codepoints
were assigned it was decided that codepoints higher than U+10FFFF
would never mean anything (and U+F0000 to U+10FFFF are "for private
use" anyway, i.e. transmitter and receiver have to agree on the
values, which are not defined by Unicode). The Wikipedia page about it
is well-written and I recommend reading it.

The so-called "byte order mark" U+FEFF ZERO-WIDTH NO-BREAK SPACE
should more appropriately be coded an "encoding mark" : it can
discriminate most Unicode encodings and endiannesses from each other,
including UTF-8, which has no byte-order ambiguity. At the head of a
UTF-8 file (e.g. an HTML file or CSS script, whose syntaxes expressly
support it), it means "This is UTF-8". However some programs which
expect only US-ASCII will choke if they get a file headed by a BOM:
for instance a #! "executable script" header will not be recognized if
it is preceded by a BOM, so if you want to start your first line by
#!/bin/bash or #!/bin/env python the file may be in UTF-8 (which
encodes the 128 ASCII characters just like us-ascii) but without BOM.

See:
https://en.wikipedia.org/wiki/Unicode
https://en.wikipedia.org/wiki/UTF-8
and beware that the Microsoft Windows documentation usually says
"Unicode" when what it means is "UTF-16" which represents each
codepoint in one, or sometimes two, 16-bit words.

Best regards,
Tony.

Johannes Koehler

unread,
Aug 5, 2021, 10:45:03 AM8/5/21
to vim_use
THX to EIKE and TONY for TIME and EFFORT @ REPLY!

I was confused due to reading the unicode documentation,
whereby utf-32 codepoints are local expandable with 1) blocks
in planes OR 2) whole planes... And intuitively
i had in mind the utf-8 is for downwardly compatible with us-ascii codespace. 
The "usecase" with bash script and us-ascii puts the same into
my mind. Q: Is bash script reading text files similar to binary? (when i am
not allow to use a BOM). Meant, is not using a charset encoding
applied by linux.

Then partition tables, which should be readable
on different systems, are encoded with utf-16/ucs-2.

Thus implied to me, UCS-2 is a new standard for independent
decentralized 2-byte charsets. And the UTF is the local interpreting
process... 

Finally, it doesnt matter - because the linux decoder seems to
be very rich of decision possibilities (e.g. creates 1-byte utf-8 file like us-ascii 
until i use an utf-16 codepoint) and therefore my files should
be readable with the 1byte utf-8 for my lifetime.
 
But attention! ...with the modern "android smartphone" philosophy 
i got brainwashed: At all cost - stay up-to-date with your software and hardware
systems, else you are not with us (community, life etc.) anymore. 
Then i got _paranoid_ when i still know there is a new charset encoding 
since years, and my system goes back to the deprecated one ... *take for fun*

sincerely
-kefko

... 
Reply all
Reply to author
Forward
0 new messages