Vim on OS X, (no)macatsui problem

Kenneth Beesley

unread,

Oct 4, 2007, 6:26:24 PM10/4/07

to vim_mu...@googlegroups.com

Background: running gvum 7.1.135 on OS X version 10.4.10

Using font DejaVuAgainSansMono.jjt, which is DejaVuSansMono.jjt
expanded with Deseret Alphabet (supplementary area) glyphs.

In .gvimrc, if I specify

set nomacatsui anti guifont=DejaVuAgain\ Sans\ Mono:h14

then gvim renders Roman glyphs, from the Basic Multilingual Plane, well,
but the Deseret glyphs (from the supplementary area) are rendered as
sequences of Roman glyphs and spaces. Completely garbled

If I change .gvimrc to

set macatsui anti guifont=DejaVuAgain\ Sans\ Mono:h14

(i.e. if I specify macatsui rather than nomacatsui, and this is the
only change)
then I see Roman and Deseret glyphs rendered as expected, but all the
glyphs
look scraggly on the screen.

Can anyone explain to me what is happening here and how I might get
sharp renderings of both BMP and supplementary glyphs?

Thanks,

Ken

Nico Weber

unread,

Oct 5, 2007, 5:59:14 AM10/5/07

to vim_mu...@googlegroups.com

Hi Ken,

I don't expect this to work at all without 'macatsui'. My experience
is that vim assigns not enough horizontal space to supparea glyphs
(is that what "scraggly" means). This is because vim needs a
monospaced font for correct display, and the supparea glyphs are too
wide for the monospaced width of the current font (this can happen
because the font is not monospaced for all glyphs or because some
glyphs are subsitituted from other fonts, because they are missing in
the current font). This also does happen for some BMP glyphs (U+0E5B
๛ for example, and many others).

One way that _might_ work is to get MacVim ( http://code.google.com/p/
macvim/ ), set its MMCellWidthMultiplier user default to something a
bit larger than 1 (do `defaults write org.vim.MacVim
MMCellWidthMultiplier 1.3`, see http://code.google.com/p/macvim/wiki/
UserDefaults for more information) and use that. This widens up all
glyphs, but perhaps it's good enough.

I don't know if it's possible at all to have a monospaced font that
works for all writing systems. The Right Thing is probably to make
vim work with variable width fonts, but I guess that's very very
complicated and won't happen :-\

Nico

Nico Weber

unread,

Oct 5, 2007, 6:57:53 AM10/5/07

to Nico Weber, vim_mu...@googlegroups.com

> One way that _might_ work is to get MacVim ( http://code.google.com/

> p/macvim/ ), set its MMCellWidthMultiplier user default to

> something a bit larger than 1 (do `defaults write org.vim.MacVim
> MMCellWidthMultiplier 1.3`, see http://code.google.com/p/macvim/

> wiki/UserDefaults for more information) and use that. This widens

> up all glyphs, but perhaps it's good enough.

Doing `:set ambiwidth=double` might help as well.

Nico

Joseph Retzer

unread,

Oct 5, 2007, 12:21:30 PM10/5/07

to vim_mu...@googlegroups.com

The MacVim looks great but as a newbie I'm not sure how to get it to work from the command window. I have the mvim.htm file but not sure where to save it or what to name it to etc. Any advice would be much appreciated.
Thanks,
Joe

Kenneth Beesley

unread,

Oct 5, 2007, 12:35:27 PM10/5/07

to vim_mu...@googlegroups.com

On 5 Oct 2007, at 03:59, Nico Weber wrote:

>
> Hi Ken,

>
>>
>> In .gvimrc, if I specify
>>
>> set nomacatsui anti guifont=DejaVuAgain\ Sans\ Mono:h14
>>
>> then gvim renders Roman glyphs, from the Basic Multilingual Plane,
>> well,
>> but the Deseret glyphs (from the supplementary area) are rendered as
>> sequences of Roman glyphs and spaces. Completely garbled
>>
>>
>> If I change .gvimrc to
>>
>> set macatsui anti guifont=DejaVuAgain\ Sans\ Mono:h14
>>
>> (i.e. if I specify macatsui rather than nomacatsui, and this is the
>> only change)
>> then I see Roman and Deseret glyphs rendered as expected, but all the
>> glyphs
>> look scraggly on the screen.
>>
>> Can anyone explain to me what is happening here and how I might get
>> sharp renderings of both BMP and supplementary glyphs?
>
> I don't expect this to work at all without 'macatsui'. My experience
> is that vim assigns not enough horizontal space to supparea glyphs
> (is that what "scraggly" means).

Hello Nico,

Thanks for the message. With 'nomacatsui' I see sharp, legible glyphs.
By "scraggly" I mean thin glyphs, with thin, shaky lines. These
"scraggly"
glyphs are legible, but they look bad.

> This is because vim needs a
> monospaced font for correct display, and the supparea glyphs are too
> wide for the monospaced width of the current font (this can happen
> because the font is not monospaced for all glyphs or because some
> glyphs are subsitituted from other fonts, because they are missing in
> the current font). This also does happen for some BMP glyphs (U+0E5B
> ๛ for example, and many others).

The Deseret glyphs (Unicode block starting U+10400) are alphabetic and
fit into the same width as the other glyphs. As far as I can tell,
the font I'm
using is monowidth. When merging the Deseret glyphs, I first reset
their
width to the width of the characters in the existing font (DejaVu
Sans Mono).
So whatever my problem is, it is not that the new glyphs I've added
are too
wide, or wider than the original glyphs.

>
> One way that _might_ work is to get MacVim ( http://code.google.com/p/
> macvim/ ), set its MMCellWidthMultiplier user default to something a
> bit larger than 1 (do `defaults write org.vim.MacVim
> MMCellWidthMultiplier 1.3`, see http://code.google.com/p/macvim/wiki/
> UserDefaults for more information) and use that. This widens up all
> glyphs, but perhaps it's good enough.

As just explained, too-wide glyphs are not the problem, as far as I
can tell.

>
> I don't know if it's possible at all to have a monospaced font that
> works for all writing systems. The Right Thing is probably to make
> vim work with variable width fonts, but I guess that's very very
> complicated and won't happen :-\

For my current work, I just need a few alphabets (Roman, Shavian,
Deseret) that can all fit in a reasonable width. I don't need "all
writing
systems". I am (as far as I can tell) using a monowidth font. The
complication is that Shavian and Deseret are in the supplementary
area, and Vim 7.1 just recently added patch 116 that is supposed to
allow
glyphs from the supplementary area to be rendered, for the first time.
Previous to 116, you could edit supplementary characters, but even
with a proper font, vim couldn't display the glyphs from the
supplementary
area.

Thanks again for your message,

Ken

>
> Nico
> >

Kenneth Beesley

unread,

Oct 5, 2007, 1:23:05 PM10/5/07

to vim_mu...@googlegroups.com

Joseph,

I assume you're working in OS X.

You should put MacVim.app in /Applications/

You should have an executable file named mvim, not mvim.htm
See http://code.google.com/p/macvim/downloads/list

and click on the mvim link. That should download mvim, not mvim.htm

Typically, you would move this mvim file to your ~/bin directory or
to some other directory on
your path.

To check your path, from a command line terminal, enter

echo $PATH

Once you have installed MacVim.app and mvim in the proper places.
Enter (again from a terminal)

rehash

Then your system should be able to find mvim. Make sure by entering

which mvim

It should respond ~/bin/mvim (if you put mvim in ~/bin)
and you should be able to launch MacVim from the command line by just
entering

mvim

Ken

Joseph Retzer

unread,

Oct 5, 2007, 3:51:15 PM10/5/07

to vim_mu...@googlegroups.com

Kenneth,
THANKS VERY MUCH!!!!

Got it to work! This solves a lot of problems I've been dealing with for some time.

Take care,
Joe

Kenneth Beesley <krbe...@gmail.com> wrote:

Nico Weber

unread,

Oct 8, 2007, 3:11:19 PM10/8/07

to vim...@googlegroups.com, vim_mu...@googlegroups.com

Forwarding this to vim_mac as Bjorn is not subscribed to
vim_multibyte as far as i know. Kenneth, I guess it would help if you
could post links to screenshots of the text as it's supposed to look
and of the garbled look, as well as the font you're using so we can
reproduce this.

Nico

Begin forwarded message:

Nico Weber

unread,

Oct 8, 2007, 4:56:44 PM10/8/07

to vim...@googlegroups.com, vim_mu...@googlegroups.com

> Ugh. I tried sifting through the forwarded posts, but it was kind of
> hard to understand them. I will try to read the posts on google
> groups instead, unless somebody can summarize the problem(s) for me?

Would have been easier if you'd "Reply all"d. Here's what I think
Kenneth problems are:

> I just installed the latest MacVim and tried it with a version of
> DejaVuSansMono.ttf, augmented
> with (monowidth) glyphs, the same width as the original
> DejaVuSansMono.ttf glyphs,
> for the Deseret Alphabet block (U+10400). It doesn't seem to work
> for me. When I select my
> Deseret Alphabet keymap and try to type Deseret Alphabet, I see
> pseudo glyphs in boxes
> rendered on the screen.

You can enter desert characters by opening the Character Palette,
putting "deseret" in the search box at the bottom and ... well, you
know the rest. MacVim displays a "character not found" sign which is
probably the Right Thing as the default DejaVu font seems not to
include these characters, but Kenneth uses a font that _does_ have
them. Having access to Kenneth's font would help...

He also reports that mapping numbers `:map 3 ...` doesn't work. I
can't reproduce this.

Nico

Nico Weber

unread,

Oct 8, 2007, 5:01:48 PM10/8/07

to vim...@googlegroups.com, vim_mu...@googlegroups.com

> He also reports that mapping numbers `:map 3 ...` doesn't work. I
> can't reproduce this.

I got this one wrong. See the other thread for Kenneth's
clarification. Sorry.

Nico

björn

unread,

Oct 13, 2007, 2:45:33 PM10/13/07

to vim_multibyte

Hi Ken,

I have looked into why MacVim fails to render the deseret glyphs and I
now have an answer, but unfortunately no solution.

The problem is that one deseret character for some reason takes up
_two_ characters when put in the text storage (I guess this have
something to do with Unicode?). Specifically, calling "length" on an
NSString containing one deseret character returns 2 instead of 1, as I
would expect.

Now, I do know how to fix this problem, but since Jiang is working on
moving his drawing code to MacVim I don't really want to spend any
time doing this, since the problem will disappear as soon as he is
finished. I'm sorry about that.

/Björn

Tony Mechelynck

unread,

Oct 13, 2007, 8:30:53 PM10/13/07

to vim_mu...@googlegroups.com

UTF-8 uses:
1 byte for each codepoint in the range U+0000 - U+007F
2 bytes for each codepoint in the range U+0080 - U+07FF
3 bytes for each codepoint in the range U+0800 - U+FFFF
4 bytes for each codepoint in the range U+10000 - U+1FFFFF
Actually, current standards mandate that no codepoints higher than U+10FFFD
will "ever" be used. (Vim supports up to U+3FFFFFFF, with up to 6 bytes per
codepoint, following an earlier draft of the standard.)

Unicode also has the notion of "composing characters", which are characters
which are "superimposed" on the preceding character, possibly changing its
shape. These are usually diacritics: most of the accents of Latin can be
either precomposed or spacing-non-accented + composing-accent, but the
optional vowel marks of Hebrew and Arabic exist only as composing characters.

Since your Deseret characters are outside the BMP, each of them requires 4
bytes in UTF-8 (also two 16-bit words in UTF-16 and one 32-bit doubleword in
UTF-32); but maybe that's not what your measured "length" means? Does your
NSString include a final null (as C strings do) or an initial bytecount (as
Pascal strings do)? Or do your Deseret characters include "composing" elements?

Best regards,
Tony.
--
hundred-and-one symptoms of being an internet addict:
55. You ask your doctor to implant a gig in your brain.

björn

unread,

Oct 14, 2007, 7:01:10 AM10/14/07

to vim_mu...@googlegroups.com, vim...@googlegroups.com

> > The problem is that one deseret character for some reason takes up
> > _two_ characters when put in the text storage (I guess this have
> > something to do with Unicode?). Specifically, calling "length" on an
> > NSString containing one deseret character returns 2 instead of 1, as I
> > would expect.
> >

> UTF-8 uses:
> 1 byte for each codepoint in the range U+0000 - U+007F
> 2 bytes for each codepoint in the range U+0080 - U+07FF
> 3 bytes for each codepoint in the range U+0800 - U+FFFF
> 4 bytes for each codepoint in the range U+10000 - U+1FFFFF
> Actually, current standards mandate that no codepoints higher than U+10FFFD
> will "ever" be used. (Vim supports up to U+3FFFFFFF, with up to 6 bytes per
> codepoint, following an earlier draft of the standard.)
>
> Unicode also has the notion of "composing characters", which are characters
> which are "superimposed" on the preceding character, possibly changing its
> shape. These are usually diacritics: most of the accents of Latin can be
> either precomposed or spacing-non-accented + composing-accent, but the
> optional vowel marks of Hebrew and Arabic exist only as composing characters.
>
> Since your Deseret characters are outside the BMP, each of them requires 4
> bytes in UTF-8 (also two 16-bit words in UTF-16 and one 32-bit doubleword in
> UTF-32); but maybe that's not what your measured "length" means? Does your
> NSString include a final null (as C strings do) or an initial bytecount (as
> Pascal strings do)? Or do your Deseret characters include "composing" elements?

I'm sorry about the confusion with posting this thread separately on
vim_multibyte and vim_mac...I'll try to bring the diverging threads
together by posting this reply to both groups.

Tim Allen replied to the vim_mac thread saying that NSString uses
utf-16 internally and this is indeed why it says one deseret char has
length 2 (since it needs two 16 bit chars to store one deseret char,
as has been pointed out already).

I was under the mistaken impression that NSString always returned
length 1 for one character (not counting composing characters), which
is why I thought MacVim would work in all situations except when
composing characters were used. Again, this can be fixed by getting
rid of the assumption that each line in the text storage has the same
length (as returned by NSString), but this is a rather big code
change.

Thanks to Tony and Tim for educating me on the finer points of Unicode... :-)

/Björn

Tony Mechelynck

unread,

Oct 14, 2007, 5:32:47 PM10/14/07

to vim_mu...@googlegroups.com, vim...@googlegroups.com

björn wrote:
[...]

> I'm sorry about the confusion with posting this thread separately on
> vim_multibyte and vim_mac...I'll try to bring the diverging threads
> together by posting this reply to both groups.
>
> Tim Allen replied to the vim_mac thread saying that NSString uses
> utf-16 internally and this is indeed why it says one deseret char has
> length 2 (since it needs two 16 bit chars to store one deseret char,
> as has been pointed out already).

Yes, obviously (if one thinks about it) one UTF-16 16-bit word cannot
represent anything above U+FFFF. For codepoints U+10000 to U+10FFFF (including
Deseret, among others), two "surrogate characters" are used -- two 16-bit
words, one in the range 0xD800-0xDBFF and the other in the range 0xDC00-0xDFFF
: see
http://en.wikipedia.org/wiki/UTF-16#Encoding_of_characters_outside_the_BMP for
details. Unlike UTF-8 and UTF-32, UTF-16 inherently cannot, even with
surrogates, represent anything above U+10FFFF, and (I suppose) that's (one of
the reasons) why it was decided to bring the "upper range" of Unicode down
from U+7FFFFFFF to U+10FFFF (and even U+10FFFD since for other reasons, the
last two codepoints of every plane -- U+xxFFFE and U+xxFFFF -- are "invalid").

>
> I was under the mistaken impression that NSString always returned
> length 1 for one character (not counting composing characters), which
> is why I thought MacVim would work in all situations except when
> composing characters were used. Again, this can be fixed by getting
> rid of the assumption that each line in the text storage has the same
> length (as returned by NSString), but this is a rather big code
> change.
>
> Thanks to Tony and Tim for educating me on the finer points of Unicode... :-)

My pleasure. :-)

>
>
> /Björn

Best regards,
Tony.
--
Court, n.:
A place where they dispense with justice.
-- Arthur Train

Kenneth Beesley

unread,

Oct 15, 2007, 12:46:36 PM10/15/07

to vim_mu...@googlegroups.com

Hi Bjôrn,

Many thanks for the message.

Yeah, the term Character is a technical term in Unicode, and each
Unicode character has a code point value that ranges from 0x0 to
0x10FFFF.

In the original vision of Unicode, code point values ranged from 0x0
to 0xFFFF, allowing just 64k distinct characters. This old limited
range
is now known as the Basic Multilingual Plane (BMP). The current
vision of Unicode, now 10 years old, allows about a million characters,
and the characters with code point values beyond 0xFFFF are known
as supplementary characters.

Many software applications still haven't caught up with supplementary
characters. They're still stuck in the BMP.

In Java, there is a type called "char" that has 16 bits and so can
represent any code point value in the BMP, 0x0 to 0xFFFF. It is
important
not to confuse "char" with the Unicode notion of Character. In Java,
to store a supplementary Unicode character, two "chars" are used, in a
coding system known as UTF-16. It sounds like MacVim has a similar
storage system, and that the length-in-chars is being confused with
the length-in-Unicode-characters.

Best wishes,

Ken

Kenneth Beesley

unread,

Oct 15, 2007, 2:16:15 PM10/15/07

to vim_mu...@googlegroups.com

Tony,

Great message, as usual.
I insert some friendly comments below.

On 13 Oct 2007, at 18:30, Tony Mechelynck wrote:

>
> björn wrote:
>>>> He also reports that mapping numbers `:map 3 ...` doesn't work. I
>>>> can't reproduce this.
>>> I got this one wrong. See the other thread for Kenneth's
>>> clarification. Sorry.
>>
>> Hi Ken,
>>
>> I have looked into why MacVim fails to render the deseret glyphs
>> and I
>> now have an answer, but unfortunately no solution.
>>
>> The problem is that one deseret character for some reason takes up
>> _two_ characters when put in the text storage (I guess this have
>> something to do with Unicode?). Specifically, calling "length" on an
>> NSString containing one deseret character returns 2 instead of 1,
>> as I
>> would expect.
>>
>> Now, I do know how to fix this problem, but since Jiang is working on
>> moving his drawing code to MacVim I don't really want to spend any
>> time doing this, since the problem will disappear as soon as he is
>> finished. I'm sorry about that.
>>
>>
>> /Björn
>

Tony responds:

> UTF-8 uses:
> 1 byte for each codepoint in the range U+0000 - U+007F
> 2 bytes for each codepoint in the range U+0080 - U+07FF
> 3 bytes for each codepoint in the range U+0800 - U+FFFF
> 4 bytes for each codepoint in the range U+10000 - U+1FFFFF

KRB: The current modern Unicode character set has code point
values ranging from U+0 to U+10FFFF, allowing about a million
distinct "characters". These Unicode Characters are slightly abstract
and need to be distinguished carefully from how they are "encoded"
in a file or in a programming language. In UTF-8 encoding, the "code
unit"
is one byte, and each Unicode character (each code point value) is
stored in one
to four bytes as you describe above. The conversion between code point
values (integers) and the bit/byte representations requires some trivlal
bit extraction and shifting.

What Bjôrn describes sounds more like UTF-16, where each Unicode
character (code point value) is stored in either one 16-bit "code unit"
or in two 16-bit code units. Characters from the Basic Multilingual
Plane,
U+0 to U+FFFF, are stored in a single 16-bit code unit. Supplementary
characters, those beyond the Basic Multilingual Plane, are stored in two
16-bit code units. (Again there is some trivial bit manipulation
involved
in conversion between code point values and the bit representations
in the 16-bit code units.)

Perl stores Unicode strings internally as UTF-8, but you should
hardly ever
have to know that. If you ask for the length of a Perl Unicode
string, Perl gives
you the length in Unicode Characters. If you loop through the
characters in
a Perl Unicode string, it loops through Unicode Characters, taking
care of
the underlying UTF-8 encoding in the background. The underlying
encoding
in UTF-8 is effectively hidden from the programmer. At the
programming level,
you can always think of a Perl Unicode string as a sequence of Unicode
Characters (including supplementary characters).

Java from the very beginning took Unicode very seriously. But Java
emerged
in the olden days of Unicode, when code point values ranged only from
U+0 to
U+FFFF, so every original Unicode character could be stored in a single
16-bit "char". The length of a Unicode string was simply the number
of chars.
Easy and clean.
The introduction of supplementary Unicode characters 10 years ago
created
quite a challenge for Java and other programming languages that wanted
to take Unicode seriously. Instead of accommodating the New Unicode by
making char 32 bits (which would allow each New Unicode character to be
stored straightforwardly in a single 32-bit char) the Java gurus
opted to keep "char" at 16-bits
and use UTF-16 to store Unicode strings. If you ask for the "length"
of a Unicode
string in Java, it still returns the length in chars rather than the
length in Unicode
Characters. This is (arguably) quite a mess, and you have to be very
aware of it
as a programmer if you want to handle Supplementary Unicode Characters.

The way that Python handles Unicode strings internally depends on how
it is configured/built. If configured for "ucs2", Python stores
Unicode strings as
UTF-16, returns the "length" of strings as the number of 16-bit code
units, and
if you try to loop through the elements of a string, it loops through
16-bit
values, which creates a mess if your string contains supplementary
characters.
This is comparable to the situation in Java.

If you configure Python for "ucs4", then each Unicode string is
stored internally as
a string of 32-bit code units, "length" is returned as the number of
Unicode characters, and if you loop through the characters in a
string, you
get one Unicode character (code point value) at a time, even for
supplementary
characters. This "ucs4" option is now formally termed UTF-32 in Unicode
circles.

> Actually, current standards mandate that no codepoints higher than U
> +10FFFD
> will "ever" be used. (Vim supports up to U+3FFFFFFF, with up to 6
> bytes per
> codepoint, following an earlier draft of the standard.)
>
> Unicode also has the notion of "composing characters", which are
> characters
> which are "superimposed" on the preceding character, possibly
> changing its
> shape. These are usually diacritics: most of the accents of Latin
> can be
> either precomposed or spacing-non-accented + composing-accent, but the
> optional vowel marks of Hebrew and Arabic exist only as composing
> characters.

Quite right. "Character" is a technical term in Unicode, and
includes spaces,
punctuation and these Composing Diacritical Marks (block starting U
+0300)
that might not fall under the everyday notion of character. An
acute-accented é,
for example, can be represented in Unicode either as a single character,

U+00E9

which has the name LATIN SMALL LETTER E WITH ACUTE

You can alternatively represent é as a sequence of two Unicode
characters

U+0065 LATIN SMALL LETTER E
U+0301 COMBINING ACUTE ACCENT

The Unicode gods have explicitly decreed that these two
representations are
equivalent, which means that any proper Unicode-capable editor should
handle and display them equivalently.

In Hopi (spoken in Arizona) orthography (as defined at the University of
Arizona), you have some double-accented graphemes like o with both
diaeresis and an acute, grave or circumflex accent. In Unicode you
can represent o with diaeresis and acute (the acute accent is rendered
above the diaeresis) as either the three-character sequence

U+006F UNICODE SMALL LETTER O
U+0308 COMBINING DIAERESIS
U+0301 COMBINING ACUTE ACCENT

or as the two-character sequence

U+00F6 LATIN SMALL LETTER O WITH DIAERESIS
U+0301 COMBINING ACUTE ACCENT

But there is no single "pre-composed" Unicode character for this
purpose.

This whole issue of Combining Diacritical Marks is separate from the
issue
of encoding (UTF-8, UTF-16 or UTF-32). Some conversion between "pre-
composed"
and "decomposed" representations can be done using "Normalization"
routines
available in Perl, Python, Java, ICU, etc.

These Combining Diacritical Marks need to be rendered above or below,
or attached in particular places, as appropriate, to any letter
character. For
that to work properly, you need a font (e.g. Doulos SIL or Charis
SIL) that
contains the diacritic-positioning information, and you need a
sophisticated rendering
engine (as in XeTeX) that reads and uses that diacritic-positioning
information.

Most software, including text editors, still do a poor job of handling
Combining Diacritical Marks and supplementary characters in general.

>
> Since your Deseret characters are outside the BMP, each of them
> requires 4
> bytes in UTF-8 (also two 16-bit words in UTF-16 and one 32-bit
> doubleword in
> UTF-32); but maybe that's not what your measured "length" means?
> Does your
> NSString include a final null (as C strings do) or an initial
> bytecount (as
> Pascal strings do)? Or do your Deseret characters include
> "composing" elements?

Because the "length" of each Deseret Character is being returned as 2
rather
than 1, it sounds like the MacVim code is using a Java-like UTF-16
internal representation
for storing Unicode characters (including supplementary characters).

There are no Combining Diacritical Marks required in the traditional
Deseret Alphabet, per se,
although proper rendering software _should_ allow you to associate
one or more Combining
Diacritics Marks with any letter character and have it rendered
acceptably. (Handling
combining diacritical marks with Deseret Alphabet is very low priority.)

Each Deseret Alphabet letter is a single Unicode character, with a
single code
point value in the supplementary area (block starting U+10400). The
Shavian alphabet is
much the same (in the block starting U+10450). The glyphs are
straightforward, rendered
left-to-right, requiring no ligatures, and could be forced into a
fixed-pitch (mono) font about
as easily as Roman glyphs.

Ken

Tony Mechelynck

unread,

Oct 16, 2007, 8:49:02 AM10/16/07

to vim_mu...@googlegroups.com

Vim doesn't use UTF-16 internally, because the many intervening nulls would
wreak havoc with the C requirement of null-terminated strings. If you set
'encoding' to UCS-4, UTF-16 or UTF-32 (of any endianness), Vim will actually
use UTF-8 internally, because 0x00 in UTF-8 is the NULL character (codepoint
U+0000), nothing else, and Vim already knows how to handle that.

When you set 'fileencoding' to UTF-16, the internal UTF-8 representation of
the text will be converted to and from UTF-16 when writing or reading
(respectively), using surrogate pairs for any codepoint above U+FFFF, so that,
_on disk_, they take two UTF-16 words rather than one.

I don't know what function you used to count characters, but the Vim
string-length function, strlen(), gives a string's length in _bytes_ in the
current internal representation: for Unicode, "a" (U+0061) is one, "é"
(e-acute, U+00E9) is two, "†" (dagger, U+2020) is three and any Deseret
character is four. (Under ":help strlen()" you can see how to count
"characters" in a string, as opposed to "bytes".)

>
>
>
> On 13 Oct 2007, at 12:45, björn wrote:
>
>>>> He also reports that mapping numbers `:map 3 ...` doesn't work. I
>>>> can't reproduce this.
>>> I got this one wrong. See the other thread for Kenneth's
>>> clarification. Sorry.
>> Hi Ken,
>>
>> I have looked into why MacVim fails to render the deseret glyphs and I
>> now have an answer, but unfortunately no solution.
>>
>> The problem is that one deseret character for some reason takes up
>> _two_ characters when put in the text storage (I guess this have
>> something to do with Unicode?). Specifically, calling "length" on an
>> NSString containing one deseret character returns 2 instead of 1, as I
>> would expect.
>>
>> Now, I do know how to fix this problem, but since Jiang is working on
>> moving his drawing code to MacVim I don't really want to spend any
>> time doing this, since the problem will disappear as soon as he is
>> finished. I'm sorry about that.
>>
>>
>> /Björn

Best regards,
Tony.
--
During a grouse hunt in North Carolina two intrepid sportsmen
were blasting away at a clump of trees near a stone wall. Suddenly a
red-faced country squire popped his head over the wall and shouted,
"Hey, you almost hit my wife."
"Did I?" cried the hunter, aghast. "Terribly sorry. Have a
shot at mine, over there."

Tony Mechelynck

unread,

Oct 16, 2007, 9:54:36 AM10/16/07

to vim_mu...@googlegroups.com

Hm. I guess I'll stay with Vim and vim-script, where I know what to expect.

also control characters (carriage return, line feed, form feed, horizontal
tab, soft hyphen, byte-order mark, zero-width joiner, etc.), which also might
not all fall under the everyday notion of "character".

but not in Vim. AFAIK, the only "normalization" routines afforded by Vim
(other than not using a separate screen cell for composing character) are: (a)
the 'delcombine' option, which, if set, allows <BS> to erase one combining
character at a time, while when clear (default) it will erase one spacing
character together with any number of combining characters in the same screen
cell; and (b) the \Z pattern atom, which will ignore combining characters
anywhere in the text while matching. But AFAIK Vim will always treat "é"
(U+00E9 LATIN SMALL LETTER E WITH ACUTE) and "é" (U+0065 LATIN SMALL LETTER E
+ U+0301 COMBINING ACUTE ACCENT) as different even if it displays them the same.

>
> These Combining Diacritical Marks need to be rendered above or below,
> or attached in particular places, as appropriate, to any letter
> character. For
> that to work properly, you need a font (e.g. Doulos SIL or Charis
> SIL) that
> contains the diacritic-positioning information, and you need a
> sophisticated rendering
> engine (as in XeTeX) that reads and uses that diacritic-positioning
> information.
>
> Most software, including text editors, still do a poor job of handling
> Combining Diacritical Marks and supplementary characters in general.

In Arabic, Vim handles combining vowels etc. ("harakaat" as Arabic grammarians
call them) quite well, including several per character as e.g. in (spacing)
seen (Arabic S) + combining shadda (geminated-consonant sign) + combining
fatha (Arabic short vowel a), a combination which appears in the fully
vocalized form of "as-salaam" (Peace). Starting recently (7.1.116), Vim can
now display (not only edit) any codepoint in the current 'guifont', not only
those in the BMP. From what you say above, it looks like Vim is ahead of "most
software including text editors", but I don't doubt that the situation will
get better as time goes on.

>
>> Since your Deseret characters are outside the BMP, each of them
>> requires 4
>> bytes in UTF-8 (also two 16-bit words in UTF-16 and one 32-bit
>> doubleword in
>> UTF-32); but maybe that's not what your measured "length" means?
>> Does your
>> NSString include a final null (as C strings do) or an initial
>> bytecount (as
>> Pascal strings do)? Or do your Deseret characters include
>> "composing" elements?
>
> Because the "length" of each Deseret Character is being returned as 2
> rather
> than 1, it sounds like the MacVim code is using a Java-like UTF-16
> internal representation
> for storing Unicode characters (including supplementary characters).

How do you compute that length? The strlen() function should return 4 for each
Deseret character, and the function (similar to that mentioned under ":help
strlen()")

strlen(substitute(string,'.','-'))

should return 1.

>
> There are no Combining Diacritical Marks required in the traditional
> Deseret Alphabet, per se,
> although proper rendering software _should_ allow you to associate
> one or more Combining
> Diacritics Marks with any letter character and have it rendered
> acceptably. (Handling
> combining diacritical marks with Deseret Alphabet is very low priority.)
>
> Each Deseret Alphabet letter is a single Unicode character, with a
> single code
> point value in the supplementary area (block starting U+10400). The
> Shavian alphabet is
> much the same (in the block starting U+10450). The glyphs are
> straightforward, rendered
> left-to-right, requiring no ligatures, and could be forced into a
> fixed-pitch (mono) font about
> as easily as Roman glyphs.
>
> Ken

and a lot more easily than Arabic, where a single letter (with a single code
point) may have to be shown in up to 4 different ways (not counting combining
characters), depending on its position in the word and on which letter (if
any) precedes it. Happily Vim (with +arabic) knows how to fetch the required
"presentation forms" from the Arabic fonts. Anyway, the beautiful cursive
shapes of Arabic still look ugly when rendered in any monospace font, but
that's because Arabic calligraphy, with its long flourishes at the end of
almost every word, was invented for the calame (i.e., the reed pen), not the
typewriter.

Best regards,
Tony.
--
Try to be the best of whatever you are, even if what you are is no
good.

Reply all

Reply to author

Forward