Suggestion: Redefine \Uxxxxx in double-quoted strings

42 views
Skip to first unread message

Tony Mechelynck

unread,
Apr 6, 2009, 2:22:41 PM4/6/09
to Vim Developers, Bram Moolenaar, vim_mu...@googlegroups.com
Vim is now capable of displaying any Unicode codepoint for which the
installed 'guifont' has a glyph, even outside the BMP (i.e., even above
U+FFFF), but there's no easy way to represent those "high" codepoints by
Unicode value in strings: I mean, "\uxxxx" and \Uxxxx" still accept no
more than four hex digits.

I propose to keep "\uxxxx" at its present meaning, but extend
"\Uxxxxxxxx" to allow additional hex digits (either up to a total of 8
hex digits, in line with ^VUxxxxxxxx as opposed to ^Vuxxxx in Insert
mode, or at least up to the value \U10FFFF, above which the Unicode
Consortium has decided that "there never shall be a valid Unicode
codepoint at any future time".

I'm aware that this is an "incompatible" change, but I believe the risk
is low compared with the advantages (as a sidenote, many rare CJK
characters lie in plane 2, in the "CJK Unified Extension B" range
U+20000-U+2A6DF).

The notation "\<Char-0x20000>" or "\<Char-131072>" doesn't work: here
(in my GTK2/Gnome2 gvim with 'encoding' set to UTF-8), ":echo"ing such a
string displays <f0><a0><80><fe>X<80><fe>X instead of just the one CJK
character 𠀀 (and, yes, I've set my mailer to send this post as UTF-8 so
if yours is "well-behaved" it should display that character properly).


Best regards,
Tony.
--
Although the moon is smaller than the earth, it is farther away.

Bram Moolenaar

unread,
Apr 6, 2009, 4:15:25 PM4/6/09
to Tony Mechelynck, Vim Developers, vim_mu...@googlegroups.com

Tony Mechelynck wrote:

It does cause problems for something like "\U12345" which would now be
the character 0x1234 followed by the character 5. After the change it
would become one character 0x12345.

I don't see a convenient alternative though. Anyone?

--
Even got a Datapoint 3600(?) with a DD50 connector instead of the
usual DB25... what a nightmare trying to figure out the pinout
for *that* with no spex...

/// Bram Moolenaar -- Br...@Moolenaar.net -- http://www.Moolenaar.net \\\
/// sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
\\\ download, build and distribute -- http://www.A-A-P.org ///
\\\ help me help AIDS victims -- http://ICCF-Holland.org ///

Kenneth Reid Beesley

unread,
Apr 6, 2009, 4:18:46 PM4/6/09
to vim_mu...@googlegroups.com, Vim Developers, Bram Moolenaar

On 6 Apr 2009, at 12:22, Tony Mechelynck wrote:

>
> Vim is now capable of displaying any Unicode codepoint for which the
> installed 'guifont' has a glyph, even outside the BMP (i.e., even
> above
> U+FFFF),

Tony,

Good news.

Many may not know that MacVim has been doing this rather well for
quite a while.
I routinely edit texts in Deseret Alphabet and Shaw (Shavian)
Alphabet, which lie in the
supplementary area.


> but there's no easy way to represent those "high" codepoints by
> Unicode value in strings: I mean, "\uxxxx" and \Uxxxx" still accept no
> more than four hex digits.
>
> I propose to keep "\uxxxx" at its present meaning, but extend
> "\Uxxxxxxxx" to allow additional hex digits (either up to a total of 8
> hex digits, in line with ^VUxxxxxxxx as opposed to ^Vuxxxx in Insert
> mode, or at least up to the value \U10FFFF,

Sounds good.

\Uxxxxxxxx is also the Python convention for representing
supplementary characters in strings.
I think it requires exactly 8 hex digits, just as \uxxxx requires
exactly four, but I'm willing to be
corrected.

The other reasonable convention is the Perl-like \x{x...}, (the prefix
\x is literally backslash,
small X) which, being delimited with curly braces, can contain any
number of hex digits
without confusing the tokenization. But your proposal is more in line
with what Vim has
already.

>
>
> I'm aware that this is an "incompatible" change, but I believe the
> risk
> is low compared with the advantages

For what it's worth, I agree.

> The notation "\<Char-0x20000>" or "\<Char-131072>" doesn't work: here
> (in my GTK2/Gnome2 gvim with 'encoding' set to UTF-8), ":echo"ing
> such a
> string displays <f0><a0><80><fe>X<80><fe>X instead of just the one CJK
> character 𠀀 (and, yes, I've set my mailer to send this post as
> UTF-8 so
> if yours is "well-behaved" it should display that character properly).

In MacVim, at least, supplementary code point values can appear
usefully in <Char- > in keymap files.
Entries like the following appear in my deseret-sampa_utf-8.vim keymap
file. It all works great.

"in out comment
i <Char-0x10428> DESERET SMALL LETTER LONG I (e.g. i in
machine)
e <Char-0x10429> DESERET SMALL LETTER LONG E (e.g. a in make)
A <Char-0x1042A> DESERET SMALL LETTER LONG A (e.g. a in father)
O <Char-0x1042B> DESERET SMALL LETTER LONG AH (e.g. a in call,
au in caught, British/USEastCoastCity pronunciation)
o <Char-0x1042C> DESERET SMALL LETTER LONG O (e.g. oa in boat)
u <Char-0x1042D> DESERET SMALL LETTER LONG OO (e.g. oo in boot)

Thanks to all those developers who have toiled to handle Unicode in Vim.

Ken

******************************
Kenneth R. Beesley, D.Phil.
P.O. Box 540475
North Salt Lake, UT
84054 USA

Tony Mechelynck

unread,
Apr 6, 2009, 5:19:32 PM4/6/09
to vim_mu...@googlegroups.com, Vim Developers, Bram Moolenaar
On 06/04/09 22:18, Kenneth Reid Beesley wrote:
>
>
> On 6 Apr 2009, at 12:22, Tony Mechelynck wrote:
>
>>
>> Vim is now capable of displaying any Unicode codepoint for which the
>> installed 'guifont' has a glyph, even outside the BMP (i.e., even
>> above
>> U+FFFF),
>
> Tony,
>
> Good news.
>
> Many may not know that MacVim has been doing this rather well for
> quite a while.
> I routinely edit texts in Deseret Alphabet and Shaw (Shavian)
> Alphabet, which lie in the
> supplementary area.
[...]

It's actually patch 7.1.116 (30-Nov-2007). So no news-breaking scoop
anymore, but as long as Vim's support of Unicode outside the BMP was
less than optimal, the problem I'm raising in this thread might have
made itself felt less acutely.


Best regards,
Tony.
--
Joe's sister puts spaghetti in her shoes!

Tony Mechelynck

unread,
Apr 6, 2009, 5:38:29 PM4/6/09
to vim_mu...@googlegroups.com, Vim Developers, Bram Moolenaar
On 06/04/09 22:18, Kenneth Reid Beesley wrote:
[...]

> In MacVim, at least, supplementary code point values can appear
> usefully in<Char-> in keymap files.
> Entries like the following appear in my deseret-sampa_utf-8.vim keymap
> file. It all works great.
[...]

In keymap files, it seems to work on Linux too (I use it in my owncoded
"phonetic" keymaps for Arabic and Russian); but I was talking of
double-quoted strings.

These Arabic and Russian keymaps aren't above U+FFFF but anywhere above
0x7F the <Char- > notation gives me problems inside double-quoted
strings. I believe this is related to the documented fact that "\xnn"
doesn't give valid UTF-8 values above 0x7F -- use "\u00nn" instead.


Best regards,
Tony.
--
If God is perfect, why did He create discontinuous functions?

John (Eljay) Love-Jensen

unread,
Apr 7, 2009, 7:30:41 AM4/7/09
to vim_mu...@googlegroups.com, Tony Mechelynck, Vim Developers
Hi Tony,


> I don't see a convenient alternative though.  Anyone?

/Uxxxx
/uxxxx
/U{x}
/U{xx}
/U{xxx}
/U{xxxx}
/U{xxxxx}
/U{xxxxxx}
/U{xxxxxxx}
/U{xxxxxxxx}

--Eljay
Reply all
Reply to author
Forward
0 new messages