Suggestion: Redefine \Uxxxxx in double-quoted strings

126 views
Skip to first unread message

Tony Mechelynck

unread,
Apr 6, 2009, 2:22:41 PM4/6/09
to Vim Developers, Bram Moolenaar, vim_mu...@googlegroups.com
Vim is now capable of displaying any Unicode codepoint for which the
installed 'guifont' has a glyph, even outside the BMP (i.e., even above
U+FFFF), but there's no easy way to represent those "high" codepoints by
Unicode value in strings: I mean, "\uxxxx" and \Uxxxx" still accept no
more than four hex digits.

I propose to keep "\uxxxx" at its present meaning, but extend
"\Uxxxxxxxx" to allow additional hex digits (either up to a total of 8
hex digits, in line with ^VUxxxxxxxx as opposed to ^Vuxxxx in Insert
mode, or at least up to the value \U10FFFF, above which the Unicode
Consortium has decided that "there never shall be a valid Unicode
codepoint at any future time".

I'm aware that this is an "incompatible" change, but I believe the risk
is low compared with the advantages (as a sidenote, many rare CJK
characters lie in plane 2, in the "CJK Unified Extension B" range
U+20000-U+2A6DF).

The notation "\<Char-0x20000>" or "\<Char-131072>" doesn't work: here
(in my GTK2/Gnome2 gvim with 'encoding' set to UTF-8), ":echo"ing such a
string displays <f0><a0><80><fe>X<80><fe>X instead of just the one CJK
character 𠀀 (and, yes, I've set my mailer to send this post as UTF-8 so
if yours is "well-behaved" it should display that character properly).


Best regards,
Tony.
--
Although the moon is smaller than the earth, it is farther away.

Bram Moolenaar

unread,
Apr 6, 2009, 4:15:25 PM4/6/09
to Tony Mechelynck, Vim Developers, vim_mu...@googlegroups.com

Tony Mechelynck wrote:

It does cause problems for something like "\U12345" which would now be
the character 0x1234 followed by the character 5. After the change it
would become one character 0x12345.

I don't see a convenient alternative though. Anyone?

--
Even got a Datapoint 3600(?) with a DD50 connector instead of the
usual DB25... what a nightmare trying to figure out the pinout
for *that* with no spex...

/// Bram Moolenaar -- Br...@Moolenaar.net -- http://www.Moolenaar.net \\\
/// sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
\\\ download, build and distribute -- http://www.A-A-P.org ///
\\\ help me help AIDS victims -- http://ICCF-Holland.org ///

Matt Wozniski

unread,
Apr 6, 2009, 4:52:18 PM4/6/09
to vim...@googlegroups.com
Bram Moolenaar wrote:
>
> Tony Mechelynck wrote:
>
>> Vim is now capable of displaying any Unicode codepoint for which the
>> installed 'guifont' has a glyph, even outside the BMP (i.e., even above
>> U+FFFF), but there's no easy way to represent those "high" codepoints by
>> Unicode value in strings: I mean, "\uxxxx" and \Uxxxx" still accept no
>> more than four hex digits.
>>
>> I propose to keep "\uxxxx" at its present meaning, but extend
>> "\Uxxxxxxxx" to allow additional hex digits (either up to a total of 8
>> hex digits, in line with ^VUxxxxxxxx as opposed to ^Vuxxxx in Insert
>> mode, or at least up to the value \U10FFFF, above which the Unicode
>> Consortium has decided that "there never shall be a valid Unicode
>> codepoint at any future time".
>
> It does cause problems for something like "\U12345" which would now be
> the character 0x1234 followed by the character 5.  After the change it
> would become one character 0x12345.
>
> I don't see a convenient alternative though.  Anyone?

Well, I don't know about *convenient*, but one option would be to
continue allowing \u to use 1-to-4 hex digits, and require that \U use
exactly 8 (or exactly 6, if we only support up to \U10FFFF) hex
digits. On the one hand, it will break just about every existing
place where someone used \U instead of \u. On the other hand, the fix
is trivial, and it gives an actual reason for supporting both \u and
\U. I think it's better than the alternative you propose, since
changing the definition from "1-to-4 hex digits" to "1-to-8 hex
digits" will cause things to fail in non-obvious ways, and changing
the defiintion to "exactly 8 hex digits" should usually cause a more
obvious failure that we could assign a helpful error number to.

~Matt

Tony Mechelynck

unread,
Apr 6, 2009, 5:19:32 PM4/6/09
to vim_mu...@googlegroups.com, Vim Developers, Bram Moolenaar
On 06/04/09 22:18, Kenneth Reid Beesley wrote:

>
>
> On 6 Apr 2009, at 12:22, Tony Mechelynck wrote:
>
>>
>> Vim is now capable of displaying any Unicode codepoint for which the
>> installed 'guifont' has a glyph, even outside the BMP (i.e., even
>> above
>> U+FFFF),
>
> Tony,
>
> Good news.
>
> Many may not know that MacVim has been doing this rather well for
> quite a while.
> I routinely edit texts in Deseret Alphabet and Shaw (Shavian)
> Alphabet, which lie in the
> supplementary area.
[...]

It's actually patch 7.1.116 (30-Nov-2007). So no news-breaking scoop
anymore, but as long as Vim's support of Unicode outside the BMP was
less than optimal, the problem I'm raising in this thread might have
made itself felt less acutely.


Best regards,
Tony.
--
Joe's sister puts spaghetti in her shoes!

Tony Mechelynck

unread,
Apr 6, 2009, 5:38:29 PM4/6/09
to vim_mu...@googlegroups.com, Vim Developers, Bram Moolenaar
On 06/04/09 22:18, Kenneth Reid Beesley wrote:
[...]
> In MacVim, at least, supplementary code point values can appear
> usefully in<Char-> in keymap files.
> Entries like the following appear in my deseret-sampa_utf-8.vim keymap
> file. It all works great.
[...]

In keymap files, it seems to work on Linux too (I use it in my owncoded
"phonetic" keymaps for Arabic and Russian); but I was talking of
double-quoted strings.

These Arabic and Russian keymaps aren't above U+FFFF but anywhere above
0x7F the <Char- > notation gives me problems inside double-quoted
strings. I believe this is related to the documented fact that "\xnn"
doesn't give valid UTF-8 values above 0x7F -- use "\u00nn" instead.


Best regards,
Tony.
--
If God is perfect, why did He create discontinuous functions?

Bram Moolenaar

unread,
Apr 7, 2009, 6:42:08 AM4/7/09
to Matt Wozniski, vim...@googlegroups.com

Matt Wozniski wrote:

> Bram Moolenaar wrote:
> >
> > Tony Mechelynck wrote:
> >
> >> Vim is now capable of displaying any Unicode codepoint for which the
> >> installed 'guifont' has a glyph, even outside the BMP (i.e., even above
> >> U+FFFF), but there's no easy way to represent those "high" codepoints by
> >> Unicode value in strings: I mean, "\uxxxx" and \Uxxxx" still accept no
> >> more than four hex digits.
> >>
> >> I propose to keep "\uxxxx" at its present meaning, but extend
> >> "\Uxxxxxxxx" to allow additional hex digits (either up to a total of 8
> >> hex digits, in line with ^VUxxxxxxxx as opposed to ^Vuxxxx in Insert
> >> mode, or at least up to the value \U10FFFF, above which the Unicode
> >> Consortium has decided that "there never shall be a valid Unicode
> >> codepoint at any future time".
> >
> > It does cause problems for something like "\U12345" which would now be

> > the character 0x1234 followed by the character 5. =C2=A0After the change =


> it
> > would become one character 0x12345.
> >

> > I don't see a convenient alternative though. =C2=A0Anyone?


>
> Well, I don't know about *convenient*, but one option would be to
> continue allowing \u to use 1-to-4 hex digits, and require that \U use
> exactly 8 (or exactly 6, if we only support up to \U10FFFF) hex
> digits. On the one hand, it will break just about every existing
> place where someone used \U instead of \u. On the other hand, the fix
> is trivial, and it gives an actual reason for supporting both \u and
> \U. I think it's better than the alternative you propose, since
> changing the definition from "1-to-4 hex digits" to "1-to-8 hex
> digits" will cause things to fail in non-obvious ways, and changing
> the defiintion to "exactly 8 hex digits" should usually cause a more
> obvious failure that we could assign a helpful error number to.

Requiring exactly 8 hex digits helps for the incompatibility. However,
most Unicode characters are only 6 digits, so one needs to type two
more. And it's easy to type the wrong number of digits with such a long
sequence..

The other suggestion about Perl give me this idea: "\x(123456)".
This has two advantages:
1. It's backwards compatible.
2. Avoids accidentally typing the wrong number of hex digits.
3. Allows typing a hex digit next as a separate character.

Eh, _three_ advantages.

I think perl uses "\x{123456}", but () is easier to type than {},
especially on some keyboards. Don't see a reason to use {}.

--
Not too long ago, unzipping in public was illegal...

Rhialto

unread,
Apr 10, 2009, 6:34:08 AM4/10/09
to vim...@googlegroups.com
On Tue 07 Apr 2009 at 12:42:08 +0200, Bram Moolenaar wrote:
> This has two advantages:
> 1. It's backwards compatible.
> 2. Avoids accidentally typing the wrong number of hex digits.
> 3. Allows typing a hex digit next as a separate character.
>
> Eh, _three_ advantages.

Nobody expects the spanish inquisition!

-Olaf.
--
___ Olaf 'Rhialto' Seibert -- You author it, and I'll reader it.
\X/ rhialto/at/xs4all.nl -- Cetero censeo "authored" delendum esse.

Reply all
Reply to author
Forward
0 new messages