I propose to keep "\uxxxx" at its present meaning, but extend
"\Uxxxxxxxx" to allow additional hex digits (either up to a total of 8
hex digits, in line with ^VUxxxxxxxx as opposed to ^Vuxxxx in Insert
mode, or at least up to the value \U10FFFF, above which the Unicode
Consortium has decided that "there never shall be a valid Unicode
codepoint at any future time".
I'm aware that this is an "incompatible" change, but I believe the risk
is low compared with the advantages (as a sidenote, many rare CJK
characters lie in plane 2, in the "CJK Unified Extension B" range
U+20000-U+2A6DF).
The notation "\<Char-0x20000>" or "\<Char-131072>" doesn't work: here
(in my GTK2/Gnome2 gvim with 'encoding' set to UTF-8), ":echo"ing such a
string displays <f0><a0><80><fe>X<80><fe>X instead of just the one CJK
character 𠀀 (and, yes, I've set my mailer to send this post as UTF-8 so
if yours is "well-behaved" it should display that character properly).
Best regards,
Tony.
--
Although the moon is smaller than the earth, it is farther away.
It does cause problems for something like "\U12345" which would now be
the character 0x1234 followed by the character 5. After the change it
would become one character 0x12345.
I don't see a convenient alternative though. Anyone?
--
Even got a Datapoint 3600(?) with a DD50 connector instead of the
usual DB25... what a nightmare trying to figure out the pinout
for *that* with no spex...
/// Bram Moolenaar -- Br...@Moolenaar.net -- http://www.Moolenaar.net \\\
/// sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
\\\ download, build and distribute -- http://www.A-A-P.org ///
\\\ help me help AIDS victims -- http://ICCF-Holland.org ///
Well, I don't know about *convenient*, but one option would be to
continue allowing \u to use 1-to-4 hex digits, and require that \U use
exactly 8 (or exactly 6, if we only support up to \U10FFFF) hex
digits. On the one hand, it will break just about every existing
place where someone used \U instead of \u. On the other hand, the fix
is trivial, and it gives an actual reason for supporting both \u and
\U. I think it's better than the alternative you propose, since
changing the definition from "1-to-4 hex digits" to "1-to-8 hex
digits" will cause things to fail in non-obvious ways, and changing
the defiintion to "exactly 8 hex digits" should usually cause a more
obvious failure that we could assign a helpful error number to.
~Matt
It's actually patch 7.1.116 (30-Nov-2007). So no news-breaking scoop
anymore, but as long as Vim's support of Unicode outside the BMP was
less than optimal, the problem I'm raising in this thread might have
made itself felt less acutely.
Best regards,
Tony.
--
Joe's sister puts spaghetti in her shoes!
In keymap files, it seems to work on Linux too (I use it in my owncoded
"phonetic" keymaps for Arabic and Russian); but I was talking of
double-quoted strings.
These Arabic and Russian keymaps aren't above U+FFFF but anywhere above
0x7F the <Char- > notation gives me problems inside double-quoted
strings. I believe this is related to the documented fact that "\xnn"
doesn't give valid UTF-8 values above 0x7F -- use "\u00nn" instead.
Best regards,
Tony.
--
If God is perfect, why did He create discontinuous functions?
> Bram Moolenaar wrote:
> >
> > Tony Mechelynck wrote:
> >
> >> Vim is now capable of displaying any Unicode codepoint for which the
> >> installed 'guifont' has a glyph, even outside the BMP (i.e., even above
> >> U+FFFF), but there's no easy way to represent those "high" codepoints by
> >> Unicode value in strings: I mean, "\uxxxx" and \Uxxxx" still accept no
> >> more than four hex digits.
> >>
> >> I propose to keep "\uxxxx" at its present meaning, but extend
> >> "\Uxxxxxxxx" to allow additional hex digits (either up to a total of 8
> >> hex digits, in line with ^VUxxxxxxxx as opposed to ^Vuxxxx in Insert
> >> mode, or at least up to the value \U10FFFF, above which the Unicode
> >> Consortium has decided that "there never shall be a valid Unicode
> >> codepoint at any future time".
> >
> > It does cause problems for something like "\U12345" which would now be
> > the character 0x1234 followed by the character 5. =C2=A0After the change =
> it
> > would become one character 0x12345.
> >
> > I don't see a convenient alternative though. =C2=A0Anyone?
>
> Well, I don't know about *convenient*, but one option would be to
> continue allowing \u to use 1-to-4 hex digits, and require that \U use
> exactly 8 (or exactly 6, if we only support up to \U10FFFF) hex
> digits. On the one hand, it will break just about every existing
> place where someone used \U instead of \u. On the other hand, the fix
> is trivial, and it gives an actual reason for supporting both \u and
> \U. I think it's better than the alternative you propose, since
> changing the definition from "1-to-4 hex digits" to "1-to-8 hex
> digits" will cause things to fail in non-obvious ways, and changing
> the defiintion to "exactly 8 hex digits" should usually cause a more
> obvious failure that we could assign a helpful error number to.
Requiring exactly 8 hex digits helps for the incompatibility. However,
most Unicode characters are only 6 digits, so one needs to type two
more. And it's easy to type the wrong number of digits with such a long
sequence..
The other suggestion about Perl give me this idea: "\x(123456)".
This has two advantages:
1. It's backwards compatible.
2. Avoids accidentally typing the wrong number of hex digits.
3. Allows typing a hex digit next as a separate character.
Eh, _three_ advantages.
I think perl uses "\x{123456}", but () is easier to type than {},
especially on some keyboards. Don't see a reason to use {}.
--
Not too long ago, unzipping in public was illegal...
Nobody expects the spanish inquisition!
-Olaf.
--
___ Olaf 'Rhialto' Seibert -- You author it, and I'll reader it.
\X/ rhialto/at/xs4all.nl -- Cetero censeo "authored" delendum esse.