Word boundry would not work when using some wierd Unicode chars with the 'contained' syntax

Jacky Liu

unread,

Feb 7, 2015, 4:16:21 PM2/7/15

to vim...@googlegroups.com

Here is the VimL code I wrote:

" Use some wierd Unicode chars to mark the region, '+' being put here as a contrast.
syntax region myCmdLine matchgroup=myCmdLine_ start=/[⣱+]/ end=/[⡇⡗⡧+]/
hi link myCmdLine _LightGreen_233b5a
hi link myCmdLine_ Normal

syntax keyword myCmdName man bind less containedin=myCmdLine contained
hi link myCmdName _Green_233b5a

And here's its effect on some simple demonstrating text (see attached image file)

With '+' as the marker all three syntax keywords were correctlly recognized, but not with the abnormal Unicode chars

Another thing is using '*' to do a quick search would work normally, as would do the following search command:

/\<man\|bind\|less\>

'iskeyword' or 'regexpengine' option seems have no effect here.

Should this be considered a bug?

Screenshot from 2015-02-08 05:12:01.png

Ben Fritz

unread,

Feb 7, 2015, 10:31:48 PM2/7/15

to vim...@googlegroups.com

Try it again, with an appropriate scriptencoding command in the file, to tell Vim how to interpret the bytes in the file.

Jacky Liu

unread,

Feb 8, 2015, 2:05:02 PM2/8/15

to vim...@googlegroups.com

OK, I added a modeline to my text file to tell Vim the fileencoding specifically:

/* Vim: set fileencoding=utf-8: */
/* Vim: set tabstop=4: */

⣱man bind | less⡇

+man bind | less+

the 'tabstop' option was there to make sure the modeline works. the result seems to have no defference.

For your reference: The underlying fileencoding in respect of my daily usage of Vim was almost always utf-8.

All the best ~

Ben Fritz

unread,

Feb 8, 2015, 11:27:32 PM2/8/15

to vim...@googlegroups.com

On Sunday, February 8, 2015 at 1:05:02 PM UTC-6, Jacky Liu wrote:
>
>
>
> OK, I added a modeline to my text file to tell Vim the fileencoding specifically:
>
> /* Vim: set fileencoding=utf-8: */
> /* Vim: set tabstop=4: */
>
> ⣱man bind | less⡇
>
> +man bind | less+
>
> the 'tabstop' option was there to make sure the modeline works. the result seems to have no defference.
>
> For your reference: The underlying fileencoding in respect of my daily usage of Vim was almost always utf-8.
>

I was not talking about fileencoding, I was talking about the :scriptencoding command, which you embed in a script to tell Vim how to read the following bytes.

But I tried it, and I see the problem you describe anyway. So encoding is not the issue.

I did note that putting any other text between your marker characters and your keywords lets the highlighting work as expected. E.g. this line highlights properly:

⣱ man bind | less ⡇

But not this one:

⣱man bind | less⡇

I think this is because of the way :syn-keyword is built to also allow multibyte
characters. I'm not sure whether it should be considered a bug or not but it
certainly is not expected. However I don't know of an elegant solution for it,
other than using "match" instead of "keyword".

:help :syn-define says:
> 1. Keyword
> It can only contain keyword characters, according to the 'iskeyword'
> option. It cannot contain other syntax items. It will only match with a
> complete word (there are no keyword characters before or after the match).
> The keyword "if" would match in "if(a=b)", but not in "ifdef x", because
> "(" is not a keyword character and "d" is.

In other words, a keyword will be matched when there are not "keyword"
characters around the match.

:help E789 (within :help :syn-keyword)
> Don't forget that a keyword can only be recognized if all the
> characters are included in the 'iskeyword' option. If one character
> isn't, the keyword will never be recognized.
> Multi-byte characters can also be used. These do not have to be in
> 'iskeyword'.

But here, we find that keyword characters for the sake of :syn-keyword include
both 'iskeyword' and also *any multibyte character*.

So in your case, the special multibyte text counts as a keyword character for
syntax purposes. Therefore the actual keywords must be separated from the
special marker text.

Jacky Liu

unread,

Feb 9, 2015, 12:11:32 PM2/9/15

to vim...@googlegroups.com

I see, any multi-byte character would be treated as keyword character so that quick search for unicode strings would work as expected.

such that '字符' will be highlighted by quick search out of '<<<字符>>>'.

And there's an implicit logic that word boundry would apply where regular keywords and multi-byte keywords concatinates.

such that '字符' will also be highlighted by quick search out of 'abc字符xyz'.

But such implicit logic does not apply to syntax keywords.

I shall use "syntax match"es in replace of my "syntax keyword" definitions, and my scripts should work again.

Many thanks ~

Jacky Liu

unread,

Feb 24, 2015, 12:08:57 PM2/24/15

to vim...@googlegroups.com

Update:

I've found a solution. Although a slight modification to Vim source would be involved, it solves the problem without any seeming side effects.

The method is changing the classification of certain characters as one desire, by modifying this file: vim74/src/mbyte.c:

/*
* Get class of a Unicode character.
* 0: white space
* 1: punctuation
* 2 or bigger: some class of word character.
*/
int
utf_class(c)
int c;
{
/* sorted list of non-overlapping intervals */
static struct clinterval
{
unsigned short first;
unsigned short last;
unsigned short class;
} classes[] =
{
{0x037e, 0x037e, 1}, /* Greek question mark */
{0x0387, 0x0387, 1}, /* Greek ano teleia */
{0x055a, 0x055f, 1}, /* Armenian punctuation */
{0x0589, 0x0589, 1}, /* Armenian full stop */
{0x05be, 0x05be, 1},
{0x05c0, 0x05c0, 1},
... ...

the above list in mbyte.c defines character slices within the unicode table and how they are to be classified. change the last value to '1' will make that segment punctuation characters, and after recompile&install, word boundry would apply where they appear.

There's another data structure in the same file which specifies the display width of characters:

/*
* For UTF-8 character "c" return 2 for a double-width character, 1 for others.
* Returns 4 or 6 for an unprintable character.
* Is only correct for characters >= 0x80.
* When p_ambw is "double", return 2 for a character with East Asian Width
* class 'A'(mbiguous).
*/
int
utf_char2cells(c)
int c;
{
/* Sorted list of non-overlapping intervals of East Asian double width
* characters, generated with ../runtime/tools/unicode.vim. */
static struct interval doublewidth[] =
{
{0x1100, 0x115f},
{0x11a3, 0x11a7},
{0x11fa, 0x11ff},
{0x2329, 0x232a},
{0x2e80, 0x2e99},
{0x2e9b, 0x2ef3},
... ...

Characters specified by this list would be drawn as double width, this is when the 'ambiwidth' option was set to "double".

The unicode table is so immense that it's not possible to make one classification of characters that suits everybody, so I think the above would be sometimes inevitable

Thank you ~

Reply all

Reply to author

Forward