[gvim] Multi-item is not incorrectly applied

hebar...@googlemail.com

unread,

Dec 29, 2020, 6:43:31 AM12/29/20

to vim_dev

[[:upper:]]\{2,} is not correctly applied, resulting in not finding what is searched for...

Please refer to the below text fragment:
--------------------------------------------------------------------------
" Version: GVim 8.2.2148
" OS:      Windows 7, 64-bit

" Test pattern
05. ПЕСНЯ О ГЕРОЯХ муз. А. Давиденко, М. Коваля и Б. Шехтера ...
05. PJESNJA O GJEROJAKH mus. A. Davidjenko, M. Kovalja i B. Shjekhtjera ...

" Use these as search expressions
/\<[[:upper:]]\+\>           " Finds all uppercase letters
/\<[[:upper:]]\{2,}\>       " Not finding what is searched for(!)
/\<[А-Я]\{2,}\>                " Finds the specified range of cyrillic letters
--------------------------------------------------------------------------

Christian Brabandt

unread,

Dec 29, 2020, 11:01:43 AM12/29/20

to vim_dev

On Di, 29 Dez 2020, 'hebar...@googlemail.com' via vim_dev wrote:

> [[:upper:]]*\{2,}* is not correctly applied, resulting in not finding what

I suppose the problem is, that the second and fourth word in the input
isn't matched?

> 05. ПЕСНЯ О ГЕРОЯХ муз. А. Давиденко, М. Коваля и Б. Шехтера ...

^^^^^ ^^^^^^

That is an interesting case. There are 2 peculiarities here:

By default, Vim comes with two different regexp engines, which you can
switch using the 'regexpengine' option. (See :h 'regexpengine' and
:h two-engines)

By default, it uses the automatic mode, which is usually the NFA engine,
only for some costly patterns, it might fall-back to the old
backtracking engine.

For some reason, the NFA engine, when used in automatic mode, fails to
compile this regex (however it doesn't mention that it switches the
engines :/). I see this in the logfile:

,----
| >>> NFA engine failed...
| Regexp: "\<[[:upper:]]\{2,}\>"
| Postfix notation (char): "NFA_BOW , NFA_START_COLL, NFA_CLASS_UPPER, NFA_CONCAT , NFA_END_COLL, "
| Postfix notation (int): -1006 -1021 -831 -1014 -1020
`----

Vim then switches back to backtracking engine (I am not sure why,
because it doesn't call `report_re_switch()`). The way this engine uses
POSIX character classes is basically it adds all possible upper
characters between 1-255 that are upper case characters into a big or
branch. I believe a character range can contain at most 256 characters
and I suppose because of old 8bit encodings it stops at 256. That's why
those other upper characters are not found.

However, if you manually switch to the nfa regexp engine, it starts to
work again. I am a bit puzzled, why this time compiling the engine
works.

I think an alternative (and faster) way would be to use the \u atom
instead of `[[:upper:]]`.

Best,
Christian
--
Was die neuen Unwissenden holen müssen:
Schlüssel zum Verfügungsraum

Christian Brabandt

unread,

Dec 29, 2020, 11:26:36 AM12/29/20

to vim_dev

That is because of this part in the code:

https://github.com/vim/vim/blob/89015a675990bd7d70e041c5d890edb803b5c6b7/src/regexp_nfa.c#L2138-L2143

// The engine is very inefficient (uses too many states) when the
// maximum is much larger than the minimum and when the maximum is
// large. Bail out if we can use the other engine.
if ((nfa_re_flags & RE_AUTO)
&& (maxval > 500 || maxval > minval + 200))
return FAIL;

So in only fails when the automatic engine is active. If you manually
force to use the NFA engine (:set regexpengine=2) it will continue to
create that many states.

Best,
Christian
--
Früher hieß es: "Heim ins Reich!"
Heute muß es heißen: "Reich ins Heim!"
-- Gerhard Kocher

hebar...@googlemail.com

unread,

Dec 29, 2020, 2:37:00 PM12/29/20

to vim_dev

Thanks for looking into this!
Although this is a work-around, it helps...

Christian Brabandt

unread,

Dec 29, 2020, 2:39:37 PM12/29/20

to vim_dev

On Di, 29 Dez 2020, 'hebar...@googlemail.com' via vim_dev wrote:

> Thanks for looking into this!
> Although this is a work-around, it helps...

I proposed some changes here: https://github.com/vim/vim/pull/7572

Best,
Christian
--
Als ich geboren wurde, war ich noch sehr jung.
-- Heinz Erhardt

Reply all

Reply to author

Forward