regexp .\{1,9000} is much slower with new regexp engine than with old regexp engine

77 views
Skip to first unread message

Dominique Pellé

unread,
Jan 21, 2015, 6:28:40 PM1/21/15
to vim_dev
Hi

The regexp .\{1,9000} is very slow with the new regexp engine
compared to the old engine. Compare the timing of those 2
commands where the only difference is set re=1 vs set re=2:

$ time vim -X -u NONE \
-c 'set re=1' \
-c 'call feedkeys("ihello\<esc>/.\\{1,9000}\<cr>:q!\<cr>")'

real 0m0.004s
user 0m0.000s
sys 0m0.004s


$ time vim -X -u NONE \
-c 'set re=2' \
-c 'call feedkeys("ihello\<esc>/.\\{1,9000}\<cr>:q!\<cr>")'

real 0m6.499s
user 0m6.432s
sys 0m0.020s

=> new regexp engine is 1624x times times slower here.

Is this a known limitation of the new NFA regexp engine?

Regards
Dominique

Dominique Pellé

unread,
Jan 21, 2015, 6:56:06 PM1/21/15
to vim_dev
Changing the value of n in the regexp .\{1,n}
I see that the time it takes to match quadruples
when doubling n with the new regexp engine.

=> This means that complexity is O(n^2):

With the old regexp engine, the time is constant.

regexp re=1 re=2
------------------------------
.\{1,1000} 0.004s 0.064s
.\{1,2000} 0.005s 0.305s
.\{1,4000} 0.005s 1.230s
.\{1,8000} 0.005s 4.799s
.\{1,16000} 0.005s 20.902s

This was measured using for n=16000 for example using:

$ time vim -X -u NONE -c 'set re=2' \
-c 'call feedkeys("ihello\<esc>/.\\{1,16000}\<cr>:q!\<cr>")'

Regards
Dominique

Bram Moolenaar

unread,
Jan 22, 2015, 1:02:57 PM1/22/15
to Dominique Pellé, vim_dev
Is this specific for matching with "."?

I would expect the old engine to be slower when what's being matched may
fail.

The problem is that the engine tries to find the longest match, which
may start at any character. Thus for every character there is another
state to be kept. It's not so easy to think of a way that works better
and works in general. There are two special cases here:
- Matching dot, which matches everywhere. If it would be more specific,
such as "[abc]" there wouldn't so many states.
- Nothing follows, thus the match that started first will always win.
If something follows the old engine is probably much slower.


--
A meeting is an event at which the minutes are kept and the hours are lost.

/// Bram Moolenaar -- Br...@Moolenaar.net -- http://www.Moolenaar.net \\\
/// sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
\\\ an exciting new programming language -- http://www.Zimbu.org ///
\\\ help me help AIDS victims -- http://ICCF-Holland.org ///

Dominique Pellé

unread,
Jan 22, 2015, 3:40:58 PM1/22/15
to Bram Moolenaar, vim_dev
It's also slow with at least:

re=1 re=2
----------------------------------
[a-z]\{1,8000} 0.004s 5.890s
\h\{1,8000} 0.004s 5.824s

I see that not only the complexity to match is O(n^2)
with the new regexp engine for the regexp .\{1,n},
but the complexity is also linear with the length N
of the line being matched. So the complexity seems
to be: O(N*n^2).

Time to match .\{1,8000} with line length of 1, 2, 4, 8, 16, 32:

line to match re=1 re=2
----------------------------------------------------
1 0.004s 0.016s
12 0.004s 1.527s
1234 0.004s 4.391s
12345678 0.004s 10.167s
1234567890123456 0.004s 21.466s
12345678901234567890123456789012 0.004s 44.849s

This was measured using for example:

$ time vim -X -u NONE -c 'set re=2' \
-c 'call feedkeys("i12345678\<esc>/.\\{1,8000}\<cr>:q!\<cr>")'

real 0m10.167s
user 0m10.085s
sys 0m0.024s

The non-greedy version is fast:

$ time vim -X -u NONE -c 'set re=2' \
-c 'call feedkeys("i12345678\<esc>/.\\{-1,8000}\<cr>:q!\<cr>")'

real 0m0.056s
ser 0m0.020s
sys 0m0.028s

> I would expect the old engine to be slower when what's being matched may
> fail.

I cannot find such an example. The old regexp engine
seems always much faster with things like \{1,8000}

Dominique

Ben Fritz

unread,
Jan 22, 2015, 3:54:40 PM1/22/15
to vim...@googlegroups.com, Br...@moolenaar.net
Trying on Windows 7 64-bit with 64-bit HUGE Vim 7.4.522 shows it taking a REALLY long time just to match "1234567890" (one line in the buffer) with pattern ".\{1,8000}".

Furthermore, I can't seem to interrupt the search with <C-C>, and I get a "search hit BOTTOM, continuing at TOP" message before it finally matches.

Anchoring the match by changing the pattern to ".\{1,8000}0" or ".\{1,8000}$" does not seem to help. Even anchoring at both ends, "^.\{1,8000}$", doesn't help very noticeably.

Setting re=1 makes the search pretty much instantaneous.

Omitting the "8000" makes the search just as fast with re=2. I.e. this pattern is fast: ".\{1,}"

Bram Moolenaar

unread,
Jan 23, 2015, 6:37:55 AM1/23/15
to Ben Fritz, vim...@googlegroups.com
Yeah, if you leave out the limit then the states are all the same. With
the limit the state differs depending on where the match starts.

It's going to be difficult to find an efficient way to implement this
with NFA. We might need to detect this part of the RE and fall back to
the old engine. I would actually think that the limit that Christian
added would work for this as well, could be useful to figure out why it
doesn't run into the limit on number of states. Perhaps because the
number of states is not going above the upper limit, but it's slow
because every state changes at every character.

--
hundred-and-one symptoms of being an internet addict:
111. You and your friends get together regularly on IRC, even though
all of you live in the same city.
Reply all
Reply to author
Forward
0 new messages