[vim/vim] perf: syntax highlighting tries patterns that cannot match a line (PR #20371)

21 views
Skip to first unread message

h_east

unread,
May 29, 2026, 9:09:58 AM (3 days ago) May 29
to vim/vim, Subscribed
Problem:  Syntax highlighting spends time matching patterns on lines where
          they cannot possibly match, which is noticeable in large files
          with many syntax items.
Solution: Before running a pattern's regexp, skip it when the bytes it
          requires are absent from the line (a per-pattern lead-byte
          prefilter derived from the pattern at definition time).  The
          resulting highlighting is unchanged.

You can view, comment on, or merge this pull request online at:

  https://github.com/vim/vim/pull/20371

Commit Summary

  • ebe4087 syntax highlighting tries patterns that cannot match a line

File Changes

(2 files)

Patch Links:


Reply to this email directly, view it on GitHub, or unsubscribe.
Triage notifications, keep track of coding agent tasks and review pull requests on the go with GitHub Mobile for iOS and Android. Download it today!
You are receiving this because you are subscribed to this thread.Message ID: <vim/vim/pull/20371@github.com>

h_east

unread,
May 29, 2026, 9:12:21 AM (3 days ago) May 29
to vim/vim, Subscribed
h-east left a comment (vim/vim#20371)

Measurement results

Build: ./configure default CFLAGS (-g -O2), no profiling.

Method: force full-buffer syntax highlighting by calling synID() on every
line/column, measure wall time with reltime(), median of several runs.
Baseline = the same tree without the prefilter.

Full-buffer highlight:

file (filetype) lines baseline with prefilter change
big.c, C (concatenated) 99,192 ~6.70 s ~3.00 s ~55% faster (~2.2x)
src/evalfunc.c, C 12,919 ~0.92 s ~0.44 s ~52% faster (~2.1x)
netrw.vim, Vim script 9,717 ~5.8 s ~5.2 s ~11% faster

Mechanism: on a typical C buffer about 40% of regexp executions are patterns
that never match on that line (for example character/string-constant and
preprocessor patterns that are tried on every line). The prefilter removes
most of them; the share of regexp time spent on never-matching patterns drops
from ~40% to ~15% (measured with :syntime).

The gain depends on the filetype. For regexp-pattern-heavy syntaxes (C/C++)
with many never-matching patterns it is large. For keyword-heavy syntaxes
(Vim script), where much of the work is keyword lookup rather than regexp
matching, the overall gain is smaller, but regexp executions are still cut
substantially with identical highlighting (netrw.vim: 1,680,030 -> 1,024,966
regexp calls, about 39% fewer).

Correctness: byte-for-byte identical synID() output vs. the baseline across
C, C++, Vim script, Python, Ruby, Lua, JavaScript, shell, HTML and CSS
(millions of cells), including multibyte content, very long lines and a
reduced 'synmaxcol'.

Tests: test_syntax (including a new Test_syntax_lead_byte_prefilter regression
test), test_highlight, test_spell and test_textprop all pass.

Notes

  • Observable change: patterns skipped by the prefilter are no longer counted
    by :syntime (they genuinely do not run). Test_syntime was updated to
    highlight the whole buffer so the profiled patterns still appear.
  • The analyzer bails out (pattern is always tried, behaviour unchanged) on
    constructs it does not model: top-level "&", look-around "@", multi-line
    "_x", the case/magic flags "\c \C \v \V \m \M", "\Z", multibyte literals
    and groups nested deeper than the recursion cap.


Reply to this email directly, view it on GitHub, or unsubscribe.
Triage notifications, keep track of coding agent tasks and review pull requests on the go with GitHub Mobile for iOS and Android. Download it today!

You are receiving this because you are subscribed to this thread.Message ID: <vim/vim/pull/20371/c4575445100@github.com>

h_east

unread,
May 29, 2026, 11:20:58 AM (3 days ago) May 29
to vim/vim, Subscribed
h-east left a comment (vim/vim#20371)

Self-review: maintainability of the prefilter analyzer, and how I plan to address it

I want to flag the main maintenance concern with this change myself.

Concern. The lead-byte prefilter derives its first-byte / required-byte sets
by re-implementing a subset of Vim's regexp grammar in syn_compute_first_bytes()
and its helpers: magic-mode parsing, character classes, anchors, quantifiers,
groups and alternation. That means:

  • It has to be kept in sync if Vim's regexp syntax ever grows new constructs.
  • A missed or mis-modelled construct does not crash — it silently produces
    wrong highlighting by skipping a pattern that could actually match. That is
    the worst kind of failure mode for a maintainer.
  • This is not hypothetical: while developing this I hit four separate soundness
    pitfalls before the output matched byte-for-byte — top-level alternation
    (a\|b), look-around (\@<= / \@!), ignore-case (\c and :syn case
    semantics for the required-byte set), and inline flags (\c \v ...).
  • The scariest future case is a new \<char> escape — especially a zero-width
    or whole-pattern flag — being misclassified as a literal/consuming atom.

Mitigation I plan to add (commit on this branch). A differential test that
makes the optimization self-checking:

  • Add test_override('syn_prefilter', 1) to disable the prefilter at runtime
    (same mechanism as nfa_fail etc.).
  • Add a test that highlights several buffers (C, Vim script, Python, shell, and
    the tricky-construct cases) and asserts that the full-buffer synID() output
    is identical with the prefilter on (default) and off.

With that test in place, any future edit — or any new regexp construct the
analyzer fails to model — surfaces as a CI failure instead of silently wrong
highlighting. The optimization stays safe to maintain regardless of the
analyzer's complexity. The analyzer is already written to bail out (keep
trying the pattern, behaviour unchanged) on anything it does not model, so the
intended degradation is "no speed-up", never "wrong result"; the differential
test is what guarantees that property keeps holding.

Alternatives considered (not in this PR).

  • Tighten the analyzer so that any unrecognised \<char> escape bails instead
    of being treated as a literal, making it safe-by-default against future
    additions (small follow-up).
  • Longer term, derive the first-byte set from the compiled NFA start states
    instead of re-parsing the pattern text. That removes the grammar-tracking
    burden entirely (single source of truth = the regexp engine), at the cost of
    coupling to engine internals and handling the BT fallback. This would be a
    separate, larger change.


Reply to this email directly, view it on GitHub, or unsubscribe.
Triage notifications, keep track of coding agent tasks and review pull requests on the go with GitHub Mobile for iOS and Android. Download it today!

You are receiving this because you are subscribed to this thread.Message ID: <vim/vim/pull/20371/c4576849112@github.com>

h_east

unread,
May 29, 2026, 11:41:37 AM (3 days ago) May 29
to vim/vim, Subscribed
h-east left a comment (vim/vim#20371)

A differential test that makes the optimization self-checking:

Done.


Reply to this email directly, view it on GitHub, or unsubscribe.
Triage notifications, keep track of coding agent tasks and review pull requests on the go with GitHub Mobile for iOS and Android. Download it today!

You are receiving this because you are subscribed to this thread.Message ID: <vim/vim/pull/20371/c4577083271@github.com>

h_east

unread,
May 30, 2026, 8:53:38 AM (2 days ago) May 30
to vim/vim, Subscribed
h-east left a comment (vim/vim#20371)

The "syntax highlighting performance trilogy" is now complete. This PR (#20371) is the first part; two more are lined up in my fork, to be submitted upstream after this one lands:

  • Part 1 — #20371 (this PR): lead-byte prefilter. Before running a pattern's regexp, skip it when the bytes it requires are absent from the line (a per-pattern lead-byte prefilter derived at definition time).
  • Part 2 — h-east#29: in_id_list() cache. Deciding whether a group is in a contains/cluster list scans the list and expands clusters on every check. Resolve each list once into a sorted, cluster-expanded set of group IDs and use a binary search, cached per syntax block and dropped when syntax definitions change.
  • Part 3 — h-east#30: saved-state search hint. Looking up the saved syntax state for a line scans the state list from the start every time. Remember the last found entry and start the search there; the list is sorted by line number, so the result cannot be earlier.

All three are pure speedups — the resulting highlighting is unchanged, and each one ships with a test_override() flag plus a test that asserts identical synID() output with the optimization on and off.

Benchmark

Measured on a single binary that contains all three, toggling each optimization via its test_override() flag, so "all off" reproduces the current (master) algorithm. The workload is a full-buffer synID() sweep (every line, up to end-of-line); median of 5 runs.

C source (src/eval.c, 8209 lines)

Configuration Time Speedup
baseline (all off ≈ master) 0.570s 1.00×
+ lead-byte prefilter (#20371) 0.253s 2.25×
+ in_id_list cache (#29) 0.551s 1.03×
+ saved-state hint (#30) 0.526s 1.08×
all three on 0.200s 2.86×

Vim script (netrw.vim, 9717 lines)

Configuration Time Speedup
baseline (all off ≈ master) 7.92s 1.00×
+ lead-byte prefilter (#20371) 7.16s 1.11×
+ in_id_list cache (#29) 5.34s 1.48×
+ saved-state hint (#30) 7.44s 1.06×
all three on 4.28s 1.85×

The three target different bottlenecks and are complementary: the prefilter dominates for C (many patterns with distinct lead bytes), the in_id_list() cache dominates for Vim script (large contains/cluster lists, e.g. netrw), and the saved-state hint gives a small but filetype-independent gain. Combined, that's roughly 2.9× on C-heavy code and 1.85× on Vim-script-heavy code, with highlighting output unchanged.


Reply to this email directly, view it on GitHub, or unsubscribe.
Triage notifications, keep track of coding agent tasks and review pull requests on the go with GitHub Mobile for iOS and Android. Download it today!

You are receiving this because you are subscribed to this thread.Message ID: <vim/vim/pull/20371/c4582876954@github.com>

h_east

unread,
May 31, 2026, 3:20:42 PM (24 hours ago) May 31
to vim/vim, Subscribed
h-east left a comment (vim/vim#20371)

@chrisbra Do you have any concerns?


Reply to this email directly, view it on GitHub, or unsubscribe.
Triage notifications, keep track of coding agent tasks and review pull requests on the go with GitHub Mobile for iOS and Android. Download it today!

You are receiving this because you are subscribed to this thread.Message ID: <vim/vim/pull/20371/c4587819421@github.com>

Christian Brabandt

unread,
May 31, 2026, 4:27:27 PM (23 hours ago) May 31
to vim/vim, Subscribed
chrisbra left a comment (vim/vim#20371)

no, this sounds like a very nice performance optimization. I'll have a closer look later, this is all a bit over my head right now :)


Reply to this email directly, view it on GitHub, or unsubscribe.
Triage notifications, keep track of coding agent tasks and review pull requests on the go with GitHub Mobile for iOS and Android. Download it today!

You are receiving this because you are subscribed to this thread.Message ID: <vim/vim/pull/20371/c4587981805@github.com>

h_east

unread,
May 31, 2026, 10:26:18 PM (17 hours ago) May 31
to vim/vim, Subscribed
h-east left a comment (vim/vim#20371)

Update: I've folded in the safe-by-default tightening I listed as a follow-up
in the self-review above. Unknown alphanumeric regexp escapes now bail (the
pattern is always tried) rather than being treated as ordinary atoms, so the
analyzer degrades to "no speed-up", never "wrong highlighting", even for regexp
constructs added in the future. Covered by the new
Test_syntax_prefilter_classes() (identical synID() with the prefilter on and
off).


Reply to this email directly, view it on GitHub, or unsubscribe.
Triage notifications, keep track of coding agent tasks and review pull requests on the go with GitHub Mobile for iOS and Android. Download it today!

You are receiving this because you are subscribed to this thread.Message ID: <vim/vim/pull/20371/c4589093412@github.com>

Reply all
Reply to author
Forward
0 new messages