Simplify "syn region" regex for reStructuredText literal text

13 views
Skip to first unread message

Michael Henry

unread,
May 9, 2021, 2:42:50 PM5/9/21
to vim...@googlegroups.com
All,

Vim's reStructuredText syntax highlighting currently has a difficult to
fix problem described below.  There exists a fairly complex solution
that's hard to simplify.  Can anyone suggest a simpler solution that's
being overlooked here?

Currently, the below snippet of reStructuredText has a syntax
highlighting problem:

  - A bullet with literal following::

      Correctly highlighted literal

    Incorrectly highlighted as literal

  Correctly unhighlighted.

A literal block is introduced via a line ending in ``::``; it continues
for all lines that are indented further than this introductory line.
For bulleted or numbered lines, there is an additional complexity: the
reference point for indentation is measured not from the start of the
line, but from the first non-white character after the bullet or number.

In the example above, the indentation reference point is the column with
the ``A`` in ``A bullet with literal following`` (two columns after the
initial bullet, ``-``).

Currently, highlighting is incorrectly measured from the bullet itself,
leading to the line ``Incorrectly highlighted as literal`` being
highlighted as literal text, even though it is not indented further than
the reference ``A``.  Numbered lines have a similar problem, e.g.:

  1000. One thousand::

          Indented

        Not indented, yet highlighted

The current rst syntax file uses the below simple ``syn region`` to
detect lines ending in ``::`` in order to activate the
``rstLiteralBlock`` region:

  syn region  rstLiteralBlock         matchgroup=rstDelimiter
        \ start='\(^\z(\s*\).*\)\@<=::\n\s*\n'
        \ skip='^\s*$'
        \ end='^\(\z1\s\+\)\@!'
        \ contains=@NoSpell

It uses the regular expression ``^\z(\s*\)`` in ``start=`` to capture
any whitespace at the start of the line (before any optional bullet or
number), storing this captured whitespace in ``\z1``; it then uses the
absence of more than this amount of leading whitespace to detect the end
of the literal block in the ``end=`` expression.

The problem is that the columns consumed by a bullet or number aren't
figuring into the calculation for minimum required indentation.  There
doesn't appear to be a simple way to capture the bullet or number within
the ``start=`` regular expression, and then convert that captured text
into a requirement to match the same number of characters of whitespace
in the ``end=`` line.

For example, an initial line of ``  1000. One thousand::`` has two
leading spaces, four columns for ``1000`` and two columns for the period
and the following space; literal text would then need to be indented at
least one more space than this (a minimum of 2 + 4 + 2 + 1 = 9 spaces in
this example).

The leading whitespace and bullet-or-number could be captured into a
group and passed to the ``end=`` regular expression, but there doesn't
appear to be any way to construct a pattern of that many spaces.  If
``\z1`` were the captured text ``  1000. ``, then conceptually
``len(\z1) + 1`` would be the number of spaces to match (if that were
legal syntax).

The complex solution contains a large amount of explanatory
documentation and a lot of code that carefully constructs a complicated
regular expression to work around the lack of a ``len(\z1)`` feature.

The basic idea is to split the possible matches from ``start=`` into
different groups, where all members of a given group consume the same
number of characters.

For example, a bullet character such as ``-`` or ``*`` consumes a single
column; ``1.`` and ``9.`` consume two columns; ``(1)``, ``99.``, etc.,
consume three columns; and so on.  Based on which group has matched,
it's possible to know the length of the captured text.  Given the wide
variety of bullet types and enumeration syntaxes supported by
reStructuredText, this leads to a very large regular expression.

There are two main questions:

- Is there a simpler way to solve this?

- Will the complex regular expression cause performance problems?

Any suggestions or comments are welcome.

Michael Henry

Here's the pull request containing the complex solution shown below:
https://github.com/marshallward/vim-restructuredtext/pull/63

The commit is here:
https://github.com/marshallward/vim-restructuredtext/pull/63/commits/102d8f4bd1c906846012a90c2069440fd6cb92e0

To ensure the mailing list contains the full idea for archival purposes,
the commit is copied below this line as well.
##############################################################################


""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
" Literal blocks typically look like this:
"
"   Some non-literal text followed by double colons::
"
"     This part is literal.
"     And so is this.
"
"   Back to non-literal text here (since it lines up below "Some
non-literal").
"
" Literal mode turns off when the level of indentation decreases back to
" the original amount of indentation on the line with the double colons.
" Put another way, literal mode remains as long as there is more
" leading whitespace than was found at the start of the line with the double
" colons.
"
" When the double colons are on a bulleted or numbered line, however,
literal
" mode stops once the indentation decreases to match the first
non-whitespace
" position *after* the bullet, e.g.::
"
"   - A bullet followed by double colons::
"
"       This part is literal.
"       And so is this.
"
"     Back to non-literal text here (since it lines up below "A bullet").
"
" For this case, the amount of leading whitespace is calculated by treating
" the bullet character or the line number as if they were composed of
spaces.
" Consider this example:
"
"   - This is a bullet.
"
"     - Here is a sub-bullet with double colons::
"
"         This part is literal.
"         And so is this.
"
"       Back to non-literal text here (since it lines up below "Here is").
"       This line is indented four spaces from "- This is".
"
" In the above example, the leading whitespace threshold is four spaces: two
" spaces before "- Here is", plus the bullet character ("-") converted to a
" space, plus the space following the bullet.  Literal mode remains as
long as
" there is more than the threshold four spaces of indentation.
"
" A valid bulleted line begins with zero or more whitespace characters,
a single
" bullet character ("-", "+", or "*"), and whitespace.
"
" As explained in
"
https://docutils.sourceforge.io/docs/ref/rst/restructuredtext.html#enumerated-lists,
" a valid numbered line begins with zero or more whitespace characters
followed
" by an enumerator.  An enumerator consists of an enumeration sequence
member
" and formatting, followed by whitespace. The following enumeration
sequences
" are recognized:
"
"   arabic numerals: 1, 2, 3, ... (no upper limit).
"   uppercase alphabet characters: A, B, C, ..., Z.
"   lower-case alphabet characters: a, b, c, ..., z.
"   uppercase Roman numerals: I, II, III, IV, ..., MMMMCMXCIX (4999).
"   lowercase Roman numerals: i, ii, iii, iv, ..., mmmmcmxcix (4999).
"
" In addition, the auto-enumerator, "#", may be used to automatically
enumerate
" a list.  The following formatting types are recognized:
"
"   suffixed with a period: "1.", "A.", "a.", "I.", "i.".
"   surrounded by parentheses: "(1)", "(A)", "(a)", "(I)", "(i)".
"   suffixed with a right-parenthesis: "1)", "A)", "a)", "I)", "i)".
"
" Therefore, the threshold leading whitespace, w, could be described by:
"
"   let w = whitespace at the start of the line with double colons.
"   If this line starts with a bullet and whitespace, extend w by one
space for
"   each character in the sequence (bullet + following whitespace).
"   If this line starts with a enumeration, extend w by one space for each
"   character in the enumeration (note that this includes the whitespace
at the
"   end of the enumeration).
"
" The difficulty lies in trying to count the number of characters in
either a
" (bullet + following whitespace) sequence or an enumeration.  In a syntax
" region, there doesn't appear to be any way to calculate the length of a
" captured group (such as '\z1') in the 'start=' section and pass that
to the
" 'end=' section in such a way that the length can be used to require a
match
" for a run of that many space characters.
"
" Consider this example:
"
"   -   Some bullet::
"
"         This part is literal.
"         And so is this.
"
"       Back to non-literal (since it lines up with "Some bullet").
"
" There is no indentation before "-   Some bullet::".  The equivalent
amount of
" indentation before "Some numbered line" is four spaces (one for the
"-" and
" three for the three following spaces).  This enumeration could be
matched with
" a capturing pattern such as '\z(-\s\+\)' (a hyphen bullet, a period,
and one
" or more whitespace characters); but there seems to be no direct way to
convert
" this captured text into a requirement to match four whitespace
characters in
" the 'end=' pattern.
"
" To work around this limitation, consider that if a bullet is present,
it must
" be followed by at least one space; if the bullet is not present, then the
" space won't be present, either.  Consider the following pattern with three
" capturing groups, where the entire pattern is optional::
"
"   \(\z(-\)\z( \)\z(\s*)\)\?
"
" If no bullet is present, the above pattern fails to match, and all three
" captured groups are empty.  If the bullet is present with following
" whitespace, then \z1 holds the bullet, \z2 holds a single space, and
\z3 holds
" any additional trailing whitespace.  Consider using the following
pattern in
" 'end='::
"
"   \z2\z2\z3
"
" The above pattern will match a run of whitespace with the length we
desire.
" When no bullet is present, it matches the empty string.  When a bullet is
" present, we know \z2 will be a single space, in which case the pattern
matches
" a space for the bullet (the first \z2), the following space (the
second \z2),
" and the remaining spaces (the \z3).
"
" A similar technique allows matching a enumeration.  In the above
technique, we
" were matching a single bullet character, and we used \z2 in the bullet's
" position to effectively convert the bullet to a space.  For
enumerations, we
" can match a fixed number, n, of non-whitespace characters, and convert
them
" into n space characters by repeating \z2 that many times.  For example::
"
"   1. Some numbered line::
"
"        This part is literal.
"        And so is this.
"
"      Back to non-literal (since it lines up with "Some numbered").
"
" The following pattern matches optional single-digit enumerations like
the one
" above::
"
"   \(\z(\d\.\)\z( \)\z(\s*)\)\?
"
" If the above matches, we know that \z1 will be two characters (a digit
and a
" period).  To replace that two-character sequence with two spaces in the
" 'end=' pattern, we use two copies of \z2 for these non-whitespace
characters
" and an additional \z2 for the immediately following space, plus \z3 to
match
" any additional whitespace::
"
"   \z2\z2\z2\z3
"
" By carefully matching runs of known-length non-whitespace characters,
" we can convert these runs into runs of equal numbers of space
characters by
" repeating the group which captured a single space the desired number
of times.
"
" Bullets and enumeration patterns can then be grouped by the number of
" non-whitespace characters before the single following space.  Consider the
" samples below:
"
"   1-character sequences:
"
"   - Bullets::
"
"       -
"       +
"       *
"
"   2-character sequences::
"
"     #.
"     #)
"     1.
"     1)
"     a)
"
"   3-character sequences::
"
"     12.
"     ab)
"     (a)
"     (#)
"
"   4-character sequences::
"
"     123.
"     abc)
"     (ab)
"
" Overall, then, a pattern to match the start of a literal block with an
" optional bullet or enumeration could be structured as the following
captured
" groups::
"
"   \z(any leading whitespace)
"   (
"     (
"       (one-character sequence)   \z(one space) |
"       (two-character sequence)   \z(one space) |
"       (three-character sequence) \z(one space) |
"       (four-character sequence)  \z(one space) |
"       (five-character sequence)  \z(one space) |
"       (six-character sequence)   \z(one space) |
"       (seven-character sequence) \z(one space)
"     )
"     \z(any extra whitespace)
"   )\?
"   any characters
"   ::
"   end-of-line
"
" This uses all nine capture groups, \z1 through \z9, to support the largest
" number of digits in enumerations (up to six digits and a period, for
example).
"
" In the above, \z1 holds the leading whitespace, and \z9 holds any extra
" whitespace.  Each of the remaining capture groups will hold a single
space if
" their corresponding n-character sequence matched, and empty
otherwise.  The
" group can be repeated n times to convert the n-character sequence into n
" spaces.  For example, \z3\z3\z3 will be three spaces if a three-character
" sequence is matched, and empty otherwise.  Further, \z2\z2\z3\z3\z3
will be
" two spaces for a 2-character match, three spaces for a three-character
match,
" and empty otherwise.
"
" Therefore, the threshold whitespace pattern is given by the
concatenation of
" all of the below patterns::
"
"   \z1
"   \z2\z2
"   \z3\z3\z3
"   \z4\z4\z4\z4
"   \z5\z5\z5\z5\z5
"   \z6\z6\z6\z6\z6\z6
"   \z7\z7\z7\z7\z7\z7\z7
"   \z8\z8\z8\z8\z8\z8\z8\z8
"   \z9

" Return a pattern for the given number of digits (in enumerations).
function! s:digits_pat(num_digits)
    if a:num_digits == 1
        " A lone "digit" includes numbers, letters and '#'.
        return '[0-9a-zA-Z#]'
    endif
    " Multiple "digits" comprise numbers and letters (but not '#').
    return '[0-9a-zA-Z]\{' . a:num_digits . '}'
endfunction

" Build up s:pat, a pattern to detect the start of rstLiteral.

" Begin group for everything before double-colon:
let s:pat = '\('

" \z1 captures optional leading whitespace:
let s:pat .= '^\z(\s*\)'

" Begin optional capture of bullet or enumeration:
let s:pat .= '\('

" n=1: A bullet character and \z2 capturing a space:
let s:pat .= '[-+*]\z( \)'

" s:n is the number of non-whitespace characters in the enumeration.
" Iterate from 2 through 7 (capturing into \z3 through \z8).
let s:n = 2
while s:n <= 7
    let s:pat .= '\|\('

    " Handle 'dddd.' and 'dddd)':
    let s:pat .= s:digits_pat(s:n - 1) . '[).]'

    " If enough characters, handle the '(ddd)' case:
    if s:n >= 3
        let s:pat .= '\|(' . s:digits_pat(s:n - 2) . ')'
    endif

    let s:pat .= '\)'

    " \z(n+1) captures a space:
    let s:pat .= '\z( \)'

    let s:n += 1
endwhile

" End optional capture of bullet or enumeration:
let s:pat .= '\)\?'

" Trailing whitespace in \z9
let s:pat .= '\z( *\)'

" Finish with arbitrary characters, then close the group:
let s:pat .= '.*\)'

" Use a zero-width look-behind for everything before double-colon:
let s:pat .= '\@<='

" End with double-colon at the end of the line followed by a blank line:
let s:pat .= '::\n\s*\n'

" Now build up the corresponding end pattern, s:end_pat.

" Match leading whitespace:
let s:end_pat = '^\(\z1'

" Then match the remaining characters of threshold whitespace for each
" capture group:
let s:end_pat .= '\z2\z2'
let s:end_pat .= '\z3\z3\z3'
let s:end_pat .= '\z4\z4\z4\z4'
let s:end_pat .= '\z5\z5\z5\z5\z5'
let s:end_pat .= '\z6\z6\z6\z6\z6\z6'
let s:end_pat .= '\z7\z7\z7\z7\z7\z7\z7'
let s:end_pat .= '\z8\z8\z8\z8\z8\z8\z8\z8'

" Require at least one more space to stay in literal block; then negate the
" match to exit literal mode:
let s:end_pat .= '\s\+\)\@!'

" Build up 'syn' command:
let s:syn = "syn region  rstLiteralBlock         matchgroup=rstDelimiter"
let s:syn .= " start='" . s:pat . "'"
let s:syn .= " skip='^\s*$'"
let s:syn .= " end='" . s:end_pat . "'"
let s:syn .= " contains=@NoSpell"

" Change to 0 to compare with original:
if 1
execute s:syn
else
syn region  rstLiteralBlock         matchgroup=rstDelimiter
      \ start='\(^\z(\s*\).*\)\@<=::\n\s*\n' skip='^\s*$'
end='^\(\z1\s\+\)\@!'
      \ contains=@NoSpell
endif

Reply all
Reply to author
Forward
0 new messages