Fixing regex-related bugs in markdown & moin colorizers

13 views
Skip to first unread message

Edward K. Ream

unread,
Jul 29, 2014, 10:04:26 AM7/29/14
to leo-e...@googlegroups.com
I just wrote a script that searches for 'regexp="\\' in all leo\modes files.

In most cases, the backslash that starts the regex is a "real" backslash, but there are problems in two colorizer files: modes/md.py (markdown) and  modes/moin.py.

Here is the list rules with a backslash as a leadin character when another character would (probably!) be correct:

===== md.py
    [ \t] leadins: rules: 8,20,24,25(?),50,51
    [ ] leadins: rules: 23,46,49
    [=-] leadins: rules: 21,47
    [\\_] leadins: rules: 54,55
    # leadins: rule: 22
    other leadins (possibly many): rules: 7,12,19,53

===== moin.py
    [ \t] leadin: rule 3.
    single-quote leadin: rule 6.

Does anyone care that these rules aren't firing?

Edward

P.S.  I'll probably attempt a fix anyway.  None of the rules are probably firing at present, so futzing by hand isn't likely to be dangerous, provided that there are no syntax errors and pylint -m will check for that.

EKR

Edward K. Ream

unread,
Jul 29, 2014, 10:57:44 AM7/29/14
to leo-e...@googlegroups.com

On Tuesday, July 29, 2014 9:04:26 AM UTC-5, Edward K. Ream wrote:

> Here is the list rules with a backslash as a leadin character when another character would (probably!) be correct:

Oops, I think I was looking at markdown.py (now deleted, rather than md.py).  md.py is simpler, and has only a few dubious rules.  I'll fix them today.

EKR

Edward K. Ream

unread,
Jul 29, 2014, 3:02:49 PM7/29/14
to leo-e...@googlegroups.com
On Tuesday, July 29, 2014 9:04:26 AM UTC-5, Edward K. Ream wrote:
I just wrote a script that searches for 'regexp="\\' in all leo\modes files.

In most cases, the backslash that starts the regex is a "real" backslash, but there are problems in two colorizer files: modes/md.py (markdown) and  modes/moin.py.

Rev 8bbf226...attempt to fix the "lead-in" characters for all regexs in modes/md.py.

As discussed below, this was an error-prone process.  Please report any problems immediately.  Otoh, simple tests still pass.

===== Background

Leo calls a pattern matcher only if the matcher's lead-in character matches the character being scanned.  This is a crucial speed optimization.  Improper lead-in characters effectively disable patterns.

To help me understand the regex patterns, I have been using this excellent online resource:
http://regex101.com/

By playing with examples, I have been able to discover the valid leadin characters for each regex.

I documented the leadin-characters for each rule, and then altered the various rulesDicts *by hand*.  Of course this is error prone.

===== Computing lead-in characters

Theoretically, jEdit2py could compute leadin characters automatically.   This is similar to computing the so-called First & Follow sets in compiler construction.  Here are two links:

http://stackoverflow.com/questions/787134/can-i-determine-the-set-of-first-chars-matched-by-regex-pattern
http://www.cs.uky.edu/~lewis/texts/theory/automata/reg-sets.pdf

Indeed, in *most* cases, computing the leadin character or characters is easy:

- If the regex starts with a non-special character, that character is the only lead-in character.

- If the regex stars with an escaped special character, that special character is the only leadin character.

- If the regex stars with \s, \S, \d, etc, the characters corresponding to those patterns are the leadin characters.

Those are the trivial cases.  Now for slightly harder cases:

- if the regex starts with something of the form (...), [...], {...} we must do the following:

A: computer the leadin characters of ...

B: depending on what follows, the (...), [...], {...} can either denote an optional pattern or a required pattern.  If required, just return the computed leadin characters.  Otherwise, if anything follows the patterns, we will *add* the leading characters of the following pattern to the computed leadin characters.

In short, this is a complex, recursive process.  Clearly doable, but non-trivial.

Alas, things get even worse.  There are another set of expressions of the form (?...), and these can be very complex.

So for now, I'll leave this project as an exercise for the interested reader :-)  There doesn't seem to be any (Python) tool for computing the desired First (leadin) sets.  If anyone knows of such a thing, please let me know.

Edward
Reply all
Reply to author
Forward
0 new messages