Rev 8bbf226...attempt to fix the "lead-in" characters for all regexs in modes/md.py.
As discussed below, this was an error-prone process. Please report any problems immediately. Otoh, simple tests still pass.
===== Background
Leo calls a pattern matcher only if the matcher's lead-in character matches the character being scanned. This is a crucial speed optimization. Improper lead-in characters effectively disable patterns.
To help me understand the regex patterns, I have been using this excellent online resource:
http://regex101.com/By playing with examples, I have been able to discover the valid leadin characters for each regex.
I documented the leadin-characters for each rule, and then altered the various rulesDicts *by hand*. Of course this is error prone.
===== Computing lead-in characters
Theoretically, jEdit2py could compute leadin characters automatically. This is similar to computing the so-called First & Follow sets in compiler construction. Here are two links:
http://stackoverflow.com/questions/787134/can-i-determine-the-set-of-first-chars-matched-by-regex-patternhttp://www.cs.uky.edu/~lewis/texts/theory/automata/reg-sets.pdfIndeed, in *most* cases, computing the leadin character or characters is easy:
- If the regex starts with a non-special character, that character is the only lead-in character.
- If the regex stars with an escaped special character, that special character is the only leadin character.
- If the regex stars with \s, \S, \d, etc, the characters corresponding to those patterns are the leadin characters.
Those are the trivial cases. Now for slightly harder cases:
- if the regex starts with something of the form (...), [...], {...} we must do the following:
A: computer the leadin characters of ...
B: depending on what follows, the (...), [...], {...} can either denote an optional pattern or a required pattern. If required, just return the computed leadin characters. Otherwise, if anything follows the patterns, we will *add* the leading characters of the following pattern to the computed leadin characters.
In short, this is a complex, recursive process. Clearly doable, but non-trivial.
Alas, things get even worse. There are another set of expressions of the form (?...), and these can be very complex.
So for now, I'll leave this project as an exercise for the interested reader :-) There doesn't seem to be any (Python) tool for computing the desired First (leadin) sets. If anyone knows of such a thing, please let me know.
Edward