Support keyword wildcard matches

Jason Oster

unread,

Oct 27, 2010, 3:18:34 PM10/27/10

to scintilla...@googlegroups.com

Greetings list,

I've attached a patch to allow wildcard matches in lexer keywords. This
is required to support the HTML5 data-* attribute type[1], but could
have some other uses that I'm unaware of. Perhaps supporting
vendor-specific CSS properties, for example: -moz-*, -ms-*, -o-*,
-webkit-*, ... just as an example. (Support for vendor-specific CSS
properties et.al. was added way back in version 1.77.)

So, I'm taking this to the list to get some additional feedback.
Mostly, I am interested in what other lexers might like to use this
functionality. The patch contains a modification to LexHTML to classify
attributes with wildcard matching, for the specific use-case of data-*
attributes. Also, as discussed on the Geany mailing list[2], it could
be useful for aria-*[3].

One note on the implementation: I decided to use strchr() to check for
the wildcard-maskable characters, instead of Scintilla's built-in
CharacterSet class. This is only because WordList.cxx already includes
string.h, but not CharacterSet.h. The unfortunate downside to this is
that calls to WordList::InListWildcard() require a character pointer
(string) to be passed, which could potentially get very messy.

See the single-line patch in LexHTML.cxx for an example. Making
charList an optional argument might be nice, defaulting to an
alphanumeric string, plus underscore. That should handle the common
cases well.

[1]
http://www.whatwg.org/specs/web-apps/current-work/#embedding-custom-non-visible-data-with-the-data-*-attributes
[2] http://lists.uvena.de/pipermail/geany/2010-October/006087.html
[3]
http://www.whatwg.org/specs/web-apps/current-work/#annotations-for-assistive-technology-products-%28aria%29

scintilla-keyword_wildcards.patch

Neil Hodgson

unread,

Oct 27, 2010, 5:27:57 PM10/27/10

to scintilla...@googlegroups.com

Jason Oster:

> I've attached a patch to allow wildcard matches in lexer keywords. This is
> required to support the HTML5 data-* attribute type[1], but could have some
> other uses that I'm unaware of. Perhaps supporting vendor-specific CSS
> properties, for example: -moz-*, -ms-*, -o-*, -webkit-*, ... just as an
> example.

There is currently support for prefix keywords which start with ^.
For example, the C++ lexer allows ^GTK_ to match all the identifiers
that start with GTK_.

Neil

Jason Oster

unread,

Oct 27, 2010, 5:51:43 PM10/27/10

to scintilla...@googlegroups.com

On 10/27/2010 02:27 PM, Neil Hodgson wrote:
> There is currently support for prefix keywords which start with ^.
> For example, the C++ lexer allows ^GTK_ to match all the identifiers
> that start with GTK_.
>
> Neil

Thank you, Neil. Unfortunately, it doesn't quite work the same way. A
few examples to illustrate:

When using ^data- as a keyword, we would expect the "data-test"
attribute to be styled as a valid attribute, and "data" and "data-"
attributes to be styled as unknown attributes. I'm getting all three
styled as valid attributes. Is this behavior a bug, or by design?

Neil Hodgson

unread,

Oct 27, 2010, 6:10:22 PM10/27/10

to scintilla...@googlegroups.com

Jason Oster:

> Thank you, Neil. Unfortunately, it doesn't quite work the same way. A few
> examples to illustrate:
>
> <div data-test="good" data="bad" data-="bad"/>
>
> When using ^data- as a keyword, we would expect the "data-test" attribute to
> be styled as a valid attribute, and "data" and "data-" attributes to be
> styled as unknown attributes. I'm getting all three styled as valid
> attributes. Is this behavior a bug, or by design?

It is supposed to be a prefix so "^data-" should match "data-test"
and "data-" but not "data". Its a bug if "^data-" matches "data".

Neil

Jason Oster

unread,

Oct 27, 2010, 6:12:29 PM10/27/10

to scintilla...@googlegroups.com

On 10/27/2010 03:10 PM, Neil Hodgson wrote:
> It is supposed to be a prefix so "^data-" should match "data-test"
> and "data-" but not "data". Its a bug if "^data-" matches "data".
>
> Neil

You're right: "data" was already defined as a keyword, which is why it
was getting styled. No bug there. Highlighting "data-" by itself might
be OK, but it's certainly against the spec:

"A custom data attribute is an attribute in no namespace whose name
starts with the string "data-", *has at least one character after the
hyphen*, is XML-compatible, and contains no characters in the range
U+0041 to U+005A (LATIN CAPITAL LETTER A to LATIN CAPITAL LETTER Z)."

Emphasis my own.

I suppose it's usable. (Aside: Styling "GTK_" and other lone prefixes
also looks odd.)

Neil Hodgson

unread,

Oct 27, 2010, 6:30:51 PM10/27/10

to scintilla...@googlegroups.com

Jason Oster:

> I suppose it's usable. (Aside: Styling "GTK_" and other lone prefixes also
> looks odd.)

Yes, but I don't want to change this in case it is used. There
could be an optional parameter to not match unless there are more
characters in the word but it doesn't seem that important to me. If
you want to apply other rules like no A..Z then this looks too
specialized to go into WordList so belongs in LexHTML.

Neil

Neil Hodgson

unread,

Oct 27, 2010, 6:46:02 PM10/27/10

to scintilla...@googlegroups.com

There is a bit of a clash between generic and specific here. You
are presumably using keywords because they can be added easily by the
application or user but then you are saying that there is an
additional check that upper-case is not allowed because the specifics
of data-*. Other uses of attribute wildcards may allow upper-case.

Neil

Lex Trotman

unread,

Oct 27, 2010, 8:37:40 PM10/27/10

to scintilla...@googlegroups.com

As I suggested on the Geany thread, the recognition of data-* where *
has no UC ASCII should be in LexHTML since it is html specific. It
should not be in the keywords list since you can't list * :-)

The general keyword ^prefix already exists in Scintilla and should
stay as it is. BTW Neil, there does not appear to be any
documentation of this in the Scintilla docs only a mention in Scite
prefs?

Perhaps in the longer term Scintilla could support separate keyword
regex lists as well as keyword lists for lexers that need them?
Separate lists means they won't get confused if keywords happen to
contain punctuation chars.

Cheers
Lex

> --
> You received this message because you are subscribed to the Google Groups "scintilla-interest" group.
> To post to this group, send email to scintilla...@googlegroups.com.
> To unsubscribe from this group, send email to scintilla-inter...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/scintilla-interest?hl=en.
>
>

Jason Oster

unread,

Oct 28, 2010, 12:12:14 AM10/28/10

to scintilla...@googlegroups.com

Yes, the spec wants only lower-case letters in attribute names (*all* attributes, by the way! Go, go, HTML5) and in fact, the only characters disallowed, according to the "data-*" and "XML compatible" sections, are upper-case letters and colons. Somehow I don't think that means to imply null and whitespace characters are valid in attribute names, but I digress. The patch I proposed worked with upper-case letters, even though they were not included in the charList; DATA-TEST worked just as well as data-test. I'm not certain why, but it wasn't a bother to me.

I'd be happy with using ^data-, if Geany drivers would consider it. At least it's *something* that is usable.

Neil Hodgson

unread,

Oct 28, 2010, 6:23:40 PM10/28/10

to scintilla...@googlegroups.com

Lex Trotman:

> The general keyword ^prefix already exists in Scintilla and should
> stay as it is. BTW Neil, there does not appear to be any
> documentation of this in the Scintilla docs only a mention in Scite
> prefs?

I can't recall the details but expect this feature was added for
SciTE. There is nowhere that really defines the APIs available to
lexers, so I added this to the WordList::InList implementation.

/** Check whether a string is in the list.
* List elements are either exact matches or prefixes.
* Prefix elements start with '^' and match all strings that start
with the rest of the element
* so '^GTK_' matches 'GTK_X', 'GTK_MAJOR_VERSION', and 'GTK_'.
*/

> Perhaps in the longer term Scintilla could support separate keyword
> regex lists as well as keyword lists for lexers that need them?

Full regex keywords would be much slower than literal keywords and
efforts to ameliorate this would add significant complexity.

Neil

Lex Trotman

unread,

Oct 28, 2010, 8:18:08 PM10/28/10

to scintilla...@googlegroups.com

On 29 October 2010 09:23, Neil Hodgson <nyama...@gmail.com> wrote:
> Lex Trotman:
>
>> The general keyword ^prefix already exists in Scintilla and should
>> stay as it is. BTW Neil, there does not appear to be any
>> documentation of this in the Scintilla docs only a mention in Scite
>> prefs?
>
> I can't recall the details but expect this feature was added for
> SciTE. There is nowhere that really defines the APIs available to
> lexers, so I added this to the WordList::InList implementation.
>
> /** Check whether a string is in the list.
> * List elements are either exact matches or prefixes.
> * Prefix elements start with '^' and match all strings that start
> with the rest of the element
> * so '^GTK_' matches 'GTK_X', 'GTK_MAJOR_VERSION', and 'GTK_'.
> */

Great.

>
>> Perhaps in the longer term Scintilla could support separate keyword
>> regex lists as well as keyword lists for lexers that need them?
>
> Full regex keywords would be much slower than literal keywords and
> efforts to ameliorate this would add significant complexity.

Yes, although it may still be "fast enough"(tm) and allows recognition
of some languages which it could not otherwise support. I am thinking
of a recent contract where the customer's DSL was edited in Emacs
because operators/keywords matched regexes. Even done in elisp it was
fast enough.

OTOH such languages are rare and I can understand that it might not be
worth Scintilla's trouble to do it.

Cheers
Lex

>
> Neil

Neil Hodgson

unread,

Oct 29, 2010, 7:38:29 PM10/29/10

to scintilla...@googlegroups.com

Lex Trotman:

> Yes, although it may still be "fast enough"(tm) and allows recognition
> of some languages which it could not otherwise support. I am thinking
> of a recent contract where the customer's DSL was edited in Emacs
> because operators/keywords matched regexes. Even done in elisp it was
> fast enough.

Emacs has a quite competent regex engine. I expect Scintilla's
would be much slower in this case.

Neil

Reply all

Reply to author

Forward