I've attached a patch to allow wildcard matches in lexer keywords. This
is required to support the HTML5 data-* attribute type[1], but could
have some other uses that I'm unaware of. Perhaps supporting
vendor-specific CSS properties, for example: -moz-*, -ms-*, -o-*,
-webkit-*, ... just as an example. (Support for vendor-specific CSS
properties et.al. was added way back in version 1.77.)
So, I'm taking this to the list to get some additional feedback.
Mostly, I am interested in what other lexers might like to use this
functionality. The patch contains a modification to LexHTML to classify
attributes with wildcard matching, for the specific use-case of data-*
attributes. Also, as discussed on the Geany mailing list[2], it could
be useful for aria-*[3].
One note on the implementation: I decided to use strchr() to check for
the wildcard-maskable characters, instead of Scintilla's built-in
CharacterSet class. This is only because WordList.cxx already includes
string.h, but not CharacterSet.h. The unfortunate downside to this is
that calls to WordList::InListWildcard() require a character pointer
(string) to be passed, which could potentially get very messy.
See the single-line patch in LexHTML.cxx for an example. Making
charList an optional argument might be nice, defaulting to an
alphanumeric string, plus underscore. That should handle the common
cases well.
[1]
http://www.whatwg.org/specs/web-apps/current-work/#embedding-custom-non-visible-data-with-the-data-*-attributes
[2] http://lists.uvena.de/pipermail/geany/2010-October/006087.html
[3]
http://www.whatwg.org/specs/web-apps/current-work/#annotations-for-assistive-technology-products-%28aria%29
> I've attached a patch to allow wildcard matches in lexer keywords. This is
> required to support the HTML5 data-* attribute type[1], but could have some
> other uses that I'm unaware of. Perhaps supporting vendor-specific CSS
> properties, for example: -moz-*, -ms-*, -o-*, -webkit-*, ... just as an
> example.
There is currently support for prefix keywords which start with ^.
For example, the C++ lexer allows ^GTK_ to match all the identifiers
that start with GTK_.
Neil
Thank you, Neil. Unfortunately, it doesn't quite work the same way. A
few examples to illustrate:
<div data-test="good" data="bad" data-="bad"/>
When using ^data- as a keyword, we would expect the "data-test"
attribute to be styled as a valid attribute, and "data" and "data-"
attributes to be styled as unknown attributes. I'm getting all three
styled as valid attributes. Is this behavior a bug, or by design?
> Thank you, Neil. Unfortunately, it doesn't quite work the same way. A few
> examples to illustrate:
>
> <div data-test="good" data="bad" data-="bad"/>
>
> When using ^data- as a keyword, we would expect the "data-test" attribute to
> be styled as a valid attribute, and "data" and "data-" attributes to be
> styled as unknown attributes. I'm getting all three styled as valid
> attributes. Is this behavior a bug, or by design?
It is supposed to be a prefix so "^data-" should match "data-test"
and "data-" but not "data". Its a bug if "^data-" matches "data".
Neil
You're right: "data" was already defined as a keyword, which is why it
was getting styled. No bug there. Highlighting "data-" by itself might
be OK, but it's certainly against the spec:
"A custom data attribute is an attribute in no namespace whose name
starts with the string "data-", *has at least one character after the
hyphen*, is XML-compatible, and contains no characters in the range
U+0041 to U+005A (LATIN CAPITAL LETTER A to LATIN CAPITAL LETTER Z)."
Emphasis my own.
I suppose it's usable. (Aside: Styling "GTK_" and other lone prefixes
also looks odd.)
> I suppose it's usable. (Aside: Styling "GTK_" and other lone prefixes also
> looks odd.)
Yes, but I don't want to change this in case it is used. There
could be an optional parameter to not match unless there are more
characters in the word but it doesn't seem that important to me. If
you want to apply other rules like no A..Z then this looks too
specialized to go into WordList so belongs in LexHTML.
Neil
Neil
As I suggested on the Geany thread, the recognition of data-* where *
has no UC ASCII should be in LexHTML since it is html specific. It
should not be in the keywords list since you can't list * :-)
The general keyword ^prefix already exists in Scintilla and should
stay as it is. BTW Neil, there does not appear to be any
documentation of this in the Scintilla docs only a mention in Scite
prefs?
Perhaps in the longer term Scintilla could support separate keyword
regex lists as well as keyword lists for lexers that need them?
Separate lists means they won't get confused if keywords happen to
contain punctuation chars.
Cheers
Lex
> --
> You received this message because you are subscribed to the Google Groups "scintilla-interest" group.
> To post to this group, send email to scintilla...@googlegroups.com.
> To unsubscribe from this group, send email to scintilla-inter...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/scintilla-interest?hl=en.
>
>
Yes, the spec wants only lower-case letters in attribute names (*all* attributes, by the way! Go, go, HTML5) and in fact, the only characters disallowed, according to the "data-*" and "XML compatible" sections, are upper-case letters and colons. Somehow I don't think that means to imply null and whitespace characters are valid in attribute names, but I digress. The patch I proposed worked with upper-case letters, even though they were not included in the charList; DATA-TEST worked just as well as data-test. I'm not certain why, but it wasn't a bother to me.
I'd be happy with using ^data-, if Geany drivers would consider it. At least it's *something* that is usable.
> The general keyword ^prefix already exists in Scintilla and should
> stay as it is. BTW Neil, there does not appear to be any
> documentation of this in the Scintilla docs only a mention in Scite
> prefs?
I can't recall the details but expect this feature was added for
SciTE. There is nowhere that really defines the APIs available to
lexers, so I added this to the WordList::InList implementation.
/** Check whether a string is in the list.
* List elements are either exact matches or prefixes.
* Prefix elements start with '^' and match all strings that start
with the rest of the element
* so '^GTK_' matches 'GTK_X', 'GTK_MAJOR_VERSION', and 'GTK_'.
*/
> Perhaps in the longer term Scintilla could support separate keyword
> regex lists as well as keyword lists for lexers that need them?
Full regex keywords would be much slower than literal keywords and
efforts to ameliorate this would add significant complexity.
Neil
Great.
>
>> Perhaps in the longer term Scintilla could support separate keyword
>> regex lists as well as keyword lists for lexers that need them?
>
> Full regex keywords would be much slower than literal keywords and
> efforts to ameliorate this would add significant complexity.
Yes, although it may still be "fast enough"(tm) and allows recognition
of some languages which it could not otherwise support. I am thinking
of a recent contract where the customer's DSL was edited in Emacs
because operators/keywords matched regexes. Even done in elisp it was
fast enough.
OTOH such languages are rare and I can understand that it might not be
worth Scintilla's trouble to do it.
Cheers
Lex
>
> Neil
> Yes, although it may still be "fast enough"(tm) and allows recognition
> of some languages which it could not otherwise support. I am thinking
> of a recent contract where the customer's DSL was edited in Emacs
> because operators/keywords matched regexes. Even done in elisp it was
> fast enough.
Emacs has a quite competent regex engine. I expect Scintilla's
would be much slower in this case.
Neil