Lexer metadata changes

128 views
Skip to first unread message

Neil Hodgson

unread,
Jun 13, 2017, 6:12:32 PM6/13/17
to Scintilla mailing list
[Starting a new thread for visibility]

In a previous email I proposed some changes to lexers to allow applications to query at runtime for a lexer how many styles there were and to retrieve 3 string values for each style: name, tags, and description.

Implementing this will require some initial work to enable but also ongoing effort to ensure lexers provide this metadata and update it as the lexer changes.

This is only worthwhile if applications want metadata in this form at runtime.

There are other possible approaches. One is to make this data available as text files for inclusion in applications at build or install time. The current SciTE *.properties files could be the basis of this, possibly with some cleaning up, additions and formalization. A script could be provided to extract this metadata from the *.properties files. Application developers could extend the script to write the metadata into their preferred format. An alternative is to build a canonical text representation of metadata and then use this to populate application metadata including SciTE’s *.properties.

One difficulty with using *.properties is that they currently contain different degrees of cleverness and indirection. For example, conf.properties simply assigns a list of keywords to keywords.$(file.patterns.conf) whereas lua.properties has a hierarchy of keywords based on language version with it being possible to switch from Lua 5 to Lua 4 by setting controlling variables.

Neil

Matthew Brush

unread,
Jun 13, 2017, 7:51:45 PM6/13/17
to scintilla...@googlegroups.com
On 2017-06-13 03:12 PM, Neil Hodgson wrote:
> [Starting a new thread for visibility]
>
> In a previous email I proposed some changes to lexers to allow applications to query at runtime for a lexer how many styles there were and to retrieve 3 string values for each style: name, tags, and description.
>

FWIW, I saw the other message and didn't respond since it sounded good
to me.

The only thing I would like to see, which is slightly off-topic and
could probably be added later based on those changes, would be to
include other non-lexer styles in this API as well. For example editors
usually use the same GUI for changing styles such as the line numbers,
visual whitespace, current line highlight, comments, literal strings, etc.

> Implementing this will require some initial work to enable but also ongoing effort to ensure lexers provide this metadata and update it as the lexer changes.
>

It seems like it would not be too much work once all the initial data is
collected. Perhaps having an "Adding a new Lexer" guide/checklist (if it
doesn't exist) would assist in making sure the needed metadata is
provided/updated, along with the other stuff.

> This is only worthwhile if applications want metadata in this form at runtime.
>

For an editor like Geany, having the data at compile-time would suffice,
but adding support in the future for dynamically loaded lexers or such
would make having it available at runtime more useful. Also, if an
application wants the data at compile-time, it could write a simple
helper application that uses Scintilla API to spit out the data in their
preferred format.

> There are other possible approaches. One is to make this data available as text files for inclusion in applications at build or install time. The current SciTE *.properties files could be the basis of this, possibly with some cleaning up, additions and formalization. A script could be provided to extract this metadata from the *.properties files. Application developers could extend the script to write the metadata into their preferred format. An alternative is to build a canonical text representation of metadata and then use this to populate application metadata including SciTE’s *.properties.
>
> One difficulty with using *.properties is that they currently contain different degrees of cleverness and indirection. For example, conf.properties simply assigns a list of keywords to keywords.$(file.patterns.conf) whereas lua.properties has a hierarchy of keywords based on language version with it being possible to switch from Lua 5 to Lua 4 by setting controlling variables.
>

IMO, DSLs for such stuff are a pain. It would be better to provide the
data in a common format that can be read in many languages, often
directly into a native data structure (ex. JSON, XML, etc). Having to
write a special parser or hack apart a big script written by someone
else just to get the data in memory in a suitable format is sub-optimal,
IMO.

Better is to just allow stuff like `theData = json.loads(specFile.read())`.

Regards,
Matthew Brush

KHMan

unread,
Jun 14, 2017, 1:32:38 PM6/14/17
to scintilla...@googlegroups.com
On 6/14/2017 7:51 AM, Matthew Brush wrote:
> On 2017-06-13 03:12 PM, Neil Hodgson wrote:
>> [Starting a new thread for visibility]
>>
>> In a previous email I proposed some changes to lexers to
>> allow applications to query at runtime for a lexer how many
>> styles there were and to retrieve 3 string values for each
>> style: name, tags, and description.

[snip snip]

>> One difficulty with using *.properties is that they
>> currently contain different degrees of cleverness and
>> indirection. For example, conf.properties simply assigns a list
>> of keywords to keywords.$(file.patterns.conf) whereas
>> lua.properties has a hierarchy of keywords based on language
>> version with it being possible to switch from Lua 5 to Lua 4 by
>> setting controlling variables.

> IMO, DSLs for such stuff are a pain. It would be better to provide
> the data in a common format that can be read in many languages,
> often directly into a native data structure (ex. JSON, XML, etc).
> Having to write a special parser or hack apart a big script
> written by someone else just to get the data in memory in a
> suitable format is sub-optimal, IMO.

Having updated the Lua lexer the last few times, I have no opinion
on this. I just followed what was laid out by the previous author(s).

My guess is that Lua coders would think the ability to adjust the
keywords easily according to Lua version is a useful feature but
most would be using the default setting anyway and never have to
change anything, a bit like the serial or parallel I/O pin headers
on your PC motherboard...

[snip]

--
Cheers,
Kein-Hong Man (esq.)
Selangor, Malaysia

Neil Hodgson

unread,
Jun 19, 2017, 7:37:54 AM6/19/17
to scintilla...@googlegroups.com
Attached is a some more implementation for lexer metadata. It adds metadata to 3 lexer: cpp, lua, and python. These show different ways of providing the metadata. cpp and python are object lexers with cpp implementing its own metadata response - currently just based on an array, but which could be expanded with dynamically added lexical states due to substyles. The python lexer provides a constant array of metadata to a default lexer implementation called DefaultLexer. lua is an old-style functional lexer, and the LexerModule constructor was extended to take an an array of metadata, so the Lua lexer initializes that.

Defining the tags for a good classification will be difficult. The similarities between programming languages helps define a reasonable set of tags although many programming languages have some unique concepts. However, the programming tags don’t really map well to markup and data languages.

Here is my current common scheme for programming and assembler languages: an optional status; a base type; a set of type modifiers:

status? base-type modifiers*

The status may be (error | unused). The error status is used for lexical statuses that indicate errors in the source code such as unterminated quoted strings. The unused status may indicate a gap in the lexical states, possibly because an old lexical class is no longer used or an upcoming lexical class may fill that position.

The basic types are (default | operator | keyword | identifier | literal | comment | preprocessor | instruction | label | register). The default type is commonly used for spaces and tabs between tokens although it may cover other characters in some languages.

The comment base type may have (documentation | line | taskmarker) modifiers.

The literal base type structures its modifiers into an optional data type followed by additional attributes with the data types from (numeric | boolean | string | regex | date | time | uuid | nil | compound). Additional attributes include (integer | real) for numeric and (heredoc | character | escapesequence | interpolated | multiline | raw) for strings.

The lexical classes definition for cpp in the patch is:

"SCE_C_DEFAULT", "default whitespace", "White space",
"SCE_C_COMMENT", "comment", "Comment: /* */.",
"SCE_C_COMMENTLINE", "comment line", "Line Comment: //.",
"SCE_C_COMMENTDOC", "comment documentation", "Doc comment: block comments beginning with /** or /*!",
"SCE_C_NUMBER", "literal numeric", "Number",
"SCE_C_WORD", "keyword", "Keyword",
"SCE_C_STRING", "literal string", "Double quoted string",
"SCE_C_CHARACTER", "literal string character", "Single quoted string",
"SCE_C_UUID", "literal uuid", "UUIDs (only in IDL)",
"SCE_C_PREPROCESSOR", "preprocessor", "Preprocessor",
"SCE_C_OPERATOR", "operator", "Operators",
"SCE_C_IDENTIFIER", "identifier", "Identifiers",
"SCE_C_STRINGEOL", "error literal line string", "End of line where string is not closed",
"SCE_C_VERBATIM", "literal string multiline raw", "Verbatim strings for C#",
"SCE_C_REGEX", "literal regex", "Regular expressions for JavaScript",
"SCE_C_COMMENTLINEDOC", "comment documentation line", "Doc Comment Line: line comments beginning with /// or //!.",
"SCE_C_WORD2", "identifier", "Keywords2",
"SCE_C_COMMENTDOCKEYWORD", "comment documentation keyword", "Comment keyword",
"SCE_C_COMMENTDOCKEYWORDERROR", "error comment documentation keyword", "Comment keyword error",
"SCE_C_GLOBALCLASS", "identifier", "Global class",
"SCE_C_STRINGRAW", "literal string multiline raw", "Raw strings for C++0x",
"SCE_C_TRIPLEVERBATIM", "literal string multiline raw", "Triple-quoted strings for Vala",
"SCE_C_HASHQUOTEDSTRING", "literal string", "Hash-quoted strings for Pike",
"SCE_C_PREPROCESSORCOMMENT", "comment preprocessor", "Preprocessor stream comment",
"SCE_C_PREPROCESSORCOMMENTDOC", "comment preprocessor documentation", "Preprocessor stream doc comment",
"SCE_C_USERLITERAL", "literal", "User defined literals",
"SCE_C_TASKMARKER", "comment taskmarker", "Task Marker",
"SCE_C_ESCAPESEQUENCE", "literal string escapesequence", "Escape sequence",

The sets of tags and ordering here represent common patterns but are nowhere near exhaustive. The tags used are likely to expand greatly and the relationships between tags become more complex. The set of tags should remain open-ended so that new languages can represent their own concepts but contributors should strive to reuse existing examples as much as possible.

It is reasonable for a lexer to refine the set of tags for a style in a new version and applications should try to handle this flexibly.

Applications could provide user interfaces that are based on the tags or that just use the tags for initial style assignments and then show a list of all lexical classes.

It may be complex to define a set of rules to produce a visual style from a set of tags as it should allow top level choices (such as changing the colour of comments) to have wide effect while still ensuring modifiers also produce visible results to distinguish between, say, comments and documentation comments.

Neil
LexicalClass.patch

Neil Hodgson

unread,
Jun 26, 2017, 9:36:51 PM6/26/17
to scintilla-interest
   Have done some more work on lexer metadata and written some documentation. Will try to implement an alternate styling system in SciTE to see if this is reasonably useful.

   Attached are some patches to current lexers with some tags and descriptions.

   Including style identifiers in lexers like this allows automatic regeneration of Scintilla.iface if new lexical states are added to a lexer so lexers become the primary source of these states. The attached LexerToIface.py script is an early version of this: it doesn’t yet handle all possible cases.

   Metadata documentation:

Language Types

Scintilla contains lexers for various types of languages:

  • Programming languages like C++, Java, and Python.
  • Assembler languages are low-level programming languages which may additionally include instructions and registers.
  • Markup languages like HTML, TeX, and Markdown.
  • Data languages like EDIFACT and YAML.

Some languages can be used in different ways. JavaScript is a programming language but also the basis of JSON data files. Similarly, Lisp s expressions can be used for both source code and data.

Each language type has common elements such as identifiers in programming languages. These common elements should be identified so that languages can be displayed with common styles for these elements. Style tags are used for this purpose in Scintilla.

Style Tags

Every style has a list of tags where a tag is a lower-case word containing only the common ASCII letters 'a'-'z' such as "comment" or "operator".

Tags are ordered from most important to least important.

While applications may assign visual attributes for tag lists in many different ways, one reasonable technique is to apply tag-specific attributes in reverse order so that earlier and more important tags override less important tags. For example, the tag list "error comment documentation keyword" with a set of tag attributes 
{ comment=fore:green,back:very-light-green,font:Serif documentation=fore:light-green error=strikethrough keyword=bold }
could be rendered as 
bold,fore:light-green,back:very-light-green,font:Serif,strikethrough.

Alternative renderings could check for multi-tag combinations like { comment.documentation=fore:light-green comment.line=dark-green comment=green }.

Commonly, a tag list will contain an optional status; a base type; and a set of type modifiers:
status? base-type modifiers*

Status

The status may be (error | unused | predefined | inactive).


The error status is used for lexical statuses that indicate errors in the source code such as unterminated quoted strings.
The unused status may indicate a gap in the lexical states, possibly because an old lexical class is no longer used or an upcoming lexical class may fill that position.

The predefined status indicates a style in the range 32.39 that is used for non-lexical purposes in Scintilla.
The inactive status is used for text that is not currently interpreted such as C++ code that is contained within a '#if 0' preprocessor block.

Basic Types

The basic types for programming languages are (default | operator | keyword | identifier | literal | comment | preprocessor | label).


The default type is commonly used for spaces and tabs between tokens although it may cover other characters in some languages.

Assembler languages add (instruction | register). to the basic types from programming languages.

The basic types for markup languages are (default | tag | attribute | comment | preprocessor).

The basic types for data languages are (default | key | data | comment).

Comments

Programming languages may differentiate between line and stream comments and treat documentation comments as distinct from other comments. Documentation comments may be marked up with documentation keywords.
The additional attributes commonly used are (line | documentation | keyword | taskmarker).

Literals

Programming and assembler languages contain a rich set of literals including numbers like 7 and 3.89e23"string\n"; and nullptr and differentiating between these is often wanted.
The common literal types are (numeric | boolean | string | regex | date | time | uuid | nil | compound).
Numeric literal types are subdivided into (integer | real).
String literal types may add (perhaps multiple) further attributes from (heredoc | character | escapesequence | interpolated | multiline | raw).

An escape sequence within an interpolated heredoc may thus be literal string heredoc escapesequence.

List of known tags

attributeMarkup attribute
booleanTrue or false literal
characterSingle character literal as opposed to a string literal
commentThe standard comment type in a language: may be stream or line
compoundLiteral containing multiple subliterals such as a tuple or complex number
dataA value in a data file
dateLiteral representing a data such as '19/November/1975'
defaultStarting state commonly also used for white space
documentationComment that can be extracted into documentation
errorState indicating an invalid or erroneous element
escapesequenceParts of a string that are not literal such as '\t' for tab in C
heredocLengthy text literal marked by a word at both ends
identifierName that identifies an object or class of object
inactiveCode that is not currently interpreted
instructionMnemonic in assembler languages like 'addc'
integerNumeric literal with no fraction or exponent like '738'
interpolatedString that can contain expressions
keyElement which allows finding associated data
keywordReserved word with special meaning like 'while'
labelDestination for jumps in programming and assembler languages
lineDifferentiates between stream comments and line comments in languages that have both
literalFixed value in source code
multilineDifferentiates between single line and multiline elements, commonly strings
nilLiteral for the null pointer such as nullptr in C++ or NULL in C
numericLiteral number like '16'
operatorPunctuation character such as '&' or '['
predefinedStyle in the range 32.39 that is used for non-lexical purposes
preprocessorElement that is recognized in an early stage of translation
rawString type that avoids interpretation: may be used for regular expressions in languages without a specific regex type
realNumeric literal which may have a fraction or exponent like '3.84e-15'
regexRegular expression literal like '^[a-z]+'
registerCPU register in assembler languages
stringSequence of characters
tagMarkup tag like '<br />'
taskmarkerWord in comment that marks future work like 'FIXME'
timeLiteral representing a time such as '9:34:31'
unusedStyle that is not currently used
uuidUniversally unique identifier often used in interface definition files which may look like '{098f2470-bae0-11cd-b579-08002b30bfeb}'

Extension

Each element in this scheme may be extended in the future. This may be done by revising this document to provide a common approach to new features. Individual lexers may also choose to expose unique language features through new tags.

Translation

Tags could be exposed directly in user interfaces or configuration languages. However, an application may also translate these to match its naming schema. Capitalization and punctuation could be different (like Here-Doc instead of heredoc), terminology changed ("constant" instead of "literal"), or human language changed from English to Chinese or Spanish.

Starting from a common set of tags makes these modifications tractable.

Open issues

HTML contains embedded sub-languages like JavaScript and PHP. How should these sublanguages be marked?
Possibly numbered: sublanguage(1) comment 
or named: sublanguage(javascript) comment.
If they are named, is there a known list of languages? Can server-side JavaScript be differentiated from client-side?

The C++ lexer (for example) has inactive states and dynamically allocated substyles. These should be exposed through the metadata mechanism but are not currently.


   Neil
LexersWithMetaData.patch
LexerToIface.py

dail8859

unread,
Jun 27, 2017, 8:58:53 AM6/27/17
to scintilla-interest, nyama...@me.com
I'll put in my two cents and at least state that adding lexer metadata will be a huge advantage, especially for editors that support numerous languages. I've dealt with Notepad++ a good bit, and it can get quite complex and burdensome to try to configure a new lexer for proper support for all themes (this normally means hard coding a hundred or more color values (which is not fun)).

Instead of a ordering the tags from most important to least important, have you given any thought to using some type of hierarchical approach? This is similar to what you have already (and would complement your approach of specifying styles for multi-tag combinations) but would enforce a bit more structure to the tags.  TextMate uses this kind of format. This would allow the lexer to be as specific as possible for each of its styles, but allow the user of Scintilla to only be as specific as they want to be. I'm not saying use the exact structure from TextMate, as it might not map over well the current lexers.

Thanks,
Justin

Neil Hodgson

unread,
Jun 28, 2017, 4:13:04 AM6/28/17
to scintilla-interest
Justin:

> Instead of a ordering the tags from most important to least important, have you given any thought to using some type of hierarchical approach?

A hierarchy would be simpler and was my original thought. There were some cases that seemed to not work well with a hierarchy.

The more important are the status tags which ‘cross-cut’ other tags. ‘inactive’, for example, gets applied to every style. In concrete style allocation, this doubles the number of styles. This could be lead to doubling the number of UI elements or settings in a options language. However, the visual effect used in default SciTE is to make ‘inactive’ styles the same as their base styles but duller. A rule like ‘inactive=fore:colour(50%)’ would be a good way of capturing this but it means that ‘inactive’ is effectively independent of the main hierarchy. Its a bit like multiple inheritance in programming: inactive line comments inherits both from 'comment line’ and ‘inactive'.

Another area was the set of string literal attributes where there are quite a few possibilities: heredoc, character, escapesequence, interpolated, multiline, and raw. There are also element type distinctions (wide/Unicode versus narrow, possibly UTF-8) which may be wanted. An ordering can be constructed but, it appears to me, that any particular order may not work well for all languages as languages emphasize different types of string handling. Allowing language implementers to choose an order may allow a better fit.

Each language could choose a different hierarchy (by changing tag order) but then commonality is decreased. Stating up-front that parts of the tag set may be ordered differently between languages may allow applications to adapt.

Neil

thomas_li...@hotmail.com

unread,
Jun 28, 2017, 6:39:26 AM6/28/17
to scintilla-interest, nyama...@me.com
The idea resembles CSS classes a lot. Each lexer state corresponds to a list of "classes" (put on a corresponding <span> element).

In CSS there are clear (or at least strict) rules defining how styles combine, etc.  (The order of classes on an element has no effect on the styling in CSS).

This suggestion is more free/suggestive when it comes to the handling of the tags, and thus what the prioritization of the tags means.  I wonder if that can make it unclear to lexer developers how to prioritize their tags.  If you develop and test your lexer with a certain set of styles and a certain styling machinery you may think that the tag order should be something, but with a different setup you think it should be different.

In that respect the hierarchical approach seems clearer/simpler.  Maybe it is possible to combine the two approaches: a hierarchical syntax class categorization plus a number of additional orthogonal tags (like error, inactive, etc).
That may also make it easier to deal with languages embedded in other languages, because the embedded property seems to be completely orthogonal to the syntax class of a construct, and should perhaps be used to select a completely different "theme".

Thomas Linder Puls

Neil Hodgson

unread,
Jul 10, 2017, 2:36:57 AM7/10/17
to scintilla-interest
dail8859:

> TextMate uses this kind of format. This would allow the lexer to be as specific as possible for each of its styles, but allow the user of Scintilla to only be as specific as they want to be. I'm not saying use the exact structure from TextMate, as it might not map over well the current lexers.

TextMate is interesting. It allows arbitrary nested scopes at any point which is a greater degree of freedom than Scintilla which assigns a single style number to each character. These styles could in some cases be grouped (html, script javascript, server php) to produce something with some similarities to that aspect of TextMate but it wouldn’t allow as much flexibility of TextMate.

TextMate's matching of multiple selectors suits a text configuration language more than a GUI. Perhaps GUI styling dialogues would need another description to specify which combinations of selectors need exposure.

The set of elements used by TextMate appear irregular as if they were defined as issues arose. The markup elements, in particular, refer to concrete visual styles instead of semantic structure.

TextMate’s list of selector elements is similar to those listed in my earlier mail but uses some different terms. There could be some value in deferring to TextMate here with these changes from the earlier mail: literal→constant; error→invalid and “markup” becomes a basic type for markup languages with “tag” and “attribute” then refinements of “markup”.

Since the language composition problem is difficult it may be worth constraining this for now to just handling the most common case: HTML with scripts. While HTML can be sliced in various ways, there are scripts that either run on the server or the client and these scripts can be in various languages such as Javascript, PHP, Python, Basic, and others not supported by Scintilla’s lexer. CSS could also be considered a client language although this is not currently supported by the lexer.

An optional embedded language sequence could be added to the set of tags. Introduced by (server | client) followed by a language name tag (javascript | php | python | basic). The known language tags may be extended over time. Thus the tags for a client-side javascript line comment may be:
client javascript comment line

Other terms may be better - “script" instead of “client” may work better for languages that aren’t client/server but that wouldn’t work well for “css”. “host” may be preferred to “server”.

Neil

Neil Hodgson

unread,
Jul 17, 2017, 2:19:43 AM7/17/17
to scintilla...@googlegroups.com
Committed (change sets 6345 to 6354) an initial implementation of lexer metadata, with the cpp, lua, python, hypertext, and xml lexers providing metadata. The cpp lexer provides metadata for inactive styles and substyles but the others don’t. Only python uses substyles. The metadata provided for inactive styles and substyles by the cpp lexer is limited to tags and its not clear that providing names or descriptions is useful.

SciTE was changed to understand style metadata and use this for styling and it works quite well although it has limitations and its unlikely this will be committed to SciTE.

The main part of the SciTE change is to have a list of “tag.style” properties which are used to interpret the tags returned for each style into a visual appearance. Here is an example:

tag.style.base=back:$(scheme.back),$(font.base)
tag.style.comment=fore:#007000,font:Georgia,size:10.1
tag.style.comment.line=fore:#10A000,italics
tag.style.comment.documentation.keyword=back:#F0C0C0
tag.style.keyword=fore:#404080,bold
tag.style.literal=fore:#009090
tag.style.literal.string=fore:#900090
tag.style.operator=fore:$(scheme.fore),bold
tag.style.preprocessor=fore:#A0A000
tag.style.tag=fore:#303090
tag.style.client=back:#D7D7FF,eolfilled
tag.style.server=back:#D7FFD7,eolfilled
tag.style.server.php=back:#FFF8F8,eolfilled
tag.style.server.php.identifier=italics
tag.style.identifier=fore:$(scheme.fore)
tag.style.inactive=fade:50
tag.style.error=back:#FF0000

The prefix “tag.style.” is removed from each property; ‘.’ is replaced with ‘ ‘ so it can be treated as a list of tags; each tag list is searched for in each style’s tag list and a match causes the value to be incorporated into the visual appearance. The list is sorted which means that “comment line” is applied after “comment” so can override “comment”. Cleverer techniques may be needed but this set actually worked quite well.

One problem with this implementation is that many styles may have the same set of tags so have the same appearance. For multiple sets of identifiers (or keywords), each set could be differentiated by choosing a different hue either from a list or by adding a constant to the hue of the previous set (and wrapping around if needed).

The tag styles use some global settings of scheme.fore/scheme.back to easily flip from common black on white to inverted white on black. It depends on a new setting ‘fade’ which combines the foreground colour with the background colour to fade inactive styles with fade:50 being half faded and fade:10 being almost invisible. Selection properties should also be changed to suit dark versus light schemes:

# Normal black on white
scheme.fore=#000000
scheme.back=#FFFFFF
selection.back=#000000
selection.alpha=32

# Inverted white on black
#~ scheme.fore=#FFFFFF
#~ scheme.back=#000000
#~ selection.back=#FFFFFF
#~ selection.alpha=64

A tag.styles.metadata property switches from SciTE’s normal styling to tag-based styling:
tag.styles.metadata.python=1
tag.styles.metadata.cpp=1
tag.styles.metadata.hypertext=1

The SciTE changes are attached with the most important being the SciTEBase::SetStyleBlock addition.

Neil
TagStyle.patch
Reply all
Reply to author
Forward
0 new messages