Rainbow identifiers

Neil Hodgson

unread,

Jun 19, 2012, 1:32:32 AM6/19/12

to scintilla...@googlegroups.com

A recurring request pattern for Scintilla has been to increase the number of keywords for particular lexers in order to highlight sets of identifiers in different colours. The keyword feature was initially designed for true language keywords and has been stretched beyond that design for sets of identifiers. There are sometimes multiple lexeme types that could benefit from multiple styles: for C++, identifiers and documentation comment keywords are candidates with preprocessor macro names a future possibility. For HTML there are lists of tags and attributes as well as identifiers for JavaScript, PHP and other server- and client-side scripts.

This produces conflicting needs since the valid range of style values has to be split up between these different lexeme types and different users will have different needs. I have rejected patches that add more identifier styles in order to preserve freedom to change here.

To allow more identifier styles, a pool of unallocated styles could be maintained with allocations performed on demand from the application. An allocation would extend an existing style with a set of new styles. Only existing styles that are coded to be extensible would be valid. An API extension to ILexer could look like this and be exposed as SCI_ALLOCATEIDENTIFIERSTYLES, ...

// Returns start of new allocation, -1 on failure
int AllocateIdentifierStyles(int styleBase, int numberStyles);
void SetIdentifiers(int style, const char *identifiers);
void FreeIdentfierStyles();

From the application, this may look like:

Call(SCI_FREEIDENTFIERSTYLES);
int idents = 8;
identStyleBase = Call(SCI_ALLOCATEIDENTIFIERSTYLES,
SCE_C_IDENTIFIER, idents);
if (identStyleBase >= 0) {
for (int i=0;i< idents;i++) {
Call(SCI_STYLESETFORE, identStyleBase + i, colourList[i]);
Call(SCI_SETIDENTIFIERS, identList[i]);
}
}
dcStyleBase = Call(SCI_ALLOCATEIDENTIFIERSTYLES,
SCE_C_COMMENTDOCKEYWORD, 3);
// …

Over time, more styles are defined for each lexer, reducing the pool of styles available for identifiers. Applications should be prepared to handle failure of an SCI_ALLOCATEIDENTIFIERSTYLES call, possibly by merging less important sets of identifiers. The pool of styles may not be contiguous due to the fixed styles 32..39 and other factors. Another possibility is to define a fixed range of identifier style numbers per lexer although its likely this will just cause the current problem to recur with requests to expand the range.

The C++ lexer duplicates each style to allow different styling of active code and code that is inactive due to preprocessor directives. The inactive style is defined by adding 64 to the active style. Adding a new identifier style for C++ will require allocating an active and inactive style and for simplicity, these should be 64 apart. A single call to AllocateIdentifierStyles will allocate both active and inactive styles and the set of identifiers used for an active identifier style will also e used for the corresponding inactive identifier style. Since an application may not know that a lexer supports active/inactive (or other similar features) another API should be provided to return the distance or -1 if there are no secondary styles.

int DistanceToSecondaryStyles()

From the point of view of lexers, there will be new support class(es) to allocate identifier styles and map identifiers to style numbers. Something like
sc.ChangeState(classifier->classify(SCE_C_IDENTIFIER, ident)|activitySet);

Currently this is just planning - I haven't written any code although it appears quite easy. Since it will require adding APIs to the externally visible ILexer interface and I don't want to have many different versions of this interface in use, it may be delayed until other changes to ILexer are finished. Unicode line end support may also require additions to ILexer.

Neil

Mike Lischke

unread,

Jun 19, 2012, 2:59:20 AM6/19/12

to scintilla...@googlegroups.com

Neil,

> From the application, this may look like:
>
> Call(SCI_FREEIDENTFIERSTYLES);
> int idents = 8;
> identStyleBase = Call(SCI_ALLOCATEIDENTIFIERSTYLES,
> SCE_C_IDENTIFIER, idents);
> if (identStyleBase >= 0) {
> for (int i=0;i< idents;i++) {
> Call(SCI_STYLESETFORE, identStyleBase + i, colourList[i]);
> Call(SCI_SETIDENTIFIERS, identList[i]);
> }
> }
> dcStyleBase = Call(SCI_ALLOCATEIDENTIFIERSTYLES,
> SCE_C_COMMENTDOCKEYWORD, 3);
> // …
>

How would the lexers know about these new styles? I have a similar need in MySQL where part of the text can be wrapped with a "conditional multi line comment". So I also duplicated most styles to allow coloring the text properly. However, what I was rather looking for was kinda style inheritance or style stack. Say, you start with a multi line comment which changes the for- and background. Then the lexer enters the special state for a conditional comment and from now on part of the styles of the text overrides the already set style (usually the foreground color and font styles). Once the current token is done the lexer returns to the previous style (the multi comment style) automatically. This way you wouldn't need additional styles, but just "combine" two existing styles somehow.

Mike
--
www.soft-gems.net

Lex Trotman

unread,

Jun 19, 2012, 3:05:21 AM6/19/12

to scintilla...@googlegroups.com

Hi Neil,

On 19 June 2012 15:32, Neil Hodgson <nyama...@me.com> wrote:
> A recurring request pattern for Scintilla has been to increase the number of keywords for particular lexers in order to highlight sets of identifiers in different colours. The keyword feature was initially designed for true language keywords and has been stretched beyond that design for sets of identifiers. There are sometimes multiple lexeme types that could benefit from multiple styles: for C++, identifiers and documentation comment keywords are candidates with preprocessor macro names a future possibility. For HTML there are lists of tags and attributes as well as identifiers for JavaScript, PHP and other server- and client-side scripts.
>
> This produces conflicting needs since the valid range of style values has to be split up between these different lexeme types and different users will have different needs. I have rejected patches that add more identifier styles in order to preserve freedom to change here.
>
> To allow more identifier styles, a pool of unallocated styles could be maintained with allocations performed on demand from the application. An allocation would extend an existing style with a set of new styles. Only existing styles that are coded to be extensible would be valid. An API extension to ILexer could look like this and be exposed as SCI_ALLOCATEIDENTIFIERSTYLES, ...

Since this is a general feature, why use a name that implies only one
of the uses. You already note documentation comment keywords below,
subsets of operators also come to mind. Perhaps something more
general like SCI_ALLOCATESUBSTYLES since that is what they are really
doing, allocating styles for subsets of the specified style (where the
style is being used as an alias for a token class I guess).

>
> // Returns start of new allocation, -1 on failure
> int AllocateIdentifierStyles(int styleBase, int numberStyles);
> void SetIdentifiers(int style, const char *identifiers);
> void FreeIdentfierStyles();
>

Clearly a lexer could do anything it wanted (was coded to do) with
these new styles, not just a new "keyword" list. Perhaps
AllocateSubStyles SetValues (which is what it is doing) and
FreeSubStyles.

> From the application, this may look like:
>
> Call(SCI_FREEIDENTFIERSTYLES);
> int idents = 8;
> identStyleBase = Call(SCI_ALLOCATEIDENTIFIERSTYLES,
> SCE_C_IDENTIFIER, idents);
> if (identStyleBase >= 0) {
> for (int i=0;i< idents;i++) {
> Call(SCI_STYLESETFORE, identStyleBase + i, colourList[i]);
> Call(SCI_SETIDENTIFIERS, identList[i]);
> }
> }
> dcStyleBase = Call(SCI_ALLOCATEIDENTIFIERSTYLES,
> SCE_C_COMMENTDOCKEYWORD, 3);
> // …
>
> Over time, more styles are defined for each lexer, reducing the pool of styles available for identifiers. Applications should be prepared to handle failure of an SCI_ALLOCATEIDENTIFIERSTYLES call, possibly by merging less important sets of identifiers. The pool of styles may not be contiguous due to the fixed styles 32..39 and other factors.

Although your example above implies that a contiguous range is
allocated, is that the intention? Perhaps a contiguous set of as many
as possible could be allocated each call and the number be the return
value. Then the application can have another go until it has as many
as it wants.

>Another possibility is to define a fixed range of identifier style numbers per lexer although its likely this will just cause the current problem to recur with requests to expand the range.

As you say, its not future proof.

>
> The C++ lexer duplicates each style to allow different styling of active code and code that is inactive due to preprocessor directives. The inactive style is defined by adding 64 to the active style. Adding a new identifier style for C++ will require allocating an active and inactive style and for simplicity, these should be 64 apart. A single call to AllocateIdentifierStyles will allocate both active and inactive styles and the set of identifiers used for an active identifier style will also e used for the corresponding inactive identifier style. Since an application may not know that a lexer supports active/inactive (or other similar features) another API should be provided to return the distance or -1 if there are no secondary styles.
>
> int DistanceToSecondaryStyles()
>
> From the point of view of lexers, there will be new support class(es) to allocate identifier styles and map identifiers to style numbers. Something like
> sc.ChangeState(classifier->classify(SCE_C_IDENTIFIER, ident)|activitySet);
>
> Currently this is just planning - I haven't written any code although it appears quite easy. Since it will require adding APIs to the externally visible ILexer interface and I don't want to have many different versions of this interface in use, it may be delayed until other changes to ILexer are finished. Unicode line end support may also require additions to ILexer.
>
> Neil
>

Cheers
Lex

Neil Hodgson

unread,

Jun 19, 2012, 10:23:14 PM6/19/12

to scintilla...@googlegroups.com

Lex Trotman:

> Since this is a general feature, why use a name that implies only one
> of the uses. You already note documentation comment keywords below,
> subsets of operators also come to mind. Perhaps something more
> general like SCI_ALLOCATESUBSTYLES since that is what they are really
> doing, allocating styles for subsets of the specified style (where the
> style is being used as an alias for a token class I guess).

I wanted the name to attract the attention of people wanting to implement coloured identifiers since that will be the main use. It doesn't matter much to me - substyles may be OK too.

> Although your example above implies that a contiguous range is
> allocated, is that the intention?

Yes, since that makes it simpler for the application.

Neil

Neil Hodgson

unread,

Jun 19, 2012, 10:23:39 PM6/19/12

to scintilla...@googlegroups.com

Mike Lischke:

> How would the lexers know about these new styles?

The lexer implements ILexer and receives the AllocateIdentifierStyles call. It doesn't know about the visual representation of the styles.

> I have a similar need in MySQL where part of the text can be wrapped with a "conditional multi line comment". So I also duplicated most styles to allow coloring the text properly. However, what I was rather looking for was kinda style inheritance or style stack.

SinkWorld examined these sorts of issues: having a tree of lexers with a corresponding tree of styles. Full generality becomes complex but it is a worthwhile area to explore. I expect that someone will eventually write a good library implementing this and be a better choice than Scintilla.

> Say, you start with a multi line comment which changes the for- and background. Then the lexer enters the special state for a conditional comment and from now on part of the styles of the text overrides the already set style (usually the foreground color and font styles). Once the current token is done the lexer returns to the previous style (the multi comment style) automatically. This way you wouldn't need additional styles, but just "combine" two existing styles somehow.

The style tree ends up containing modifiers as well as setters: make 20% less bold and dull the foreground colours 50% might be how you define inactive text. With SinkWorld, I couldn't find a good way to implement this at the code level, even less for setting by users although CSS may be a good starting point.

Neil

Mike Lischke

unread,

Jun 20, 2012, 2:57:01 AM6/20/12

to scintilla...@googlegroups.com

>> Say, you start with a multi line comment which changes the for- and background. Then the lexer enters the special state for a conditional comment and from now on part of the styles of the text overrides the already set style (usually the foreground color and font styles). Once the current token is done the lexer returns to the previous style (the multi comment style) automatically. This way you wouldn't need additional styles, but just "combine" two existing styles somehow.
>
> The style tree ends up containing modifiers as well as setters: make 20% less bold and dull the foreground colours 50% might be how you define inactive text. With SinkWorld, I couldn't find a good way to implement this at the code level, even less for setting by users although CSS may be a good starting point.

I wouldn't go that far. A simple style stack plus a clever style merge definition (as simple as possible) would do the job. But others may have different requirements.

Mike
--
www.soft-gems.net

Neil Hodgson

unread,

Aug 2, 2012, 9:33:31 AM8/2/12

to scintilla...@googlegroups.com

An initial version of sub styles has been implemented and a patch is attached to this message. It is fairly similar to earlier discussion. There is a corresponding patch to SciTE to illustrate one way to use the feature. Some extra calls were added so that the application doesn't have to remember how it has configured sub styles.

Only 'object' lexers can have sub-styles and they have to implement several new methods to do so. In the patch, the C++ lexer is extended to allow sub styles to be added for identifiers (SCE_C_IDENTIFIER=11) and doc-comment keywords (SCE_C_COMMENTDOCKEYWORD=17). Classes SubStyles and WordClassifier were added to help lexers implement sub styles.

The changes to SciTE are at an early stage so may or may not be published. They start by specifying the number of sub styles to attach to a base style with sub styles.<lexer>.<baseStyle>=<numberOfSubStyles> - for example: substyles.cpp.11=2 adds 2 sub styles to identifiers. Then the set of identifiers for each sub style is defined and the style defined. Since the style numbers allocated may change from run to run, they are referred to by using the base style number and sub style within that separated by '.' so the first sub style of C++ identifiers is 11.1. To show some Scintilla class names in purple, some C++ standard library identifiers in pink and to add a doc comment keyword @random displayed in cyan, the settings look like this:

substyles.cpp.11=2
substylewords.11.1.$(file.patterns.cpp)=CharacterSet LexAccessor SString WordList
substylewords.11.2.$(file.patterns.cpp)=std map string vector
style.cpp.11.1=fore:#AA00EE
style.cpp.11.2=fore:#EE00AA
style.cpp.75.1=$(style.cpp.75),fore:#663388
style.cpp.75.2=$(style.cpp.75),fore:#883366

substyles.cpp.17=1
style.cpp.17.1=$(style.cpp.17),fore:#00AAEE
substylewords.17.1.$(file.patterns.cpp)=random

To avoid changing the ILexer interface twice, this feature will not be committed until any changes to ILexer for Unicode line ends are also complete.