Is it possible to 'extend' an existing lexer

106 views
Skip to first unread message

Bostjan Mihoric

unread,
Mar 2, 2017, 11:58:41 AM3/2/17
to scintilla-interest
Hello,

I'm writing a plugin for Notepad++, which uses Scintilla. I'm assuming they use Scintilla's lexers for various languages, including C.

What I would like is to add support for dynamically decided keywords. More precisely, words which are type identifiers, so that they would be displayed with a different style. Of course, they cannot be known in advance, and can change between file saves (it will require lexing every time).

I do C parsing in my plugin and have all identifier info available. What I'd like is something like lexer asking me about whether some word is a type keyword, and I would give it an answer. Other than that, it would do all other lexing just as usual.

If possible, I'd like to avoid writing my own lexer, as I'm unsure how complicated that is (and for other reasons, like Notepad++ having it's own settings for lexers). I can communicate with Scintilla directly, and can also subclass it if necessary, intercepting all message communication (Windows mechanism).

Is it possible to extend an existing lexer in such a way?

Thanks!

Bostjan Mihoric

unread,
Mar 2, 2017, 4:45:23 PM3/2/17
to scintilla-interest
Figured it out.

What I did was set Scintilla lexer to container mode.
Then when processing notification, switch back to original lexer, perform lexing, switch back to container mode.
Then get styled text, do my own lexing, override on identifiers and push the changes.

The only thing I don't understand is, why does SCI_CLEARDOCUMENTSTYLE clear styles, but generates no new lexing requests?
Text remains unstyled afterwards...?

Neil Hodgson

unread,
Mar 2, 2017, 6:17:39 PM3/2/17
to Scintilla mailing list
Bostjan Mihoric:

> I do C parsing in my plugin and have all identifier info available. What I'd like is something like lexer asking me about whether some word is a type keyword, and I would give it an answer. Other than that, it would do all other lexing just as usual.

If the plugin can fully parse C then its should be able to take over the role of providing lexical data over the file.

> If possible, I'd like to avoid writing my own lexer, as I'm unsure how complicated that is (and for other reasons, like Notepad++ having it's own settings for lexers). I can communicate with Scintilla directly, and can also subclass it if necessary, intercepting all message communication (Windows mechanism).
>
> Is it possible to extend an existing lexer in such a way?

It is difficult to extend lexers in this way and there can be unexpected patterns of reentrancy that can upset implementations. For example, fold discovery can lead to nested lexing to find the end of a fold structure.

This is layering lexers where later lexers process the result of earlier lexers. Other examples include recognising URLs in the text or spelling mistakes. Scintilla does not have good support for this. Scintilla could be extended to change the current single ‘endStyled’ field into a list of positions, one for each lexer in a stack. Then, when styled text is needed, each of the lexers is called to advance its ‘endStyled’ up to the limit from earlier lexers.

While Scintilla does not provide support for this, it can be implemented in a container by watching for modification and style modification events.

An issue is how the new styling is added to the style state since the base lexer will not understand any style values introduced by the later lexers. Using indicators can avoid changing style values and is a good approach for URLs and similar. Indicators can also change the foreground colour of text but they cannot change font, bold, or italics.

> The only thing I don't understand is, why does SCI_CLEARDOCUMENTSTYLE clear styles, but generates no new lexing requests?
> Text remains unstyled afterwards…?

It zeroes all the style bytes. The application may want to use a non-standard approach to styling at this point. If you want to start styling from some position, call SCI_STARTSTYLING.

Neil

Bostjan Mihoric

unread,
Mar 3, 2017, 2:52:53 AM3/3/17
to scintilla-interest, nyama...@me.com
Neil Hodgson wrote:

   An issue is how the new styling is added to the style state since the base lexer will not understand any style values introduced by the later lexers. Using indicators can avoid changing style values and is a good approach for URLs and similar. Indicators can also change the foreground colour of text but they cannot change font, bold, or italics.

Thanks for the idea, I might switch to indicators instead.
 

> The only thing I don't understand is, why does SCI_CLEARDOCUMENTSTYLE clear styles, but generates no new lexing requests?
> Text remains unstyled afterwards…?

   It zeroes all the style bytes. The application may want to use a non-standard approach to styling at this point. If you want to start styling from some position, call SCI_STARTSTYLING.


My usecase: when new identifiers are parsed, I wanted to refresh the styles. For example, a word that was not recognized as a type before, is now a type and we want to display it with it's assigned style.

I like that Scintilla only asks for restyling of the visible part of screen, not whole buffer. This is what I wanted to leverage by resetting styles. To return to the state where whole buffer is considered unstyled and Scintilla asks again for styling the visible parts.

I don't quite understand why I would call SCI_CLEARDOCUMENTSTYLE and then SCI_STARTSTYLING, because SCI_STARTSTYLING is what I use when handling SCN_STYLENEEDED, and SCN_STYLENEEDED is what I'd like Scintilla to start sending again, so I know what (minimal) range needs to be styled.

Bostjan Mihoric

unread,
Mar 6, 2017, 12:23:08 PM3/6/17
to scintilla-interest, nyama...@me.com


> I do C parsing in my plugin and have all identifier info available. What I'd like is something like lexer asking me about whether some word is a type keyword, and I would give it an answer. Other than that, it would do all other lexing just as usual.

   If the plugin can fully parse C then its should be able to take over the role of providing lexical data over the file.


Only if we are talking about the tokenizing step. I was referring to style-classifying identifier types, which requires parsing. Parsing, however, cannot perform the job of Scintilla's lexers I believe.

Compiler parsing is done on a preprocessed source, so one cannot utilize it for styling original source. At least not without superhuman effort, much more than writing own lexers.

My plugin uses a different approach with which it works on non-preprocessed source. However, during parsing it cannot recognize identifiers that don't have a declaration/definition in current file. So this approach cannot be used for styling the source either.

Then, there's the speed issue. A lexer must give user instantaneous experience. A parser is much slower, and it isn't easy to write a parser so as to be able to begin parsing at any line. It's usually whole file or nothing.

So Scintilla is very correct to offer lexers, as nothing can really replace them.

Fortunately, I was able to utilize indicators instead, although they have limitations compared to styles.

Regards,
B

Neil Hodgson

unread,
Mar 6, 2017, 6:44:03 PM3/6/17
to scintilla-interest
Bostjan:

> Then, there's the speed issue. A lexer must give user instantaneous experience. A parser is much slower, and it isn't easy to write a parser so as to be able to begin parsing at any line. It's usually whole file or nothing.

Speed is a crucial element of layered lexing as checkers are often slow, external, whole file operations. It should minimize distracting flashing and temporarily incorrect highlights. Abandoning the tool’s run after the portion of the file visible on screen may be worthwhile. With spell-checking, which can normally be restarted at line starts, the checking may be optimised by prioritizing the visible section and nearby text (for quick reaction to scrolls) before checking the rest of the file.

There may be additional Scintilla APIs that can help with these tasks but different tools have sufficiently different performance characteristics that a single approach is unlikely to be universally applicable.

> Fortunately, I was able to utilize indicators instead, although they have limitations compared to styles.

The limitations of indicators are actually a help in some ways. If indicators were able to change font and thus positions, changes to indicators would require much more effort. With wrap enabled, wider or thinner fonts can even cause more or fewer screen lines to be needed leading to text moving up and down.

Perhaps there could be two classes of indicator: layout preserving and layout changing and some applications may choose to accept the costs of layout changing indicators.

Neil

Paul K

unread,
Mar 6, 2017, 11:29:53 PM3/6/17
to scintilla-interest
Hi Bostjan,

> If possible, I'd like to avoid writing my own lexer, as I'm unsure how complicated that is (and for other reasons, like Notepad++ having it's own settings for lexers). I can communicate with Scintilla directly, and can also subclass it if necessary, intercepting all message communication (Windows mechanism).
> Is it possible to extend an existing lexer in such a way?

In addition to those already suggested options, you may want to check scintillua (https://foicica.com/scintillua/) as it's a module that can be installed as a scintilla lexer, but allows writing its own lexers in Lua (and comes with 90+ lexers already defined). It also supports embedded lexers and handles keywords, folding, and some of the other features that Scintilla lexers support. I think its only dependence is lua/lpeg and they can all be compiled into one dll (I'm using it in the environment where lua and lpeg are already available, so not sure about single dll configuration).

Paul.

Bostjan Mihoric

unread,
Mar 7, 2017, 11:07:46 AM3/7/17
to scintilla-interest, nyama...@me.com


   Perhaps there could be two classes of indicator: layout preserving and layout changing and some applications may choose to accept the costs of layout changing indicators.



An interesting idea, but I have to admit that you are right about indicators: it would be bad if indicators actually changed font, made text bold, etc. Because their purpose is to point out something without moving text around. For my plugin that would also be unacceptable, because I apply indicators with a slight delay and indeed it would be bad if this moved anything.

Bostjan Mihoric

unread,
Mar 7, 2017, 11:36:42 AM3/7/17
to scintilla-interest


In addition to those already suggested options, you may want to check scintillua (https://foicica.com/scintillua/) as it's a module that can be installed as a scintilla lexer, but allows writing its own lexers in Lua (and comes with 90+ lexers already defined). It also supports embedded lexers and handles keywords, folding, and some of the other features that Scintilla lexers support. I think its only dependence is lua/lpeg and they can all be compiled into one dll (I'm using it in the environment where lua and lpeg are already available, so not sure about single dll configuration).


Thanks Paul, I'll take a look. However I need to mention that the problem in reality isn't me writing a lexer. That would actually be somewhat simple, on the level of copypasting and modifying my plugin's tokenizer plus looking up identifiers in my parsing outputs.

The actual problem is that Notepad++ is using lexers and there's no way for me to just add a little extra one at the end, or maybe override C++ lexer's group by telling it I'll provide keywords dynamically.

I'd have to look up Notepad++ code (everything it does with lexers, when it does it, how it does it), probably parse Notepad++ configs for it's keywords and styles, etc. Like, reimplement it's whole lexing stack and then fully override it in the plugin. That's what I am really happy to avoid, so indicators are good enough for now.

Especially since I then went overboard and added 12 indicator styles (for every identifier type), and I'm not even sure you can have that many groups in a lexer. Notepad++ allows 8 keyword groups in a user-defined language, so... yeah.

Finally, I'm not even sure a lexer would be a better solution. Every time identifiers are refreshed, all styles should be wiped. I'm working with C files over 250KB which already exhibited human-noticable delays when lexed fully with looking up every word among identifiers. Such delays are ultimately unacceptable. What I do currently seems better (refreshing current view on a timer of about 100ms, and user will be able to change this interval).

Regards,
B

Neil Hodgson

unread,
Mar 7, 2017, 10:34:05 PM3/7/17
to scintilla-interest
Bostjan:

> Especially since I then went overboard and added 12 indicator styles (for every identifier type), and I'm not even sure you can have that many groups in a lexer. Notepad++ allows 8 keyword groups in a user-defined language, so... yeah.

There is a newer keyword mechanism called substyles which is supported by the C++ lexer. This allows the application to decide how many keyword groups (maximum 64 for C++) to define for classifying identifiers and comment keywords. SciTE supports substyles for Python identifiers but its currently commented out for C++ in its properties file. I don’t know whether Notepad++ enables substyles or plans to.

http://www.scintilla.org/ScintillaDoc.html#Substyles

Neil

Reply all
Reply to author
Forward
0 new messages