Writing lexers in Lua

760 views
Skip to first unread message

Neil Hodgson

unread,
Feb 6, 2010, 2:13:07 AM2/6/10
to scite-interest
There is now some experimental support for writing lexers in Lua.
The API is similar to the StyleContext class used in LexCPP although
the low level calls from Accessor are also available. It is very
likely that the API will change and it may still be 'experimental'
when included in a release of SciTE so that the API can be fixed after
more experience.

The documentation is at
http://www.scintilla.org/ScriptLexer.html

Iteration is by character rather than byte and styler.Current(),
styler.Next(), and styler.Previous() return strings containing all the
bytes in multi-byte characters. If the document is in UTF-8 with the
value "«" then the initial value of styler.Current() is "«" which is
the same as "\xc2\xab". If the document is in Latin-1 then "«" is
"\xab". This makes it easy to write lexers for a particular encoding
in that encoding as code can be written naturally like
styler.Match("«"). Lexers for languages that depend on characters
outside ASCII for syntax and that have to deal with multiple encodings
will be more complex. The API still uses byte positions for Position()
and other calls since it is costly to convert byte positions to
character positions and vice versa.

Another change from previous lexers is that there is an imaginary
extra NUL ('\0') character at the end of the document when using
styler.More() .. styler.Forward(). This makes it easier to treat the
end of the document as if it was the end of a line which means the
normal code for determining that an identifier is a keyword will
trigger. This avoids the common lexer problem of keywords at the end
of the document not highlighting.

Configuring a script lexer is indicated by using a lexer name that
starts with "script_". There can be multiple script lexer languages
mentioned in the properties files at once although there is only one
OnStyle. All script languages use the same numeric lexer ID
SCLEX_CONTAINER. The lexer name is available as a member of the styler
object.

The implementation is not great due to my not understanding section
Chapter 28 of "Programming in Lua" http://www.lua.org/pil/28.html.
Rather than producing a new Lua type, I made the styler a table and
then had the 'methods' use closures to access a C struct containing
the state. This means that the call is styler.More() rather than
styler:More() which would be more expected. I hope someone understands
this aspect of Lua better than me and can fix the code.

The current code has *not* been committed to CVS. It is available from

http://www.scintilla.org/scite.zip Source
http://www.scintilla.org/wscite.zip Windows executable

Neil

missdeer

unread,
Feb 6, 2010, 6:57:19 AM2/6/10
to scite-interest
Sounds good.
Well, that's only a syntactic sugar which will push styler table
itself as
the first argument for the method, no more differences. You just
need
getting one more argument first from the Lua stack in those C
closures.


On Feb 6, 3:13 pm, Neil Hodgson <nyamaton...@gmail.com> wrote:
>    The implementation is not great due to my not understanding section
> Chapter 28 of "Programming in Lua"http://www.lua.org/pil/28.html.
> Rather than producing a new Lua type, I made the styler a table and
> then had the 'methods' use closures to access a C struct containing
> the state. This means that the call is styler.More() rather than
> styler:More() which would be more expected. I hope someone understands
> this aspect of Lua better than me and can fix the code.
>
>

>    Neil

Neil Hodgson

unread,
Feb 7, 2010, 4:40:16 PM2/7/10
to scite-i...@googlegroups.com
missdeer:

> Well, that's only a syntactic sugar which will push styler table
> itself as
> the first argument for the method,  no more differences.  You just
> need
> getting one more argument first from the Lua stack in those C
> closures.

OK, uploaded a version that works like that along with updated
documentation. It does seem added complexity and also a bit wasteful
to have both the closure pointing to the C struct as well as the
table.

Neil

Philippe Lhoste

unread,
Feb 10, 2010, 5:22:53 AM2/10/10
to scite-i...@googlegroups.com
On 06/02/2010 08:13, Neil Hodgson wrote:
> There is now some experimental support for writing lexers in Lua.

Cool! That's something I wished for years and never found time (or courage) to do.
Sorry for Mitchell as I never found time, either, to test his Scintillua, which uses an
interesting option too (LPeg).

> The API is similar to the StyleContext class used in LexCPP although
> the low level calls from Accessor are also available.

A low level approach. Perhaps harder than using LPeg (I wonder if it can be plugged in
this new framework) although learning Peg syntax isn't simple either. At least it will be
familiar to those writing C++ lexers...

Nicely done with good examples.

> The current code has *not* been committed to CVS. It is available from
>
> http://www.scintilla.org/scite.zip Source
> http://www.scintilla.org/wscite.zip Windows executable

Will play with that and report.

This is really great for making quick prototypes (if performance is poor, maybe rewriting
in C++ might help; here similar logic can help the porting).
Or, one of the main use case I see too, to lexer DSLs or special config files or just to
highlight some parts in semi-structured files like log files, stack traces, etc.

--
Philippe Lhoste
-- (near) Paris -- France
-- http://Phi.Lho.free.fr
-- -- -- -- -- -- -- -- -- -- -- -- -- --

mozers

unread,
Feb 10, 2010, 8:45:36 AM2/10/10
to Philippe Lhoste
On 06/02/2010 08:13, Neil Hodgson wrote:
> There is now some experimental support for writing lexers in Lua.

Scintillua by Mitchell Foral - a very interesting project. I would be glad if this API was added to the official release. But this is unlikely to happen.
Requires the creation of huge number of lexers. Those that are now - only toys.
Very few people engaged in the development.

ScriptLexer - simpler and poorer, but there is a chance that he will be in the official SciTE.
While not written good lua-lexers can use the c-lexers. This is - a good compromise and a smooth transition.
I hope that the improvement ScriptLexer will deal a lot of people.

Questions to Neil:
Do you plan to use LPeg?
Will created a separate branch in CVS?
Where now can download the latest zip version?

--
mozers
<http://scite.net.ru>

Philippe Lhoste

unread,
Feb 10, 2010, 11:15:32 AM2/10/10
to scite-i...@googlegroups.com
On 10/02/2010 11:22, Philippe Lhoste wrote:
>> The documentation is at
>> http://www.scintilla.org/ScriptLexer.html
>
> Nicely done with good examples.

Although I have some preliminary remarks... :-)

You should indicate where users should put the lexers.
I strongly suspect SciTEStartup.lua (and indeed, it have been just confirmed by the new
build I just made with this version, complaining on a vestigial OnStyle function that was
laying on this file for a long time, waiting to wake up... :-)).

I suggest to lexer writers (or users) to put them in a separate .lua file and to load it
with dofile(), to avoid cluttering this file.

Other remark: you use 'function xOnStyle(styler)' in the first example.
Is the x significant or a typo? I first thought it was a way to distinguish lexers...

Ah, looking better, I just notice: "language : string | Name of the language. Allows
implementation of multiple languages with one OnStyle function."
I wonder if SciTE couldn't call dynamically the lexer depending on the language name.
Eg. calling OnStyle_zog for the zog language, OnStyle_supercalli for SuperCalli language, etc.

On the other hand, it isn't hard to make OnStyle just a dispatcher to specialized lexer
functions (to avoid having a giant function). The performance hit is probably minimal.
Will go this route.

Seems to be well thought, and the size increase is negligible. Good job.

Philippe Lhoste

unread,
Feb 10, 2010, 1:29:59 PM2/10/10
to scite-i...@googlegroups.com
On 10/02/2010 17:15, Philippe Lhoste wrote:
> On the other hand, it isn't hard to make OnStyle just a dispatcher to
> specialized lexer functions (to avoid having a giant function). The
> performance hit is probably minimal.
> Will go this route.

OK, nothing like taking an existing lexer as base for a new (or improved) one!
So I beefed up a bit our good Zog language:

@@ Commenter
proc clip(int @a)
� Clip into the positive zone �
if (a > 0.1E-14)
b = +3.14159 + .5
end
end

with more styles:

lexer.*.zog=script_zog
# DEFAULT
style.script_zog.0=fore:#000000
# IDENTIFIER
style.script_zog.1=fore:#7F007F
# KEYWORD
style.script_zog.2=fore:#000080,font:Andale Mono,bold
# COMMENT
style.script_zog.3=fore:#008000,font:Georgia,size:10
# UNICODECOMMENT
style.script_zog.4=fore:#008000,font:Georgia,italics,size:9
# NUMBER
style.script_zog.5=fore:#008000,font:Andale Mono,bold,size:9
# OPERATOR
style.script_zog.6=fore:#A02000,bold,size:12

and handled them:

function Log4jLexer(styler)
end

function ZogLexer(styler)
local S_DEFAULT = 0
local S_IDENTIFIER = 1
local S_KEYWORD = 2
local S_COMMENT = 3
local S_UNICODECOMMENT = 4
local S_NUMBER = 5
local S_OPERATOR = 6
local keywords = { ["if"] = 1, ["end"] = 1, ["proc"] = 1, ["int"] = 1 }

local IsIdentifier = function ()
local c = styler:Current()
return c:find('^%a+$') ~= nil
end

local IsNumber = function (initial)
local IsDigit = function (c) return c >= '0' and c <= '9' or c == '.' end
local c = styler:Current()
if initial ~= nil then
return IsDigit(c) or ((c == '-' or c == '+') and IsDigit(styler.Next()))
end
return IsDigit(c) or c == 'e' or c == 'E' or c == '-' or c == '+'
end

local IsOperator = function ()
return string.find("+-/*%()<>=@", styler:Current(), 1, true) ~= nil
end

--~ print("Styling: ", styler.startPos, styler.lengthDoc, styler.initStyle)
styler:StartStyling(styler.startPos, styler.lengthDoc, styler.initStyle)

while styler:More() do
local stst = styler:State()

-- Exit state if needed
if stst == S_IDENTIFIER then
if not IsIdentifier() then -- End of identifier
local identifier = styler:Token()
if keywords[identifier] == 1 then -- Is it a keyword?
styler:ChangeState(S_KEYWORD)
end
styler:SetState(S_DEFAULT)
end
elseif stst == S_COMMENT and styler:AtLineEnd() then
styler:SetState(S_DEFAULT)
elseif stst == S_UNICODECOMMENT then
if styler:Match("�") then
styler:ForwardSetState(S_DEFAULT)
end
elseif stst == S_NUMBER and not IsNumber() then
styler:SetState(S_DEFAULT)
elseif stst == S_OPERATOR then
styler:SetState(S_DEFAULT)
end

-- Enter state if needed
if styler:State() == S_DEFAULT then
if styler:Match("�") then
styler:SetState(S_UNICODECOMMENT)
elseif styler:Match("@@") then
styler:SetState(S_COMMENT)
elseif IsIdentifier() then
styler:SetState(S_IDENTIFIER)
elseif IsNumber(true) then
styler:SetState(S_NUMBER)
elseif IsOperator() then
styler:SetState(S_OPERATOR)
end
end

styler:Forward()
end
styler:EndStyling()
end

luaLexersLanguages =
{
["script_zog"] = ZogLexer,
["script_log4j"] = Log4jLexer,
}

function OnStyle(styler)
luaLexersLanguages[styler.language](styler)
end

It is working wonderfully! :-)

Neil Hodgson

unread,
Feb 10, 2010, 8:57:55 PM2/10/10
to scite-i...@googlegroups.com
mozers:

> Questions to Neil:
> Do you plan to use LPeg?

No. Mitchell's code has a larger scope than this as it is useful in
other Scintilla-based applications. I wouldn't mind LPeg support in
SciTE (rather than Scintilla) if it was a simple addition.

> Will created a separate branch in CVS?

The script lexer code is now in CVS.

> Where now can download the latest zip version?

It will be in the 2.03 downloads but will be 'secret'. That is it
won't be mentioned in the documentation or release notes.

Neil

Neil Hodgson

unread,
Feb 10, 2010, 9:18:21 PM2/10/10
to scite-i...@googlegroups.com
Philippe Lhoste:

> You should indicate where users should put the lexers.
> I strongly suspect SciTEStartup.lua (and indeed, it have been just confirmed
> by the new build I just made with this version, complaining on a vestigial
> OnStyle function that was laying on this file for a long time, waiting to
> wake up... :-)).

Yes, they can be put in SciTEStartup.lua although I expect users
will structure their files more sensibly.

> Other remark: you use 'function xOnStyle(styler)' in the first example.
> Is the x significant or a typo? I first thought it was a way to distinguish
> lexers...

I have 3 lexers in SciTEStartup.lua and comment out 2 of them by
renaming. When exporting the function to go on the web, I forgot to
enable the main example.

> Ah, looking better, I just notice: "language : string | Name of the
> language. Allows implementation of multiple languages with one OnStyle
> function."
> I wonder if SciTE couldn't call dynamically the lexer depending on the
> language name.
> Eg. calling OnStyle_zog for the zog language, OnStyle_supercalli for
> SuperCalli language, etc.

I've gone back and forth on that one. Just having a single entry
point allows the cleverness to be done in Lua as shown by SciteExtMan.

Allowing styling of the output pane probably requires a new entry
point in the Extension interface as adding parameters to OnStyle will
be incompatible with current users. This means an OnOutputStyle
function in Lua.

> OK, nothing like taking an existing lexer as base for a new (or improved)
> one!
> So I beefed up a bit our good Zog language:

Links to more complex lexers will be included in the documentation.
I wanted to have very simple examples on the main page to avoid
scaring people.

> local keywords = { ["if"] = 1, ["end"] = 1, ["proc"] = 1, ["int"] = 1 }

I tried this but it looked too ugly so went to:
keywords = {}
for k,v in pairs({"end", "if", "int", "proc"}) do keywords[v] = 1 end
but it would probably be better to do whatever the Lua equivalent to Python's
set("end if int proc".split())

Neil

Philippe Lhoste

unread,
Feb 11, 2010, 7:50:40 AM2/11/10
to scite-i...@googlegroups.com
On 11/02/2010 03:18, Neil Hodgson wrote:
> Yes, they can be put in SciTEStartup.lua although I expect users
> will structure their files more sensibly.

Using dofile, I suppose? (or require, perhaps.) Unless I missed the capability do have
several Lua script files in SciTE.

> I've gone back and forth on that one. Just having a single entry
> point allows the cleverness to be done in Lua as shown by SciteExtMan.

Yes, and it is consistent with current way OnUserListSelection works, for example.

> Links to more complex lexers will be included in the documentation.
> I wanted to have very simple examples on the main page to avoid
> scaring people.

Sure, and you did a good job selecting features illustrating the various available functions.

>> local keywords = { ["if"] = 1, ["end"] = 1, ["proc"] = 1, ["int"] = 1 }
>
> I tried this but it looked too ugly so went to:
> keywords = {}
> for k,v in pairs({"end", "if", "int", "proc"}) do keywords[v] = 1 end
> but it would probably be better to do whatever the Lua equivalent to Python's
> set("end if int proc".split())

Indeed, I dislike the notation I used, I was just trying to explore the various Lua
capabilities we can use. Like the regex to replace the list of identifier chars: on one
hand, it is probably just slower (not even sure), on the other hand, it is nice and easy
(but less readable!) and the speed difference is probably undetectable.

I just rewrote your suggestion as:

function KeywordList(list)
local kl = {}
for i, v in ipairs(list) do
kl[v] = i
end
return kl
end

used as:

local keywords = KeywordList{ 'if', 'end', 'proc', 'int' }

(and of course used as: if keywords[identifier] ~= nil then

which makes a bearable syntax.
Maybe such function can be added as a helper to styler to avoid each user to re-create it?


This functionality not only allows non-C++ programmers to have their own lexer, but it
eases maintenance. For example, I still drag around my LexAHK1.cxx file which was rejected
due to clumsy coding (I agree, but the language is clumsy itself... :-) - never had time
to improve that and it works well enough for me) and it is annoying to re-inject it in
each new version (I automated that with... AutoHotkey itself).
So this Lua lexer is excellent for those with private lexers (for esoteric or proprietary
languages) as one doesn't have to hack SciLexer.h, KeyWords.cxx and make files on each
release.

Another good use case is to parse logs: we use Log4j which is quite configurable in output
format. For example, we have logs like:

20081010 173050 CEST [INFO] <Thread-8> Creating data: Stuff

We can highlight date and time, severity, thread info, etc.
But such lexer need to be quickly modifiable, if we change the output format. Lua lexers
are wonderful for such task.


Last note: I think I found a bug, or an unexpected behavior... :-)
If I load a file with an extension SciTE doesn't know, eg. .duff, OnStyle is called
repeatedly with an empty string in styler.language.
I can easily guard against this case but it should be either corrected or documented... :-)

mozers

unread,
Feb 11, 2010, 2:00:53 PM2/11/10
to Philippe Lhoste
Thursday, February 11, 2010, 3:50:40 PM, Philippe wrote:
> local keywords = KeywordList{ 'if', 'end', 'proc', 'int' }

IMHO be more convenient to store the keywords in the language.properties:

function GetWordList(prop)
local t = {}
local s = props[prop]
for v in string.gmatch(s, "(%w+)") do
t[#t+1] = v
end
return t
end

Also and styles.

--
mozers
<http://scite.net.ru>

mozers

unread,
Feb 11, 2010, 2:04:53 PM2/11/10
to Neil Hodgson
Thursday, February 11, 2010, 4:57:55 AM, Neil wrote:
> The script lexer code is now in CVS.

This is - great news!

--
mozers
<http://scite.net.ru>

Neil Hodgson

unread,
Feb 11, 2010, 6:43:53 PM2/11/10
to scite-i...@googlegroups.com
Philippe Lhoste:

> Using dofile, I suppose? (or require, perhaps.) Unless I missed the
> capability do have several Lua script files in SciTE.

I'd expect most to use dofile.

> This functionality not only allows non-C++ programmers to have their own
> lexer, but it eases maintenance.

Its also a stepping stone. If you write a script lexer then it
should be convertible to a C++ lexer in a fairly obvious way.

The things you lose from writing a script lexer are speed and
portability to the many other applications that use Scintilla.

> Another good use case is to parse logs: we use Log4j which is quite
> configurable in output format. For example, we have logs like:
>
> 20081010 173050 CEST [INFO] <Thread-8> Creating data: Stuff
>
> We can highlight date and time, severity, thread info, etc.
> But such lexer need to be quickly modifiable, if we change the output
> format. Lua lexers are wonderful for such task.

An easy to use log viewer with great filtering abilities would be a
good project.

> Last note: I think I found a bug, or an unexpected behavior... :-)
> If I load a file with an extension SciTE doesn't know, eg. .duff, OnStyle is
> called repeatedly with an empty string in styler.language.
> I can easily guard against this case but it should be either corrected or
> documented... :-)

I'm unsure. It may be frustrating to add an OnStyle and then not
have it called because you messed up a property. OTOH it would be more
efficient to set the lexer to SCLEX_NULL.

Neil

Philippe Lhoste

unread,
Feb 12, 2010, 4:02:03 AM2/12/10
to scite-i...@googlegroups.com
On 12/02/2010 00:43, Neil Hodgson wrote:
> The things you lose from writing a script lexer are speed and
> portability to the many other applications that use Scintilla.

Well, some years ago, I tried SciLexer with an additional lexer in it with Notepad2 and
Notepad++.
I found out that the lexer list was hard-coded, so the new lexer was ignored. Or Scintilla
itself was statically linked, so no simple substitution was possible.
It would need hacking the code and recompiling to take in account the new lexer.
Now, it is still easier than shoehorn Lua in an application to make a lexer to work! :-)

Note: perhaps these editors are more flexible today (or not), I haven't checked in a while.

> I'm unsure. It may be frustrating to add an OnStyle and then not
> have it called because you messed up a property. OTOH it would be more
> efficient to set the lexer to SCLEX_NULL.

Well, that's part of debugging... After all, if you mess up properties (like I did,
duplicating a style without updating the number, for example), you won't see the expected
results.
I would go for the efficient way... :-)

Neil Hodgson

unread,
Feb 13, 2010, 9:31:33 PM2/13/10
to scite-i...@googlegroups.com
Philippe Lhoste:

> Well, some years ago, I tried SciLexer with an additional lexer in it with
> Notepad2 and Notepad++.
> I found out that the lexer list was hard-coded, so the new lexer was
> ignored.

They'll often include new lexers in new releases.

> Well, that's part of debugging... After all, if you mess up properties (like
> I did, duplicating a style without updating the number, for example), you
> won't see the expected results.
> I would go for the efficient way... :-)

OK, changed in CVS (post 2.03 release) to SCLEX_NULL for unknown extensions.

Neil

Reply all
Reply to author
Forward
0 new messages