How to write a lexer (scanner)

33 views
Skip to first unread message

Thomas Tempelmann

unread,
Jan 30, 2006, 3:23:33 AM1/30/06
to REALs...@googlegroups.com
OK, I'm planning to write a language scanner & lexer for the REALbasic
language -- unless someone has already written one and can offer it?

I want to make it effective, using either a very smart (and hence
hopefully fast) RegEx expression or by scanning the source
byte-for-byte, probably from a Memoryblock for optimal speed

So, has anyone done any work or developed any opinions on this already
that he/she/it would like to share?

--
Thomas Tempelmann - exaggerating over a million times a day!
http://www.tempel.org/rb/ -- The primary source of outdated REALbasic
plugins and examples
Skype: tempel.org AIM: superTempel

Thomas Tempelmann

unread,
Jan 30, 2006, 4:07:43 AM1/30/06
to REALs...@googlegroups.com
Some thoughts:

I could use something like the "flex" ("lex") tool to generate a
scanner in C and then make a plugin from it.
(http://en.wikipedia.org/wiki/Lex_programming_tool)

Another way I learned once was to do it like this, which combines the
scanner and the tokenizer nicely:

You build a binary tree, with all predefined symbols already in it
(e.g. =, >, >=, return, if, etc.)
Now, when you scan the source, you take each char from the source and
walk the binary tree with it or extend the tree with the new chars.
You end up with a known or new symbol right away in the tree that way.

Of course, the scanner still has to know the basic rules of which
chars are in a symbol, and what is whitespace, and how to detect
literals (strings, in particular). Walking the binary tree along with
the scanner might make it a bit faster compared to first getting the
symbol and then looking it up in a Dictionary, I could imagine
(although the dict might still be faster because it's written in
"optimized" C while the binary tree would be handled in plain "slow"
RB).

Separating the scanner from the tokenizer might, OTOH, make the lexer
code more readable. Perhaps that's more important if we make this open
source, right?

Any opinions on this?

Ed Kleban

unread,
Jan 30, 2006, 5:36:56 PM1/30/06
to REALs...@googlegroups.com


On 1/30/06 2:23 AM, "Thomas Tempelmann" <tempe...@gmail.com> wrote:

>
> OK, I'm planning to write a language scanner & lexer for the REALbasic
> language -- unless someone has already written one and can offer it?

Your timing, as usual Thomas, is impeccable. Scott and I were just
discussing that Jon appears to be sufficiently busy that we probably won't
be seeing a grammar from him for morphe anytime soon. If it were up to me,
I don't know that I'd use morphe for this function, but I sure would stare
at it enough to fully understand it, hopefully in the context of some small
example that usese it, before I went and wrote something new from scratch.

Specifically, when talking with Jon he mentioned about developing an "AST"
(Abstract Syntax Tree). I'd love to ask him for an example of the code that
does this for a much simpler grammar.

I happen to like this approach because it nicely partitions the code that
generates the AST from a grammar, the specification of the grammar itself,
and the various ways we might like to use the AST for our own projects. I
believe that having the grammar as a data set for community review and
discussion independent of the code that uses it a big plus if that's
possible, since it will allow us to communally discuss grammar nuances as we
learn more and to tweak it for future RB changes without getting into the
guts of the code.




> I want to make it effective, using either a very smart (and hence
> hopefully fast) RegEx expression or by scanning the source
> byte-for-byte, probably from a Memoryblock for optimal speed

"Hopefuly fast" and "RegEx" are a contradition in terms. I never understood
why you pursued the use of regular expressions instead of using a simple
state-based parser when you tackled the indentation problem. Regular
expressions are inherently slow. They're a great general purpose tool that
can save you a massive amount of time for developing simple scanners for
one-off purposes. But from my vantage point you kept revisiting them over
and over again to tweak them for different odd-ball cases you kept finding
that didn't fit your quick-and-dirty first pass implementation. I offered
you a proposal then for a fast, byte-at-a-time single-pass in-context
tokenizer approach. A full lexer would require a more heavy-duty version
but it would likely the general approach at the tokenizer level if I were
going to roll my own.



> So, has anyone done any work or developed any opinions on this already
> that he/she/it would like to share?
>

Good luck!

--Ed


Here's some of the relevant communication history from Jon regarding his
former plans for morphe:

On 12/5/05, Ed Kleban <E...@kleban.com> wrote:
> >Nope. We have never released a grammar for REALbasic. However, you're
> >now the third person that has asked for one, so I might try to hammer
> >out a grammar file and a corresponding AST tonight to help out with
> >the cause.
>
> "AST"?

Abstract Syntax Tree. Basically, it's a tree that represents the code
it parsed, so instead of implementing the parser in terms of functions
that interact directly during the parsing phase, you're given a tree
structure back that will allow you to analyze it in any way without
requiring seprate grammar files for each. Specific to this: when done,
you won't actually need to use the grammar file itself, but rather
just drag a set of classes in, call Parse() on the parser, and it will
return a tree for you representing the program.

> That would be kind of you, and very much appreciated. It is a reasonably
> good cause after all, namely for better power tools for the RB community.
> Doh! I just saw that my reply yesterday to Scott Steinman got directed
> through NUG in the "Got Parser?" topic as opposed to going to him directly
> as intended. Oh well, I guess you can read about the "cause" in black and
> white if you haven't already. I can't promise nobody will eventually try
> to make a buck on tools such as these, but I can certainly tell you that any
> tools I churn out for the community will offer ample credit to the
> contributors. I'd be delighted to list you among them.

It's not a big deal. I'm just saving up for Christmas, so I don't want
to dedicate too much time that I could spend making money on this ;)
It's an expensive Christmas this year -- engagement ring :)
> >I guess I hide my identity pretty well on the site :) I'll definitely be
there.
>
> You mean the fact that you're a RS employee? Naw I picked that up, and
> thought it likely you would be attending.


On 12/5/05, Dr. Scott Steinman <stei...@sco.edu> wrote:

> a. I was hoping to use Morphe as the parser, but still need
how to
> figure out how to use Morphe on a real programming language and how
> to feed its output to a contextual analyzer/syntax tree.

Sorry for not getting back sooner, but as interest is mounting for
variuos other projects as well, and even though my time hasn't
increased in quantity, I'll help out by creating a REALbasic grammar
for Morphe that creates an abstract syntax tree based on a chunk of
statements you give it. I'll try to get it done within the next few
evenings.

-Jon

Thomas Tempelmann

unread,
Jan 30, 2006, 5:44:12 PM1/30/06
to REALs...@googlegroups.com
On 1/30/06, Ed Kleban <E...@kleban.com> wrote:
> "Hopefuly fast" and "RegEx" are a contradition in terms. I never understood
> why you pursued the use of regular expressions instead of using a simple
> state-based parser when you tackled the indentation problem. Regular
> expressions are inherently slow.

Where do you have this knowledge from? From my theoretical understanding
of RegEx, it builds an internal state machine table from the search string
and then should be pretty fast scanning the target string once the tables
have been built. And since the building of the table should only happen once,
the parsing of target strings often, I'd think it should be faster than doing it
all by hand in RB.

> I offered
> you a proposal then for a fast, byte-at-a-time single-pass in-context
> tokenizer approach.

Well, what you offered was more like the idea of a tokenizer. I still need
to code it. And that's harder done than talked of.

Ed Kleban

unread,
Jan 30, 2006, 6:10:27 PM1/30/06
to REALs...@googlegroups.com


On 1/30/06 4:44 PM, "Thomas Tempelmann" <tempe...@gmail.com> wrote:

>
> On 1/30/06, Ed Kleban <E...@kleban.com> wrote:
>> "Hopefuly fast" and "RegEx" are a contradition in terms. I never understood
>> why you pursued the use of regular expressions instead of using a simple
>> state-based parser when you tackled the indentation problem. Regular
>> expressions are inherently slow.
>
> Where do you have this knowledge from?

From the uninformed bias of making a snap judgment based on intuition in the
context of total ignorance regarding the actual facts of course.

And the inference from your previous emails when using it for indentation
that you weren't using a single regex to accomplish this but a whole battery
of them applied in succession. This is really the heart of my concern. If
you've got a single regEx that determines indentation or can parse RB code
in a single pass I'll eat my bits. But if you're lining up a brigade or
regular expressions to successively parse the same text over and over again
with conditional logic, the I do believe that if efficiency is what matters
most that a state-based scanner that streams over the text a single time is
going to be much faster.

> From my theoretical understanding
> of RegEx, it builds an internal state machine table from the search string
> and then should be pretty fast scanning the target string once the tables
> have been built.

That's my understanding as well.

> And since the building of the table should only happen once,
> the parsing of target strings often, I'd think it should be faster than doing
> it
> all by hand in RB.

That may very well be. Don't know, but I'd like to.

>
>> I offered
>> you a proposal then for a fast, byte-at-a-time single-pass in-context
>> tokenizer approach.
>
> Well, what you offered was more like the idea of a tokenizer. I still need
> to code it. And that's harder done than talked of.

Yes. What I offered was a trivial example of a simple preparser that would
allow higher-level logic to work at a token base but have positional context
which it could immediately look up rather than have to repeatedly calculate
in a far less efficient manner. But that's a far cry from a full-blown
implementation.

You asked for ideas. That's what I'm offering. I'd still check out morphe
pretty thoroughly first.

Thomas Tempelmann

unread,
Jan 30, 2006, 6:17:29 PM1/30/06
to REALs...@googlegroups.com
On 1/31/06, Ed Kleban <E...@kleban.com> wrote:
> I'd still check out morphe
> pretty thoroughly first.

Ahem, you do know that Morphe does not provide a lexer, do you?
That's why i need a lexer: because there is none yet.

Ed Kleban

unread,
Jan 30, 2006, 7:00:39 PM1/30/06
to REALs...@googlegroups.com
Nope, haven't taken the time to look at it. All I know is that it exists
and is not sufficiently high on my infinite list of interesting things to
investigate.

Scott Steinman

unread,
Jan 30, 2006, 8:19:56 PM1/30/06
to REALs...@googlegroups.com
Thomas is correct. Morphe supplies only the parser; the lexer,
supplied by the user, must be plugged into it through one of Morphe's
methods.

In my own inspection of Morphe so far, I've been unable to figure out
how to take its output and build an abstract syntax tree. I also
find Morphe's syntax -- plugging in what looks like RBScript code
rather than pure grammar rules -- a little harder to understand than
other parsers. I can more easily write a full EBNF grammar for
REALbasic than a grammar suited for Morphe.

My compiler books use one of two approaches to build lexers/parsers:
1. Hand-built to fit the grammar
2. Using available lexer and parser generators, like 'lex' and
'yacc' (or their equivalents in Java, such as 'ANTLR'). Such tools
don't exist in REALbasic -- Morphe is the closest we have and it only
generates the parser.

I can do the first approach by modifying the example code of my
compiler books, if somebody else will help me with the EBNF grammar
(to ensure I don't make mistakes). However, the flaw of this
approach is the need to recode when the language syntax changes. I
don't think I can easily handle creating my own generic lexer and
parser generators. I'd have to dissect the existing generator tool's
code and code a REALbasic equivalent, and I'm not confident that I'm
talented enough to do it.

Dr. Scott Steinman
Brought to you by a grant from the Steinman Foundation (Thanks, Mom
and Dad!)
Recommended by Major University Studies Over the Leading Brand
steinman at midsouth dot rr dot com

I hope I die peacefully in my sleep like my grandfather. . .not
screaming in terror like his passengers. -- "Deep Thoughts", Jack Handy

Thomas Tempelmann

unread,
Jan 31, 2006, 3:33:10 AM1/31/06
to REALs...@googlegroups.com
I have my scanner almost working. It does a basic tokenizing of rb
source without knowing reserved words made of letters (if, return,
...), but already identifies strings, numbers, symbols, comments.

Such tokens, when being a symbol, can then be matched up against a
dictionary of reversed words.

Where do I get a _complete_ list of reserved words from, now? Mars,
don't you have one internally, at least?


About using Morphe or writing your own parser:

I would strongly try to support Morphe instead of making your own,
just because Morphe is apparently supported by the compiler team
(well, it's been written by one of them), meaning that it is not
unlikely to get it updated when RB's syntax changes.

for building the AWT (abstract syntax tree) - I have no idea how it
looks like, but Jonathan J had indicated he'd like to write one on top
of Morphe, so I think we should follow that path. I just don't know
how yet.

Anyone here know something about how to build an AST from a parser output?

(see: http://morphe.nilobject.com)

Ed Kleban

unread,
Jan 31, 2006, 5:22:47 PM1/31/06
to REALs...@googlegroups.com, j...@nilobject.com


On 1/31/06 2:33 AM, "Thomas Tempelmann" <tempe...@gmail.com> wrote:

>
> I have my scanner almost working. It does a basic tokenizing of rb
> source without knowing reserved words made of letters (if, return,
> ...), but already identifies strings, numbers, symbols, comments.
>
> Such tokens, when being a symbol, can then be matched up against a
> dictionary of reversed words.
>
> Where do I get a _complete_ list of reserved words from, now? Mars,
> don't you have one internally, at least?
>
>
> About using Morphe or writing your own parser:
>
> I would strongly try to support Morphe instead of making your own,

I agree with that. I'd sure like to use Morphe if we can, especially if we
can get a jump start with a little help from Jon. Unfortunately, rumor has
it that Jon has been ill recently, just got married, and has no lack of
things to do at work I'm sure. I just don't know if despite his previously
stated desire he'll have time to help in a timely manner.

> just because Morphe is apparently supported by the compiler team
> (well, it's been written by one of them), meaning that it is not
> unlikely to get it updated when RB's syntax changes.

That however I disagree with completely. Just the fact that there is no RB
grammar for morph is evidence that morph isn't used at all in RB and is
highly UNLIKELY to get updated. I believe it's simply a fun personal
project Jon engaged in some time ago. The fact that he both wrote Morphe
and may work on languages at RB is a result of his interest in languages I
presume, and not an endorsement for any official status regarding morphe.
And certainly not an offer or implication to maintain it.

But I like the fact that it's open source, written in RB, and I'd very hope
we'll take some approach that is data driven based on a grammar spec that we
can tweak over time as needed.

> for building the AWT (abstract syntax tree) - I have no idea how it
> looks like, but Jonathan J had indicated he'd like to write one on top
> of Morphe, so I think we should follow that path. I just don't know
> how yet.

That's why I suggested that we ask Jon if he has an example he can share
with this for some other very simple grammar. As he's noted previous
developing a grammar for RB is non-trivial and could itself take several
days. But understanding the AST is I believe the key clue to figuring out
how to use morphe for this purpose. I'll include Jon in the recipient list
of this email. Perhaps he'll be able to take a moment to send us a reply.

Mars Saxman

unread,
Jan 31, 2006, 7:13:22 PM1/31/06
to REALsource
> Where do I get a _complete_ list of reserved words from, now? Mars,
> don't you have one internally, at least?

Yes. I don't know whether it's documented anywhere, and even if it is,
I'd be surprised if it's fully up to date. Even the code editor gets it
wrong sometimes. There certainly is an official internal list, though;
I'll check to see if I can post it.

> I believe it's simply a fun personal
> project Jon engaged in some time ago.

This is correct. If there is a REALbasic grammar for Morphe, it is just
something someone maintains on their own, and not related to the actual
Rb parser.

Mars Saxman
REAL Software

Mars Saxman

unread,
Jan 31, 2006, 7:21:20 PM1/31/06
to REALsource
Here's the complete list of reserved words in REALbasic 2006r1:

If, Sub, Function, Until, As, End, Mod, Dim, Then, Else, While, Wend,
Raise, For, Exception, Array, Of, Not, GoTo, And, Or, Nil, Rem, Next,
ElseIf, To, True, False, ByRef, New, Select, Case, Return, Self, Do,
Loop, Exit, Redim, DownTo, Step, IsA, Me, #pragma, #bad, #tag, ByVal,
Optional, Extends, Assigns, Const, #if, #else, #endif, Declare, Class,
Object, Interface, Implements, Inherits, Property, Public, Private,
Protected, Shared, Namespace, Module, Static, Event, Handles, Each, In,
Catch, Try, Finally, Super, Lib, Inline68k, Is, Call, AddressOf,
Delegate, ParamArray, #ElseIf, Soft, Continue, With, Structure, Enum,
RaiseEvent

Mars Saxman
REAL Software

Thomas Tempelmann

unread,
Jan 31, 2006, 7:26:44 PM1/31/06
to REALs...@googlegroups.com
Much appreciated. The scanner works, the parser created by morphe
works, well, it ends up in an endless loop and kills itself trying...
but i'll get it tamed :)

and after that I can to the AST. easy.

Ed Kleban

unread,
Jan 31, 2006, 11:06:38 PM1/31/06
to REALs...@googlegroups.com
Super!

More fodder for the wiki.

Thanks Mars.

Scott Steinman

unread,
Jan 31, 2006, 11:42:04 PM1/31/06
to REALs...@googlegroups.com
Thomas, Ed-

Knowing the keywords is good, but a lot more than that needs to be
handled in order to construct the abstract syntax tree and symbol
table. We also need to be able to classify identifiers, handle method
calls, their parameters and return values, tell apart intrinsics like
Ubounds from other methods, and handle framework objects, variables
and methods.

In other words, once a parsed "word" is not recognized as a keyword,
it must be classified as an intrinsic or identifier (data or code).
The identifier must be then classified in terms of type, scope, name,
and value. This must be compared to the symbol table to see if it
already exists or not. If it does not, it is added to the symbol
table at the current scope; if it does, it is linked to the existing
value.

This information is also used to "decorate" the abstract syntax tree
with contextual information. This abstract syntax tree must be
retained if refactoring is to be done because refactoring will move
entities around in the abstract syntax tree.

I've started the process of handling the frameworks and intrinsics by
building an XML file containing all framework definitions -- objects,
modules, interfaces, methods, constants and variables. This can be
read into the symbol table before parsing the source code and
references in the source code compared to those already existing in
the frameworks. Similarly, return values from methods (whether
framework methods or user-defined methods) can be compared to these
as well. Shall I post this to my iDisk for others to examine and
shred, er, I mean critique?

-Scott

Ed Kleban

unread,
Feb 1, 2006, 12:12:53 AM2/1/06
to REALs...@googlegroups.com


On 1/31/06 10:42 PM, "Scott Steinman" <stei...@midsouth.rr.com> wrote:

>
> Thomas, Ed-
>
> Knowing the keywords is good, but a lot more than that needs to be
> handled in order to construct the abstract syntax tree and symbol
> table. We also need to be able to classify identifiers, handle method
> calls, their parameters and return values, tell apart intrinsics like
> Ubounds from other methods, and handle framework objects, variables
> and methods.


Yes, you're quite rignt.



> In other words, once a parsed "word" is not recognized as a keyword,
> it must be classified as an intrinsic or identifier (data or code).
> The identifier must be then classified in terms of type, scope, name,
> and value. This must be compared to the symbol table to see if it
> already exists or not. If it does not, it is added to the symbol
> table at the current scope; if it does, it is linked to the existing
> value.
>
> This information is also used to "decorate" the abstract syntax tree
> with contextual information. This abstract syntax tree must be
> retained if refactoring is to be done because refactoring will move
> entities around in the abstract syntax tree.

Yep, yep. I'm going to need access to all of this sooner or later for RI.
At some point I also want to make a comprehensive list of all the methods,
properties, etc for all of the intrinsic classes. I was planning on writing
some automation that pulled a great deal of the info out of the LR .html
pages. But of course, there's always the couple hundred additional minor
details that you'd only know about from reading release notes -- if you're
lucky.



> I've started the process of handling the frameworks and intrinsics by
> building an XML file containing all framework definitions -- objects,
> modules, interfaces, methods, constants and variables. This can be
> read into the symbol table before parsing the source code and
> references in the source code compared to those already existing in
> the frameworks. Similarly, return values from methods (whether
> framework methods or user-defined methods) can be compared to these
> as well. Shall I post this to my iDisk for others to examine and
> shred, er, I mean critique?

Sounds like we're about at the point where we need some shared file space on
REALopen to exchange and update stuff like this, if it's not simple textual
info that coulde be easily maintained, read, and updated by all on the
rbwiki.

But the simple answer is, "Sure, post away!" Come next week I might even
have time to look at it.

Reply all
Reply to author
Forward
0 new messages