Has anyone hand-written a scanner/parser module?

tuxisthe...@gmail.com

unread,

Nov 15, 2008, 12:49:38 PM11/15/08

to

I know most people anymore use lex/yacc or some derivative of these
tools to create scanner/parser modules for their compiler projects. I
was wondering if anyone has developed a scanner or parser that they
personally hand-wrote? If so, I would like to know what language you
used and what type of grammar you parsed. If you feel divulgent,
please tell me a little bit about you're implementation or what you
did for intermediate representation. I'm curious and just a bit nosy
and would like to know how you're experience went doing things this
way. Thanks.
[I hand-wrote a scanner for Fortran 77 about 30 years ago. It worked but
it was pretty gross because Fortran's lexical structure is so strange. -John]

Eric Christopher

unread,

Nov 15, 2008, 6:35:32 PM11/15/08

to

On Sat, Nov 15, 2008 at 9:49 AM, tuxisthe...@gmail.com
<tuxisthe...@gmail.com> wrote:
> I know most people anymore use lex/yacc or some derivative of these
> tools to create scanner/parser modules for their compiler projects. I
> was wondering if anyone has developed a scanner or parser that they

> personally hand-wrote? ...

The clang project (clang.llvm.org) uses a handwritten lexer and parser
for C based languages.

-eric

Felipe Angriman

unread,

Nov 15, 2008, 7:12:08 PM11/15/08

to

> I know most people anymore use lex/yacc or some derivative of these
> tools to create scanner/parser modules for their compiler projects. I
> was wondering if anyone has developed a scanner or parser that they
> personally hand-wrote?

Wrote an interpreter for a very simple imperative language at university.
Back then I was given the choice of using lex (or some other variant)
or make my custom lexer. I choose to write the lexer by hand because I din't
want to learn lex back. if you're doing some complex pattern matching try using
lex, it is more maintainable the handwritten code

> If so, I would like to know what language you
> used and what type of grammar you parsed.

I used plain old ANSI C 90,
The grammar was LL(1), -> this allows you to compute
First(1) And Follow(1) Sets very easily by hand.
The Parser was a Recursive descent parser

>If you feel divulgent, please tell me a little bit about you're
> implementation or what you did for intermediate representation.

The program generated an AST for intermediate representation.
Then I used a Series of recursive functions to interpret the AST.

> I'm curious and just a bit nosy and would like to know how you're
> experience went doing things this way

If I had to do something serious, I would do some things differently
now, I would choose C#, Java o some other Language which has better
Type Checking and Libraries than C. (I messed things up plenty times
during the construction of the interpreter)

Hope it helps,

-Felipe

Dmitry A. Kazakov

unread,

Nov 16, 2008, 4:20:08 AM11/16/08

to

On Sat, 15 Nov 2008 09:49:38 -0800 (PST), tuxisthe...@gmail.com
wrote:

> I know most people anymore use lex/yacc or some derivative of these
> tools to create scanner/parser modules for their compiler projects. I
> was wondering if anyone has developed a scanner or parser that they
> personally hand-wrote?

I did, a table-driven parser.

> If so, I would like to know what language you
> used and what type of grammar you parsed.

Do you mean the language in which the parser was written, or the
language[s] being parsed?

Anyway, it is not based on any grammar. Of course, one could try to figure
out a more or less exact class of grammars it can handle. But I am too lazy
to do it.

> If you feel divulgent,
> please tell me a little bit about you're implementation or what you
> did for intermediate representation. I'm curious and just a bit nosy
> and would like to know how you're experience went doing things this
> way.

The source is free:

http://www.dmitry-kazakov.de/ada/components.htm#Parsers_etc

--
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

Jeremy Bennett

unread,

Nov 16, 2008, 5:06:33 AM11/16/08

to

On Sat, 15 Nov 2008 09:49:38 -0800, tuxisthe...@gmail.com wrote:

> I know most people anymore use lex/yacc or some derivative of these
> tools to create scanner/parser modules for their compiler projects. I
> was wondering if anyone has developed a scanner or parser that they

> personally hand-wrote? ...

It's old, but take a look at "BCPL: The Language and its Compiler" by
Martin Richards and Colin Whitby-Strevens. It includes the listing and
description for the BCPL compiler front end (in BCPL of course). It's an
example of a hand-written recursive descent compiler with precedence
parsing for expressions.

I believe that makes it a parser for LL(1) language with LR(1)
expressions.

HTH,

Jeremy

Barry Kelly

unread,

Nov 16, 2008, 8:30:24 AM11/16/08

to

tuxisthe...@gmail.com wrote:

> I know most people anymore use lex/yacc or some derivative of these
> tools to create scanner/parser modules for their compiler projects. I
> was wondering if anyone has developed a scanner or parser that they
> personally hand-wrote?

In my experience, most commercial compilers use hand-written scanners
and parsers. Certainly, the Embarcadero (CodeGear, ex-Borland) Delphi
and C++ compilers use hand-written scanners and parsers, and the
publicly visible C# compiler in MS shared source CLI (Rotor) is also
hand-written.

The CodeGear IDE support for editing, outline views, refactoring etc. on
the other hand, does use compiler compilers - specifically, an internal
tool that creates recursive descent LL recognizers.

> If so, I would like to know what language you
> used and what type of grammar you parsed.

The compiler I work on (Delphi, a Pascal variant) has its scanner and
parser written in C.

The main reasons for a hand-written scanner are speed and flexibility.
The scanner needs to take buffered data from multiple sources, whether
it's data from files on disk (which may need to be transcoded into
UTF-8) or data in memory from the IDE (i.e. direct from the editor's
buffers). Furthermore, the grammar has special cases where things that
would otherwise be keywords need to get parsed as identifiers, and
preprocessor expressions may recurse back into the parser to evaluate
constant expressions.

For the parser: good error handling may use a fair amount of context to
decide what the programmer's intention was. Good editor support (code
insight / intellisense) works with cooperation from the compiler, where
available symbols at the editor cursor's location in the file gets
handed back to the editor. A long history of adding new syntax
constructs while maintaining backward compatibility means the grammar is
no longer simple or clean in any real way. Furthermore, the parser
itself doesn't use a single strategy, it uses a mix of recursive descent
for declarations and operator precedence for expressions.

-- Barry

--
http://barrkel.blogspot.com/

Egdares Futch

unread,

Nov 16, 2008, 10:45:26 AM11/16/08

to

Hi, I hand-wrote a lexer and parser for Aho's Pascal-like language on
the appendix of the Dragon Book. This was written in Turbo Pascal
with Objects, and had scoping, stack machine IL. I could share the
source if you'd like (if I find it...).

Regards,

Egdares Futch H.

On Sat, Nov 15, 2008 at 11:49 AM, tuxisthe...@gmail.com
<tuxisthe...@gmail.com> wrote:
> I know most people anymore use lex/yacc or some derivative of these
> tools to create scanner/parser modules for their compiler projects. I
> was wondering if anyone has developed a scanner or parser that they

> personally hand-wrote? ...

j...@cix.compulink.co.uk

unread,

Nov 16, 2008, 11:57:40 AM11/16/08

to

tuxisthe...@gmail.com () wrote:

> I know most people anymore use lex/yacc or some derivative of these
> tools to create scanner/parser modules for their compiler projects. I
> was wondering if anyone has developed a scanner or parser that they
> personally hand-wrote? If so, I would like to know what language you
> used and what type of grammar you parsed. If you feel divulgent,
> please tell me a little bit about you're implementation or what you
> did for intermediate representation.

It wasn't a general-purpose parser, but I have hand-written Perl code to
take C headers for a C API apart and generate an (unsafe) C# binding for
the API. However, the task was somewhat different from normal parsing
for compilation:

* I could assume that the headers were valid C code, although I checked
this by running them through the C compiler before starting parsing.

* I could not preprocess my C before parsing, because the names of
large numbers of macros defining constants needed to survive into
the C#. This was the point that set me to hand-writing code.

* I could take advantage of the conventions of the API I was dealing
with, since this parser could be considered as an extra convention-
enforcer. The conventions are quite constricting, which was helpful.

* Perl regexps are pretty useful.

* C# does not require declaration before use, which made outputting
the C# code much simpler.

* I'm not even an amateur when it comes to writing compilers, so there
were probably better ways to do the job. However, we could not find
anything ready-made, and it works pretty well.

The intermediate representation, such as it was, is a bunch of Perl
associative arrays. These were relatively simple to dump out and examine
for problems.

John

Ira Baxter

unread,

Nov 16, 2008, 4:54:15 PM11/16/08

to

<tuxisthe...@gmail.com> wrote in message

> I know most people anymore use lex/yacc or some derivative of these
> tools to create scanner/parser modules for their compiler projects. I
> was wondering if anyone has developed a scanner or parser that they
> personally hand-wrote? If so, I would like to know what language you
> used and what type of grammar you parsed. If you feel divulgent,
> please tell me a little bit about you're implementation or what you
> did for intermediate representation. I'm curious and just a bit nosy
> and would like to know how you're experience went doing things this
> way. Thanks.

Building lexers by hand is generally pretty easy. The easiest way is
to draw a finite state machine covering the set of lexemes you want to
handle, and then implement that directly. That's all that FLEX does
for you anyway :-} For older languages (PASCAL, MSBasic, several
metalangauges) this is pretty straightforward. New, uglier langauges
like PHP make this harder only because they contain messy things like
HERE strings (which are strings whose quotes are identifiers, ick).
It is useful to have a character buffering scheme that allows you to
put an arbitrary number of characters back in the input stream, but in
practice putting back just one is generally sufficient.

One trick I regularly use(d) is a character-classification lookup
table. This consists of a table of indicator bits indexed by the
ASCII character code to indicate the type of characters. Useful
character classes are "illegal" (e.g, most the ascii control
characters), "whitespace", "identifier start", "identifier continue",
"decimal digit", "hexadecimal digit", etc. At each point where the
lexer has acquired a character, it may inspect it directly (check for
the "E" in exponent, if you are collecting the exponent), or look it
up to avoid a set of tests (e.g., "is it a digit").

It is useful for the start state, at least, to do an indexed branch on the
1st character, and it seems useful to hand code the state that notices
the first whitespace to collect as many as it can before rebranching
on the next one. Remarkably, I'm still working on a compiler
that has such a hand code lexer, and it regularly compiles 500K
SLOC in a few minutes.

Similarly, building parsers for many langauges is pretty easy. Write
out a non-left-recursive grammar with rules of the form of
A = B C D;
For each nonterminal A, code a subroutine that returns
a boolean flag:
boolean subroutine A()
{ if (!B()) return false;
if (!C) syntax_error();
if (!D) syntax_error();
}

You need similar boolean-returning subroutines for terminals produced
by the lexer. This doesn't work directly for langauges that you can't
define LL rules for, but its amazing what you can hack in to step
around such things. Of course, this is just parsing; you have to
interleave actions to do semantic checking, code generation or tree
generation as you see fit, and that's really where the bulk of
compiler work is.

You can find writeups for this kind parsing in most compiler texts.
But the background for this comes from a paper that I think everybody
in the compiler business should have memorized :-} This is Val
Schorre's 1964 (not a typo) paper on Meta II, a syntax directed
metacompiler.

In ten pages, he shows you how to build compilers using metacompiler
methods, by defining a meta compiler machine, showing you how to build
it, and defines two languages (including the metacompiler language)
based on it. If you haven't seen this paper, I *strongly* encourage
you to get and read it. This for me is computer science at its best.

http://portal.acm.org/citation.cfm?id=808896&dl=ACM&coll=

Having said all this, my favorite method for building parsers these
days is to use a really powerful tool for converting sets of regular
expressions for token definitions into the FSA to detect such tokens,
and to use a GLR parser generator to convert a grammar directly into
an engine to parse a context-free language including the automatic
building of syntax trees. There's just so many other compiler-things
to do in life, that hand coding lexers/parsers just isn't worth it
anymore.
[You still have to do ugly things with HERE docs as they aren't context-free
:-{]
If you want to learn more about that, check out our web site.

--
Ira Baxter, CTO
www.semanticdesigns.com

..

Raja Mukherji

unread,

Nov 16, 2008, 5:21:07 PM11/16/08

to

On Nov 15, 5:49 pm, "tuxisthebirdfo...@gmail.com"

<tuxisthebirdfo...@gmail.com> wrote:
> I know most people anymore use lex/yacc or some derivative of these
> tools to create scanner/parser modules for their compiler projects. I
> was wondering if anyone has developed a scanner or parser that they

> personally hand-wrote? ...

I've hand written the scanner/parser for my language Wrapl for a
while. I originally wrote them in Icon, then Modula-2, Modula-3, C and
currently in C++. However they all had the same basic structure, which
you find in many scanner/parsers:

The scanner has 3 main functions:

a method/function "next" which advances the scanner to the next token

a method/function "parse" which checks that the current token is a
particular token or in a set of tokens. If the current token is
"none", call next(). If the current token matches, the current token
is set to "none". Returns true/false if the current token matched or
not. The current token values (e.g. line/column number, integer/string
value) is available from the scanner.

a method/function "accept" which behaves similarly to "parse" but
throws an error if the match fails.

The parser consists of many "parse_???" and "accept_???" functions,
built up from the other parse/accept functions. The trickiest part is
the infix operation parser, since it has to deal with different levels
of precedence. At first I had a different function for each level,
like "parse_expr_NN" where NN was the precedence level. But that
became tedious to modify, so I now use a function "parse_expr(level)"
where level is a parameter and the function contains a switch
statement that switches on the level, each case falling through to
higher precedence expressions.

If you're brave you can find the source for my scanner/parser at
http://wrapl.sf.net/download.html#source; the scanner/parser is in the
dev/src/Lib/Wrapl/Loader/scanner.c and dev/src/Lib/Wrapl/Loader/
parser.c files, but they lack commenting.

Raja

Bill Cox

unread,

Nov 16, 2008, 11:35:33 PM11/16/08

to

Raja Mukherji wrote:
> On Nov 15, 5:49 pm, "tuxisthebirdfo...@gmail.com"
> <tuxisthebirdfo...@gmail.com> wrote:
>> I know most people anymore use lex/yacc or some derivative of these
>> tools to create scanner/parser modules for their compiler projects. I
>> was wondering if anyone has developed a scanner or parser that they
>> personally hand-wrote? ...

The current GNU C++ compiler has had a hand-written lexer and parser for
since version 3.x. The compiler is now at 4.3.2, Source code is
available from gcc.gnu.org through many mirrors.

The lexer is interesting because it has the C preprocessor as *its*
front end.

Marco van de Voort

unread,

Nov 17, 2008, 3:47:38 AM11/17/08

to

On 2008-11-15, tuxisthe...@gmail.com <tuxisthe...@gmail.com> wrote:
> I know most people anymore use lex/yacc or some derivative of these
> tools to create scanner/parser modules for their compiler projects. I
> was wondering if anyone has developed a scanner or parser that they
> personally hand-wrote?

Afaik gcc (per 3.4?) nowadays also employs a handwritten RD parser. Afaik
for the same reasons that Barry names.

The parser/lexer for the largely Delphi compatible Open Source Free Pascal
is also handwritten, though not as integrated with (lazarus) intellisense as
Delphi is rumoured to be.

Since the Borland originating Pascal dialects have a limited preprocessor
integrated, effectively it is lexer-preprocessor-parser.

The parser is written in the dialect it parses.

David Z Maze

unread,

Nov 17, 2008, 10:26:23 AM11/17/08

to

"tuxisthe...@gmail.com" <tuxisthe...@gmail.com> writes:

> I know most people anymore use lex/yacc or some derivative of these
> tools to create scanner/parser modules for their compiler projects. I
> was wondering if anyone has developed a scanner or parser that they
> personally hand-wrote? If so, I would like to know what language you
> used and what type of grammar you parsed.

In my copious free time, I hand-wrote (sort of) a parser for XML 1.0
(non-validating, ignores character-set issues, rejects DTDs, does do
namespaces) a month or two ago. I wrote this in Haskell using the
Parsec support library, and generated a straightforward tree
representation of the XML. I say "sort of hand-wrote" in that Parsec
isn't really a parser generator in the same sense that yacc is; also, a
lot of its functionality could be better expressed in modern Haskell
extensions like arrows and the Control.Applicative module that post-date
Parsec.

At any rate, this is an LL(0) implementation, with appropriate context
checking for duplicate attributes and tag matching. Since Haskell
supports functions as first-class objects, I can turn a grammar fragment
like

document ::= xml-declaration?
(whitespace | processing-instruction | comment)*
element
(whitespace | processing-instruction | comment)*

into (syntax approximate, ignoring many issues)

document :: Parser XMLDocument
document = do optional xmldeclaration
pre <- many (whitespace <|> pi <|> comment)
elt <- element
post <- many (whitespace <|> pi <|> comment)
return $ XMLDocument (pre ++ [elt] ++ post)

xmldeclaration :: Parser ()
xmldeclaration = do string "<?xml"
-- stuff
string "?>"
-- etc.

where all of the above is *code*, not a description that needs to be
preprocessed. The only tricky thing is refactoring the grammar into
LL(0) form since otherwise Parsec will pick up the '<' character for the
obvious construction of processing instructions (<?name ... ?>) and then
complain when it doesn't see the '?' for comments () or
elements (<name>).

(Also there is some amount of wrapping your head around Haskell, of
course; a lot of deep magic is hidden in that "do".)

HTH,

--dzm

Gene

unread,

Nov 17, 2008, 8:28:31 PM11/17/08

to

On Nov 15, 12:49 pm, "tuxisthebirdfo...@gmail.com"

I've done both. My sense is that if you need a quick tool where the
input language is likely to change and error recovery is not a big
deal, the lex/yacc approach works well.

On the other hand if you are working on a stable syntax or where
detailed error reporting is needed, you're better off rolling your
own.

Your claim that lex/yacc (or other similar tools) are used by most
people is probably too general. Last time I looked, the C and c++
front ends of gcc used a hand-written lexer with bison parser, and the
preprocessor (which includes a lexer) was entirely hand-written. I've
read that the gcc team has in the past regretted bison because error
messages have been hard to get right. The Ada front end of gcc is
entirely hand-written, recursive descent. Perl uses yacc (at least
through v5) but with a hand-written tokenizer. Python seems to
generate its own parser from a grammar during startup. Yada yada. At
least for production compilers, it's more of a mixed bag than you
infer.

Good luck!

Aharon Robbins

unread,

Nov 18, 2008, 12:20:21 AM11/18/08

to

"tuxisthe...@gmail.com" <tuxisthe...@gmail.com> writes:

> I know most people anymore use lex/yacc or some derivative of these
> tools to create scanner/parser modules for their compiler projects. I
> was wondering if anyone has developed a scanner or parser that they
> personally hand-wrote? If so, I would like to know what language you
> used and what type of grammar you parsed.

The scanner in GNU Awk (gawk) is hand-written, for a yacc parser. Gawk
builds a parse tree and then recursively evaluates it for each input
record. The lexer builds identifiers and then binary searches a table
indicating what kind of identifier it is (keyword, builtin function),
and if the identifier isn't found in the table, then it returns IDENTIFIER
to the parser.

Unix Awk ("the one true awk") is available from Brian Kernighan's home
pages at Bell Labs and Princeton. Some years ago he rewrote the lexer
from lex into hand-written C. A few bugs were introduced along the way
although there was definitely a portability gain, since the original
awk was pretty intimate with the lex internals. His CHANGES file makes
interesting reading.

Hope this is of interest,
--
Aharon (Arnold) Robbins arnold AT skeeve DOT com
P.O. Box 354 Home Phone: +972 8 979-0381
Nof Ayalon Cell Phone: +972 50 729-7545
D.N. Shimshon 99785 ISRAEL

Stephen Horne

unread,

Nov 18, 2008, 12:36:07 AM11/18/08

to

On Sat, 15 Nov 2008 09:49:38 -0800 (PST),
"tuxisthe...@gmail.com" <tuxisthe...@gmail.com> wrote:

>I know most people anymore use lex/yacc or some derivative of these
>tools to create scanner/parser modules for their compiler projects. I
>was wondering if anyone has developed a scanner or parser that they
>personally hand-wrote? If so, I would like to know what language you
>used and what type of grammar you parsed.

I have written code for deriving FSMs from regular grammars, and I
have written code to derive an LR(1) PDA model from a grammar. The
former gets a little use. The latter is really just a
prototype/learning exercise thing.

Neither has been developed into a full lex/yacc-like tool. The FSMs
from regular grammars stuff is used in an incomplete DSL, but one
which doesn't do lexical analysis - at this point all I've really got
from it are a bunch of GraphViz dot files.

These are all part of a C++ digraph handling library.

For the FSMs from regular grammars, I use expressions to build complex
FSMs from simpler FSMs, exploiting methods for eliminating epsilon
transitions and minimizing nondeterminism, and for state model
minimisation. The intermediate representation is an AST, at least for
the current application. I use a home-grown tool similar to treecc to
generate AST node structs and multiple-dispatch operation functions.
However, the AST execution is external to the FSM handling class. The
FSM class provides methods for combining and manipulating FSMs, but
that's driven directly by the AST traversal code.

For the LR(1) stuff, I never actually got around to adding any kind of
front end, or any back end either. Just the internal representations
of the grammar and PDA model, and the code to generate the latter from
the former. The grammar uses a digraph - actually a tree - to
effectively define the set of BNF rules for each nonterminal. Both the
grammar and the parser models are based on the same digraph class.

The code for both is a bit of a mess because of the way my digraph
library evolved over time, and because the inheritence layering led to
some data-unhiding hassles as new functionality was added, especially
for the regular grammar stuff.

From this experience, I'd say that building an inheritance hierarchy
for digraphs is something to avoid, at least on agile programming
grounds. In fact writing your own digraph code may be a mistake in
itself. Beyond that, there's little to say.

I once tried to write an LR(1) parser generator in Python, but I never
could get it to work properly. Not Pythons fault - more a case of
premature optimisation.

Charles E. Bortle, Jr.

unread,

Nov 18, 2008, 12:24:36 PM11/18/08

to

> [I hand-wrote a scanner for Fortran 77 about 30 years ago. It worked but
> it was pretty gross because Fortran's lexical structure is so
> strange. -John]

Note: I am not a "compiler professional" nor expert. (At present, I'm
a Sound & Computer Tech at my church, and the Web Servant (read
Webmaster) of the church web site, and our IT and Network
Administrator...as well as a few other hats ;-)

Yes, back in the 80's. Similarly to John, I hand-wrote a scanner for
Fortran IV. It was written in Pascal on a Z80-based S100 computer
with 64k (yes, k) running the CP/M operating system on 8" floppies :-0

The core of the scanner was a Pascal Case statement (al la the bastard
algol modle in Gries). The Pascal compiler I was using did not
support files of Char but only block reads. I had to write the char
by char read routines, reading a sector sized block at a time from the
floppy.

The scanner was written as a seperate Pass...its output was a token
stream file and a symbol table in memory. The hand-written recersive
descent parser read the token stream file and produced an RPN
intermediate file, which was passed to the Code generator, which
produced assembly code.

I had eventually ported the scanner to an OSBORNE I (yes, the Adaquit
(sp?) machine with the tiny screen ;-), and then, eventually to an IBM
PC and Turbo Pascal (eventually the OOP version 7). The code
generator produced assembly code for 80x80 under PC/MS DOS and used
the 8087 math coprocessor instructions (this was in the days before
the math coprocessors were standard features...not all machines had
them)

My compiler even handled Complex types. I had to hand-write the RTP
in assembly code. I never completely finished the compiler, but it
did produce non-trivial working programs. I never got time to
impliment Subroutines, though, but did have Formatted output.

All in all, I think the hand-coded scanner worked well.

Since then, I wrote an LL(1) parse table generator and the parser.
The parser now calles the scanner for each token, instead of making
the scanner a seperate pass. I started to use it to write a Pascal
compiler, but have not had time to finish it. I also started using it
to write a Scanner generator, with the intent of bootstrapping its own
scanner into a table driven one, but alas, time fails me.

Nowadays, I would probably write the scanner, parser, etc as seperate
*.dll's under Windows, so that different source code files could be
compiled at the same time with only one copy of the compiler routines
resident.

As to implementation language, I would still pick a Pascal-like
language. (I know...That was then, This is Now ;-) (I exclude ADA,
which just gutted and emasculated Pascal) and personally I don't care
for C (or even C++).

Any way, that's my story, and I'm sticking to it ;-)

Happy Holidays,
Charles E. Bortle, Jr.

Jeremy J Starcher

unread,

Nov 18, 2008, 10:20:59 PM11/18/08

to

On Sat, 15 Nov 2008 09:49:38 -0800, tuxisthe...@gmail.com wrote:

> I'm curious and just a bit nosy and would
> like to know how you're experience went doing things this way. Thanks.
> [I hand-wrote a scanner for Fortran 77 about 30 years ago. It worked
> but it was pretty gross because Fortran's lexical structure is so
> strange. -John]

I wrote a byte-code compiler and VM in Javascript.

It handles a Visual-Basic type syntax with strong type-checking.

< http://mopedepot.com/jjs/mote/ >

Armel

unread,

Nov 19, 2008, 4:23:44 AM11/19/08

to

<tuxisthe...@gmail.com> a icrit dans le message de news:

>I know most people anymore use lex/yacc or some derivative of these
> tools to create scanner/parser modules for their compiler projects. I
> was wondering if anyone has developed a scanner or parser that they
> personally hand-wrote?

At my previous job, I wrote a javascript interpreter, the lexer is based on
tables (generated by a soft of mine called fdc), and the parser itself is a
state based hand coded LL recursive descent parser. It was pretty easy to
code a correct javascript parser this way and in fact certainely much easier
than with a generator.
as I like that, I rewrote a new javascript interpreter for my new job
(compatible with ECMAScript 1 3rd edition) , with keywords matched against
perfect hasing function rather than binary searches, similar parser.

Armel

Robert A Duff

unread,

Nov 23, 2008, 5:30:21 PM11/23/08

to

Gene <gene.r...@gmail.com> writes:

>... The Ada front end of gcc is
> entirely hand-written, recursive descent.

Yes, and the GNAT Ada front end has just about the best syntactic
error recovery and syntax error messages I have ever seen in any
compiler.

>... Perl uses yacc (at least through v5) but with a hand-written

> tokenizer. Python seems to generate its own parser from a grammar
> during startup. Yada yada. At least for production compilers, it's
> more of a mixed bag than you infer.

Yes, it's mixed bag. There are advantages and disadvantages to
either approach.

- Bob

Robert A Duff

unread,

Nov 23, 2008, 5:32:57 PM11/23/08

to

"Charles E. Bortle, Jr." <charle...@mpowercom.net> writes:

> As to implementation language, I would still pick a Pascal-like
> language. (I know...That was then, This is Now ;-) (I exclude ADA,
> which just gutted and emasculated Pascal) and personally I don't care
> for C (or even C++).

The above comment about Ada seems strange to me. Ada is far more
powerful than Pascal (some would say "far more bloated"). But certainly
not "gutted and emasculated" compared to Pascal.

- Bob

Charles E. Bortle, Jr.

unread,

Nov 24, 2008, 5:25:56 PM11/24/08

to

Howdy Bob,

Well...it really is a matter of personal preference...
didn't mean to ruffle anyones feathers ;-)

In relation to compiler writting one problem I see with Ada is that
they (Ichbiah et al ?) took away Sets when they designed Ada. When I
ported my LL(1) parse table generator from Turbo Pascal (OOP ver 7) to
an old GNAT Ada version, I had to write my own Set routines package.
Of course, I did it, so Set is now included, for me anyway, but is not
a built-in feature. I also had to write my own Turbo Pascal style
string routines, but then Turbo style strings are not really standard
Pascal, so that might be nit picking on my part.

[Of course, I could write my own "Rational" pre processor for Ada to
have my cake and eat it too ;-) ]

Charles

Charles E. Bortle, Jr.

unread,

Nov 24, 2008, 5:51:03 PM11/24/08

to

Another example for the OP:

Back in the 80's I was a computer tech with MAI Basic Four, Inc.
(originally Basic Four Corp.) and in some spare time I wrote a PILOT
interpreter in Basic Four Business Basic.

The scanner, parser, and semantics were rather blurred together. (I
have old listings somewhere, but I don' really remember any of the
code details) The gist of the scanner used the string indexing
features of Busisness Basic. (Read a line, then scan along the string
with the indexes)

As John mentioned about the odd syntax of FORTRAN, PILOT syntax is
different. PILOT is supposedly a language that teachers who are not
programmers can use to write instructional and quiz type "programs".
The commands are one or two char long and end with a colon. My
scanner/parser would look for the colon and then backtrack to build
the command symbol.

At the time it was a fun exercise, but I soon got bored with it, as
PILOT is not a very useful language aside from presenting material to
students and testing their knowledge with limited logic.

Back in the 90's I wrote a completely different PILOT interpreter in
Visual Basic 3.0 for Windows, using the same find colon/backtrack
approach. I added my own PILOT extentions that would allow test
results to be logged for the instructor with the intent of producing
an automated testing system, but I never got it finished. (Our church
was just then setting up a computer lab for our school, and I had
thought to provide it to them).

Charles E. Bortle, Jr.

Marcel van der Veer

unread,

Nov 25, 2008, 5:17:20 PM11/25/08

to

The scanner/parser of Algol 68 Genie was hand-written by me personally.

See http://www.xs4all.nl/~jmvdveer/algol.html

It is written in C. Difficulties are Algol 68's context-dependency,
bracket-overloading and the possibility to apply symbols before they are
declared. The parser uses a multi-pass approach already outlined by B.J.
Mailloux.

Marcel van der Veer

xianwei

unread,

Nov 27, 2008, 8:46:28 PM11/27/08

to

On Nov 16, 1:49 am, "tuxisthebirdfo...@gmail.com"

<tuxisthebirdfo...@gmail.com> wrote:
> I know most people anymore use lex/yacc or some derivative of these
> tools to create scanner/parser modules for their compiler projects. I
> was wondering if anyone has developed a scanner or parser that they
> personally hand-wrote?

I write a regular expression engine (base on DFA) to scan the source
code.
the interface like this:
AddMoreType( scanner, L"ID", L"[_a-zA-Z][a-zA-Z0-9_]*" );
AddMoreType( scanner, L"ID.WHILE.discard",
L"while" );
AddMoreType( scanner, L"ID.IF.discard",
L"if" );
AddMoreType( scanner, L"ID.THEN.discard",
L"then" );
AddMoreType( scanner, L"ID.ELSE.discard",
L"else" );
AddMoreType( scanner, L"ID.ENDIF.discard",
L"endif" );
AddMoreType( scanner, L"ID.DO.discard",
L"do" );
AddMoreType( scanner, L"ID.BREAK.discard",
L"break" );
AddMoreType( scanner, L"ID.FOR.discard",
L"for" );
AddMoreType( scanner, L"ID.TO.discard",
L"to" );
AddMoreType( scanner, L"ID.LET.discard",
L"let" );
AddMoreType( scanner, L"ID.IN.discard",
L"in" );
AddMoreType( scanner, L"ID.END.discard",
L"end" );
AddMoreType( scanner, L"ID.TYPE.discard",
L"type" );
AddMoreType( scanner, L"ID.ARRAY.discard",
L"array" );
AddMoreType( scanner, L"ID.OF.discard",
L"of" );
AddMoreType( scanner, L"ID.VAR.discard",
L"var" );
AddMoreType( scanner, L"ID.FUNCTION.discard",
L"function" );
AddMoreType( scanner, L"ID.NIL.discard",
L"nil" );
AddMoreType( scanner, L"STRING", L"\"(\\\\.|[^\"])*
\"" );
AddMoreType( scanner, L"NUM", L"[1-9][0-9]*|
0" );
....

but a little bug is already exists, the next version, i try to fix it.

I spent one month write a LR(1) Grammar Analysis, used to parsing
Tiger language the language defined in Modern Compiler Implementation
in C. about 85 rules. only generate trans-DFA, not support EBNF.

Chris F Clark

unread,

Nov 28, 2008, 1:59:11 PM11/28/08

to

Ah, Pilot. That was my first "compiler" writing experience. The long
defunct PCC newpaper distributed to various schools (including my HS)
carried an interpreter for a dialect called Pilot-72, which I
re-implemented in the Univac variant of BASIC available to me and then
started extending.

The key goal I had in mind was the ability to keep the program and
data in external files and thus not be limited to the computer memory,
as I had written a Monopoly program that exceeded the capacity of the
BASIC compiler. However, rewriting in pilot and using the interpreter
I was not limited on program size.

There was a lot of good in pilot, especially the "match" command,
which was kind of a poor imitation of Snobol--and I did not know
Snobol at that time, just BASIC and a smattering of FORTRAN IV.

Fun reminiscing....
-Chris

******************************************************************************
Chris Clark Internet: christoph...@compiler-resources.com
Compiler Resources, Inc. or: com...@world.std.com
23 Bailey Rd Web Site: http://world.std.com/~compres
Berlin, MA 01503 voice: (508) 435-5016
USA fax: (978) 838-0263 (24 hours)

Dustin Voss

unread,

Nov 29, 2008, 5:35:56 PM11/29/08

to

"tuxisthe...@gmail.com" <tuxisthe...@gmail.com> wrote:

> I know most people anymore use lex/yacc or some derivative of these
> tools to create scanner/parser modules for their compiler projects. I
> was wondering if anyone has developed a scanner or parser that they

> personally hand-wrote? If so, I would like to know what language you

> used and what type of grammar you parsed. If you feel divulgent,
> please tell me a little bit about you're implementation or what you

> did for intermediate representation. I'm curious and just a bit nosy

> and would like to know how you're experience went doing things this
> way. Thanks.
> [I hand-wrote a scanner for Fortran 77 about 30 years ago. It worked but
> it was pretty gross because Fortran's lexical structure is so strange. -John]

I wrote a PEG parser. I'm using it in a documentation tool, to parse
parts of the Dylan language and (separately) to parse markup.

The Dylan language grammar is hand-written but based on the BNF grammar.
So far, my intermediate representation is the AST, one per file. I then
examine the ASTs and convert them to the intermediate representation
that I'm using for the documentation (which is basically DITA).

The parser itself is written as a Dylan library, simply because that is
my preferred language. It works with grammars expressed as Dylan code
rather than in a separate file.

A top-down grammar can be converted into a recursive-descent parser
fairly directly. If you are doing RD parsing, I don't see much need for
tools like lex or yacc or bison except if you want to check the grammar
for errors.

Granted, it would have been nice to have had some sort of error- and
consistency-checker for the markup grammar. I've had trouble pinning
down some parsing mistakes.

At the time I started this project, I really didn't know much about
parsing at all. I chose to use a recursive descent grammar because
that's what made sense to me; it would have taken longer than I wanted
to understand bottom-up parsing.