The method I am currently using is a kind of BNF form for keywords
(all statements begin with a keyword, and each statement is occupied
by one keyword only [except for IF..THEN statements] so the parsing
should be quite easy) and a rule reducer for expressions (# + #
becomes # etc) - I hope I'm making myself clear here, this is a new
project for me.
The problem that I have is one of speed. When the user presses
return, the text is parsed and if it is correct, it's added to the
program listing. If not, the point that the mistake is made is
returned, and highlighted. This doesn't have to be fast. However, I
would like to create a simple syntax highlighter which displays a
syntax template (such as FOR <variable> = <Numexpr> TO <numexpr> ) and
highlights errors in red as you type. Unfortunately, it slows dow to
the point of unusability if you type over about 150 chars in the input
string.
The parser performs the following steps to parse:
1) tokenises the input string to Keywords (and function names), and
types (numerics, strings, symbols) into a stack.
2) runs the resulting stack through the rule reducer to reduce all the
expressions to their base types (numexpr, stringexpr etc)
3) runs the resulting (smaller) stack through the BNF parser using the
rule for that keyword. It is this part that performs error checking.
As I said, this is very slow. I code in Delphi 5 (Object Pascal). Can
anyone point out:
1) How to parse a simple BASIC language (like Sinclair Basic) quickly,
2) Any better method than the one I'm using here. I used BNF for everything,
but that was even slower than the method I am using.
Thanks,
Paul.
> The problem that I have is one of speed.
> I would like to create a simple syntax highlighter which displays a
> syntax template (such as FOR <variable> = <Numexpr> TO <numexpr> ) and
> highlights errors in red as you type. Unfortunately, it slows dow to
> the point of unusability if you type over about 150 chars in the input
> string.
> 1) How to parse a simple BASIC language (like Sinclair Basic) quickly,
> 2) Any better method than the one I'm using here. I used BNF for
everything,
> but that was even slower than the method I am using.
I'm not familiar with Sinclair Basic, but I have written many parsers
in Object Pascal (Delphi). Might I suggest that you use a simple
recursive descent parser (and change your grammar to LL(1), with
context if necessary), and begin highlighting in red at the beginning
of the first invalid token found by the parser?
For speed, the method I have found to work efficiently with Object
Pascal, for lexical analysis, looks like this; It uses a little hack,
similar to Classes.pas's TParser, which defines TToken as a character;
this makes it easy to use for ':', ';' etc:
---8<---
// class outline edited for clarity
type
TToken = #0..#255;
// use a hash lookup at the end of NextToken to check if
// a tokIdentifier is actually a tokKeyword
TKeyword = (keyNone, keyBlah, keyFor, keyEtc);
TLexer = class
private
FCurrentPosition: Integer;
public
constructor Create(ASource: string);
procedure NextToken;
property Token: TToken;
property TokenStart: Integer;
property TokenLength: Integer;
property TokenAsString: string;
property TokenAsInteger: Integer;
property TokenAsFloat: TFloat;
property TokenAsWhatever: TWhatever;
property TokenAsKeyword: TKeyword;
end;
const
tokEOF = TToken(0);
tokInteger = TToken(1);
tokFloat = TToken(2);
tokWhatever = TToken(3);
tokKeyword = TToken(4);
tokIdentifier = TToken(5);
--->8---
The job of the NextToken procedure is:
beginning at the CurrentPosition:
skip whitespace
determine the token type from its first character
read in that token type fully
leave the CurrentPosition 1 after the end of the token just read
set TokenAsString, TokenAsInteger, Token, etc as appropriate.
I'm sure you're familiar with writing native code regexp matchers, but I
have seen many people using Object Pascal do it inefficiently, so I'm going
to be fairly verbose, edited a little for clarity:
---8<---
procedure TLexer.NextToken;
const
WhiteSpace = [#1..#32];
Alpha = ['a'..'z', 'A'..'Z', '_'];
Numeric = ['0'..'9'];
AlphaNumeric = Alpha + Numeric;
var
currPos: Integer; // allows register variable
// another optimization if Delphi doesn't do automatic
// loop variable induction is to use a PChar for currPos
// instead; it'll mean slightly more pointer arithmetic
// to figure out TokenStart, and you should use the
// SetString function instead of the Copy function.
begin
currPos := FCurrentPosition;
// skip whitespace
while source[currPos] in WhiteSpace do
Inc(currPos);
// determine token type
case source[currPos] of
#0:
begin
// a #0 (null character) means end of source
Token := tokEOF;
// we don't touch the current position,
// so repeated NextToken keeps setting
// the token to tokEOF
TokenStart := currPos;
TokenLength := 1;
end;
Alpha:
begin
// an identifier
FTokenStart := currPos;
while source[currPos] in AlphaNumeric do
Inc(currPos);
// the current position is now at the first
// character that ISN'T part of the identifier
// extract token (from start, length currpos - start)
StringToken := Copy(source, TokenStart,
currPos - TokenStart);
// set token type
Token := tokIdentifier;
TokenLength := currPos - TokenStart;
end;
Numeric:
begin
// a number
TokenStart := currPos;
while source[currPos] in Numeric do
Inc(currPos);
// extract token (from start, length currpos - start)
StringToken := Copy(source, TokenStart,
currPos - TokenStart);
IntegerToken := StrToInt(StringToken);
FloatToken := IntegerToken;
// set token type
Token := tokInteger;
TokenLength := currPos - TokenStart;
end;
else
// some operator like + or whatever
// an easy cheat is to store the token as
// a character type and use characters below
// 32 (whitespace usually; platform) to
// represent token types like integer,
// float, identifier, etc.
Token := source[currPos];
TokenStart := currPos;
TokenLength := 1;
// VERY IMPORTANT to leave current position
// AT ONE AFTER current token
Inc(currPos);
end;
FCurrentPosition := currPos;
end;
--->8---
Make sure you initialize FCurrentPosition to 1 (not the default 0) in the
constructor. Keep a copy of ASource in something like FBuffer if you use
PChars instead of indices, otherwise the refcount of the string might go to
zero when the constructor returns.
Recursive descent parsers are trivial to write; essentially, you start off
writing a class that takes a TLexer parameter in the constructor; the
constructor should probably call NextToken on the lexer before any start
token is read. The class should have a public method that corresponds to a
starting grammar rule, like ReadStatement, or whatever. Each grammar rule
should eat anything it expects in the token stream, collect information
about identifiers, etc, and at the end do a case statement which decides
what rule to continue. For rules that look like subtraction or division
(need to be bracketed from the left), use a while or repeat loop instead of
recursing into the same function.
An example (I'm just checking for consistency):
---8<---
procedure TParser.ReadForLoop;
begin
EatKeyword(keyFor); // should be a helper method
EatToken(tokIdentifier); // just checking for consistency
// easy enough to add more support if necessary
EatToken('='); // this is possible with the 'hack'
ReadExpression; // change to return type of
// expression if you like
EatKeyword(keyTo);
ReadExpression;
if Lexer.TokenAsKeyword = keyStep then
begin
Lexer.NextToken;
ReadExpression;
end;
end;
--->8---
Of course, the Eat* methods should raise exceptions if the token in the
lexer doesn't correspond to the expected token.
If you add methods to the lexer and parser to reset, they should easily be
fast enough to handle thousands of characters without any noticable delay;
you should catch any exceptions thrown by the Eat* (or Expected* - similar
to Eat* but they don't call Lexer.NextToken) methods and begin your
highlighting with the information given in the lexer state. You shouldn't
need to parse the input anymore until the user presses return or else
modifies somewhere before the error position.
This recursive descent parser technique was inspired by Jack Crenshaw.
One final note for speed; make sure you're using a hash table for lookup.
I've got a fairly fast hashtable on
http://codecentral.borland.com/codecentral/ccweb.exe/listing?id=15171.
Much of this goes without saying, but I want to make sure.
-- Barry
barry_...@hotmail.com
> The parser performs the following steps to parse:
>
> 1) tokenises the input string to Keywords (and function names), and
>types (numerics, strings, symbols) into a stack.
A stack might not be the most appropriate data type, since it seems
that you do not approach the data Last In, First Out. Consider using
some other data type, e.g. an an array.
Of course, another question is how efficient your tokenizing
code is.
> 2) runs the resulting stack through the rule reducer to reduce
> all the expressions to their base types (numexpr, stringexpr etc)
You do this before parsing? This is a strange approach, which I do not
fully understand.
> 3) runs the resulting (smaller) stack through the BNF parser using the
>rule for that keyword. It is this part that performs error checking.
Is this a hand-coded parser? Perhaps you wrote a parser that
backtracks when confronted with something it cannot handle? Such a
parser can be very slow, if it has to do lots of backtracking.
Consider using a parser generator tool, or try to rewrite your parser
so that it doesn't do any backtracking.
>1) How to parse a simple BASIC language (like Sinclair Basic) quickly,
I think there is a Lex/Yacc for Object Pascal.
>2) Any better method than the one I'm using here. I used BNF for everything,
>but that was even slower than the method I am using.
I think your parser is not efficient. You can ditch the second step
once you have an efficient parser.
Another idea is to only do the syntax checking in the "idle" loop of
your application. In this way, when the user starts to type a line, it
will not be syntax checked until the computer has time to catch up
with the user.
Stephan
--
ir. Stephan H.M.J. Houben
tel. +31-40-2474358 / +31-40-2743497
e-mail: step...@win.tue.nl
I'm also writing a Scanner and Parser in Delphi, and want to contribute
my experience. It's my first serious attempt at a parser, so I use a
recursive descendant one to parse a small toy-language (basicallymixture of
C and Pascal . But I chose not to use a LL(1) grammar because
I have to look ahead two tokens in some productions. I use a small
fixed-size token stack to buffer the token stream so it's as fast as it
can get. Of course I could do a LL(1) grammar but why making the grammar
harder to read when this way works and is fast enough?
>For speed, the method I have found to work efficiently with Object
>Pascal, for lexical analysis, looks like this; It uses a little hack,
>similar to Classes.pas's TParser, which defines TToken as a character;
>this makes it easy to use for ':', ';' etc:
Yes, it is faster than my approach, but mine is definitely cleaner.
I use something like the record below for my tokens which greatly
aides to make a decent error reporting, which I miss with some
compilers out there.
type
TToken=record
Code:word;
Line:integer;
Col:integer;
Text:string;
end;
>One final note for speed; make sure you're using a hash table for lookup.
>I've got a fairly fast hashtable on
>http://codecentral.borland.com/codecentral/ccweb.exe/listing?id=15171.
I'm going to give it a try, since my own hash table implementation sucks
pretty much, as it was written in about half an hour. Btw, is there
something like gperf for pascal?
--
Timon Christl <chr...@fmi.uni-passau.de>
Done. The stack still stays, but is now an array :) you're right - it's
quicker.
> Of course, another question is how efficient your tokenizing
> code is.
>
Very - or at least as fast as I can process it. It only reads each
character in the input stream once, and can tokenise a 500 char line in less
than a millisecond (according to windows - so I don't trust it)
> You do this before parsing? This is a strange approach, which I do not
> fully understand.
It's so I don't have to bother with checking for a valid type (like a
decimal number, a variable name [which can contain spaces and
numerals], a reserved word (stored in the tokenised array as an
integer value - you'll see why later) or a symbol such as +, -, /, *
etc) during parsing. The parser now only checks if numeric/string
expressions are correct syntax. No keyword parsing at all.
> > 3) runs the resulting (smaller) stack through the BNF parser using the
> >rule for that keyword. It is this part that performs error checking.
>
> Is this a hand-coded parser? Perhaps you wrote a parser that
> backtracks when confronted with something it cannot handle? Such a
> parser can be very slow, if it has to do lots of backtracking.
>
> Consider using a parser generator tool, or try to rewrite your parser
> so that it doesn't do any backtracking.
Used a parser generator. Overkill, No good. Tried Yacc/Lexx for TP,
bloated code. generates far too much to do a simple job. (I may be
utilising it badly though) My parser is about 150-200 lines of code,
and works like a dream. No backtracking, once I had figured out
exactly where I was backtracking. The slowest part of the parser is
getting items from the array, so I eliminated backtracking in order to
prevent a token being read more than once. A nice trick I learnt from
a C program was to pass the token in question to the expression
parser, and advance the pointer on one step. Then process the token.
> I think your parser is not efficient. You can ditch the second step
> once you have an efficient parser.
I have. I ditched the second step, and built generic routines to check
different argument types. IE, Sin(A) = B, STR$(A)=S$, Val(D$)=D etc
all have similar arguments and results. one procedure to check them
does the job. The trick was to get it to behave like a real spectrum
:) So now I have about 10 routines to check, and only four of them are
specialised routines. They all work recusively with the expression
parser, and build the syntax string as they go. So when stepping
through the token list (stack as was), if I find a value (everything
is stored as ints) greater than 1000, I use a lookup table to jump to
the appropriate routine.
It was an absolute nightmare, but it works now. Phew!
> Another idea is to only do the syntax checking in the "idle" loop of
> your application. In this way, when the user starts to type a line, it
> will not be syntax checked until the computer has time to catch up
> with the user.
Thanks for you suggestions - you obviously know more about it than I
do, but I hate to use a utility to write code for me when I could do
it myself. I won't be using a parser generator in future, unless I can
write one. Someone once asked me why I wanted to re-invent the wheel -
and I replied "Because It's *MY* wheel."
Your suggestion about the idle checking is a good one though, but the
parser is fast enough now to not be a worry. Hope no-one runs it on
slower hardware though :)
As I said, the hardest part was getting it to behave like a *real* spectrum.
Chr$ 2+2 doesn't work, but chr$(2+2) does. print a$>B$ does not work on a
128 spec, but print (a$)>(b$) does...
Thanks for your interest,
Paul.
I'd use context if I were you - if you look ahead two tokens, you'll
need some kind of 'push back', and that can make coding more
awkward. But, it's an implementation detail...
> I use a small fixed-size token stack to buffer the token stream so
> it's as fast as it can get. Of course I could do a LL(1) grammar but
> why making the grammar harder to read when this way works and is
> fast enough?
All you usually need is 'magic' identifier tokens
> Yes, it is faster than my approach,
Unfortunately, speed is a big concern of mine; I'm writing a
dynamically compiling expression evaluator, that essentially produces
a function pointer from a string containing an expression, designed to
be applicable to graphics (matrix maths) and spreadsheet
applications. While multiple iterations will amortize the cost of
upfront expression generation, to compete with simpler interpreted
systems (like parser10, you'll find it on most Delphi archives) I'm
trying to get it as fast as possible.
Delphi, being an amazingly fast compiler, inspires one to try hard, too.
> but mine is definitely cleaner.
>
> I use something like the record below for my tokens which greatly
> aides to make a decent error reporting,
My lexer has this information up front, and I only look ahead one
token, so I don't need it. The main reason I used my 'TToken =
#0..#255' hack is because Object Pascal didn't support overloaded
functions when I first wrote it. IOW, historical reasons.
> type
> TToken=record
> Code:word;
> Line:integer;
> Col:integer;
> Text:string;
> end;
> >One final note for speed; make sure you're using a hash table for lookup.
> >I've got a fairly fast hashtable on
> >http://codecentral.borland.com/codecentral/ccweb.exe/listing?id=15171.
>
> I'm going to give it a try, since my own hash table implementation sucks
> pretty much, as it was written in about half an hour.
> Btw, is there
> something like gperf for pascal?
You could translate the output of gperf; AFAICS it is driver for a
table, so you could write a quick text extracter in Perl to remap it
for a Pascal unit.
I would like to point out though, that I can't see the point in using
it. The more sparse your hash table for looking up keywords is, the
faster it will be at rejecting non-matches (since every successful
look-up in a hash table has to be followed up by a byte-for-byte
comparison), and it will also almost certainly be unique (since it
will be larger). I think it would be faster to merge your keyword data
with your global symbol table, on the basis that a sparse table will
return non-matches much faster.
-- Barry
If you like Parser10, have a look at
http://www.stack.nl/~marcov/symbolic.zip
It is in FPC, but I fixed it for Delphi in 2 minutes, but forgot to IFDEF it
in the code :-)
( if "type xx=^simpletype; var yy:xx;" then FPC in native Object Pascal
mode wants yy[5], while Delphi wants yy^[5])
The speed is comparable with parser10.
P.s. is Object Pascal LL(1)?
Or store only the tokenized text. You need to invent a junk token,
though, which attribute stores arbitrary text so you can load/edit
syntactically invalid programs. You'd be surprised how many old
programs contain syntax errors in rarely or never used code paths.
> > 2) runs the resulting stack through the rule reducer to reduce
> > all the expressions to their base types (numexpr, stringexpr etc)
>
> You do this before parsing? This is a strange approach, which I do not
> fully understand.
It sounds like semantic analysis, in particular the type check.
> > 3) runs the resulting (smaller) stack through the BNF parser using the
> >rule for that keyword. It is this part that performs error checking.
>
> Is this a hand-coded parser? Perhaps you wrote a parser that
> backtracks when confronted with something it cannot handle? Such a
> parser can be very slow, if it has to do lots of backtracking.
I wrote my own basic interpreter (http://www.moria.de/~michael/bas),
but benchmarked mostly execution speed, which is dominated by
expression evaluation. I tried recursive descending vs. LR(0) with
hand-resolved conflicts for that and on PCs both run at about the same
speed, whereas an old Sun 4c does much better with LR parsing, but
either way, parsing during execution is not going to yield a very
efficient result.
My interpreter stores programs tokenised. Before running a program,
it makes a first pass to build the symbol table and a second to resolve
symbol references and check types.
A line-by-line check only works for one-line statements, but BASIC is not
that simple at all. It only looks that way. ;)
Michael
I don't particularly <g>; it has a horrible parsing speed (about 50
times slower than simple recursive descent), but a fairly efficient
evaluation method (a three-address code variant), while Marcov's code
uses an RPN stack machine.
> P.s. is Object Pascal LL(1)?
Strictly, no. With context, you can manage it.
-- Barry
Michael Haardt <mic...@moria.de> wrote in message
> Or store only the tokenized text. You need to invent a junk token,
> though, which attribute stores arbitrary text so you can load/edit
> syntactically invalid programs. You'd be surprised how many old
> programs contain syntax errors in rarely or never used code paths.
I now store the tokens in a TList of pointers, thus gaining an
appreciable speed increase. Whereas my test line (256 chars of string
expression and functions) took 3 milliseconds to parse 100 times, it
now takes 2 milliseconds to parse 500 times :) - but I have optimised
the parser somewhat since I used the old array.
> parsing during execution is not going to yield a very efficient result.
>
That is exactly why parsing should be carried out only once the return
key is pressed. My problem was that I wanted to perform a parse of the
current edit line every time the user pressed a key, and when you
reached about 200 chars, you had to wait for the parser to finish
before pressing another key :( I really needed help speeding it up.
> My interpreter stores programs tokenised. Before running a program,
> it makes a first pass to build the symbol table and a second to resolve
> symbol references and check types.
This is exactly what I intend doing. However, I have just spent about
6 months writing a parser. To be fair, I knew absolutely nothing about
parsing before that.
> A line-by-line check only works for one-line statements, but BASIC is not
> that simple at all. It only looks that way. ;)
How do you mean one line statements? I thought that statements could
only occupy one line at a time anyway? You *do* get big problems when
there are more than one statements on a single line however... I had
to consider whether the overhead of :
1) parsing the line.
2) storing which statements (if there was more than one) were correct.
3) checking where the line had changed, and which statement was affected,
4) subsequently reparsing from that point.
was worth it, or just to parse the whole line over from scratch.
Anyone know how to efficiently suspend parsing on an incomplete line and
then carrying on at a later time (ie, resuming the parse when more has been
added to the line by the user?)
Thanks
Paul.