If you haven't done this before, I suggest that the easiest/fastest approach
is to create a function that builds a node for each BNF clause in the
language with a pointer to an associated real Delphi function that actually
implements the clause, and use the Delphi stack as your temporary parse tree
(populating a couple of helper structures like compiler directives table,
function pointer table, constants tables, string tables as you go, and
variable tables), then collapse out of the functions sewing the nodes into
an interpretable btree. Then simply traverse the btree, executing the the
function referenced by each node as you go, using the Delphi stack to be
your interpreter's stack.
The disadvantage of this approach is that the entire app must be
syntactically correct to execute, and start up is slightly slower than a
lazy evaluater - where as an interpreter that tokenizes and interprets as it
goes doesn't care about the bits that it doesn't actually have to execute -
and therefore starts faster, but then it also results in slower total
execution, and allows bugs through that only show in production when obscure
code paths are executed.
Class polymorphism and inheritance is going to be a little trickier, but the
process is made a lot easier with a clear language definition that includes
inheritance and interfacing rules.
But I guess it isn't the end of the world to jump in feet first...You will
need a tokeniser first. In my experience the tokeniser is best designed to
break a token on every non alpha-numeric+ '_' character. This makes the
design of layer interpreter functions following the language grammar more
straight forward.
One consideration is how strings should be handled. If the pure tokeniser
is adopted, it means strings are tokenized to the individual components,
which also means that you need to store the spacing with each token so you
can reassemble a string exactly. Alternatively the tokeniser can know about
strings and present them as a single token - but then you need to have the
string rules as part of the tokeniser. The latter is more powerful as a
generalized string handler, but the former is more robust and flexible in
that it maintains true language independence between the tokeniser and the
language.
In spite of this I suggest the tokeniser should know about strings and the
string parsing rules. Now, this is not a automatic as it sounds as we have
to decide how line breaks in strings are to be parsed. As an embedded
scripting language you might decide to allow the string to spread over
multiple lines because this can be an important simplification when
embedding blocks of HTML or javascript into the pascal script, but as a
Delphi mirror you would break over the line break.
This gets messier when you think about quoting rules.
It is admirable to want to stick with the Delphi syntax, but what object
centric Delphi is good at is not necessarily the same as a text centric web
app requires to minimize coding and maximize clarity.
This is one of the reasons why I suggest a clear written definition of the
language with sample scripts is an extremely wise first step.
Your call, of course.
Regards
Jonathan Bishop
Managing Director
Bishop Phillips Consulting | Melbourne, Australia – Vancouver, Canada
Mobile +61 411.404.483 | Office +61 (3) 9525.7066 | Fax +61 (3) 9525.6080
bis...@bishopphillips.com | www.bishopphillips.com
Should mention of White Space be explicitly included in syntax
definitions?
It seems that "No" is the favoured answer in other peep's definitions,
including Dragonkiller's.
The idea seems to be that you can fit white space in almost where, and
the reader understands the exceptions. That has always bothered me a
bit. What about when you DONT want to allow the script writer to put in
white space between the tokens. For example you can't have arbitrary
white space inside a quoted string (in Delphi or Pascali), or inside a
guid (in Pascali).
For example here is the ebnf for a literal constant guid value in
Pascali.
EBNF:
digit = "0" .. "9";
guid char = digit | ("A" .. "F");
guid = "[*" , 8 * guid char , "-" ,
4 * guid char , "-" ,
4 * guid char , "-" ,
4 * guid char , "-" ,
12 * guid char , "*]";
And here is what it looks like ....
Example pascalia:
[*1FB62321-44A7-11D0-9E93-0020AF3D82DA*] // No white spaces
allowed!
I don't know what the best thing to do is here. Should Optional White
Space be included in the syntax definition? Or left implicit?
Faithfully,
Sean.