[Apologies for the several week reply delay. Got started on a reply,
then got interrupted, then didn't get back to it until now...]
<
kint...@gmail.com> wrote:
+---------------
| Thanks for your great post Rob .
+---------------
You're welcome.
+---------------
| Some questions related to the process described by the code you included .
| Does by that scheme the grouping constructs for example ``(`` and also
| ``)`` and also ``[` and ``]`` and ``{`` and ``}`` each them get parsed
| at an operator ?
+---------------
Yes, though the opening brackets are given a *very high* priority,
whereas the closing brackets are given a *very low* priority, as
are things like "," and ";" and so on. Such low-priority operators
are sometimes collectively called "stoppers", since they end
[or "stop"] the parsing of an expression [or subexpression].
[Note: EOF is also considered a "stopper" operator, of the
lowest possible priority.]
+---------------
| In my opinion btw puting operator precedence activity to early in
| the parse chain is one of the biggest reason for medicocre praser
| implemntations with lots of problems .
+---------------
In this case there's no "early" or "late" to it: the simple operator
precedence parsing is *completely* interwoven with the recursive
descent parsing -- both are done during the same phase. This hybrid
approach permits the recursive descent parser to be *much* simpler,
since you don't have to have the traditional layers and layers of
intermediate parsing routines whose only purpose is to resolve the
operator precedence -- which is one of the reasons hand-coding "pure"
recursive descent parsers is usually such a pain!! ;-}
+---------------
| You also mentioned something like "a token might be a parse tree",
| isn't that the cart before the horse ? how can a tokenizer receive
| as tree representation of prior parsing ?
+---------------
Actually, I believe that what I said [in a footnote] was:
Note: In the BLISS parsers, a lexeme can be either a terminal
value/token [literal, identifier, delimeter, etc.] or the
result of a reduce operation representing a value -- in which
case it is called a "graph table lexeme", and points to a node
in the parse tree being built.
The "tokenizer" [the LEX phase, in BLISS-11] doesn't "receive as tree
representation of prior parsing". Rather, the tokenizer delivers
[via the RUND() function, see previous article] "lexemes" *to*
the parser [the SYN phase, in BLISS-11]:
Lexemes are the smallest units of program used by the rest
of the compiler, and consist of the internal representations
of identifiers, reserved words, special characters, and
literal values.
-- From Section IV.1.1 of "The Design of an Optimizing Compiler",
Wulf et al.(1975, Elsevier).
Identifier & literal lexemes are kinds of "values", as opposed
to "operators". For values resulting from the evaluation of
compile-time expressions, the result might be the same as
(or similar to) lexemes from literal values read in from
the source. But in the process of performing a "reduction"
operation, the syntax analyzer *also* creates "value" lexemes
[graph table lexemes a.k.a. IR nodes] to represent the results
of an operator that can only be executed at run-time. For example,
if the RUND buffer contains the following:
sym del futsym futdel
FOO * BAR +
then because the precedence of the operator "*" is higher than
the operator "+", the syntax parser can construct a "graph table
lexeme" representing the subexpression "FOO * BAR", stick that
into FUTSYM and then call RUND again, resulting in (perhaps):
sym del futsym futdel
[*] + BAZ ;
|
\---->+--------------+
| <op:arith:3> |
+--------------+
| * |
+--------------+
| FOO |
+--------------+
| BAR |
+--------------+
In other words, the syntax analyzer has (briefly) acted like a
lexer by forming an IR code node and overwriting FUTSYM with it.
So when the syntax analyser calls RUND [the lexer] to read the
next lexeme or two [in this case, "BAZ" & ";"], it automatically
shifts the result of the syntax reduction into SYM where it is
needed to continue the parse.
But what would happen if the current operator in DEL were of *lower*
precedence that the operator in FUTDEL? Then the current syntax parsing
routine would save SYM & DEL in local variables ["on the stack",
effectively] and loop calling RUND and the appropriate delimeter-
processing routine(s) until the original saved operator *was* of
higher precedence than the current DEL, then return a graph table
lexeme for {saved_DEL, saved_SYM, SYM}. That's where the "recursive
descent" part actually happens, by the way. ;-}
See the code for the EXPR-INFIX function in my previous message.
This single routine handles *all* binary infix operators, regardless
of the operator's precedence.
+---------------
| In a general way, where would you put the code you gave about shifting
| around symbols an operators in the chain "the text through the scanner
| through the tokenizer through the lexer through the parser".
+---------------
I'm not *quite* sure what you're asking here, but I think I've
already shown you in detail in my previous reply & above.
But to recap, there's a function [subroutine, really, if you want to
be pedantic about "functions"] named RUND -- Read Until Next Delimiter
(a.k.a. operator) -- that is the primary interface to the lexer
from the syntax analyzer. RUND reads in the next one or two lexemes
[depending on whether the first is an "operator" lexeme or not], shifts
FUTSYM into SYM & FUTDEL into DEL, and then fills FUTSYM with the first
new lexeme [or a dummy NOVALUE, if the first lexeme was an operator],
and fills FUTDEL with the second new lexeme [or the first, if the first
lexeme was an operator]. This must *always* leave SYM & FUTSYM containing
"value" lexemes or graph table lexemes or NOVALUE, and must *always*
leave DEL & FUTDEL containing operators lexemes. [If not, there is a
syntax error. But that's another story...]
+---------------
| Do you have notions about the functionality that could be
| isolated and specified as happening in the lexer?
+---------------
Basically, the *only* thing that happens in the lexer proper is that
one or more characters of input are read until a complete lexical token
is found, and then that token is analyzed (just a little) into "values"
[identifiers or literals such as numbers & strings] or "delimiters"
["operators" such as "IF,THEN,WHILE" or "+-*/({[=" or "stoppers" such
as "]}),;"] and then packaged up into a tagged "lexeme" object [one of
the tags being delimeter/operator or not].
+---------------
| I see all kinds of madness in the code I review, if actual practice
| is the criteria of evaluation then it is bullshit to pretend that for
| example scanning/tokenizing/lexing/parsing has all been figured out...
+---------------
I think you're being a little too skeptical. With a little more experience
in different languages/compilers you might learn that indeed scanning/
tokenizing/lexing/parsing *HAS* been pretty much all figured out, at least
for languages with a relatively straightforward syntax.
+---------------
| and bullshit too to insist that the solution is to do it the
| correct way according to the instructions of the old masters.
+---------------
Well, I'm not trying to push any particular design style as any kind
of "correct way" -- far from it!! I'm just sharing one particular
little parsing "trick" [that I learned from the BLISS-10 & BLISS-11
compilers decades ago] that I have personally found quite useful when
implementing small (*non*-Lisp-like) languages that need to be embedded
in other software.
The biggest thing I ever personally used the above for was writing
an infix dialect of Scheme [called "Parentheses-Light Scheme", or
shorter, "P'Lite Scheme", or just "plite"]. It was successfully
used by a networking hardware development group at SGI to code
user-mode hardware debugging programs.
+---------------
| Now that being said if I wasn't being so lazy I'd be reading the
| dragon book and answering these questions for myself instead of
| wasting your time with something that is so off-topic.
+---------------
Well, I didn't learn about the above in the Dragon Book myself,
either, but from the BLISS-10 compiler source and the BLISS-11
compiler design book [mentioned above].
And it's not *too* far off-topic, since Lispers *do* occasionally
have to be able to parse non-sexpr code or config file formats, and
the above parsing style is easy to implement in Lisp. So there. ;-}
-Rob