Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Would this qualify as a "recursive descent parser"?

1 view
Skip to first unread message

tinyurl.com/uh3t

unread,
Oct 3, 2004, 1:40:22 AM10/3/04
to
http://computing-dictionary.thefreedictionary.com/recursive%20descent%20parser
(grammar) recursive descent parser - A "top-down" parser built from a
set of mutually-recursive procedures or a non-recursive equivalent
where each such procedure usually implements one of the productions of
the grammar. Thus the structure of the resulting program closely
mirrors that of the grammar it recognises.
I.e. you need to actually write a separate procedure (function,
method, whatever) for each production rule in the grammar. If you have
a general-purpose table-driven parser, where the production rules are
expressed in nested data structure instead of in the actual program
code, and the program code simply implements each production-rule
format (1), then it doesn't qualify per that definition. Is that
correct? Or is the above definition too narrow, and a table-driven
parser could also qualify?

(1) The parser I wrote a few days ago uses data structures
(specifically multi-level linked-lists) to represent the BNF-like
production rules. The program code implementes these specific kinds of
right sides of production rules:
(:EXACT <LiteralString>) ;String must match exactly
(:1 <CharClassString>) ;Exactly one char of any listed char class
(:1+ <CharClassString>) ;One or more chars, each of any l.c.c.
(:ANY1 . <ListOfKeywordsOrRightSides>) ;Exactly one rule must match
(:01 <ListOfRightSides>) ;The whole sequence matches

There's a table of BNF-like mappings, each from a keyword to a list of
individual production rules like above. There's a list of toplevel
production rules for matching the target type of syntax, in effect a
gigantic (:ANY1 <rule1> <rule2> ... <rule99>), except that level is
handled outside the parser, at the next higher level of program
control, because failure of (:ANY1 ...) causes backtracking whereas
failure of toplevel parse simply records how much success in case
nothing succeeded and more training is needed to get a successful
parse.

Character classes are:
A = any alphabetic character
9 = any numeric character
w = any whitespace character
<AnyOtherCharacter> = literally itself (exact match)

Here are a few examples of keyword definitions (first element in list
is left side of BNF, rest is right side, in different syntax from
actual BNF of course):
(:IPNUM4 :9 (:EXACT ".") :9 (:EXACT ".") :9 (:EXACT ".") :9)
(:9 (:1+ "9"))
(:W (:1+ "w"))
(:W+MN? (:01 (:W :MonthName)))
(:DATE9 :DOW+W? :DayOfMonth :W+MN? :W :Year)

Here is an example of a toplevel rule (BNF right-side only) for
matching a Received: line in an RFC822 header:
((:EXACT "from") :W :EMAIL1 :W (:EXACT "(HELO") :W (:1+ "A9-.")
(:EXACT ")") :W (:EXACT "(") :JW? (:EXACT "[") :IPNUM4 (:EXACT "]")
(:EXACT ")") :W (:EXACT "(envelope-sender") :W :EMAIL2 (:EXACT ")")
:W (:EXACT "by") :W :EMAIL1 :W+<J>? :W+WITH3? :W+ID? :W+FOR?
(:EXACT ";") :W :DATIME*)
Note that the Received: and any immediately following whitespace is
skipped before starting to apply any of the toplevel rules (in case
you wondered why the rule doesn't start with (:EXACT "Received:") :W).

The algorithm is "greedy" in the sense that each rule tries to gobble
the maximum amount of text consistent with the character class(es) or
sub-production(s) it's looking for.

So would my program qualify as a table-driven recursive-descent parser
or not? If not, what should I call this kind of parser?

Omri Barel

unread,
Oct 4, 2004, 4:30:53 AM10/4/04
to
tinyurl.com/uh3t wrote:

It seems to me that this expression is regular, so there's no need for
any recursion. I don't exactly see how you can use a table directly
(without a stack or a call-stack) to parse a recursive expression.

The question I use to distinguish a recursive expression from a regular
expression is "can I match parentheses?". In your case, there's no need
- you know exactly where each parenthesis belongs. But can you parse,
using your table method (with a different table, of course), the expression:

1 * (2 + 3 * (4 + 5 * ( 6 + 7 * ( 8 + 9 * ( 10 + 11 ) ) ) ) )

And tell that it's a well-formed expression (i.e. the parentheses
match)? A parser that can do that must use a stack (either an explicit
stack or a call stack in the case of recursive descent).

Also, I think that normally a recursive descent parser is only that
which is implemented by a set of procedures, and not table-driven
(table-driven parsers can have different "powers" of parsing, whereas a
recursive-descent is normally an LL(1) parser).

Programmer Dude

unread,
Oct 14, 2004, 1:18:17 PM10/14/04
to
Omri Barel writes:

> The question I use to distinguish a recursive expression from a
> regular expression is "can I match parentheses?".

I'm sure you mean that more generally, yes? Consider:

<Envelope>
<Document>
<Contents>
Hello, World!
</Contents>
</Document>
</Envelope>

Just as XML (ignoring schema), isn't the above a recursive expression?

> But can you parse, using your table method [...], the expression:


>
> 1 * (2 + 3 * (4 + 5 * ( 6 + 7 * ( 8 + 9 * ( 10 + 11 ) ) ) ) )
>
> And tell that it's a well-formed expression (i.e. the parentheses
> match)? A parser that can do that must use a stack (either an explicit
> stack or a call stack in the case of recursive descent).

If the only goal is determining form, isn't counting () enough?

Programmer Dude

unread,
Oct 14, 2004, 1:14:23 PM10/14/04
to
tinyurl.com/uh3t writes:

> (grammar) recursive descent parser - A "top-down" parser built from a
> set of mutually-recursive procedures or a non-recursive equivalent

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

> If you have a general-purpose table-driven parser, where the production
> rules are expressed in nested data structure instead of in the actual
> program code, and the program code simply implements each production-

> rule format (1), then it doesn't qualify per that definition.

Perhaps it does under "or a non-recursive equivalent."

> [big snip]


> So would my program qualify as a table-driven recursive-descent parser
> or not? If not, what should I call this kind of parser?

I didn't look to closely at your grammar, but if the analysis and
resulting rules define a "recursive" structure, then I'd say you have
a recursive descent parser implemented non-recursively.


A. Bolmarcich

unread,
Oct 14, 2004, 6:46:30 PM10/14/04
to
On 2004-10-03, tinyurl.com/uh3t <Rober...@YahooGroups.Com> wrote:
> http://computing-dictionary.thefreedictionary.com/recursive%20descent%20parser
> (grammar) recursive descent parser - A "top-down" parser built from a
> set of mutually-recursive procedures or a non-recursive equivalent
> where each such procedure usually implements one of the productions of
> the grammar. Thus the structure of the resulting program closely
> mirrors that of the grammar it recognises.
> I.e. you need to actually write a separate procedure (function,
> method, whatever) for each production rule in the grammar. If you have
> a general-purpose table-driven parser, where the production rules are
> expressed in nested data structure instead of in the actual program
> code, and the program code simply implements each production-rule
> format (1), then it doesn't qualify per that definition. Is that
> correct? Or is the above definition too narrow, and a table-driven
> parser could also qualify?

For what it is worth, I use the narrow definition. A recursive
descent parser has a procedure for each production of the grammar.

Non-recursive equivalents are due to replacing calls to a procedure

- with the body of the procedure, such as when a production occurs
on the right hand side of only one grammar rule

- to itself with a loop

A table-driven parser may be top-down but it is not recursive descent.

Programmer Dude

unread,
Oct 15, 2004, 5:11:36 PM10/15/04
to
Niklas Borson writes:

>> I didn't look to closely at your grammar, but if the analysis and
>> resulting rules define a "recursive" structure, then I'd say you have
>> a recursive descent parser implemented non-recursively.
>

> Support for nested ("recursive") structures is what distinguishes type 2
> (context free) grammers from type 3 (regular) grammers. There are many
> parsing techniques that work on context free grammers, some with various
> restructions. Recursive descent is only one technique, and not the most
> general. Some of the most popular techniques are "bottom up" algorithms
> whereas recursive descent is "top down"; they are radically different
> conceptually so it's not a case of recursive descent parsers implemented
> non-recursively.

Right. In the end, it depends on exactly how the OP's table of rules
ends up analyzing the input. From his description of them as keying
off BNF, it sounded to me that the analysis was likely R/D. And if so,
I'd call that a R/D parser.

> The OP's definition of recursive descent was pretty good, I thought.

Likewise.

> The comp.compilers FAQ is a good resource if you want more info.

Thanks. Not necessary.

tinyurl.com/uh3t

unread,
Oct 20, 2004, 1:32:57 PM10/20/04
to
> From: Programmer Dude <Ch...@Sonnack.com>

> If the only goal is determining form, isn't counting () enough?

It seems to me that for any given grammar there's a basic difference
between a recognizer and a parser for that grammar: A recognizer merely
says yes or no whether a given expression complies with that grammar,
whereas a parser actually breaks the expression into pieces according
to that grammar. (A user-friendly parser would provide both functions,
throwing a nice clean exception if the expression doesn't comply with
the grammar, and otherwise doing the parse. But if a program goes into
an infinite loop or produces incorrect output when confronted with a
non-grammatical expression, but still works correctly on expressions
that comply with the grammar, that would probably qualify under the
strict minimal definition of what it's supposed to do.)

It also seems there's a range of what qualifies for "breaks the
expression into pieces". It might do a complete down-to-the-bottom
decomposition, re-building a parse tree that expresses the complete
parse that was achieved. Or it might just break the expression at the
top level, returning a CONTINUATION as to how to proceed for each
non-leaf piece at that level. "Break" might mean actually building
sub-strings, or just making start-end index pairs into the original
expression without any copying of any of the original data.

Accordingly, merely counting parentheses wouldn't qualify as a parser,
only as a recognizer.

By the way, for some grammars, such as those which permit strings as
atomic expressions, where those strings can contain embedded
parentheses which don't need to be matched, counting parentheses must
be done in combination with some sort of lexical mode analysis which
keeps track of whether the parser is "inside a string" or not at each
position within the string to be parsed. Counting parentheses in the
raw input string doesn't give the correct result.

Programmer Dude

unread,
Oct 21, 2004, 1:19:39 PM10/21/04
to
tinyurl.com/uh3t writes:

>> If the only goal is determining form, isn't counting () enough?
>

> Accordingly, merely counting parentheses wouldn't qualify as a parser,
> only as a recognizer.

Agreed. I got the impression the OP was interested in recognition.

tinyurl.com/uh3t

unread,
Oct 24, 2004, 1:51:35 PM10/24/04
to
> From: nik...@microsoft.com (Niklas Borson)

> Support for nested ("recursive") structures is what distinguishes
> type 2 (context free) grammers from type 3 (regular) grammers.

I'm not sure I am clear on the definitions. For "recursive" structure,
do you really mean that a single production can be applied inside
another application of itself, so there's a recursive loop in applying
the given set of productions (rules)? For example, if an expression is
defined as either a single lower-case letter or as an upper-case letter
followed by exactly two expressions, then:
CApqr
---
The underlined expression is an application of the second rule, but the
whole expression is likewise an application of the second rule, so
that's a recursive appliction of the second rule, so this is a
recursive grammar?

In that case, the parser I wrote is quite capable of handling recursive
rulesets that define recursive grammars, but the particular grammar I
was using to parse RFC822 Received: lines was in frat not a recursive
grammar. So I was not using my parser to the fullest extent of its
capabilities, because I had no need to.

> There are many parsing techniques that work on context free grammers,
> some with various restructions. Recursive descent is only one
> technique, and not the most general. Some of the most popular
> techniques are "bottom up" algorithms whereas recursive descent is
> "top down";

I'm not clear on the meaning of "top down" in this context. To my mind,
a strictly "top down" parser would find absolutely definitive tags to
delimit the outermost expression, and make firm decisions as to the top
level of parse based on those tags, and then descend into each of the
pieces for deeper parsing without any chance of changing its mind from
the top level. By comparison any normal kind of parser, presumably
a "recursive descent" parser, would generate paths to explore from the
top down, but not finalize any such path-decision until after checking
all the lower deatils and finding they match correctly and returning
back to the top levle. So it's not really top-down, it's really
top-down-then-back-up-to-top.

The bottom-up parsers I have in mind, by comparison, really are bottom
up in a strict sense. (But I'm not sure this is what you're referring
to.) They find tiny patterns that match productions (rules) in the
grammar, and tag them with names of grammar types, then try to combine
these into larger expressions mathing other productions, building up
larger and larger structures, maintaining a forest of all such
bottom-up expressions which have been built so-far, discarding any that
aren't the whole expression but can't be fitted into anything larger,
duplicating references to any that can be fitted into more than one
larger structure. Eventually either all expressions (structures) in the
forest have been discarded, or at least one such equals the whole
expression that was being parsed. Is this what you mean by a bottom-up
parser?

tinyurl.com/uh3t

unread,
Oct 25, 2004, 1:17:11 PM10/25/04
to
> From: Programmer Dude <Ch...@Sonnack.com>

> In the end, it depends on exactly how the OP's table of rules ends up
> analyzing the input. From his description of them as keying off BNF

Maybe I'm nitpicking, but no they don't key off BNF (Backus-Naur Form),
rather they key off internal pointy structures which are loaded not
from BNF but from s-expressions which semantically are similar in
effect to BNF. But BNF is a particular notation involving lines like:
L ::= R1 R2 | R1A
or something like that, quite a different *form* of representing the
production rules from BNF. So perhaps I can amend your statement so
that it applies to *any* form of representation of such grammatical
rules, not just BNF?

> it sounded to me that the analysis was likely R/D. And if so,
> I'd call that a R/D parser.

For the moment it seems the concensus is that I wrote a R/D parser
although used it with a set of productions (grammar rules) which in
fact were not recursive. I.e. my program is overkill for the necessary
task, but that's common in computer programming where it's easier to
write a more general program and then just use it to a limited extent,
than to write an inherently restricted program. (Often the easiest way
to write a restricted program is to write a general program then add an
artificial constraint to cripple it.)

By the way, the way my program developed was that initially I had only
a list of toplevel rules, each of which expressed the entire syntax of
one type of RFC822 Received: line. But as I tested the program on more
and more Received: lines and had to add more and more toplevel rules, I
got tired of copy&paste of common sub-expressions, and so I implemented
a list of abbreviations for such common sub-expressions, at which point
my program suddenly became capable of parsing recursive expressions if
I were ever to have a sub-expression abbreviation that included itself
as a possible smaller sub-expression. Eventually I was getting multiple
parses for the same input, due to optional parts that were different
between different toplevel rules where the given expression didn't
use any optional part hence satisfied both toplevel rules. So then
I began merging rules to avoid the duplications, for example:
Req1 Req2 Opt3 Opt5 Req6 Opt7
Req1 Req2 Opt4 Req6
could be merged into a single rule:
Req1 Req2 Opt3 Opt4 Opt5 Req6 Opt7
where Req and Opt mean required and optional subexpressions
respectively.

By the way, my parser ran directly off the raw text input, not off a
stream of tokens produced by some tokenizer. So for exmple the
whitespace that separates significant parts is explicitly included in
my production rules, whereas whitespace would be already gone if using
a tokenizer. For the relatively free form of some parts of RFC822
Received: lines, which contain a mixmash of white and non-white text
not easily characterized by any formal grammar, there's probably no
easy way to write an appropriate tokenizer, so my way was probably
necessary. I suppose that absence or presence of tokenizer before the
parser-proper makes no difference as to whether the parser-proper is
R/D or not.

Programmer Dude

unread,
Oct 27, 2004, 12:55:35 PM10/27/04
to
tinyurl.com/uh3t writes:

> [snip]... I suppose that absence or presence of tokenizer before the


> parser-proper makes no difference as to whether the parser-proper is
> R/D or not.

Don't see any reason why it would.

Just for fun, when I recently wrote an HTML/XML parser, I used an R/D
design in the *tokenizer*!

tinyurl.com/uh3t

unread,
Oct 27, 2004, 1:23:57 PM10/27/04
to
> From: Programmer Dude <Ch...@Sonnack.com>

> I got the impression the OP was interested in recognition.

Nope, you got the wrong impression. I'm the OP, and I definitely wanted
to destructure the Received: line to recover the SMTP-client IP number
and the date&time in a proper manner. In the past I had a hack whereby
I looked for any sub-string that looked like an IP number regardless of
where it was located in the Received: line. I was getting worried that
spammers might forge that number in the HELO which also appears in the
Received: line, and I might be picking up the wrong IP number, the
forged one rather than the correct one. So I wrote the parser to
analyze the structure of Received: lines to be sure my software knew
exactly where the correct IP number was supposed to be located, to
expect it to actually be there, and to then extract it from that
location and ignore any IP-number-syntax elsewhere in the Received:
line.

Just recognizing a Received: line, without destructuring it, is
trivial. Just see if the first word is exactly "Received" followed by
colon and whitespace. Of course I did that to pick out which lines of
header to send to my parser, but that by itself didn't give me the info
I needed.

Using the parser as a tool, the next higher-up task was working
backwards from the very last (topmost) Received: line to see which
internal forwarding occurred after my ISP/MSP already received the
e-mail, until I found the particular Received: line which showed the
message being injected from some outside source. Parsing those
internal-forwarding Received: lines was needed to check whether they
were internal or not, so my program would know whether to stop there or
look further back, and of course parsing the injection-point Received:
line was needed to find the IP number of the spammer's relay or source
so I'd know which ISP was responsible for sending me the spam, and I
could then either complain there or simply move the spam to a public
archive classified per responsible ISP.

0 new messages