The algorithms which generate parsers from grammars get to "conflict"
points when they need to disambiguate the grammar, and at those points
one can essentially iterate to find k (and then take the max k over
all the points to determine k for a given grammar). In some cases,
you can easily establish that there is no such k.
However, there are other cases where the existence of k cannot be
easily proven or disproven. There are ways of constructing grammars
such that if you can prove they are non-ambiguous (and have a k, such
that they are LR(k), you can show that certain arbitrary Turing
machines halt. Since, the latter problem cannot be solved
algorithmically, neither can the former.
Thus, if you start by iterating on increasing k's, your program may
never halt. Thus, simply increasing k at each conflict point is not a
panacea to your problem. The best answer such an algorithm can
produce is that your grammar isn't LL(k) or LR(k) for values of k less
than this. And, by the way, the memory requirements for even
performing those experiments grow exponentially(note below), so it
isn't even useful in a practical sense.
Note, that the above description applied to grammars (not languages),
which although seemingly identical are not. You can have a language,
but not a grammar (in BNF or other equivalent notation) for that
language. Moreover, certain properties can be proven to hold for some
languages even without a grammar. For example, if you know that a
language is LR(5), you then know that the language is also LR(1).
That does not mean you can find the LR(1) grammar for the language,
just that it exists.
note from above: One of Terence Parr's innovations was discovering the
"linear approximate lookahead algorithm" which grows linearly, but
only gets an approximate solution to the problem. However, it does
make LL(k) parsing practical for many cases.
On a related note, I once investigated how one could compute the
closure of LR(k) grammars, and have an algorithm to do so (other than
just simply iterating over k) implemented in Yacc++. Unfortunately,
even that algorithm does not necessarily terminate on any given
grammar (and cannot due to the Turing limitation), and still has
exponential memory bounds. I am currently working with a researcher,
who I'm hoping will take the work further.
Hope this helps,
-Chris
******************************************************************************
Chris Clark Internet: christoph...@compiler-resources.com
Compiler Resources, Inc. or: com...@world.std.com
23 Bailey Rd Web Site: http://world.std.com/~compres
Berlin, MA 01503 voice: (508) 435-5016
USA fax: (978) 838-0263 (24 hours)
For LL, I don't know much.
For LR, *almost* all context-free grammars are LR(k < infinity), so if
you can prove that the grammar is context-free you are almost there.
You can always construct an LR(1) - or even LR(0) - state model to
parse a context-free grammar, but if the grammar is not LR(1) that
state model will have shift/reduce or reduce/reduce conflicts.
Almost all such conflicts can be resolved finitely. This is the logic
behind Tomita and other generalised LR parsing algorithms. The basic
principle is that conflicts are resolved by checking all possibilities
- the Tomita approach being to check them concurrently using stack
duplication, whereas another common approach is to use backtracking.
The basic problem is that sometimes, there are infinite possibilities
to check. This can occur when the grammar contains empty rules - when
there are reductions that pop zero tokens from the stack. However, it
doesn't happen for *all* cases where empty rules are used - except in
the case of LR(0), I think - and it is not reasonable to ban such
rules.
In such cases, Tomita will give an infinite number of stack
duplicates, whereas the backtracking approach will simply give an
infinite loop.
Longer lookaheads can bypass most cases where these loops can happen,
and even LR(1) can handle most practical grammars with empty rules.
However, in general, it is not decidable for a particular grammar
whether such loops can be avoided using longer lookaheads. It is only
decidable if you also know the specific input that will be parsed.
I have designed an variant of GLR (though I have never implemented it)
which should theoretically give constant-time parsing of any context
free grammar that can be parsed, but which (unlike Tomita or
backtracking) can detect these infinite loops at run-time. I say "I
have designed", but it's not my invention - I was beaten to it by
about a decade. I was inspired by the packrat approach to LL parsers,
and designed a memoized/dynamic/tabular LR parsing algorithm, but once
I'd got the basic idea, it was also easy to find certain papers by
Mark-Jan Nederhof et al, published in the mid-90s.
Detecting the infinite loops in this tabular approach is not trivial,
but is constant-time. It involves a digraph cycle check for one column
of the table. This is constant-time because there is a maximum digraph
size, which is decided by the state model and thus the grammar. There
is no rule to say that constant-time must be fast ;-)
Basically, that's it - *almost* any context-free grammar is LR(k <
infinity) but there are a few that are not, and there are some
grammars which cannot be proven either LR(k) or otherwise except by
iterating through the infinite set of possible inputs.
For those grammars that are LR(k), k can be found by starting with an
LR(1) state model and identifying the states at which conflicts exist.
At each of these states, extend the model by adding "lookahead" states
until all the conflicts are resolved. That is, the actions for the
transitions are neither "shift" nor "reduce" but a kind of lookahead
pseudo-shift that adds nothing to the stack. In a sense, this is
building the stack duplication/backtracking into the state model
itself. The maximum number of lookahead steps needed to resolve all
conflicts determines k, and a genuine LR(k) model can then be derived.
This always has the possibility of an infinite loop growing an
infinite state model due to those undecided non-LR(k) cases, however.
My understanding is that this limitation cannot be overcome since
doing so would provide a solution to the halting problem, which is
proven non-decidable.
Could you provide references for this?; preferably how to construct
example grammars. or the example grammars themselves.
Could there exist an algorithm the can construct a parser for any
LR(k) grammar for any finite k without determining k? Note that,
since such an algorithm may also construct parsers for some grammars
which are not LR(k) for any k, such an algorithm could not tell you if
the grammar is LR(k) for any k.
> Thus, if you start by iterating on increasing k's, your program may
> never halt. Thus, simply increasing k at each conflict point is not a
> panacea to your problem. The best answer such an algorithm can
> produce is that your grammar isn't LL(k) or LR(k) for values of k less
> than this. And, by the way, the memory requirements for even
> performing those experiments grow exponentially(note below), so it
> isn't even useful in a practical sense.
>
> Note, that the above description applied to grammars (not languages),
> which although seemingly identical are not. You can have a language,
> but not a grammar (in BNF or other equivalent notation) for that
> language. Moreover, certain properties can be proven to hold for some
> languages even without a grammar. For example, if you know that a
> language is LR(5), you then know that the language is also LR(1).
> That does not mean you can find the LR(1) grammar for the language,
> just that it exists.
Is there not an algorithm to convert any LR(k) grammar into an LR(1)
or even LR(0) grammar?
Regards,
Ralph Boland
Ralph Boland <rpbo...@gmail.com> writes:
> Could you provide references for this?; preferably how to construct
> example grammars. or the example grammars themselves.
The rub comes with grammars which may be ambiguous. There is no LR(k)
parser for an ambiguous language--all LR(k) languages are unambiguous.
From wikipeida:
The general question of whether a grammar is not ambiguous is
undecidable. No algorithm can exist to determine the ambiguity of a
grammar because the undecidable Post correspondence problem can be
encoded as an ambiguity problem.
And, TMs can mechanically be transformed into Post Correspondence
Problems. Thus, if you have an algorithm the can tell you if an
arbitrary grammar is ambiguous, you can use that to construct a
machine which determines if a TM halts, by converting the TM into the
corresponding PCP, which is then encoded as an a grammar ambiguity
problem and then runs your ambiguity algorithm. Viola, a solution to
the halting problem, not.
> Could there exist an algorithm the can construct a parser for any
> LR(k) grammar for any finite k without determining k? Note that,
> since such an algorithm may also construct parsers for some grammars
> which are not LR(k) for any k, such an algorithm could not tell you if
> the grammar is LR(k) for any k.
If I understand what you are asking, the GLR algorithm does
essentially that. If your grammar is LR(k) for some k, eventually the
parse forest will resolve down to a (single) parse tree and you have
the LR(k) parse for said input. However, if your language is
ambiguous, the GLR method won't always tell you that it's ambiguous,
and for some input there will be a parse forest that doesn't resolve
into a single tree, because it is ambiguous. And, unless you stumble
across an ambiguous sentence in your input, you may never know whether
your language is ambiguous or not.
> Is there not an algorithm to convert any LR(k) grammar into an LR(1)
> or even LR(0) grammar?
Not that I am aware of. I'm pretty certain I saw somewhere a proof
that there can't be. However, my paper archives are still in MA with
me in AZ (and too extensive to rummage though in any event), and
Google doesn't show anything obvious. As I recall, even the proof
that any LR(k) language is an LR(1) language was non-constructive,
which it would have to be if there is no algorithm--otherwise the
construction would be the algorithm.
> For LR, *almost* all context-free grammars are LR(k < infinity), so if
> you can prove that the grammar is context-free you are almost there.
That's not very useful, and not even true. There are many useful
grammars that are not LR(k) for any k and even useful languages for
which there doesn't exits LR(k) grammars for any k.
> You can always construct an LR(1) - or even LR(0) - state model to
> parse a context-free grammar, but if the grammar is not LR(1) that
> state model will have shift/reduce or reduce/reduce conflicts.
>
> Almost all such conflicts can be resolved finitely. This is the logic
> behind Tomita and other generalised LR parsing algorithms. The basic
> principle is that conflicts are resolved by checking all possibilities
> - the Tomita approach being to check them concurrently using stack
> duplication, whereas another common approach is to use backtracking.
This does, however, not give you any k, since the amount you have to
read ahead before backtracking may depend on the input.
> The basic problem is that sometimes, there are infinite possibilities
> to check. This can occur when the grammar contains empty rules - when
> there are reductions that pop zero tokens from the stack. However, it
> doesn't happen for *all* cases where empty rules are used - except in
> the case of LR(0), I think - and it is not reasonable to ban such
> rules.
>
> In such cases, Tomita will give an infinite number of stack
> duplicates, whereas the backtracking approach will simply give an
> infinite loop.
Empty productions can be eliminated automatically using a fairly
simple procedure (which, however, may increase the grammar size
quadratically), so these are not the problem. The problem is that you
often have to look to the end of the input before backtracking, which
effectively gives you unbounded lookahead.
If you just want to parse a text with a grammar, there are plenty of
algorithms around (such as Early's universal parser) that will give
you a lazy list of all possible parse trees. But that doesn't help
you decide if the language is LR(k).
> I have designed an variant of GLR (though I have never implemented it)
> which should theoretically give constant-time parsing of any context
> free grammar
I assume you mean "linear-time", as constant-time parsing sounds
implausible. But even linear-time parsing for any context-free
grammar sounds suspect and would be a very big result (if you can
prove it).
> that can be parsed,
Any context-free grammar can be parsed, so this restriction is
meaningless.
Torben
Cheers,
Ulf
>Stephen Horne <sh006...@blueyonder.co.uk> writes:
>
>> For LR, *almost* all context-free grammars are LR(k < infinity), so if
>> you can prove that the grammar is context-free you are almost there.
>
>That's not very useful, and not even true. There are many useful
>grammars that are not LR(k) for any k and even useful languages for
>which there doesn't exits LR(k) grammars for any k.
Context-free grammars? And for any finite k?
My understanding is that you'd need an infinite input. While there are
certainly applications where unbounded inputs occur, a genuinely
infinite input can never be completely parsed anyway.
I'm not saying I'm right here, as I'm certain you know more than me,
but I'd like to see an example.
>> You can always construct an LR(1) - or even LR(0) - state model to
>> parse a context-free grammar, but if the grammar is not LR(1) that
>> state model will have shift/reduce or reduce/reduce conflicts.
>>
>> Almost all such conflicts can be resolved finitely. This is the logic
>> behind Tomita and other generalised LR parsing algorithms. The basic
>> principle is that conflicts are resolved by checking all possibilities
>> - the Tomita approach being to check them concurrently using stack
>> duplication, whereas another common approach is to use backtracking.
>
>This does, however, not give you any k, since the amount you have to
>read ahead before backtracking may depend on the input.
My claim is that almost all conflicts can be resolved finitely. I
never claimed that this tells me the value of k. A method for
determining k was vaguely described later in the post.
Of course formally I'm wrong, because I'm counting in terms of what is
useful rather than the far more numerous grammars that aren't.
Anyway, the main limitation on finite resolution is that the input
itself must be finite. That's no cheat since any input that can be
parsed completely must be finite. Obviously, formally, its a
limitation - but if you're doing state-so-far parsing of infinite
input streams, large lookaheads (often any lookaheads) are
inappropriate anyway, so it didn't seem relevant.
I probably should have mentioned this, of course.
>> I have designed an variant of GLR (though I have never implemented it)
>> which should theoretically give constant-time parsing of any context
>> free grammar
>
>I assume you mean "linear-time", as constant-time parsing sounds
>implausible.
Yes - sorry - I meant constant time per token of input. For some
variants (depending on how the table is managed and evaluated) that's
amortised constant time per token of input.
> But even linear-time parsing for any context-free
>grammar sounds suspect and would be a very big result (if you can
>prove it).
1. As I said, there's a pretty big proviso in there, related to the
halting problem. The only improvement in the power of this
relative to backtracking/Tomita is that it can detect problems for
specific inputs at run-time.
Actually, there is a proviso to that proviso. My method doesn't
need reduce sizes to be encoded in the state model - they are
determined at run-time by the table. This makes it easy to support
EBNF definitions for nonterminals. There's no extra power in terms
of the set-of-strings definition of a grammar, but it's a little
more flexible in how you get to define the grammar.
That's not all that interesting, though, as EBNF rules could
already be handled by grammar transformations.
Early is certainly still more powerful. Where backtracking LR goes
into an infinite loop and tabular LR can give a run time error, my
understanding is that Early would just work.
2. As I said, at best I re-invented it ten years after the fact.
Formal papers have already been published for tabular LR parsing.
I also suggest some tricks such as a kind of sliding window for
the table for handling unbounded inputs, but that leads to further
provisos. Truth told, even ignoring the cycle issues, you can
only ensure linear time parsing of any CF grammar using my method
if you know the full input up-front, so that you can use a
preallocated table.
Actually, I'm not entirely confident that Nederhofs tabular LR is
the same as mine - I haven't read his papers properly, and can't
even get hold of all of them. But given what I have read, it seems
to be the same thing.
That said, I believe that a proof that general CF parsing is
linear time was described here some years back. I started a thread
in 2002 relating to linear time parsing (my idea back then was
broken), and one of the replies referred to Kolomogorov complexity
theory. 6 years later, and I still haven't done the reading to
understand that.
Search for "Have I discovered something new?" and "Stephen Horne"
in Google Groups.
3. A lot of the costs have big constant factors. The cycle checks for
a start. *Very* unlikely to be mainstream. Potentially useful for
niche applications, I suppose, but even then I have my doubts.
My notes are in a 12-page approx. 340K OpenDocument Text file. The
real explanation of the method takes about 3 pages, the rest is on
variations and other general notes.
I can e-mail them if you want.
[Yes, there are lots of context free grammars that are not LR(k). See, for
example http://compilers.iecc.com/comparch/article/99-10-178 when this
came up nine years ago. -John]
> On Fri, 07 Nov 2008 14:27:18 +0100, tor...@pc-003.diku.dk (Torben AEgidius Mogensen) wrote:
>
>>Stephen Horne <sh006...@blueyonder.co.uk> writes:
>>
>>> For LR, *almost* all context-free grammars are LR(k < infinity), so if
>>> you can prove that the grammar is context-free you are almost there.
>>
>>That's not very useful, and not even true. There are many useful
>>grammars that are not LR(k) for any k and even useful languages for
>>which there doesn't exits LR(k) grammars for any k.
>
> Context-free grammars? And for any finite k?
Indeed. The language of all even palindromic sequences over the
alphabet {a,b} is not LR(k) for any k. Note that this has an
unambiguous grammar:
P -> aPa
P -> bPb
P ->
> My understanding is that you'd need an infinite input. While there are
> certainly applications where unbounded inputs occur, a genuinely
> infinite input can never be completely parsed anyway.
It is true that for any particular input, you need only finite
lookahead, but for a grammar to be LR(k), the same k can be used for
all inputs.
> I'm not saying I'm right here, as I'm certain you know more than me,
> but I'd like to see an example.
>
>>> You can always construct an LR(1) - or even LR(0) - state model to
>>> parse a context-free grammar, but if the grammar is not LR(1) that
>>> state model will have shift/reduce or reduce/reduce conflicts.
>>>
>>> Almost all such conflicts can be resolved finitely. This is the logic
>>> behind Tomita and other generalised LR parsing algorithms. The basic
>>> principle is that conflicts are resolved by checking all possibilities
>>> - the Tomita approach being to check them concurrently using stack
>>> duplication, whereas another common approach is to use backtracking.
>>
>>This does, however, not give you any k, since the amount you have to
>>read ahead before backtracking may depend on the input.
>
> My claim is that almost all conflicts can be resolved finitely.
If you have an ambiguous grammar, this is certainly not true. But if
the grammar is unambiguous, there is by definition only one possible
parse tree for a string, and since there exist algorithms for finding
this, you can say that you resolve the conflict by finite lookahead.
But, again, this lookahead is input-dependent.
> I never claimed that this tells me the value of k. A method for
> determining k was vaguely described later in the post.
Vaguely indeed, as in wrong.
> Of course formally I'm wrong, because I'm counting in terms of what is
> useful rather than the far more numerous grammars that aren't.
How do you determine if a grammar is useful?
> Anyway, the main limitation on finite resolution is that the input
> itself must be finite. That's no cheat since any input that can be
> parsed completely must be finite. Obviously, formally, its a
> limitation - but if you're doing state-so-far parsing of infinite
> input streams, large lookaheads (often any lookaheads) are
> inappropriate anyway, so it didn't seem relevant.
>
> I probably should have mentioned this, of course.
I never talked about infinite input, so this is irrelevant.
>>> I have designed an variant of GLR (though I have never implemented it)
>>> which should theoretically give constant-time parsing of any context
>>> free grammar
>>
>>I assume you mean "linear-time", as constant-time parsing sounds
>>implausible.
>
> Yes - sorry - I meant constant time per token of input. For some
> variants (depending on how the table is managed and evaluated) that's
> amortised constant time per token of input.
>
>> But even linear-time parsing for any context-free
>>grammar sounds suspect and would be a very big result (if you can
>>prove it).
>
> 1. As I said, there's a pretty big proviso in there, related to the
> halting problem. The only improvement in the power of this
> relative to backtracking/Tomita is that it can detect problems for
> specific inputs at run-time.
>
> Actually, there is a proviso to that proviso. My method doesn't
> need reduce sizes to be encoded in the state model - they are
> determined at run-time by the table. This makes it easy to support
> EBNF definitions for nonterminals. There's no extra power in terms
> of the set-of-strings definition of a grammar, but it's a little
> more flexible in how you get to define the grammar.
I just don't understand this description. Can you write it as
pseudo-code?
> 2. As I said, at best I re-invented it ten years after the fact.
> Formal papers have already been published for tabular LR parsing.
Indeed. But while they claim linear-time parsing for tables without
conflicts, there is no claim for tables with conflicts.
> I also suggest some tricks such as a kind of sliding window for
> the table for handling unbounded inputs, but that leads to further
> provisos. Truth told, even ignoring the cycle issues, you can
> only ensure linear time parsing of any CF grammar using my method
> if you know the full input up-front, so that you can use a
> preallocated table.
Knowing the full input up-front is no problem, but if you make a table
specifically to this input, you must show that this can be constructed
in linear time and that subsequent parsing using this table is also in
linear time.
> That said, I believe that a proof that general CF parsing is
> linear time was described here some years back. I started a thread
> in 2002 relating to linear time parsing (my idea back then was
> broken), and one of the replies referred to Kolomogorov complexity
> theory. 6 years later, and I still haven't done the reading to
> understand that.
I believe linear-time CF parsing is still an open problem.
> 3. A lot of the costs have big constant factors. The cycle checks for
> a start. *Very* unlikely to be mainstream. Potentially useful for
> niche applications, I suppose, but even then I have my doubts.
For just proving linear-time, big constant factors don't matter - as
long as they are constants independent of the input size.
> My notes are in a 12-page approx. 340K OpenDocument Text file. The
> real explanation of the method takes about 3 pages, the rest is on
> variations and other general notes.
If you really have an algorithm for linear-time CF parsing, send a
paper to a journal to get it reviewed. If you get it past the
reviewers, I might be persuaded to read your article. There are just
too many "I have this wonderful solution to (open problem), which I
have described in (long and impenetrable text)" claims out there for
me to want to read one more.
Torben
Nice paper, thanks for the pointer. It's pretty amazing what some
bachelor's students can do, even if they are leveraging other work as
a primary basis.
-Chris
>Stephen Horne <sh006...@blueyonder.co.uk> writes:
>
>> On Fri, 07 Nov 2008 14:27:18 +0100, tor...@pc-003.diku.dk (Torben AEgidius
Mogensen) wrote:
>>
>>>Stephen Horne <sh006...@blueyonder.co.uk> writes:
>>>
>>>> For LR, *almost* all context-free grammars are LR(k < infinity), so if
>>>> you can prove that the grammar is context-free you are almost there.
>>>
>>>That's not very useful, and not even true. There are many useful
>>>grammars that are not LR(k) for any k and even useful languages for
>>>which there doesn't exits LR(k) grammars for any k.
>>
>> Context-free grammars? And for any finite k?
>
>Indeed. The language of all even palindromic sequences over the
>alphabet {a,b} is not LR(k) for any k. Note that this has an
>unambiguous grammar:
>
> P -> aPa
> P -> bPb
> P ->
This is annoying because formally, I'm sure you are absolutely 100%
correct, and I cannot even appeal to usefulness since palindrome-like
expressions are part of many programming language grammars (esp WRT
different styles of parentheses).
Sort of.
Yet despite this, I still think your criticism is wrong.
Basically, palindrome-like expressions are common, as I said, in many
programming languages - most of which parse perfectly happily with
LR(1). I specifically mention problems with empty rules later in the
original post. Therefore, the only problem you *can* have WRT this is
in the subjective meaning of the term "almost all" in my original
post.
My impression of this discussion is that a lot of this has happened. I
can't fault you on your theory - for that matter, I have to thank you,
because you have made me take a serious look at some of my sloppy
thinking. Even so, the original post wasn't meant to be formal, the
bits that needed qualifiers were (mostly, at least) qualified, and
what has lead you to see me as simply wrong seems to be mostly a
matter of interpretation of subjective terms such as "almost all".
In these situations, I have learned to stop and rethink what the real
issue is, and I suspect that the real issue is the tone in my original
post. Very often, my writing seems to have an unintentional tone of
authority or even of arrogance.
All I can say here is that I have Aspergers. I have developed a
pedantic and formal communication style as a desperate and misguided
strategy to avoid being misunderstood, due to my dodgy nonverbals etc.
By now, it is a deeply ingrained habit. Mixing a pedantic and formal
communication style with an informal approach to a subject is of
course bound to cause misunderstandings, especially given that I
mostly see the tone I intend when I write - the tone that others see
is of course obvious, but only once someone else has pointed it out to
me :-(
Anyway, at this point, I think the best thing to do is try to draw
this to an end before I cause any more upset.
>> 1. As I said, there's a pretty big proviso in there, related to the
>> halting problem. The only improvement in the power of this
>> relative to backtracking/Tomita is that it can detect problems for
>> specific inputs at run-time.
>>
>> Actually, there is a proviso to that proviso. My method doesn't
>> need reduce sizes to be encoded in the state model - they are
>> determined at run-time by the table. This makes it easy to support
>> EBNF definitions for nonterminals. There's no extra power in terms
>> of the set-of-strings definition of a grammar, but it's a little
>> more flexible in how you get to define the grammar.
>
>I just don't understand this description. Can you write it as
>pseudo-code?
By "proviso to the proviso" I am not describing a feature related to
the detection of problems at run-time, only a
maybe-there-is-a-slight-improvement-in-power-after-all.
In an LR(0) state model, you might describe a state with a set of
dotted rules. For LR(1) etc, each dotted rule also has a lookahead,
but for simplicity I'll deal with LR(0).
If one of those dotted rules for the state has the following form...
NONTERM : a b C D e f.
then that indicates that the top of the stack contains those 6 tokens,
and that a reduction is possible which replaces those 6 tokens with
the single NONTERM token. The reduction size - 6 - is encoded in the
state model.
In normal LR, with a BNF form grammar, you always have a fixed reduce
length - a fixed number of tokens to pop from the stack, which is
encoded in the state model, and easy to determine using a dotted rules
based representation of the state model.
If you want to use EBNF rules to define your grammar, this basically
means that some rules are themselves defined using a regular grammar.
The normal approach to handling this is IIRC to translate the EBNF
form into a BNF form. However, tabular LR offers an alternative.
Basically, you can in principle have dotted rules such as...
NONTERM : A (b C D e)* f.
A standard LR parser cannot cope with this - it doesn't know how many
tokens to pop off of the stack in order to do the reduction - but a
tabular LR parser can derive that information from the table at
run-time.
Obviously, this approach has both benefits (state models should be a
little smaller) and costs (attribute evaluation may be problematic).
>> 2. As I said, at best I re-invented it ten years after the fact.
>> Formal papers have already been published for tabular LR parsing.
>
>Indeed. But while they claim linear-time parsing for tables without
>conflicts, there is no claim for tables with conflicts.
*Tables* with conflicts? If this refers to the cycles within a column
of the table, then perhaps I am making a new claim, though it seems a
fairly obvious idea to me.
In the simplest form of the algorithm, you evaluate the input and
table backwards using a standard LR state model. It's a kind of
dynamic programming approach. Since it is evaluated one column of the
table (one token of the input) at a time, with all cells being fully
evaluated in each column before progressing, forward references are
always trivially resolved - the cells being referenced have already
been evaluated.
The only problem is the evaluation of the cells in a single column,
which may reference each other. The only "conflict" I can see,
therefore, is the possibility of cycles. Since the number of cells is
constant for a particular grammar, order of evaluation can be
determined (and cycles detected) in constant time.
Perhaps you're thinking of conflicts in the grammar/state model. But
then the whole point of tabular LR (like backtracking LR and Tomita)
is to finitely resolve those conflicts/ambiguities that can be
finitely resolved. The only real differences between these, from this
perspective, are that tabular LR can detect cycles at run-time rather
than going into a loop, and that tabular LR can always give the
first/best parse in linear time (though enumerating alternative parses
obviously breaks the linear time claim).
>Knowing the full input up-front is no problem, but if you make a table
>specifically to this input, you must show that this can be constructed
>in linear time and that subsequent parsing using this table is also in
>linear time.
Absolutely true, and an area in which I fully acknowledge that some
variants of my algorithm "cheat a bit".
With input known fully in advance, the table is just a big array -
constant size per column, one column per input token. Trouble is, it's
a *big* array. Therefore, some variants look at ways to reduce the
memory requirement at any point in time, and the memory management and
lookup issues for that have not been properly addressed except to note
that it would be hard (perhaps impossible) to ensure strictly linear
time, though "practically" linear time shouldn't be too hard.
n log m (where m is the maximum number of table cells in memory at
once) should be easy. Unfortunately, you cannot decide m based on the
grammar alone. Therefore, for example, tree-based data structures are
not suitable for table-cell lookup. Hash-tables *might* work, but
while hash tables can give amortized constant time inserts, that's
only under certain constraints - when inserts and deletes are mixed in
such a way that the table grows substantially at times, and shrinks
substantially at other times, the analysis is a lot more complex.
>For just proving linear-time, big constant factors don't matter - as
>long as they are constants independent of the input size.
Yes - but big constant factors do affect whether something is
practical.
There's a common perception that linear time implies something faster
than, for example, n log n. That is only true in general for
sufficiently long inputs, and "sufficiently long" can be very very
long indeed. Furthermore, in practice, there is some correlation
between length of input and grammar complexity. That is, if the
grammar is very simple, you are unlikely to be dealing with very long
inputs, and if the inputs are all very short, it's unlikely that they
have a complex grammar.
No-one is going to switch to using a linear parsing algorithm for
their parser when - for typical source files - it takes many times
longer than their existing non-linear parsing algorithm. This includes
me - and it's the major reason why I haven't developed the idea
further.
Comparing tabular LR with basic LR, my guess would be that tabular LR
would run between 100 and 1000 times slower for the same fairly
typical domain-specific-language scale grammar. As the grammar gets
more complex, the table gets many more rows to evaluate per token
parsed, so I could easily be underestimating by orders of magnitude.
Sure, tabular LR can cope with some things that basic LR can't, but
just how much am I willing to pay for those things?
Basically, tabular LR has scaling constants based on the number of
nonterminal tokens and the number of LR states, which both tend to
increase the number of columns in the table. Where LR parsing speed is
only weekly dependent on the complexity of the grammar (depending on
the state model representation in the parser), and backtracking LR is
usually not much slower, tabular LR speed is extremely sensitive to
grammar complexity due to the evaluation of the table.
I've not done any formal analysis (or even looked back at how the
table is structured), but my intuition suggests at least quadratic
time for each token parsed as the grammar complexity increases (and
thus both the number of nonterminals and the number of states),
basically due to quadratically increasing numbers of columns in the
table. It could be much worse due to order-of-evaluation and
cycle-detection issues as each column is evaluated, since this also
has to be decided at run-time, and also because the number of states
probably increases polynomially with "grammar complexity", however
that is defined. So my best guess is "probably polynomial rather than
exponential, but pretty seriously bad". And I'm certainly not ruling
out exponential, either.
Anyway, I'm sure you can see how this can get very impractical, both
in terms of time and memory, even for quite modest grammars. Niche
applications with very large inputs and very simple grammars might
work, but I suspect that there are very very few applications where
the grammars are simple enough and the inputs long enough for it to be
worthwhile - as I said, there is some correlation between grammar
complexity and typical input size, and having simple grammars but
large inputs seems very niche indeed.
For the record, I mostly use Kelbt - a backtracking parser generator
that's a bit beta, really. Sometimes, simple mistakes cause core
dumps, for example. What's particularly strange is I don't even use
the backtracking features - LALR(1) works fine for what I need, and I
don't even miss precedence and associativity stuff that much. The big
gain for me is a lot of flexibility in how the generated code is
structured, since you basically insert generated code fragments into a
template that you write for yourself.
Since backtracking LR is overkill for what I need, it's fairly obvious
that designing and implementing a tabular LR parser generator just for
my few DSLs would be overkill on an epic scale.
>If you really have an algorithm for linear-time CF parsing, send a
>paper to a journal to get it reviewed. If you get it past the
>reviewers, I might be persuaded to read your article. There are just
>too many "I have this wonderful solution to (open problem), which I
>have described in (long and impenetrable text)" claims out there for
>me to want to read one more.
Well, it's not that long and impenetrable. Several posts in this
thread were probably longer and more impenetrable, but I'm not sure
how well I can present the table format in ASCII.
Anyway, considering that I can't really be bothered with it myself, I
certainly don't blame you.
Besides, as one of those who proposed a "wonderful solution to
linear-time parsing" back in 2002, with a long, vague, impenetrable
description of a broken algorithm - well, I don't exactly get to
complain ;-)
I have a lot of these ideas that get scribbled down, partly so that I
can forget them and moved on. I have some old notes for "texture
mapping" that I wrote when I was maybe 17 - something in the order of
20 years ago anyway. I had never seen real texture mapping done at the
time, though I had seen some stuff where the graphics are only
projected for a fixed number of orientations. My idea then wasn't
texture mapping, as it turns out, but a limited form of ray tracing
done the hard way (simultaneous equations rather than vectors etc).
If anyone discovers all this junk after I'm dead, I'll probably be
seen as a kind of tenth-rate wannabe Da Vinci.