I believe it should be the latter. The language oriented-editor designers
should provide a technology base that may be taken as given by the language
designer who may then design languages without ugly lexical or syntactic
structures that make it impossible for the compiler to give sensible error
messages.
(For example, in the language I'm learning now, C++, if you forget to
declare a class, it is a *syntax* error. Not misuse of identifier or
undefined class, *syntax* error.)
(Define language-oriented editor to by the hypertext object-oriented
WYSIWYG formating editor or equivalent that you wish *you'd* written.)
Bill Smith
wsm...@cs.uiuc.edu
uiucdcs!wsmith
Don't blame C++. Blame C.
C is not context-free parsable [actually I believe it is, but not
with any sensible grammar].
--
Gordon V. Cormack CS Dept, University of Waterloo, Canada N2L 3G1
gvco...@waterloo.EDU gvco...@uwaterloo.CA gvco...@water.BITNET
Well, the *grammar* defining C programs is certainly context-free! Just
look in the back of K&R. This doesn't mean that any grammatically correct
program is semantically correct. I don't know of any non-trivial
language with a context-free grammar for semantically correct programs. (I
may be wrong here, but certainly this is correct for most languages.)
Actually, I've always thought parts of the grammar (e.g. type declarations)
were hacked to facilitate easy parsing.
Many compilers, however, do consider some semantic errors syntax errors,
because it's easier that way. As for the C++ syntax error described above,
that's compiler-dependent. MPW C++ tells me that the identifier is not
a valid type name.
Bob Hearn
he...@claris.com
I'm looking in the back of K&R (first edition), and if I'm not mistaken I'm
seeing context-sensitive stuff in the grammar.
Suppose you're looking at a compound-statement and you've just seen the `{'
opening it. Now suppose the next three tokens are `extern', "x", and `;'.
This is produced by a declaration-list, expanding to declaration, which expands
to decl-specifiers init-declarator-list[opt] ;. `extern' is an sc-specifier,
and decl-specifiers -> sc-specifier decl_specifiers[opt]. Now is "x" a
declarator (init-declarator-list -> init-declarator -> declarator ->
identifier) or a typedef-name (decl-specifiers -> type-specifier ->
typedef-name -> identifier)?
It's ambiguous unless you use non-context-free information to tell you whether
or not "x" was previously declared via typedef.
ajr
--
"The goto statement has been the focus of much of this controversy."
-- Aho & Ullman, Principles of Compiler Design, A-W 1977, page 54.
Enough of this bullshit! How many times do I have to say it??? The grammar
in K&R is context-free. Do you know what a context-free grammar *is*? You're
confusing the *class* of the grammar (ever heard of the Chomsky hierarchy?)
with what categories of parsers will work on it. You're also confusing
*syntactic correctness* with *semantic correctness*. This is a little more
acceptable, since, as I was saying, most compilers do to make parsing
easier.
Most C compilers use LALR(1) parsers (stands for "look-ahead left to right,
order 1), which are a subcategory of shift-reduce parsers. The 1 means they
can look at the next 1 token on the input stream when deciding whether to
shift or reduce. And you are quite correct in saying that, in this case,
that one token is not enough; you *do* need symbol table information. But
(1) No one says all context-free grammars have to be LALR(1) parsable
without context information, (2) no one says that the grammar in the back of
K&R is necessarily the one that's used in practice (grammars are often
tweaked (without modifying the language generated!) no make them easier to
parse), and (3) just because your compiler gives you a syntax error when
you do something like that, that doesn't necessarily mean your program
is syntactically incorrect. It just means that treating that kind of
semantic error as a syntax error makes the language easier to parse.
Now, before anyone else wants to tell me that K&R's grammar for C is not
context-free, will you *please* go and look up the definition of context-free
grammar??? Hint: anything that can be expressed in BNF is context-free!!
Bob Hearn
he...@claris.com
Uh, that's 'grammar'. :-)
Bob Hearn
he...@claris.com
Don't blame C either. The real cause of this error is a conspiracy
between the compiler writer and YACC. YACC produces this message
for every syntax error unless you specifically design your grammer
to catch a specific error and output a different message. The writer
of your compiler probably used YACC, and then didn't design his
grammer to do this in this case. He is probably not too much to blame
though. To design a languge to catch every possible generic grammerical
error would make the compiler so slow you might as well code in assembly.
enough from this mooncalf - Steven
----------------------------------------------------------------------------
To flame someone for bad spelling or grammer is a discouragement to net
discussion in a timely fashion.
----------------------------------------------------------------------------
These opinions aren't necessarily Motorola's or Remora's - but I'd like to
think we share some common views.
----------------------------------------------------------------------------
Steven R Weintraub cs.utexas.edu!oakhill!devsys!steve
Motorola Inc. Austin, Texas
(512) 440-3023 (office) (512) 453-6953 (home)
----------------------------------------------------------------------------
Bob, Bob, Bob.
Chill out, Dude. You're going to blow a brain vain if you're not
careful.
Now then.
The term "context-free", as Chomsky defined it is not very useful
to us computer types, because no one ever, ever writes grammars any
other way. To him, "context-free" meant that a rule could only
have one symbol on the left-hand side. "Context-sensitive" meant that
production rules could have strings of symbols on the left:
a X b <= a x y b
In the above, the a and b form a "context" for the reduction
X <= xy.
I haven't been following the discussion, but I suspect that the
correspondents were not alluding to the unuseful Chomsky definitions,
but instead were using the terms quite literally: "context-free" meaning
"context-free". I think that is a perfectly reasonable thing to
do, especially if one is unfamiliar with the confusing Chomsky
definition. (I never did like the practice of "defining" something
that already had a meaning in normal English.)
The grammar as published in K&R (both additions) requires
that the lexical analyzer must use left-context to determine whether
or not a token is recognized as a type-name identifier or a regular
identifier. In that sense, it is not "context-free".
I seem to remember that I once figured out that it is possible to
modify the grammar so that the scanner and parser do not have to know
about left-context. But I think I remember that such a parser would
require a two token-lookahead, making it impossible to use with most
compiler-compilers.
From article <890317222...@explorer.dgp.toronto.edu>, by fl...@dgp.toronto.edu (Alan J Rosenthal):
>
> In article <90...@claris.com> he...@claris.com (Bob Hearn) writes:
>>Well, the *grammar* defining C programs is certainly context-free! Just
>>look in the back of K&R.
>
> I'm looking in the back of K&R (first edition), and if I'm not mistaken I'm
> seeing context-sensitive stuff in the grammar.
>
> [Stuff deleted about ambiguous derivation sequence.]
>
> It's ambiguous unless you use non-context-free information to tell you whether
> or not "x" was previously declared via typedef.
I have to agree with Bob on this one, but for a different reason.
The grammar in the back of K&R certainly is context-free... but it
is also ambiguous. To paraphrase Aho & Ullman's "The Theory of
Parsing, Translation, and Compiling, Volume I: Parsing", page 91:
A grammar is Context-free if every production has a single
nonterminal on the left and a sequence of 0 or more terminals
or nonterminals on the right.
It is obvious that the grammars in both editions of K&R fit this
description. Thus they are context-free.
Now on to ambiguity. According to Aho & Ullman (same as above), pagee 143:
We say that a CFG G is ambiguous if there is at least one sentence
w in L(G) for which there is more than one distinct derivation tree
with frontier w. This is equivalent to saying that G is ambiguous
if there is a sentence w in L(G) with two or more distinct leftmost
(or rightmost) derivations.
The classic example of ambiguity is the "dangling else" problem, as
illustrated by the grammars in both editions of K&R.
So we see that K&R's grammars are ambigous context-free grammars. That's
no big deal--the ambiguity doesn't affect the language generated by the
grammar. The problem comes in when compilers base the semantic meaning
of a sentence on the derivation sequence the grammar uses to generate
that sentence. For some reason :-) compiler writers like to choose
deterministically between competing derivation sequences. Thus they
either statically choose the proper derivation sequence (eg. the "else"
always matches the last else-less "if") or they use run-time information
to help them choose the "proper" derivation sequence.
--
Norman Graham Oklahoma State University
Internet: nor...@a.cs.okstate.edu Computing and Information Sciences
UUCP: {cbosgd, ihnp4, 219 Mathematical Sciences Building
rutgers}!okstate!norman Stillwater, OK 74078-0599
Precisely! But, gee... I've considered myself a computer type for, oh, 12
years or so, and it's useful to me. How you write grammars depends on
what you're going to use the grammars for. I don't think anyone could
write a compiler, for example, without being at least familiar with
terms of formal language theory like "context-free" and "context-sensitive."
>
>I haven't been following the discussion, but I suspect that the
>correspondents were not alluding to the unuseful Chomsky definitions,
>but instead were using the terms quite literally: "context-free" meaning
>"context-free". I think that is a perfectly reasonable thing to
>do, especially if one is unfamiliar with the confusing Chomsky
>definition. (I never did like the practice of "defining" something
>that already had a meaning in normal English.)
If you *had* been following the discussion, you'd have seen that my original
posting was a followup to someone complaining about how screwed-up C's
grammar was, that it wasn't context-free. When someone uses the words
"context-free" and "grammar" in the same sentence, I assume he knows what
he's talking about, and what he said was incorrect... K&R's grammar for C
*is* CF, and not particularly screwed-up. I've received no end of mail
and flames for that simple correction! As for "context-free" already
having a meaning in normal English, I disagree. One could say that
you need context to parse C, but if you use the phrase "context-free" in
the realm of computer science, it's because you've heard it before, and
you should find out what it means before using it.
Bob Hearn
he...@claris.com
P.S. I'll chill when I stop getting flamed.
Oh. That's nice to know. From Scott, who is looking over my shoulder:
"Why are even bothering to argue with these idiots?"
Good question.
Bob Hearn
he...@claris.com
>
> If you *had* been following the discussion, you'd have seen that my
> original posting was a followup to someone complaining about how
> screwed-up C's grammar was, that it wasn't context-free. When someone
> uses the words "context-free" and "grammar" in the same sentence, I
> assume he knows what he's talking about, ...
What a silly assumption. I almost never assume that people who post
to the net know what they are talking about. But what you really
mean, I think, is that you want to use the term "context-free" exactly
the way Chomsky did. Okay, that's your right. But don't be surprised
or offended if someone doesn't know it. Just explain.
I taught two terms of formal language theory during a two year stint
as Visiting Associate Prof at Ohio University. I'm sure I brought the
subject up, but only to try to avoid the kind of discussion that is
going on here. We never talked about (Chomsky) context-sensitive
grammars again. I suspect that for most students, that small piece of
information got lost along the way.
Context-sensitive grammars, (in the Chomsky sense), are just not useful.
I really don't know why people continue to use his definitions.
I just now pulled a new compiler book off the shelf, completely at random.
It does pay lip-service to the Chomsky definitions. But of its 811 pages,
it devotes less than one quarter of page 97 to context-sensitive
grammars! First it gives a very brief definition, then the following:
While context-sensitive grammars and type-0 grammars are more
powerful than context-free grammars, they are also far less
useful. The problem is that efficient parsers for these extended
grammar classes do not exist, and without a parser there is no
way to use a grammar definition to drive a compiler.
[ Two sentences about parsers for context-free grammars omitted... ]
Whenever we mention a grammar (without saying which kind), the
grammar will be assumed to be context-free.
> [...]
> I've received no end of mail and flames for that simple correction!
Take my advice. Quit worrying about the flames you receive. Flamers
are small minded bigots, pedants, and idiots. Just ignore them.
(Or better yet, *apologize* sincerely. God, that irritates them!)
Dave J.
P.S. Spring has sprung. I probably won't be around the net much
in the next few weeks.
Yeah, here is my regular C grammar (a regular expression, really):
c :
| letter c
| digit c
| /* etc. */
;
Okay, it matches a lot more than C, but all legal C programs are acceptable.
So are all programs in all other programming languages. Useful!
The grammar in K&R is in the same category. It accepts all legal C programs,
plus a lot more. Because of current compiler technology we use an artificial
separation between syntax and static semantics. Static semantics can be
expressed in more powerful grammar types (2-level van Wijngaarden grammars etc.)
Your note is correct, but so is the observation that some context-sensitive
features are suggested by using different names for the same terminal, nl.
identifier. K&R use non-terminal names judiciously to suggest more than they
are really telling.
C can not be described exactly by a CFG, so C is not a context-free language.
Usually this type of confusion starts because one party is talking about
a grammar, whilst the other party is really talking about (an exact grammar of)
the language.
Ge' Weijers, KUN Nijmegen, the Netherlands
g...@cs.kun.nl
In my modest opinion, Bob Hearn is doing a fine job of entertaining us
with his emotional outbursts, but a considerable disservice with his
claims. Likely, Cormack is correct and Hearn is wrong. Likely
C's grammar is not Context Free.
I believe Cormack (in his academic way) has proven Hearn wrong. I believe
C's grammar has the appearance of a context free grammar but that it
violates the mathematical def of context free.
This question (is C's grammar context free) is
important and we should get a clear answer to it rather than outbursts.
Could someone please settle this definitely?
Ric Holt
I haven't programmed in C, but I do have a degree in Math, and have read
the source of some compilers, as well as portions of Aho and whozits.
So please take this as a real request for information.
As I read your message, and ponder, it seems to me that the implication is
that, in C, the user can define new Terminals at whim. (or alternately it
is infinite to begin with :-) ).
If this is, in fact, the correct implication, would you send a trivial
example?
--
--------------------------------------------------------------
Whatcha call a boomerang that doesn't come back? A stick.
1-(512)-346-5781 (v) using Modula-2. enquiries welcome.
Austin, TX 78759 ...!cs.utexas.edu![oakhill|kvue]!val!aubrey
OK, OK, I'm sorry. I'm new to the net, and I wasn't expecting to get flamed
*quite* so much for one simple correction... it just became very annoying.
I'll try to restrict my responses to computer science from now on.
>I believe Cormack (in his academic way) has proven Hearn wrong. I believe
>C's grammar has the appearance of a context free grammar but that it
>violates the mathematical def of context free.
>
>This question (is C's grammar context free) is
>important and we should get a clear answer to it rather than outbursts.
>Could someone please settle this definitely?
> Ric Holt
Good idea. Although I think this whole argument is founded on
misunderstandings. It all depends on what you mean by "context-free
grammar." If you follow the technical definition, then there is no
question that C's grammar (as specified in K&R) is context-free. As
for Cormack's claim that C does not have a finite set of terminals, I
can't say that I really understand what he means. Perhaps he is
referring to the fact that there is an infinite number of identifiers?
Well, if we restrict ourselves to K&R, then the terminal in question is
simply (identifier), and it does not decompose further. If we want
to make this grammar more concrete and have it generate the actual
language that a compiler accepts, then we would have identifiers producing
strings over the char. set. In this case, the char. set is the set
of terminals, and it is certainly finite.
BTW, the real argument here is not whether C's grammar satisfies some
mathematical definition. It does, but it is not a particularly relevant
piece of information. The only reason I claimed it *did* was that
someone, a long time ago, made some comment on how screwed up C's grammar
was, and that it wasn't context-free. Well, as I don't think C's grammar
is particularly screwed up relative toother languages' grammars, I
responded. I happen to like C, and I don't like C-bashing. Guess I
learned my lesson, huh?
Bob Hearn
he...@claris.com
I detect that we are picking nits here. I think that we all know what is
meant by the claim that the K&R grammar is not context-free. It has been
noted that the K&R grammar is *ambiguous*; hence, its specification of
how programs should be parsed is *ill-defined*. Additional
"context-sensitive" specifications are added to the K&R grammar in the form
of English prose (in the second edition, anyway; the first edition
seems to ignore the ambiguity problem).
Effectively, the grammar in K&R is "BNF + English prose". Taking this
view, the K&R grammar is not context-free.
Steve Vegdahl
Computer Research Lab
Tektronix Labs
Beaverton, Oregon
A context-free grammar G is a 4-tuple G = <N,T,P,S> where N is a finite
set of nonterminal symbols, T (the alphabet) is a finite set of terminal
symbols, P is a finite set of production rules of the form A -> X1 X2 X3 ...
where A is in N, and X1, X2, X3 are in Union(N,T). S in N is the Start symbol.
The K&R grammar does not have a finite alphabet.
I agree, that this is founded on misunderstandings, but I think you've
got the wrong misunderstanding. I believe that C's grammar is _not_,
in fact context free, though the grammar in K&R is. How can this be?
I'm glad you asked. The grammar given in K&R isn't technically
correct. Don't take offense, I'm not making a judgement on this
practice, just pointing it out.
One inaccuracy is in the production
typedef-name : identifier
I don't know if there are others. "typedef-name" occurs in the production
type-specifier: ... | typedef-name
and a mechanical reading of this implies that _any_ identifier can
occur as a type-specifier. This is not the case. In fact, an
identifier can only be a type-specifier if it has previously been
declared with a typedef. I can't prove it, but I believe you need a
context-sensitive grammar to specify this.
So both claims made in this argument are correct:
(1) the grammar of C is not context free
and
(2) the grammar given in K&R is context free
Yes, this is another big misunderstanding in this discussion... I would
say that the *grammar* stipulates that a type-specifier can be an
identifier, but the *semantics* specifies that this results in a valid
program only when the identifier is a valid type name. This is really
the same problem as
{
.
.
x = 1;
.
.
}
which is correct only if x has been previously declared. I think here
one would always call this an undeclared identifier error, a type of
semantic error, rather than a syntax error. And you're quite right that
it takes a context-sensitive grammar to generate the set of VALID C programs
in this sense! My point is that this is true for just about any language
you can name, and there is nothing particularly odd about this aspect
of C.
In any case, this discussion is getting old... I vote we talk about something
new. This is comp.lang.misc, not comp.lang.C.grammar, right? :-)
So, anyone designed any good languages lately?
Bob Hearn
he...@claris.com
It sounds as though you are defining the semantics of a language as
any part that cannot be described with a context free grammar. Of
course this makes you trivially correct, but does not affect the
opinions of people who criticize C based on a different definition of
syntax and semantics. Also, you cannot treat the C grammar as a
formal grammar that is only meant to accept strings from a formal
language. The grammar is intended to show syntactic categories with
semantic attachments.
The problem of undeclared identifiers is common to many programming
languages, but the typedef problem is more serious. An identifier is
always parsed into category "identifier" regardless of whether it has
been declared. However, typedefs are ambigous. For example:
typedef int foo;
int i;
void f()
{ foo *i;
...
}
The K&R grammar is ambigous on how to parse this, while any C
programmer would expect "i" to be a local pointer to an int. Once an
identifier has been declared as a type, it should no longer be an
identifier, but rather have a different syntactic class. No, this
does not affect the language accepted by the grammar. But it does
affect the syntactic category of the whole string "foo *i;".
Because of the differences in syntactic categories, most people would
say that
static foo x;
should be a syntax error if foo has not been typedefed. You can argue
that, but like I said before, you are only defining terms to fit your
views.
>In any case, this discussion is getting old... I vote we talk about something
>new. This is comp.lang.misc, not comp.lang.C.grammar, right? :-)
The following should not be taken as a flame, I'm just making an
observation: If you want to stop the discussion, you don't have to
reply. It doesn't work to write an article and then suggest that no
one respond to it after you have gotten in the last word.
Clever thought, but wrong. The *grammar* in K&R only talks of terminals
which are taken from a finite set of lexical types, not of the inifinite set
of possible lexemes whose structure is defined within the surrounding text
but not in the grammar.
In as much as it is given, the grammar for C in K&R IS context free; some of
the stuff defined in the text around it might not be (e.g., lexical
analysis), but the whole issue isn't worth this network bandwidth. C is
easily parsed using any of LL(1), LALR(1), etc., with only the usual lexical
trickery (AKA the usual symbol table mechanism).
This is an interesting point (ie. one I hadn't thought of before :-).
In theory, a grammar is a generator for strings in a language, not
a recognizer for a language (eg. various forms of automata). So
the grammar in K&R is context-free, but it doesn't generate the C
language--in addition to valid C programs it also generates invalid
C programs. Of course, when this grammar is used as the basis of
a compiler, the semantic analyzer should detect these invalid
programs. This makes sense to me... how about anyone else out there?
I'm sure that Gordon Cormack has something in mind here because I know
he's a smart guy, but I'm darned if I can figure out what it is. In
any case, I'm sure that the K&R grammar only uses a finite subset of its
alphabet, because otherwise it wouldn't fit in my desk.
So Gordon, if your still there after being insulted in such an uncalled for
way, please enlighten us.
As I recall, the grammar in K&R [1978] has a single terminal "identifier"
which includes all identifiers, a nonterminal called "typedef_name"
defined by the production
typedef_name --> identifier
This leads to an ambiguity because phrases like
t * i ;
can be understood to be either declarations or statements. But ambiguous
grammars have been with us a long time. People even use parser generators
that accept ambiguous grammars! In fact there are other ambiguities
in the grammar (e.g. the dangling else). Ambiguities don't change the
set of strings admitted by a grammar they merely admit some strings in
more than one way.
The real question is what do you want to call syntax and what do you
want to call static semantics. In the absence of any definitive definition,
this is a matter of taste and convenience to the implementor. Is
the following (complete) compilation unit syntactically correct?
f() { t * i ; int a; }
The grammar says yes; most compliers say "syntax error". If we
change the grammar to make typedef_name a terminal and remove its
only production, then the grammar agrees with the compilers.
The grammar is of course still context free, but the ambiguity is
gone.
Ric Holt raises the separate question of whether the C language is
context free. The syntax of C certainly is. The problem with
C is that it is not "feedback free" in the following sense. The
parsing of C depends on the lexing of C (this is only natural),
but the lexing depends on the parsing. Of course, with the original
K&R grammar, C _is_ feedback free which is no doubt why Ritchie
wrote the grammar that way.
Summary:
-The K&R [1978] grammar is context free, but describes the wrong language.
-The C language syntax (abstracted from lexing and static semantics)
is also context free.
-The usual implementations and most useful definitions of C language
are not "feedback free".
-We're still in the dark about infinite alphabets.
Theo Norvell
(The term "feedback free" is used in a completely different sense by Holt.
My apologies for stealing it. By the time I realized I was using it
differently, I liked it too much to give it back. In Holt's sense, C is
feedback free. See the Turing Programming Language Design and Definition
for his definition. In fact, see it anyway, if you're interested in
design or definition.)
I think the crux of the argument of the non-context-free crowd is that
the grammar in K&R does NOT actually specify the C language by itself, not
as a grammar. It relies on the trick of telling the lexer about typedefs,
which is a form of back-door context.
Consider: if you wrote a YACC specification of C according to the context-free
grammar, but omitted any kind of action clauses (perhaps you want to turn on
debugging and watch the state machine play, a fine entertainment for a rainy
day), it would NOT be able to correctly parse a C program with typedefs (unless,
of course, you hid a "real" C grammar (with actions and the ever present
back-door gimmick) in the lexer itself).
When the pro-context-free crowd's argument is presented as "So what; it is
close enough, and it works", the argument in convincing. When it is presented
as "Yeah, well, your mother wears army boots!", it isn't.
--
John Woods, Charles River Data Systems, Framingham MA, (508) 626-1101
...!decvax!frog!john, jo...@frog.UUCP, ...!mit-eddie!jfw, j...@eddie.mit.edu
Remainder Khomeini!
First, our definition:
In article <12...@watdragon.waterloo.edu> gvco...@watdragon.waterloo.edu
(Gordon V. Cormack) writes:
>A context-free grammar G is a 4-tuple G = <N,T,P,S> where N is a finite
>set of nonterminal symbols, T (the alphabet) is a finite set of terminal
>symbols, P is a finite set of production rules of the form A -> X1 X2 X3 ...
>where A is in N, and X1, X2, X3 are in Union(N,T). S in N is the Start symbol.
In article <91...@claris.com> he...@claris.com (Bob Hearn) writes:
>... I think this whole argument is founded on misunderstandings.
It is.
>It all depends on what you mean by "context-free grammar."
Most people mean the definition above.
>If you follow the technical definition, then there is no
>question that C's grammar (as specified in K&R) is context-free.
Aha! But the grammar in K&R 1st ed. is incorrect.
In article <98...@megaron.arizona.edu> gud...@arizona.edu (David Gudeman)
writes:
>One inaccuracy is in the production
> typedef-name : identifier
>I don't know if there are others.
(I recall a correction or two elsewhere, but my K&R is in my office and
I am not.)
>So both claims made in this argument are correct:
>(1) the grammar of C is not context free
>and
>(2) the grammar given in K&R is context free
Returning to <91...@claris.com> (Bob Hearn):
>... As for Cormack's claim that C does not have a finite set of terminals,
(this refers to the line `The K&R grammar does not have a finite alphabet.'
in <12...@watdragon.waterloo.edu>)
>I can't say that I really understand what he means. Perhaps he is
>referring to the fact that there is an infinite number of identifiers?
Rather, that there are an unbounded (not infinite, although it comes
out the same) number of type declarators, via the `typedef' mechanism.
>Well, if we restrict ourselves to K&R, then the terminal in question is
>simply (identifier), and it does not decompose further.
(He then suggests splitting it into letters, etc., a simple mechanical
task.)
Unfortunately, the terminal in question is the `identifier' in
`typedef-name : identifier'. This is, as has been noted, inaccurate.
The key is that only those identifiers which (a) have been previously
defined via `typedef' and (b) have not been hidden by new local
declarations are in fact typedef names. This information can only be
found by context (oops, there is that word again), and if the size of
the input string (C code) is unbounded, there is no way to capture this
in a CFG (using our definition above). If we limit the length of the
language to be recognised---to any fixed number of symbols, as long as
it is fixed---we can construct a CFG for this subset of C.
Fortunately, none of this really matters.
As someone else pointed out (in an article I forgot to save), we
generally do not write grammars that accept only strictly correct
inputs. Instead, we take some shortcuts and patch things up via
semantics. This approach works quite well for C, although a number of
compilers do it wrong (including PCC) and patch the grammar by having
the lexer return `type' instead of `identifier' when the identifier has
been defined via typedef. In particular, this prevents
typedef int temperature;
f() { auto double temperature; ... }
from compiling correctly. (PCC does handle
f(temperature) double temp; { ... }
correctly, by turning off typedef name recognition in the lexer during
argument parsing. It could do the same after type keywords in local
variable declarations, and I am not sure why it does not---it may have
done a lookahead already, perhaps. But that is not the only place
PCC has been seen to fall short....)
>someone, a long time ago, made some comment on how screwed up C's grammar
>was, and that it wasn't context-free. Well, as I don't think C's grammar
>is particularly screwed up relative to other languages' grammars, I
>responded. ...
Strictly speaking (as long as you stick with `typedef produces a new
type specifier'), C's grammar is not context free; but it is indeed
not seriously `screwed up'. Perfect generators are not necessary, and
a language `close enough to C' exists that can be parsed simply.
--
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain: ch...@mimsy.umd.edu Path: uunet!mimsy!chris
Our `impure' CFG includes (among others) the following productions
(rearranged and simplified to make the problem obvious):
statement : '{' optional-declarators statement-list '}'
statement : expression
declarator : typedef-name '*' identifier // i.e., a pointer
typedef-name : identifier
expression : identifier '*' identifier // i.e., multiply
Thus, there are two correct parses for the input string
{ t * x; }
One declares the variable `x' as a pointer to the typedef'd type `t';
the other multiplies the value of `t' by the value of `x'.
It seems our impure C CFG is ambiguous.
Well, we never said otherwise....
As a practical matter, we want the compiler to choose the declarator
parse if and only if `t' is in fact an active typedef name. Fortunately,
there are a number of ways to gimmick a parser to do this.
The difference I see is that a C grammar that accepts undeclared typedef
names destroys more of the structure of the grammar than a Pascal
grammar that accepts undeclared variable names. Consider the following
C expression:
x = (a)(b);
If "a" is a typedef name, then the right side is a cast. If "a" is
not a typedef name, then the right side is a function call.
I dont believe similar situations exist for Pascal (corrections welcome).
As G.V. Cormack notes, any language can be parsed by some context free
grammar; we are interested in grammars which represent the semantic
structure of the language as closely as possible.
About "C's grammar [not] being screwed up": with a few simple kludges
such as recognizing typedef names in the lex, it is fairly
easy to build a grammar which has sensible structure for C.
Or is it? Some points were raised about where PCC does/doesnt turn off
typedef recognition, and this seems ill-defined.
(Anybody know what the ANSI draft says?)
If having parts of your language ill-defined does not bother you,
consider trying to build a suffix or reverse parser for C. Such
a parser might be used in a syntax directed editor, or for syntax
error recovery. Such parsers cannot preserve the structure of
a C grammar because they do not have available the information required
to tell if an identifier is a typedef or not.
Andrew K. Wright akwr...@watmath.waterloo.edu
CS Dept., University of Waterloo, Ont., Canada.
In article <12...@watdragon.waterloo.edu> akwr...@watdragon.waterloo.edu
(Andrew K. Wright) writes:
>The difference I see is that a C grammar that accepts undeclared typedef
>names destroys more of the structure of the grammar than a Pascal
>grammar that accepts undeclared variable names. Consider the following
>C expression:
>
> x = (a)(b);
>
>If "a" is a typedef name, then the right side is a cast. If "a" is
>not a typedef name, then the right side is a function call.
>I dont believe similar situations exist for Pascal (corrections welcome).
I am not sure what you mean by `destroys more of the structure of the
grammar'. It is ambiguous, and in particular it is an ambiguity of an
uncomfortable sort, quite unlike the shift/reduce ambiguity for if/then/else
which everyone resolves by taking the shift. Yacc permits this reduce/
reduce conflict, but resolves it by taking the first grammar rule; we
want the parser to take the *appropriate* grammar rule, which is sometimes
the first and sometimes not. So yacc unaided will not produce a
deterministic parser that does what we want.
BUT all this applies to constructing a deterministic LALR(1) parser,
which is not at all the same thing as having a context free grammar!
(These comments are based on the July 1988 draft.) The standard allows
typedef names to be redeclared within an inner scope:
typedef int x;
{
char *x;
}
The only restriction is that the type must be explicily given. That is
you cannot redeclare it like:
static x;
One solution is to define the identifier in the declarator to
be any identifier:
any_ident: Identifier | Typedef_name;
rather than turn off typedef recognition in the lexical analyzer. This is
also needed for labels. However, this simple change to the grammar
presented in the standard does not result in an LALR1 grammar. I found
the transformations required to get a grammar that yacc would accept
quite difficult to find and the result was not very elegant. So my
answer to the question "Or is it?" is NO!
Ken Walker / Computer Science Dept / Univ of Arizona / Tucson, AZ 85721
+1 602 621 2858 kwa...@Arizona.EDU {uunet|allegra|noao}!arizona!kwalker
For a new approach to this, look at:
"The Revised Report on the Algorithmic Language Algol 68".
- van Wijngaarden et al
This approached this problem by defining the grammar using another (context-free)
grammar (hence the term 'two-level grammar'). This meant that things like not using
a variable before it was defined were specified in the *grammar* - very nifty.
To do this, the second level grammar had a non-terminal called NEST which expanded
to sequences like "int a char b" and so on. Productions of the first level grammar
had NESTs scattered in them (translating to C):
TYPE NEST variable : TYPE NEST IDENTIFIER, where TYPE IDENTIFIER declared in NEST .
(comma means "and"). The production
where TYPE IDENTIFIER declared in NEST
expands to "empty" if the variable was declared, and to a nonterminal with no
production rule if it isn't - thus the grammar can't generate a program with an
undeclared variable. (Algol 68 had user-definable types, like C).
They did all kinds of things with this - type checking of assignments and operator
arguments, and context sensitive automatic type conversion come to mind.
Pity it was so hard to parse :-)
--
Clive D.W. Feather cl...@ixi.uucp
IXI Limited ...!mcvax!ukc!acorn!ixi!clive (untested)
+44 223 462 131
Of course these two-level grammars are *substantially* more powerful
than context-free grammars. Two-level grammars are capable of generating
recursively-enumerable sets (ie. those things that can be recognized only
by something with the power of a Turing machine). This class of languages
is much larger than both the Context-Free and Context-Sensitive languages.
However, Two-level grammars can also be used to generate the Context-Free
and Context-Sensitive languages that we use for programming. Two-level
grammars provide a compact and elegant (some would say 'cryptic' :-)
notation for the description of programming language syntax without
the need to escape to a natural language. Personally, I would love
to see more language designers take advantage of this tool.
--
Norman Graham Oklahoma State University
Internet: nor...@a.cs.okstate.edu Computing and Information Sciences
UUCP: {cbosgd, rutgers} 219 Mathematical Sciences Building
!okstate!norman Stillwater, OK 74078-0599
For practical parsing (using, e.g., yacc), yes; for CF-ness, no. The
*real* crux of the argument is that while you can write a decent, simple,
no-feedback-through-lexer CFG that accepts all C programs, you can *not*
write a decent, simple, no-feedback LALR(1) parser to do this, because
the simple CFG is ambiguous.
Those who say `C does not have a CFG' are really saying `C does not
have a trivially yacc-able grammar', which is true (but, as I keep
saying, really means only that we need a better parser generator than
yacc).
(And no, it does not take a nondeterministic automaton to parse C. The
little back-door trick would work wonderfully well if only you could
know, each time a yacc parser examines a token, which rule yacc was
attempting. When it tries `typedef-name: identifier', the lexer should
answer `identifier' if and only if the identifier happens to be a
typedef. Then make sure that yacc tries this rule before trying
`primary: identifier' [this part is easy] and voila! you have a
parser. Of course, yacc actually only calls the parser once for both
rules; you would need a way to change the token on the fly. A better
gimmick---less expensive in compute power, at least---would be to
associate a function (predicate) with each rule, and allow the
predicate to abort the rule. You then attach an `identifier-really-is-
typedef' predicate to the `typedef: identifier' rule, making the parser
pass over this resolution and choose instead the `primary: identifier'
rule. Indeed, the same trick would allow one to reject undeclared
identifiers, making `program foo; begin bar := 1 end.' a syntax error
instead of a semantic one---not that this is a good thing to do!)
gvcormack points out (as I conveniently forgot to mention) that this
is actually an affix grammar, and affix grammars are not context free.
I think this is a good point for a summary.
- There is no context free grammar that recognises only valid
C programs (but this is true of many languages).
- There is a trivial context free grammar that recognises all
valid C programs, along with a number of invalid ones, but it
is ambiguous. There is probably no unambiguous CFG for C
(consider arbitrarily many parentheses around an id: is it a
cast or a value?).
- There is a trivial affix grammar (gimmick B above) that
recognises all valid C programs, along with a number of
invalid ones, which is not ambiguous and can be handled with
a modified yacc (you would have to augment yacc's output
tables and parser to retain both reductions and to know what
to call to select which is to occur).
and of course,
- There is always the wrong way. :-)
No, I think there is a real difference in "context-freeness" between
Pascal and C. In C compilers that are widely considered to conform to
K&R, the syntax accepted by the parser is actually modified when a typedef
is encountered and, furthermore, there is no better way to handle the
typedefs. To elaborate:
In Pascal, there is only one way to interpret a given statement or declaration,
regardless of any previous declarations.
Consider the following in C, however:
. . . { T (x); . . . }
How should the construct be parsed? Is it the declaration of
a variable x of type T (where T is a previously typedef'd identifier)?
Or is it a call to the function T with argument x?
We can't even distinguish between an executable statement and a declaration
in C without resorting to context. This makes C fundamentally
less context-free than Pascal (or FORTRAN or Ada, for that matter). The
difference can be put in practical terms by noting that it is impossible to
write a processor for C that accepts exactly those programs that would not
be considered a "syntax error", unless you put a symbol table into the
program, to remember the context of typedefs. No such requirement holds for
Pascal. For example, a program that counts executable statements in C
programs must have a symbol table for typedefs, with all the extra complexity
that implies. The corresponding program for counting statements in Pascal
is much simpler. The same would hold for a cross-reference program and any
number of other program analyzers.
To summarize, the only way you can parse C with a context-free parser is if
you don't care whether a declaration is confused with an executable statement.
This utterly destroys the usefullness of such a parse, as far as I'm concerned.
--
Michael Condict {att|allegra}!m10ux!mnc
AT&T Bell Labs (201)582-5911 MH 3B-416
Murray Hill, NJ
>Consider the following in C, however:
> . . . { T (x); . . . }
>How should the construct be parsed? Is it the declaration of
>a variable x of type T (where T is a previously typedef'd identifier)?
>Or is it a call to the function T with argument x?
...
>To summarize, the only way you can parse C with a context-free parser is if
>you don't care whether a declaration is confused with an executable statement.
>This utterly destroys the usefullness of such a parse, as far as I'm concerned.
That may destroy the usefulness in your mind, but that does not make it
non-context-free. To make a language recognizer (whose sole purpose
in life is to say "yes" this is a valid program, or "no" it is not)
you only need say yes or no for the whole program, not say what piece
is which. C may very well be not context free, or not easily context
free, but this is based on things like making sure variables are declared,
the right type, etc. (i.e. only allow p->q->w if q is a pointer member
of a structure type p is a pointer to, and the type q points to has a
member w...)
The suggestion you make, of a rule like:
decl_or_fcall : identifier '(' identifier ')' ';' ;
to spot the case you mention works well. In fact it is a useful tool
for doing things like allowing forward-referenced typedefs.
>Michael Condict {att|allegra}!m10ux!mnc
--
Marc Mengel mme...@cuuxb.att.com
attmail!mmengel
...!{lll-crg|att}!cuuxb!mmengel
The parsing is always the same, from the point of view of a
context-free grammer. It's the semantics that are different. That you
assign a different meaning to the language construct based on what has
come before has nothing to do with its context-free grammar.
Context-free grammars don't care about the declaration of type names.
If they could, they wouldn't be context-free.
The confusion here arises because real languages like Pascal and C
simply *cannot* be completely defined using context-free grammars. But
they are--because the definition we use is incomplete, and we fill in
the missing parts with a set of additional rules that are not part of
the context-free grammar. The size of this set of rules may or may not
be greater for C than for Pascal.
--
Rahul Dhesi UUCP: <backbones>!{iuvax,pur-ee}!bsu-cs!dhesi
ARPA: dh...@bsu-cs.bsu.edu
x y z;
--
Peter da Silva, Xenix Support, Ferranti International Controls Corporation.
Business: uunet.uu.net!ficc!peter, pe...@ficc.uu.net, +1 713 274 5180.
Personal: ...!texbell!sugar!peter, pe...@sugar.hackercorp.com.
It depends on which context-free grammar you are using. Here's a
simple one:
program ::= char charlist
charlist ::= .null. | charlist char
char ::= 'a' | 'b' | ... '0' | '1' | ... '#' | '(' | .... blank | tab ... etc.
(.null. means empty string.)
My lexical analyzer always returns a single character or EOF. My
parser is pretty simple. There is only one possible error during
parsing: "premature EOF".
Semantic analysis is a nightmare :-).
Everyone seems to have missed my point on this, because I used a bad example.
Consider instead the one that someone else posted:
{ T * x; . . . }
In this case (which could be the declaration of a var x of type (T *) or
could be the expression "T multiplied by x"), the parsing most certainly should
not be the same for either case, at least not if you intend to do anything
useful with the parse. One of the following two different parse trees
should be produced:
* decl
/ \ / \
T x T *
|
x
What you are ignoring is that very few parsers stand alone by themselves. They
are used in a program that wants to do something with the language. The
fundamental difference between C and other languages is that because of the
above bogosity, there is much less you can do with the language without
including a symbol table for typedefs. I can't believe this is even open
for argument. Try writing programs for C and Pascal that delete all executable
statements, printing only the declarations. The C version will have to have
a symbol table -- the Pascal version will not. Isn't that perfectly clear?
Of course we all know that no useful programming language, including C,
Pascal, FORTRAN, Ada or LISP, is context-free when we consider the exact
set of programs that are accepted by the compiler, but that was not my point.
--
Michael Condict {att|allegra}!m10ux!mnc
We are not ignoring this. The topic was `context free grammars', not
`compilers and parsers'. If you want to argue that one cannot build
a useful compiler using a CFG approximation for C, that is fine (since
we have seen that this is true). Just do not do it by saying that
there is no context free grammar that recognises all valid C programs.
There is no context free grammar that recognises all valid C programs.
Several people have pointed out that your artificial definition of
syntax (anything that can be described with a context free grammar)
begs the issue. By my definition of syntax, and aparently that of
most other posters, C's syntax is not context free.
"Context free grammar" and "recognize" have precise technical definitions;
look them up. By those definitions, there are context free grammars which
will recongnize any valid C program. The fact that the grammar may
not produce a parse which matches your definition of the syntax of the
program does not invalidate the truth of that statement. As you point
out, one should not confuse grammar and syntax.
The statement you want is: there is no unambiguous context free grammar
which will generate a parse which matches your syntax for all C programs.
That may be rather long winded, but at least it is accurate.
As far as I know, "context free syntax" does not have a widely recognized
meaning (can anyone come up with a reference to disprove this?). Clearly
you can define it in such a way that it would be correct to say that
C does not have a context free syntax. However, people will confuse the term
with "context free grammar", so I wouldn't suggest using it. It is better
to make a long winded explaination of the problem with C's syntax.