Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Is Python context-sensitive?

341 views
Skip to first unread message

Michal Nazarewicz

unread,
Oct 6, 2009, 5:45:09 AM10/6/09
to
Hello everyone,

I've recently participated in a discussion of whether Python programming
language is context-sensitive or context-free. For those who are not
aware of Python it uses indention as a way of separating blocks rather
then some structural characters. For instance that's how a Python code
might look:

#v+
statement
if expression :
statement
statement
if expression :
statement
else
statement
statement
statement
statement
#v-

If I'm not mistaken if we were to construct a grammar with following
list of terminal symbols: { if, statement, expression, ":", " ", "\n" }
which would accept fragments as those presented it would be
context-sensitive grammar, wouldn't it?

Now, the argument of those who claimed Python is context-free is that
the most widely used Python's implementation uses a context-free
grammar[1] in particular grammar for (a simplified) if statement looks
like this:

#v+
if-stmt -> "if" expression ":" suite
| "if" expression ":" suite "else" ":" suite
suite -> INDENT statement+ DEDENT
#v-

As can be seen, lexer used with that parser generates a INDENT and
DEDENT symbols which serve the purpose of "{" and "}", "begin" and "end"
or similar in other languages.

* * *

So here comes my first question: Does that fact that language's
implementation uses a context-free parser mean language is context-free
or do we have to take lexer into consideration as well?

In my opinion if we say that grammar of a language is the same thing as
grammar used in language's implementation's parser we face two facts:
1) two different implementations of the very same language may have
different parsers and therefore we would have to say the vary same
language has two different grammars (which obviously is not true) and
2) two different languages may use the very same parser's grammar and
different lexers and therefore we would have to say two different
languages has the same grammar (which again is obviously not true).

* * *

On the other hand, I postulate that implementation's lexer is not
context-free and thus language as a whole cannot be considered
context-free because the whole process (ie. lexer + parser) is what
makes for a "language's grammar" and not only grammar used in parser.

But then they got me thinking after claiming that a lexer which converts
language based on indention to grammar with INDENT and DEDENT symbols
can be done using FSA with stack and thus Python's lexer is
context-free.

Clearly, the following pseudo code implements such simplified lexer
(some special cases not relevant to the discussion were ignored):

#v+
stack <- ( 0 )
lexem <- " "

WHILE lexem != "END"
lexem <- next lexem

depth <- 0
WHILE lexem = " "
depth <- depth + 1
lexem <- next lexem

IF depth > peek stack
generate "INDENT"
push depth on stack
ELSE
WHILE depth < peek stack
generate "DEDENT"
pop stack
IF depth != peek stack
error

WHILE lexem != "\n" AND lexem != "END"
IF lexem != " "
generate lexem
lexem <- next lexem

WHILE pop stack != 0
generate "DEDENT"
#v-

Even simpler approach could use no stack at all but instead remember
indention of last line:

#v+
indent <- 0
lexem <- " "

WHILE lexem != "END"
lexem <- next lexem

depth <- 0
WHILE lexem = " "
depth <- depth + 1
lexem <- next lexem

WHILE indent < depth
generate "INDENT"
indent <- indent + 1
WHILE indent > depth
generate "DEDENT"
indent <- indent - 1

WHILE lexem != "\n" AND lexem != "END"
IF lexem != " "
generate lexem
lexem <- next lexem

WHILE indent > 0
generate "DEDENT"
indent <- indent - 1
#v-

The first one may look a bit like FSA with a stack where the second may
even look like a plain FSA (without stack) but are they really? It got
me thinking for a while but then I realised that since it uses a counter
it cannot be an FSA, can it?

* * *

And that's my second question: Can in fact a translation of indent based
syntax to a (lets call it) parenthesis based syntax be done using a FSA
with a stack? Or is a more powerful automa required,
ie. a Turing-complete automa?


________________________________________________________________________

I've used the terms "context-free lexer" and "context-sensitive lexer"
which may be a bit disturbing for some so (as a bonus) let me say what
I understand under that.

First of all, I came up to a conclusion that lexer is something I know
under the name of "translation schemas" or "translators" which is a (N,
T, P, S, Act) tuple such that:
a) N is a non-empty finite set of non-terminal symbols,
b) T is a non-empty finite set of terminal symbols,
c) P is a set of productions of the form:
alpha -> beta where
* alpha is an element of (N u T)^+ and
* beta is an element of (N u T u Act)^*
d) S is a starting non-terminal symbol, is S e N and
e) Act is a set of actions.

For instance if we have the following grammar for simple expressions:

#v+
T = { "+", "*", "{", "}", NUM }
N = { expr, term, fact }
S = expr

P = {
expr -> expr "+" term | term
term -> term "*" fact | fact
fact -> NUM | "(" expr ")"
}
#v-

we can use the following translator to translate simple expressions to
RPN:

#v+
T = { "+", "*", "{", "}", NUM }
N = { expr, term, fact }
S = expr
Act = { <<print "+">>, <<print "*">>, <<print NUM>> }

P = {
expr -> expr "+" term <<print "+">> | term
term -> term "*" fact <<print "*">> | fact
fact -> NUM <<print NUM>> | "(" expr ")"
}
#v-

So if we wanted to use grammar for simple expressions in a parser we
would need a lexer which would accepts a stream of characters and among
others generate a NUM symbols. Such a lexer could be seen as the
following translator:

#v+
T = Unicode characters
N = { stream, something, number, digit }
S = stream
Act = { <<print "+">>, <<print "*">>, <<print NUM>> }

stream -> epsilon | stream something
something -> "+" <<print "+">> | "*" <<print "*">> |
"(" <<print "(">> | ")" <<print ")">> |
number <<print NUM>> |
" " | "\n"
number -> digit | number digit
digit -> "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"
#v-


________________________________________________________________________
[1] http://docs.python.org/reference/grammar.html

--
Best regards, _ _
.o. | Liege of Serenly Enlightened Majesty of o' \,=./ `o
..o | Computer Science, Michal "mina86" Nazarewicz (o o)
ooo +--<mina86*tlen.pl>--<jid:mina86*jabber.org>--ooO--(_)--Ooo--

Torben Ægidius Mogensen

unread,
Oct 7, 2009, 6:35:00 AM10/7/09
to
Michal Nazarewicz <min...@tlen.pl> writes:

When you say that a programming language is context-free or
context-sensitive, you don't normally talk about run-time semantics of
the languaeg, but solely about whether the syntax of the language can be
described by a context-free or context-sensitive grammar.

But what is the syntax? Is this the set of texts that can successfully
pass the compiler without error messages? If so, very few programming
languages are context-free, as such questions of whether a variable is
used without definition or if a GOTO statement has a defined destination
can not be solved by context-free parsing. Some dynamically-typed
languages will compile any program that passes a context-free parser by
postponing the remaining checks until runtime, though.

If you don't require all compile-time checks to be handled by the
grammar, you really muddle the question, as you can always define a
context-free grammar that accepts a superset of the language and then
let subsequent tests reject invalid programs. In the simplest case, a
parser can just "parse" the input to a list of characters and
arbitrarily complex tests can then inspect this list to reject illegal
programs (and compile legal programs).

> And that's my second question: Can in fact a translation of indent based
> syntax to a (lets call it) parenthesis based syntax be done using a FSA
> with a stack? Or is a more powerful automa required,
> ie. a Turing-complete automa?

For simplicity, let us assume that the grammar for the bracketed
language is

A -> aB
A -> (A)B
B ->
B -> A

and that the indented language uses the rule that the indentation on a
line is equivalent to the number of open parentheses and that each line
contains just indentation and an "a" symbol. Any string consisting of
lines where each line is a sequence of spaces followed by an a is a
legal string and has a unique equivalent bracketed string. Since the
language of such lines is CF (regular, even), parsing it is not an
issue. Translation may be, though.

A stack automaton that translates an indented string into a bracketed
string should output an "a" when it reads an "a" and output a "(" for
each extra space of indentation a new line has than the previous and a
")" for each less space of indentation a new line has than the previous.
For this reason, it needs to know what the indentation of the previous
line was. Since indentation is unbounded, the only place that can be
stored is on the stack. By popping from the stack as you read a new
line, you can find the difference in indentation, but as you do so, the
information about what the actual indentation was is lost. I can not
offhand see a way of regaining that without reading backwards in the
input stream. In fact, a two-way deterministic stack automaton that
does the translation is trivial to make. You don't even need a stack:
Two two-way pointers in an FSA suffices (so you are in the logspace
language class).

Similarly, I can't see how a one-way stack automaton can translate a
bracketed string into an indented string. While it can use the stack to
store the current number of open brackets, it can not read this
information without destroying it. Again, a two-pointer two-way FSA can
do the trick, though.

Note that the translation can be done in linear time by using Cook's
linear-time simulation of a two-way stack automaton on a RAM.

Torben

Michal Nazarewicz

unread,
Oct 7, 2009, 4:18:31 PM10/7/09
to
> Michal Nazarewicz <min...@tlen.pl> writes:
>> I've recently participated in a discussion of whether Python
>> programming language is context-sensitive or context-free. [...]
>>
>> [...]

>>
>> So here comes my first question: Does that fact that language's
>> implementation uses a context-free parser mean language is
>> context-free or do we have to take lexer into consideration as well?

tor...@pc-003.diku.dk (Torben Ægidius Mogensen) writes:
> When you say that a programming language is context-free or
> context-sensitive, you don't normally talk about run-time semantics of
> the languaeg, but solely about whether the syntax of the language can
> be described by a context-free or context-sensitive grammar.
>
> But what is the syntax? Is this the set of texts that can
> successfully pass the compiler without error messages? If so, very

> few programming languages are context-free [...]

I just realised this a few hours ago and so indeed it seems my question
is not as well-defined as I have hoped. Especially that one may even
wonder if a program that compiles but produces an run-time error is
valid or not?[1]

Guess there's not much point in trying to clarify things more as it
probably will lead us nowhere. :)

>> And that's my second question: Can in fact a translation of indent
>> based syntax to a (lets call it) parenthesis based syntax be done
>> using a FSA with a stack? Or is a more powerful automa required,
>> ie. a Turing-complete automa?

> [...] Since indentation is unbounded, the only place that can be


> stored is on the stack. By popping from the stack as you read a new
> line, you can find the difference in indentation, but as you do so,
> the information about what the actual indentation was is lost. I can
> not offhand see a way of regaining that without reading backwards in
> the input stream. In fact, a two-way deterministic stack automaton
> that does the translation is trivial to make.

So yes, that's what I was thinking as well. Thank you for confirmation
-- at one moment I really thought I was wrong in that matter. :P

Anyhow, thanks for your input. All in all I think I gained some
knowledge and understanding of how formal (as in theoretical) grammar
apply to programming languages used in practise. :)

____________________
[1] For instance "int main(void) { return *((int*)0); }" is a "valid"
C program which compiles but generates serious run-time error (or
unleashes demons as some folks down at comp.lang.c like to say) when
executed.

0 new messages