feedback for GSOC 2012 idea

310 views
Skip to first unread message

Bharath M R

unread,
Mar 8, 2012, 11:55:10 AM3/8/12
to sy...@googlegroups.com
Hi,
I am Bharath M R, a student of Electrical Engineering at IIT Madras. 
I would like to apply for the building the gamma.sympy.org site. I would like to
implement the following things as part of the project. 

1)Basic parsing
Something like solve x**2==1, integrate x**2 will be parsed and interpreted. Doing something like wolphram alpha would require a lot of ideas from natural language processing which I am not familiar with.

2) An incremental search for functions in symPy
Its very important for a person to get to know the particular function he wants to use. This will be implemented using ajax calls to the sphinx documentation database.This will be similar to search in scilab / mathematica.

3) A lyx/ Mathematica styled input.
This greedily converts the expressions into latex symbols, so that the user knows what he is actually computing. We can also add a set of symbols for summation, integrals, differentiation as a bottom bar. The input will look like latex input to the user, while we convert the expressions into required symPy functions in the backend.

4)Matplotlib for plotting
We plot the expressions, if it is plottable just as Wolphram Alpha by default.

5)Ipython like notebook.
I think it is possible to port the Ipython notebook according to this discussion(https://groups.google.com/forum/?fromgroups#!topic/sage-notebook/re2bUt4vCxA). But the time it takes to port is not very fixed. I want to know whether I am overreaching by including it.

5) Make the interface look beautiful with twitter bootstrap. 

Aaron Meurer

unread,
Mar 8, 2012, 12:25:34 PM3/8/12
to sy...@googlegroups.com
This looks like a good start. Did you think about ways that some of
these ideas can be used in SymPy Live too?

On Thu, Mar 8, 2012 at 9:55 AM, Bharath M R <catchmr...@gmail.com> wrote:
> Hi,
> I am Bharath M R, a student of Electrical Engineering at IIT Madras.
> I would like to apply for the building the gamma.sympy.org site. I would
> like to
> implement the following things as part of the project.
>
> 1)Basic parsing
> Something like solve x**2==1, integrate x**2 will be parsed and interpreted.
> Doing something like wolphram alpha would require a lot of ideas from
> natural language processing which I am not familiar with.

This is something that can be improved upon over time. The important
thing here is to start with a good framework in SymPy to build upon,
so we can easily extend it with new rules.

And no matter how much you implement, it will always be a heuristic.
We just want it to catch the common case. We can do things like
provide a feedback button for mis-interpreted input, so we can get an
idea of what doesn't work and where things need to be improved the
most.

>
> 2) An incremental search for functions in symPy
> Its very important for a person to get to know the particular function he
> wants to use. This will be implemented using ajax calls to the sphinx
> documentation database.This will be similar to search in scilab /
> mathematica.

Additionally, there should be links throughout the interface to the
relevant Sphinx documentation for the various functions used (similar
to in WolframAlpha).

>
> 3) A lyx/ Mathematica styled input.
> This greedily converts the expressions into latex symbols, so that the user
> knows what he is actually computing. We can also add a set of symbols for
> summation, integrals, differentiation as a bottom bar. The input will look
> like latex input to the user, while we convert the expressions into required
> symPy functions in the backend.

How did you plan to implement this? For now, the way that I know to
do it is to first compute the whole expression to SymPy, then call
latex() on it, and parse the latex with MathJax. That is what happens
at SymPy Live. It sounds like you want something that lets you type
LaTeX styled math on the fly, which basically amounts to an equation
editor.

>
> 4)Matplotlib for plotting
> We plot the expressions, if it is plottable just as Wolphram Alpha by
> default.

The app engine now supports numpy, so this should be doable (I hope).
Our matplotlib plotting engine is still in its infant stages (for now,
it only lives at https://github.com/sympy/sympy/pull/673), but it
should grow. Hopefully we will get another project that will improve
it. You could also spend some time of this project working on it,
though obviously it would be better if most of the time were spent
doing the other things.

>
> 5)Ipython like notebook.
> I think it is possible to port the Ipython notebook according to this
> discussion(https://groups.google.com/forum/?fromgroups#!topic/sage-notebook/re2bUt4vCxA).
> But the time it takes to port is not very fixed. I want to know whether I am
> overreaching by including it.

Even without the notebook, we should use IPython itself in SymPy Live
(http://code.google.com/p/sympy/issues/detail?id=2645), if that's
possible. That will give us things like better tab completion and
access to IPython magic. My idea is that you should be able to click
on a SymPy Gamma calculation and just get a little SymPy Live session,
so the two projects are related.

>
> 5) Make the interface look beautiful with twitter bootstrap.

+1. I don't know anything about this specific library, but a
beautiful, highly functional interface is almost as important as good
functionality.

Aaron Meurer

krastano...@gmail.com

unread,
Mar 8, 2012, 12:33:20 PM3/8/12
to sy...@googlegroups.com
I am not certain but I think that matplotlib includes some C code and
so it won't work on GAE. But the new module can easily be extended
with other backends like Google Chart API.

Also there are some less used matplotlib backends that do not need
compilation (pure python)
http://old.nabble.com/Pure-python-matplotlib-for-Google-App-Engine-td32721389.html


>
>>
>> 5)Ipython like notebook.
>> I think it is possible to port the Ipython notebook according to this
>> discussion(https://groups.google.com/forum/?fromgroups#!topic/sage-notebook/re2bUt4vCxA).
>> But the time it takes to port is not very fixed. I want to know whether I am
>> overreaching by including it.
>
> Even without the notebook, we should use IPython itself in SymPy Live
> (http://code.google.com/p/sympy/issues/detail?id=2645), if that's
> possible.  That will give us things like better tab completion and
> access to IPython magic. My idea is that you should be able to click
> on a SymPy Gamma calculation and just get a little SymPy Live session,
> so the two projects are related.
>
>>
>> 5) Make the interface look beautiful with twitter bootstrap.
>
> +1.  I don't know anything about this specific library, but a
> beautiful, highly functional interface is almost as important as good
> functionality.
>
> Aaron Meurer
>

> --
> You received this message because you are subscribed to the Google Groups "sympy" group.
> To post to this group, send email to sy...@googlegroups.com.
> To unsubscribe from this group, send email to sympy+un...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/sympy?hl=en.
>

Aaron Meurer

unread,
Mar 8, 2012, 12:39:45 PM3/8/12
to sy...@googlegroups.com

Oh, OK. Here is the app engine issue to add matplotlib:
http://code.google.com/p/googleappengine/issues/detail?id=482.

Someone there suggests using http://code.google.com/p/svgfig/ as a
pure Python alternative to Google Charts. It's GPL, though, so I
don't know if that would be a problem. Regardless, it looks like a
neat library to add a backend for. It should be easy (see
http://code.google.com/p/svgfig/wiki/PlottingTutorial).

Aaron Meurer

Bharath M R

unread,
Mar 9, 2012, 8:13:51 AM3/9/12
to sy...@googlegroups.com


On Thursday, March 8, 2012 10:55:34 PM UTC+5:30, Aaron Meurer wrote:
This looks like a good start.  Did you think about ways that some of
these ideas can be used in SymPy Live too?

On Thu, Mar 8, 2012 at 9:55 AM, Bharath M R <catchmr...@gmail.com> wrote:
> Hi,
> I am Bharath M R, a student of Electrical Engineering at IIT Madras.
> I would like to apply for the building the gamma.sympy.org site. I would
> like to
> implement the following things as part of the project.
>
> 1)Basic parsing
> Something like solve x**2==1, integrate x**2 will be parsed and interpreted.
> Doing something like wolphram alpha would require a lot of ideas from
> natural language processing which I am not familiar with.

This is something that can be improved upon over time.  The important
thing here is to start with a good framework in SymPy to build upon,
so we can easily extend it with new rules.

And no matter how much you implement, it will always be a heuristic.
We just want it to catch the common case.  We can do things like
provide a feedback button for mis-interpreted input, so we can get an
idea of what doesn't work and where things need to be improved the
most.

Yeah. I am going through the Match functions in the codebase and I should be able to figure out a way. 

>
> 2) An incremental search for functions in symPy
> Its very important for a person to get to know the particular function he
> wants to use. This will be implemented using ajax calls to the sphinx
> documentation database.This will be similar to search in scilab /
> mathematica.

Additionally, there should be links throughout the interface to the
relevant Sphinx documentation for the various functions used (similar
to in WolframAlpha).

>
> 3) A lyx/ Mathematica styled input.
> This greedily converts the expressions into latex symbols, so that the user
> knows what he is actually computing. We can also add a set of symbols for
> summation, integrals, differentiation as a bottom bar. The input will look
> like latex input to the user, while we convert the expressions into required
> symPy functions in the backend.

How did you plan to implement this?  For now, the way that I know to
do it is to first compute the whole expression to SymPy, then call
latex() on it, and parse the latex with MathJax.  That is what happens
at SymPy Live.  It sounds like you want something that lets you type
LaTeX styled math on the fly, which basically amounts to an equation
editor.

I was thinking something like http://www.texrendr.com/.    

>
> 4)Matplotlib for plotting
> We plot the expressions, if it is plottable just as Wolphram Alpha by
> default.

The app engine now supports numpy, so this should be doable (I hope).
Our matplotlib plotting engine is still in its infant stages (for now,
it only lives at https://github.com/sympy/sympy/pull/673), but it
should grow.  Hopefully we will get another project that will improve
it.  You could also spend some time of this project working on it,
though obviously it would be better if most of the time were spent
doing the other things.


Looks like I have to go with svgfig on this. The discussion at (http://old.nabble.com/Pure-python-matplotlib-for-Google-App-Engine-td32721389.html ) says that a lot of c++ code has to be rewritten in python. I can write a backend for svgfig. 

>
> 5)Ipython like notebook.
> I think it is possible to port the Ipython notebook according to this
> discussion(https://groups.google.com/forum/?fromgroups#!topic/sage-notebook/re2bUt4vCxA).
> But the time it takes to port is not very fixed. I want to know whether I am
> overreaching by including it.

Even without the notebook, we should use IPython itself in SymPy Live
(http://code.google.com/p/sympy/issues/detail?id=2645), if that's
possible.  That will give us things like better tab completion and
access to IPython magic. My idea is that you should be able to click
on a SymPy Gamma calculation and just get a little SymPy Live session,
so the two projects are related.

I will figure out whether this is possible.  

Aaron Meurer

unread,
Mar 9, 2012, 6:41:08 PM3/9/12
to sy...@googlegroups.com

What we have now is very limited, and the way that it's done will not
work. We have two kinds of parsers in SymPy right now. There's
sympify(), which parses an expression in Python grammar and converts
unknown variables into Symbols, literals into corresponding SymPy
types (like 1 to Integer(1)) and operations into corresponding SymPy
operations (like 1/2 to Rational(1, 2)). This only works with
expressions in Python notation (we allow ^ for exponentiation, but
that only works by replacing all instances of ^ with ** before doing
any parsing).

Then we have some very limited parsers for things like mathematica and
maxima. These are almost completely useless. Aside from not really
implementing very much, the problem is that they use regular
expressions. But regular expressions do not work for parsing, because
the expressions that we are parsing are in a higher grammar. Namely,
it's impossible to do things like parenthesis matching with regular
expressions.

What is really needed is a parser, that goes through an expression,
and builds a syntax tree out of it. What you need to do is build
something that is modular enough that you can easily plug in new rules
to it (like "integrate expr dx" => integrate(expr, x), or "x y" ->
x*y).

By the way, on a slightly related note, here are some other things to
think about doing as far as heuristics go. First, automatic spelling
correction, so if I type "intergate x^2 dx", it gives me automatically
"integrate(x**2, x) == x**3/3", with a message saying "did you mean
"integrate x^2 dx" with "integrate" underlined (like happens when you
do a Google search). Second, automatic correction of mismatched
parentheses, so if I type "sin(x*(x + 1)", it automatically guesses
that I meant "sin(x*(x + 1))" (WolframAlpha does this). Obviously,
these would both be limited functionality, and are not required for
the project, but they would be neat and useful to have.

You could just do what WolframAlpha does. There, you can't click on
parts of the expression, but there are little links at the bottom for
each of the relevant parts to their docs.

Aaron Meurer

>>
>> >
>> > 5) Make the interface look beautiful with twitter bootstrap.
>>
>> +1.  I don't know anything about this specific library, but a
>> beautiful, highly functional interface is almost as important as good
>> functionality.
>>
>> Aaron Meurer
>

> --
> You received this message because you are subscribed to the Google Groups
> "sympy" group.

> To view this discussion on the web visit
> https://groups.google.com/d/msg/sympy/-/pd5QrgBooY8J.

Bharath M R

unread,
Mar 10, 2012, 1:11:45 PM3/10/12
to sy...@googlegroups.com

What we have now is very limited, and the way that it's done will not
work.  We have two kinds of parsers in SymPy right now.  There's
sympify(), which parses an expression in Python grammar and converts
unknown variables into Symbols, literals into corresponding SymPy
types (like 1 to Integer(1)) and operations into corresponding SymPy
operations (like 1/2 to Rational(1, 2)).  This only works with
expressions in Python notation (we allow ^ for exponentiation, but
that only works by replacing all instances of ^ with ** before doing
any parsing).

Then we have some very limited parsers for things like mathematica and
maxima.  These are almost completely useless.  Aside from not really
implementing very much, the problem is that they use regular
expressions.  But regular expressions do not work for parsing, because
the expressions that we are parsing are in a higher grammar.  Namely,
it's impossible to do things like parenthesis matching with regular
expressions.

 

What is really needed is a parser, that goes through an expression,
and builds a syntax tree out of it.  What you need to do is build
something that is modular enough that you can easily plug in new rules
to it (like "integrate expr dx" => integrate(expr, x), or "x y" ->
x*y).

This looks interesting. I recently did a course on finite automata and I think 
I can use the things I learnt. I need to read up on parsing though. 

Also, I think that we can only support basic implementations of functions by this
method. I mean the functions have to have maximum of 2 to 3 arguments. Otherwise it 
will be highly inefficient to type things after one another with a space. This would mean 
that the new parser will only be helpful in implementing the web interface and not sympy 
in general.  

By the way, on a slightly related note, here are some other things to
think about doing as far as heuristics go. First, automatic spelling
correction, so if I type "intergate x^2 dx", it gives me automatically
"integrate(x**2, x) == x**3/3", with a message saying "did you mean
"integrate x^2 dx" with "integrate" underlined (like happens when you
do a Google search).  Second, automatic correction of mismatched
parentheses, so if I type "sin(x*(x + 1)", it automatically guesses
that I meant "sin(x*(x + 1))" (WolframAlpha does this).  Obviously,
these would both be limited functionality, and are not required for
the project, but they would be neat and useful to have.

I think it will be easy to implement the above features if I parse using a syntax tree. 

> sympy+unsubscribe@googlegroups.com.

Sergiu Ivanov

unread,
Mar 10, 2012, 1:30:52 PM3/10/12
to sy...@googlegroups.com
Hello,

On Sat, Mar 10, 2012 at 8:11 PM, Bharath M R <catchmr...@gmail.com> wrote:
>
>> What is really needed is a parser, that goes through an expression,
>> and builds a syntax tree out of it.  What you need to do is build
>> something that is modular enough that you can easily plug in new rules
>> to it (like "integrate expr dx" => integrate(expr, x), or "x y" ->
>> x*y).
>
> This looks interesting. I recently did a course on finite automata and I
> think
> I can use the things I learnt. I need to read up on parsing though.

My area of research is very tightly connected with formal languages;
I've had a course in parsing techniques last semester. If you have
any questions, I'll be glad to try to help :-)

Sergiu

Bharath M R

unread,
Mar 10, 2012, 11:27:32 PM3/10/12
to sy...@googlegroups.com
@Sergiu I was looking at top down operator precedence parsing for the task
(http://javascript.crockford.com/tdop/tdop.html) . One of the important task is to 
come up with the grammar. Is it possible to meet you on IRC sometime to discuss about this? My time zone
is GMT + 5:30.

Joachim Durchholz

unread,
Mar 11, 2012, 9:30:15 AM3/11/12
to sy...@googlegroups.com
Am 11.03.2012 05:27, schrieb Bharath M R:
> @Sergiu I was looking at top down operator precedence parsing for the task

That's the technique where the grammar requires the least contortions.
It also composes well: you can define sublanguages (such as arithmetic
and boolean logic), and as long as their operator sets do not overlap,
you can combine them without any problems.

It is a bit restricted in what classes of languages it can handle.
Parsing a*-b is not possible but can be enabled using a hack.
I don't know how to parse http://en.wikipedia.org/wiki/Bra-ket_notation
using operator precedence.

> (http://javascript.crockford.com/tdop/tdop.html) .

Crockford is very wrong that operator-precedence parsing is in any way
connected to dynamic languages.

In general, he's not exposing principles but presenting a recipe. I
don't think that's easy to transfer to Python code.

> One of the important task is to
> come up with the grammar.

Not for operator-precedence grammars!
You need:
- a list of precedence levels,
- for each level:
- the list of operators
- whether it's left, right, or non-associative
- a list of left and right parentheses

The problem with unary minus is that it has a different precedence than
binary minus. The standard approach is to let the tokenizer detect
multiple operator symbols in a row, and emit a different pseudo-minus
token whenever it isn't the first.
I'm pretty sure I could whip up a systematic approach that would cover
more operators and hold up in the presence of postfix operators, but I
don't know whether that's even relevant in SymPy. (The only postfix
operator in widespread use I know about is !, and that isn't used as an
infix operator.)

Bharath M R

unread,
Mar 11, 2012, 12:58:59 PM3/11/12
to sy...@googlegroups.com
Thanks for the reply. The parser is for building the web interface similar to 
Wolfram Alpha for SymPy.  Basically I want to parse things like 
integrate x**2 dx
roots x**2==1

Won't the operator precedence parsing be sufficient for this? If it is not, can 
you help me with some reading material on some parsing technique which can 
acomplish this.

Thanks,
Bharath M R 

Joachim Durchholz

unread,
Mar 11, 2012, 1:32:14 PM3/11/12
to sy...@googlegroups.com
Am 11.03.2012 17:58, schrieb Bharath M R:
> Thanks for the reply. The parser is for building the web interface similar
> to
> Wolfram Alpha for SymPy. Basically I want to parse things like
> integrate x**2 dx
> roots x**2==1
>
> Won't the operator precedence parsing be sufficient for this?

These two are easy to parse with operator precedence.

The "interesting" things happen with unary minus and bra-ket notation.

> If it is not, can
> you help me with some reading material on some parsing technique which can
> acomplish this.

There are more general parsing techniques. The problem is that each
technique places constraints on the grammars, and it quickly becomes
specialist work to write a grammar that (a) expresses what you want and
(b) obeys the restrictions of the parsing technique.

Among parsing techniques, operator precedence places the tightest
restrictions, but it is also the most complicated one that can be
mastered in a weekend (for grammar writers) or in a week (for implementors).

For a quick rundown, take a look at
http://epaperpress.com/oper/index.html .

A Pythonic operator-precedence parser seems to be at
http://effbot.org/zone/simple-top-down-parsing.htm .
"Seems to be" because I didn't thoroughly read it. What I liked about
that article is that it explains how to best do it in Python.

Sergiu Ivanov

unread,
Mar 11, 2012, 4:40:22 PM3/11/12
to sy...@googlegroups.com
On Sun, Mar 11, 2012 at 6:58 PM, Bharath M R <catchmr...@gmail.com> wrote:
>
> On Sunday, March 11, 2012 7:00:15 PM UTC+5:30, Joachim Durchholz wrote:
>>
>> Am 11.03.2012 05:27, schrieb Bharath M R:
>> > @Sergiu I was looking at top down operator precedence parsing for the
>> > task
>>
>> That's the technique where the grammar requires the least contortions.
>> It also composes well: you can define sublanguages (such as arithmetic
>> and boolean logic), and as long as their operator sets do not overlap,
>> you can combine them without any problems.
>>
>
> Thanks for the reply. The parser is for building the web interface similar
> to
> Wolfram Alpha for SymPy.  Basically I want to parse things like
> integrate x**2 dx
> roots x**2==1

I've never had any serious experience with operator precedence
parsing, but I have the intuition that this technique is going to be
quite unwieldy if we would like to go beyond simple expressions like
the ones you have shown.

I'd advocate a more general approach to this problem, and namely
something like LL(1), SLR or LALR(1). When writing from scratch, this
is obviously going to require more effort than operator precedence but
instead we will have a much wider class of languages covered.
However, I'd recommend using a parser generator instead of writing a
parser from scratch yourself. It will introduce some slowdown, but
will instead make the whole thing a lot easier to write, maintain and,
more importantly, to extend. I'm not sure about the state of parser
generators for Python, but this page
http://wiki.python.org/moin/LanguageParsing may provide some
information.

I can see a different problem here though: choosing a parsing method
and producing the grammar of the language (or what information the
parser would require) may not be enough to get the desired
functionality. What we are trying to get at should incorporate
something similar to natural language processing. Therefore, before
actually starting the parsing process, I think the input should be
brought to some kind of canonical form. I am strongly inclined to
believe that Wolfram|Alpha takes exactly this way of doing things (but
obviously with a lot more details).

One simple thing I can think of is detecting synonyms. For example,
"roots x**2 == 1" and "solutions x**2 == 1" and "solve x**2 == 1" are
one and the same thing. Therefore, it may be useful to start with
substituting each of these synonyms with some canonical word ("solve",
for example). It is possible to argue that this can be solved by
introducing additional rules into the grammar. However, I am inclined
to believe that this will make the grammar unnecessarily large and,
even worse, slow down the parser.

Another simple thing is detecting spelling errors. Suppose I type
"intgrate" or any of the other (numerous) wrong possibilities. I
think it would be very nice to have such mistakes detected and
corrected automatically. This http://packages.python.org/pyenchant/
seems on topic.

I'd also recommend showing the substitutions and the resulting
canonical form. In this way the user will be able to see how their
input was interpreted and, maybe, change their formulation to get
nearer to what they wanted.

The list of actions the preprocessor should can be extended
arbitrarily, I guess. For example, it could try to fix wrongly
balanced parentheses. It might also try to cope with a different
order of keywords in the string, like in "sin(x) integral". It would
be nice to parse single-argument functions written without parentheses
("sin x" instead of "sin(x)"). The preprocessor could also drop
incomprehensible (and thus supposedly meaningless) words, like in
"find the integral of x^2".

Apparently, the elements in this list should also be given priorities,
because some of them are essential (synonyms, for example, as I see
it), others are less critical.

Sergiu

Sergiu Ivanov

unread,
Mar 11, 2012, 4:48:24 PM3/11/12
to sy...@googlegroups.com
On Sun, Mar 11, 2012 at 6:27 AM, Bharath M R <catchmr...@gmail.com> wrote:
>
> @Sergiu I was looking at top down operator precedence parsing for the task
> (http://javascript.crockford.com/tdop/tdop.html) . One of the important task
> is to
> come up with the grammar. Is it possible to meet you on IRC sometime to
> discuss about this? My time zone
> is GMT + 5:30.

I'm in GMT+2 which means that you can generally find me on-line during
your day. However, I'd prefer staying on E-mail for as long as
possible because it allows for a much better structured conversation
and many more people can participate.

Whenever we come up with a certain concrete question which requires
live discussion, don't hesitate to point that out to me (since I may
be sufficiently inattentive to miss that moment :-) )

Sergiu

Bharath M R

unread,
Mar 13, 2012, 12:16:50 AM3/13/12
to sy...@googlegroups.com

I've never had any serious experience with operator precedence
parsing, but I have the intuition that this technique is going to be
quite unwieldy if we would like to go beyond simple expressions like
the ones you have shown.

I'd advocate a more general approach to this problem, and namely
something like LL(1), SLR or LALR(1).  When writing from scratch, this
is obviously going to require more effort than operator precedence but
instead we will have a much wider class of languages covered.
However, I'd recommend using a parser generator instead of writing a
parser from scratch yourself.  It will introduce some slowdown, but
will instead make the whole thing a lot easier to write, maintain and,
more importantly, to extend.  I'm not sure about the state of parser
generators for Python, but this page
http://wiki.python.org/moin/LanguageParsing may provide some
information.

Yeah I think I should use some parser generators. But doesn't it introduce a 
dependency on the parser generator. Is it ok to have this dependency?
 

I can see a different problem here though: choosing a parsing method
and producing the grammar of the language (or what information the
parser would require) may not be enough to get the desired
functionality.  What we are trying to get at should incorporate
something similar to natural language processing.  Therefore, before
actually starting the parsing process, I think the input should be
brought to some kind of canonical form.  I am strongly inclined to
believe that Wolfram|Alpha takes exactly this way of doing things (but
obviously with a lot more details). 

One simple thing I can think of is detecting synonyms.  For example,
"roots x**2 == 1" and "solutions x**2 == 1" and "solve x**2 == 1" are
one and the same thing.  Therefore, it may be useful to start with
substituting each of these synonyms with some canonical word ("solve",
for example).  It is possible to argue that this can be solved by
introducing additional rules into the grammar.  However, I am inclined
to believe that this will make the grammar unnecessarily large and,
even worse, slow down the parser 

Even I was thinking of the same. We have to substitute the synonyms before we 
start parsing.  

.

Another simple thing is detecting spelling errors.  Suppose I type
"intgrate" or any of the other (numerous) wrong possibilities.  I
think it would be very nice to have such mistakes detected and
corrected automatically.  This http://packages.python.org/pyenchant/
seems on topic.

I'd also recommend showing the substitutions and the resulting
canonical form.  In this way the user will be able to see how their
input was interpreted and, maybe, change their formulation to get
nearer to what they wanted.

This will be implemented just like Wolfram Alpha. We will show them what 
the input was interpreted as. 
 

The list of actions the preprocessor should can be extended
arbitrarily, I guess.  For example, it could try to fix wrongly
balanced parentheses.  It might also try to cope with a different
order of keywords in the string, like in "sin(x) integral".  It would
be nice to parse single-argument functions written without parentheses
("sin x" instead of "sin(x)").  The preprocessor could also drop
incomprehensible (and thus supposedly meaningless) words, like in
"find the integral of x^2".

Apparently, the elements in this list should also be given priorities,
because some of them are essential (synonyms, for example, as I see
it), others are less critical.

Thanks for your help. I think I have to look into all these ideas and prioritize them and
come up with a plan to implement them. 

Sergiu

Joachim Durchholz

unread,
Mar 13, 2012, 3:42:57 AM3/13/12
to sy...@googlegroups.com
Am 11.03.2012 21:40, schrieb Sergiu Ivanov:
> I've never had any serious experience with operator precedence
> parsing, but I have the intuition that this technique is going to be
> quite unwieldy if we would like to go beyond simple expressions like
> the ones you have shown.

It's a trade-off. The decision should be driven by the project goals,
since no parsing technique can cover all.

Here's a run-down of available techniques:

Operator-precedence:
Somewhat limited (but not as badly as many people think).
Extending the grammar requires the ability to grasp the concepts of
precedence and associativity.
The parser can easily generate useful error messages.
Extremely easy to implement.

LL(k):
Surprisingly limited, it cannot parse any input that starts with k or
more left parentheses; without parentheses, it cannot parse any grammar
that has more than k precedence levels with infix or postfix operators.
Extending the grammar requires the ability to compute the sets of
prefixes, which is moderately difficult and mostly doable in one's head.
The parser has a hard time generating useful error messages.
Implementation is quite complex.

SLR:
Only marginally less limited than operator-precedence.
Otherwise, the same advantages and disadvantages as LR(k) hold.

LALR(k):
Almost as powerful as LR(k+1).
Generating error messages is somewhat more difficult than with LR(k).
Otherwise, the same advantages and disadvantages as LR(k) hold.

LR(k):
This is the most powerful widely-known algorithm for which ambiguities
can be detected by inspecting the grammar.
Extending the grammar requires intimate knowledge of the parsing
algorithm; otherwise, you'll be unable to interpret the "shift-reduce"
and "reduce-reduce" conflict error messages.
Generating useful error messages is moderately difficult.
Implementation is complex.

Earley:
Can handle arbitrary context-free grammars. The downside is that there
is no way to check whether a grammar is ambiguous. However, the
algorithm will detect if a given input is ambiguous.
Extending the grammar is trivial, checking it for ambiguity requires an
arsenal of techniques (in practice, most teams just hope that there is
no problem there and tweak the grammar until practical usage does not
throw ambiguity errors anymore).
I don't know how hard generating useful error messages is. I suspect
it's hard or impossible, but I'd have to ask a search engine to verify that.
Implementation complexity is almost as complex as LR(k) (the parsing
algorithm is more complex but you do not need to preprocess the grammar).
The algorithm is worst-case polynomial in input length: quadratic if the
input is unambiguous, cubic if it is ambiguous. This is not a serious
issue for small inputs. There are variants that have better worst-case
complexities, but these are more complex.

> However, I'd recommend using a parser generator instead of writing a
> parser from scratch yourself. It will introduce some slowdown, but
> will instead make the whole thing a lot easier to write, maintain and,
> more importantly, to extend.

A parser generator is required for LL and LR variants.
The problem with these is that they usually come with their own syntax,
so using them requires learning new skills.

> I'm not sure about the state of parser
> generators for Python, but this page
> http://wiki.python.org/moin/LanguageParsing may provide some
> information.

A parser that's embedded into Python as a DSL would be an option.

> I can see a different problem here though: choosing a parsing method
> and producing the grammar of the language (or what information the
> parser would require) may not be enough to get the desired
> functionality. What we are trying to get at should incorporate
> something similar to natural language processing. Therefore, before
> actually starting the parsing process, I think the input should be
> brought to some kind of canonical form. I am strongly inclined to
> believe that Wolfram|Alpha takes exactly this way of doing things (but
> obviously with a lot more details).

The first step to transformation would be a parse, so I'm not really
sure what point you're making here.

> One simple thing I can think of is detecting synonyms. For example,
> "roots x**2 == 1" and "solutions x**2 == 1" and "solve x**2 == 1" are
> one and the same thing. Therefore, it may be useful to start with
> substituting each of these synonyms with some canonical word ("solve",
> for example).

This is stuff that is easiest to do *after* a parse.

> It is possible to argue that this can be solved by
> introducing additional rules into the grammar.

It's easier to build alternate definitions into the semantic processing
that happens after the parse.

> However, I am inclined
> to believe that this will make the grammar unnecessarily large and,
> even worse, slow down the parser.

The parsing algorithms enumerated above do not care much about whether
each function and operator name is done in a separate grammar rule, or
if there is one rule for function calls and one rule for all operators.
Initialization the tables requires time roughly linear with the number
of grammar symbols, and that's all.

Having a larger grammar isn't so much a maintenance problem either if
each rule follows a pattern from a small set of standard patterns.
What's making trouble is having many different patterns. Mostly because
rules interact, and the interactions become more complex if there are
many rule patterns - you get "conflicts", and these are generally hard
to resolve, and adding a new rule can create a conflict in a faraway
rule (unless you're using operator precedence, where the number of
patterns is restricted, or Earley, where there is no checking on the
grammar).

> Another simple thing is detecting spelling errors. Suppose I type
> "intgrate" or any of the other (numerous) wrong possibilities. I
> think it would be very nice to have such mistakes detected and
> corrected automatically. This http://packages.python.org/pyenchant/
> seems on topic.

Spell checking, again, is something that is easily done on the parse
tree after it has been created.
Just make sure that the parse tree contains all the information needed
to generate a good message: Each symbol should carry not just the text
but also originating line and column number.

> I'd also recommend showing the substitutions and the resulting
> canonical form. In this way the user will be able to see how their
> input was interpreted and, maybe, change their formulation to get
> nearer to what they wanted.

+1.

> The list of actions the preprocessor should can be extended
> arbitrarily, I guess.

Yes.

> For example, it could try to fix wrongly
> balanced parentheses.

That one would be very hard to do well. There is too little redundancy
in the input to guess the proper place for a parenthese.
For an example of how it doesn't work, try entering unbalanced
parentheses in Excel.

> It might also try to cope with a different
> order of keywords in the string, like in "sin(x) integral". It would
> be nice to parse single-argument functions written without parentheses
> ("sin x" instead of "sin(x)").

That one can be done on the grammar level: just make a function name
bind tighter than any operator (this is the typical of how functional
programming languages have their grammar defined).
Then,
sin x * y
will parse as
(sin x) * y

Of course, this is not Python's grammar anymore. (It does follow
mathematical convention more closely though.)

> The preprocessor could also drop
> incomprehensible (and thus supposedly meaningless) words, like in
> "find the integral of x^2".

Experience with this kind of approach was that it tends to reduce
predictability. In this case, the user might consider a word meaningless
but the program has some obscure definition for it and suddenly spits
out error messages that refer to stuff that the user doesn't know about
- say, if there is an "of" function somewhere in a category-theoretic
module.

Aaron Meurer

unread,
Mar 13, 2012, 4:22:44 AM3/13/12
to sy...@googlegroups.com
Thanks for the helpful advice. So which method would you recommend
using for this? Expression size isn't an issue for SymPy Gamma, by
the way, as we are limited by what the App Engine can handle anyway
(You'll find that WolframAlpha can't handle very large or even
moderately expressions either).

I suspect that things that do this just do very trivial cases. Doing
this well is not high priority, as the user should really give
balanced parentheses for unambiguous input. But it could be useful,
e.g., if there are too many opening parentheses, to try adding enough
closing parentheses to the end and seeing if that works.

Maybe you could play around with Wolfram Alpha and see to what extent
it does this.

>
>
>> It might also try to cope with a different
>> order of keywords in the string, like in "sin(x) integral".  It would
>> be nice to parse single-argument functions written without parentheses
>> ("sin x" instead of "sin(x)").
>
>
> That one can be done on the grammar level: just make a function name bind
> tighter than any operator (this is the typical of how functional programming
> languages have their grammar defined).
> Then,
>  sin x * y
> will parse as
>  (sin x) * y
>
> Of course, this is not Python's grammar anymore. (It does follow
> mathematical convention more closely though.)
>
>
>> The preprocessor could also drop
>> incomprehensible (and thus supposedly meaningless) words, like in
>> "find the integral of x^2".
>
>
> Experience with this kind of approach was that it tends to reduce
> predictability. In this case, the user might consider a word meaningless but
> the program has some obscure definition for it and suddenly spits out error
> messages that refer to stuff that the user doesn't know about - say, if
> there is an "of" function somewhere in a category-theoretic module.

So I think to do natural language processing correctly, you'd have to
implement some AI techniques. We can probably put this on a lower
priority than the rest of the things discussed here. Considering that
what we are suggesting would go far beyond only accepting exactly
correct SymPy/Python syntax, I think it's a good start.

Aaron Meurer

Sergiu Ivanov

unread,
Mar 13, 2012, 10:51:17 AM3/13/12
to sy...@googlegroups.com
On Tue, Mar 13, 2012 at 6:16 AM, Bharath M R <catchmr...@gmail.com> wrote:
>
> Yeah I think I should use some parser generators. But doesn't it introduce
> a
> dependency on the parser generator. Is it ok to have this dependency?

Usually, you only use the parser generator to produce the code of the
parser. This resulting code is technically independent of the parser
generator. I'd think such (not that firm) dependency should be OK.

Sergiu

Sergiu Ivanov

unread,
Mar 13, 2012, 12:07:10 PM3/13/12
to sy...@googlegroups.com

Thank you for the very good overview; I wish I had been given
something like this in my course on compilers.

> A parser generator is required for LL and LR variants.
> The problem with these is that they usually come with their own syntax, so
> using them requires learning new skills.

I think the effort required to acquire these new skills is still less
than that of mastering a certain parsing technique and then
implementing it. (Just for reference, I think I can remember I once
learnt the basics of JavaCC in a day.)

On the other hand, you can switch the parsing techniques much easier
if you are using parser generators.

>>  I'm not sure about the state of parser
>> generators for Python, but this page
>> http://wiki.python.org/moin/LanguageParsing may provide some
>> information.
>
>
> A parser that's embedded into Python as a DSL would be an option.

I'm not sure I can properly understand your idea; why do we need to
*embed* the DSL into Python?

I've checked the link I provided one more time; it lists a number of
implementations that look quite appealing, including
http://www.acooke.org/lepl/ and http://pypi.python.org/pypi/modgrammar
.

>> I can see a different problem here though: choosing a parsing method
>> and producing the grammar of the language (or what information the
>> parser would require) may not be enough to get the desired
>> functionality.  What we are trying to get at should incorporate
>> something similar to natural language processing.  Therefore, before
>> actually starting the parsing process, I think the input should be
>> brought to some kind of canonical form.  I am strongly inclined to
>> believe that Wolfram|Alpha takes exactly this way of doing things (but
>> obviously with a lot more details).
>
>
> The first step to transformation would be a parse, so I'm not really sure
> what point you're making here.

Hm, well, in my imagination the transformations I listed were to be
solved by doing find-and-replace, so no, the first step in the
transformations I suggested would not be parsing. Maybe lexing, but
not parsing.

>> One simple thing I can think of is detecting synonyms.  For example,
>> "roots x**2 == 1" and "solutions x**2 == 1" and "solve x**2 == 1" are
>> one and the same thing.  Therefore, it may be useful to start with
>> substituting each of these synonyms with some canonical word ("solve",
>> for example).
>
>
> This is stuff that is easiest to do *after* a parse.

How so? I can't see how it is hard to replace ("roots" or
"solutions") with "solve" before parsing.

>> It is possible to argue that this can be solved by
>> introducing additional rules into the grammar.
>
>
> It's easier to build alternate definitions into the semantic processing that
> happens after the parse.

Hm, I think I can see your point now. It is indeed more correct to
process alternate definitions into the after-parse semantic
processing. However, I'm not sure the resulting code is going to be
simpler.

However, I am not going to fight for this point of mine; your idea
looks more correct.

>> However, I am inclined
>> to believe that this will make the grammar unnecessarily large and,
>> even worse, slow down the parser.
>
>
> The parsing algorithms enumerated above do not care much about whether each
> function and operator name is done in a separate grammar rule, or if there
> is one rule for function calls and one rule for all operators.
> Initialization the tables requires time roughly linear with the number of
> grammar symbols, and that's all.

Yes, sure.

> Having a larger grammar isn't so much a maintenance problem either if each
> rule follows a pattern from a small set of standard patterns.

I think I don't understand this. Could you please give more details?

By the way, I can see reason for the difference between your
suggestion and mine. I am too used to (theoretical) rewriting rules,
which made me forget about the lexing phase completely.

>> Another simple thing is detecting spelling errors.  Suppose I type
>> "intgrate" or any of the other (numerous) wrong possibilities.  I
>> think it would be very nice to have such mistakes detected and
>> corrected automatically.  This http://packages.python.org/pyenchant/
>> seems on topic.
>
>
> Spell checking, again, is something that is easily done on the parse tree
> after it has been created.

I'm not sure I agree. Consider the (supposedly valid) sentences
"integrate sin(x) by x" and "limit sin(x) when x goes to zero". I
don't think I'd recommend parsing these two sentences with one
(general) rule, which means that the words "integrate" and "limit"
actually determine which of the rules to use. If spell checking
doesn't happen before lexing, the necessary difference between
"integrate" and "limit" may not be detected.

>> For example, it could try to fix wrongly
>> balanced parentheses.
>
> That one would be very hard to do well. There is too little redundancy in
> the input to guess the proper place for a parenthese.

Yes, of course. That was just a wild suggestion with brainstorming
mode on.

> For an example of how it doesn't work, try entering unbalanced parentheses
> in Excel.

LibreOffice Spreadsheet it seems to cope with the simplest cases, but
I guess it only works for the simplest cases.

>> It might also try to cope with a different
>> order of keywords in the string, like in "sin(x) integral".  It would
>> be nice to parse single-argument functions written without parentheses
>> ("sin x" instead of "sin(x)").
>
> That one can be done on the grammar level: just make a function name bind
> tighter than any operator (this is the typical of how functional programming
> languages have their grammar defined).
> Then,
>  sin x * y
> will parse as
>  (sin x) * y

Indeed.

>> The preprocessor could also drop
>> incomprehensible (and thus supposedly meaningless) words, like in
>> "find the integral of x^2".
>
>
> Experience with this kind of approach was that it tends to reduce
> predictability. In this case, the user might consider a word meaningless but
> the program has some obscure definition for it and suddenly spits out error
> messages that refer to stuff that the user doesn't know about

I don't think this is a problem because the user is not supposed to
purposefully input meaningless words in normal scenarios.

> - say, if
> there is an "of" function somewhere in a category-theoretic module.

:-D

Sergiu

Sergiu Ivanov

unread,
Mar 13, 2012, 12:24:34 PM3/13/12
to sy...@googlegroups.com
On Tue, Mar 13, 2012 at 10:22 AM, Aaron Meurer <asme...@gmail.com> wrote:
>
> Thanks for the helpful advice.  So which method would you recommend
> using for this?  Expression size isn't an issue for SymPy Gamma, by
> the way, as we are limited by what the App Engine can handle anyway
> (You'll find that WolframAlpha can't handle very large or even
> moderately expressions either).

Apparently, operator precedence parsing together with some additional
processing of the string may be enough. This may be a good start.

For more complicated expressions (like "limit sin(x)/x when x goes to
0"), I think that operator precedence will not be enough. In that
case I'd go for a generator of one of the more powerful parsers (LR
for example).

Sergiu

Joachim Durchholz

unread,
Mar 13, 2012, 2:55:02 PM3/13/12
to sy...@googlegroups.com

Hm... let me phrase it differently:
Those parsing methods that are so complicated that you need a parser
generator require a different skill set. Which means that we're raising
the entry barrier for contributors.

Hence, -1.

Jorge Cardona

unread,
Mar 13, 2012, 3:05:47 PM3/13/12
to sy...@googlegroups.com
Maybe this will be useful for the project: http://www.nltk.org/

> --
> You received this message because you are subscribed to the Google Groups
> "sympy" group.

> To post to this group, send email to sy...@googlegroups.com.
> To unsubscribe from this group, send email to

> sympy+un...@googlegroups.com.


> For more options, visit this group at
> http://groups.google.com/group/sympy?hl=en.
>

--
Jorge Eduardo Cardona
jorgee...@gmail.com
jorgeecardona.blogspot.com
github.com/jorgeecardona
------------------------------------------------
Linux registered user  #391186
Registered machine    #291871
------------------------------------------------

Joachim Durchholz

unread,
Mar 13, 2012, 3:16:10 PM3/13/12
to sy...@googlegroups.com
Am 13.03.2012 17:24, schrieb Sergiu Ivanov:
> Apparently, operator precedence parsing together with some additional
> processing of the string may be enough. This may be a good start.

For the umteenth time, NO string processing. You'll be going straight to
ambiguity hell if you do that - you'll end with a parser that misparses
strings just because you weren't careful with the string substitutions.

If you want to substitute, do it on the AST.

> For more complicated expressions (like "limit sin(x)/x when x goes to
> 0"), I think that operator precedence will not be enough.

In fact this kind of stuff is easy. Just make "limit", "when", and "goes
to" bind less tightly than operators.

What's hard is unary minus. Or anything where you want the same symbol
on different precedence levels depending on syntactic context.

> In that
> case I'd go for a generator of one of the more powerful parsers (LR
> for example).

Did you ever write a nontrivial LR grammar?
Fixing the inevitable shift-reduce conflicts is hard. Really hard.
Fixing the conflicts also means changing the grammar. Which means that
the grammar will produce parse trees that do not reflect the original
grammar anymore.
Adding another rule to an LR grammar tends to break it. All over the
place. Basically, you need to restart the conflict-fixing process.

We'd need an LR grammar expert, or live with the suffering.

Joachim Durchholz

unread,
Mar 13, 2012, 3:17:55 PM3/13/12
to sy...@googlegroups.com
Am 13.03.2012 20:05, schrieb Jorge Cardona:
> Maybe this will be useful for the project: http://www.nltk.org/

I have only passing knowledge of natural language processing.
I fear that it is going to devote a lot of effort to stemming and
similar activities that bear no fruit in a SymPy context.

Joachim Durchholz

unread,
Mar 13, 2012, 3:46:16 PM3/13/12
to sy...@googlegroups.com
Am 13.03.2012 09:22, schrieb Aaron Meurer:
> Thanks for the helpful advice. So which method would you recommend
> using for this?

Hard to tell right now. We need to see what grammars we want to support.

So the first step would probably be to draw up all the grammar rules for
features present (and future, as far as we can fathom them).
Make that as complete as possible - it's sometimes the innocent rule
that will make or break a parsing technique.

Unless we're extremely lucky, that initial grammar will not work (except
for Earley, which will accept anything).
The next round will then be to see what modifications to the grammar are
needed to make it acceptable for the various parsing techniques.
For the ambiguity checking, any parser generator that uses a specific
parsing technique will do. Prefer one with a record of good error
messages, we'll need them.

Then it's time for the judgement call: Which techniques require what
adaptations of the grammar, and which of the adaptations are acceptable
and which aren't.

Sergiu Ivanov

unread,
Mar 13, 2012, 3:48:33 PM3/13/12
to sy...@googlegroups.com
On Tue, Mar 13, 2012 at 9:16 PM, Joachim Durchholz <j...@durchholz.org> wrote:
> Am 13.03.2012 17:24, schrieb Sergiu Ivanov:
>
>> Apparently, operator precedence parsing together with some additional
>> processing of the string may be enough.  This may be a good start.
>
>
> For the umteenth time, NO string processing. You'll be going straight to
> ambiguity hell if you do that - you'll end with a parser that misparses
> strings just because you weren't careful with the string substitutions.
>
> If you want to substitute, do it on the AST.

Yes, sure. Sorry, didn't think before hitting Send :-(

>> For more complicated expressions (like "limit sin(x)/x when x goes to
>> 0"), I think that operator precedence will not be enough.
>
>
> In fact this kind of stuff is easy. Just make "limit", "when", and "goes to"
> bind less tightly than operators.
>
> What's hard is unary minus. Or anything where you want the same symbol on
> different precedence levels depending on syntactic context.

I see; I did a really poor job thinking over the previous message,
sorry.

>> In that
>>
>> case I'd go for a generator of one of the more powerful parsers (LR
>> for example).
>
>
> Did you ever write a nontrivial LR grammar?
> Fixing the inevitable shift-reduce conflicts is hard. Really hard.
> Fixing the conflicts also means changing the grammar. Which means that the
> grammar will produce parse trees that do not reflect the original grammar
> anymore.
> Adding another rule to an LR grammar tends to break it. All over the place.
> Basically, you need to restart the conflict-fixing process.
>
> We'd need an LR grammar expert, or live with the suffering.

I see. My course on compilers was seriously biased to simple
explanations of simple algorithms; my practical experience is the
weakest part of my knowledge in this area. Whenever I know that
conflicts can be fixed, I rarely bother to see how hard it can to fix
them practically.

Sergiu

Aaron Meurer

unread,
Mar 13, 2012, 4:17:43 PM3/13/12
to sy...@googlegroups.com
On Tue, Mar 13, 2012 at 1:46 PM, Joachim Durchholz <j...@durchholz.org> wrote:
> Am 13.03.2012 09:22, schrieb Aaron Meurer:
>
>> Thanks for the helpful advice.  So which method would you recommend
>> using for this?
>
>
> Hard to tell right now. We need to see what grammars we want to support.
>
> So the first step would probably be to draw up all the grammar rules for
> features present (and future, as far as we can fathom them).
> Make that as complete as possible - it's sometimes the innocent rule that
> will make or break a parsing technique.

So would it help to start a wiki page where we list all the things we
want to support, in the order of importance?

Here's a beginning of that list (in order):

- SymPy syntax: This is probably obvious, but correct SymPy/Python
syntax should always be parsed exactly as it is given. If the
heuristic parser has ambiguities problems that would prevent this, we
can just preparse with a call to sympify(), and only use heuristics if
that fails.

- Mathematica, Maple, Maxima, etc. syntax. Where they conflict, we
should pick the more popular variant, or if that's nontrivial, we can
just let it decide as a side-effect of the implementation (i.e., leave
the behavior undefined in that case).

- LaTeX. The ability to parse LaTeX math into something that can be
computed would be very useful. WolframAlpha has support for this.

- Text based math. What I mean here is, it should support parsing
things as you would type them in plain text without trying to follow
any kind of set grammar. Stuff like 3x^2 + 2, d/dx x^2.

- Special symbols: Support stuff like √x or ∫x^2 dx. Allow, to some
degree, pasting in stuff from the SymPy pretty printer (in particular,
things that are not printed on more than one line, like 3⋅x₃).

- Text based functions: Stuff like "integrate x^2 dx", "limit x^2
with respect to x as x approaches infinity".

- Natural language processing: There is a vagary between this and the
last bullet point. What I mean here is that it tries to guess what is
meant from a plain text description without using a set grammar. This
could even support stuff like "the integral of x squared with respect
to x".

I think we should support at least the first three bullet points.
Supporting the last one or two bullet points will be the most
difficult, and can be left out, at least in the initial
implementation. We should also consider if a scheme will be
extendable to such things in the long term.

Shall I start a wiki page? I know there have been other things
discussed here, like unary minus and bra-ket, that can be problems
that are important to consider.

Aaron Meurer

>
> Unless we're extremely lucky, that initial grammar will not work (except for
> Earley, which will accept anything).
> The next round will then be to see what modifications to the grammar are
> needed to make it acceptable for the various parsing techniques.
> For the ambiguity checking, any parser generator that uses a specific
> parsing technique will do. Prefer one with a record of good error messages,
> we'll need them.
>
> Then it's time for the judgement call: Which techniques require what
> adaptations of the grammar, and which of the adaptations are acceptable and
> which aren't.
>
>

> --
> You received this message because you are subscribed to the Google Groups
> "sympy" group.

> To post to this group, send email to sy...@googlegroups.com.
> To unsubscribe from this group, send email to

> sympy+un...@googlegroups.com.

Joachim Durchholz

unread,
Mar 13, 2012, 4:18:32 PM3/13/12
to sy...@googlegroups.com
Am 13.03.2012 17:07, schrieb Sergiu Ivanov:
> Thank you for the very good overview; I wish I had been given
> something like this in my course on compilers.

You're welcome :-)

Some of that knowledge came with bruises.

>> A parser generator is required for LL and LR variants.
>> The problem with these is that they usually come with their own syntax, so
>> using them requires learning new skills.
>
> I think the effort required to acquire these new skills is still less
> than that of mastering a certain parsing technique and then
> implementing it. (Just for reference, I think I can remember I once
> learnt the basics of JavaCC in a day.)

You learnt the syntax of the tool.
The hard part is dealing with conflicts.
And to deal with conflicts, you need to have mastered the parsing
technique. In the sense of "what does it actually do internally".
At which point you're already fully equipped to roll your own parser.

Of course, writing a parser engine is still more work than using a
prewritten one. But it's not THAT much; 500-1000 lines of code could be
enough. Besides, there are tons of recipes in the internet - the real
challenge would be to decide which one is best for our purposes.

> On the other hand, you can switch the parsing techniques much easier
> if you are using parser generators.

No, not at all.
All parser generators have a different input syntax, different
mechanisms to attach semantic actions, etc.
That's a decision that cannot be reversed easily.

>>> I'm not sure about the state of parser
>>> generators for Python, but this page
>>> http://wiki.python.org/moin/LanguageParsing may provide some
>>> information.
>>
>>
>> A parser that's embedded into Python as a DSL would be an option.
>
> I'm not sure I can properly understand your idea; why do we need to
> *embed* the DSL into Python?

Wikipedia:
"embedded (or internal) domain-specific languages, implemented as
libraries which exploit the syntax of their host general purpose
language or a subset thereof, while adding domain-specific language
elements (data types, routines, methods, macros etc.)."

That is, have a few functions, and the grammar becomes a set of function
calls that will contain the grammar, or generate the data to drive the
parser engine.

> I've checked the link I provided one more time; it lists a number of
> implementations that look quite appealing, including
> http://www.acooke.org/lepl/

- Recursive descent cannot (usually) handle infix operators.
+ Parsers are Python code, defined in Python itself.
(i.e. it is an embedded DSL.)

> and http://pypi.python.org/pypi/modgrammar

- Cannot handle left recursion (BLOCKER)

Right-associative operators need left recursion.

> How so? I can't see how it is hard to replace ("roots" or
> "solutions") with "solve" before parsing.

Oh. So you'd replace "rootset" with "solveet".

>> It's easier to build alternate definitions into the semantic processing that
>> happens after the parse.
>
> Hm, I think I can see your point now. It is indeed more correct to
> process alternate definitions into the after-parse semantic
> processing. However, I'm not sure the resulting code is going to be
> simpler.

If you parse things, you need a place to link up a subtree with some
semantics.
Providing alternate syntaxes for the same semantics just means you have
to assign the same semantics to some different kinds of subtrees. You
just reuse the existing semantic object (callable, class, whatever).
Maybe with a small adapter if the semantic object assumes a specific
structure in the tree (most parser toolkits abstract away even from that).

>> Having a larger grammar isn't so much a maintenance problem either if each
>> rule follows a pattern from a small set of standard patterns.
>
> I think I don't understand this. Could you please give more details?

It's no problem if you have a+b, a-b, a*b, a^b, a&&b, etc. etc.
It *is* a problem if you have a-b, a*b, and -a, and want -a to bind
tighter than a*b.

It is not a problem to have a gazillion of control structures like
while...end, switch, etc.
It *is* a problem if you have if...then...else... (without a closing
endif). Google for "dangling else".

>> Spell checking, again, is something that is easily done on the parse tree
>> after it has been created.
>
> I'm not sure I agree. Consider the (supposedly valid) sentences
> "integrate sin(x) by x" and "limit sin(x) when x goes to zero". I
> don't think I'd recommend parsing these two sentences with one
> (general) rule, which means that the words "integrate" and "limit"
> actually determine which of the rules to use. If spell checking
> doesn't happen before lexing, the necessary difference between
> "integrate" and "limit" may not be detected.

Where does spell checking factor in here?

>>> The preprocessor could also drop
>>> incomprehensible (and thus supposedly meaningless) words, like in
>>> "find the integral of x^2".
>>
>> Experience with this kind of approach was that it tends to reduce
>> predictability. In this case, the user might consider a word meaningless but
>> the program has some obscure definition for it and suddenly spits out error
>> messages that refer to stuff that the user doesn't know about
>
> I don't think this is a problem because the user is not supposed to
> purposefully input meaningless words in normal scenarios.

Then I don't understand what purpose the mechanism serves.

Joachim Durchholz

unread,
Mar 13, 2012, 4:58:12 PM3/13/12
to sy...@googlegroups.com
Am 13.03.2012 21:17, schrieb Aaron Meurer:
> So would it help to start a wiki page where we list all the things we
> want to support, in the order of importance?
> Here's a beginning of that list (in order):
>
> - SymPy syntax: This is probably obvious, but correct SymPy/Python
> syntax should always be parsed exactly as it is given. If the
> heuristic parser has ambiguities problems that would prevent this, we
> can just preparse with a call to sympify(), and only use heuristics if
> that fails.
>
> - Mathematica, Maple, Maxima, etc. syntax. Where they conflict, we
> should pick the more popular variant, or if that's nontrivial, we can
> just let it decide as a side-effect of the implementation (i.e., leave
> the behavior undefined in that case).
>
> - LaTeX. The ability to parse LaTeX math into something that can be
> computed would be very useful. WolframAlpha has support for this.

It's almost guaranteed that combining syntaxes from different sources
gives an ambiguous grammar. The only technique that can deal with that
would those in the succession of the Earley parser.

I see that http://en.wikipedia.org/wiki/Earley_parser lists four
different Python implementations, one of them just 150 lines.

> - Text based math. What I mean here is, it should support parsing
> things as you would type them in plain text without trying to follow
> any kind of set grammar. Stuff like 3x^2 + 2, d/dx x^2.

That's really hard to do well. Most of the time, the users's guess of
the parser's guess will be quite different than the actuall guess of the
parser.

> - Special symbols: Support stuff like √x or ∫x^2 dx. Allow, to some
> degree, pasting in stuff from the SymPy pretty printer (in particular,
> things that are not printed on more than one line, like 3⋅x₃).

That's simple. Just plop in the appropriate grammar rules. Make √ a
prefix operator, ∫...dx a "circumfix" one.
₃ would probably have to be lexed as <sub>3<endsub>, where <sub> and
<endsub> are synthetic lexer symbols.

> - Text based functions: Stuff like "integrate x^2 dx", "limit x^2
> with respect to x as x approaches infinity".
>
> - Natural language processing: There is a vagary between this and the
> last bullet point. What I mean here is that it tries to guess what is
> meant from a plain text description without using a set grammar. This
> could even support stuff like "the integral of x squared with respect
> to x".

The same caveat as with "text-based math" apply.

> Shall I start a wiki page? I know there have been other things
> discussed here, like unary minus and bra-ket, that can be problems
> that are important to consider.

I see two things that need a decision:

1) Whether supporting such a wide array of syntaxes is such an important
goal.
If yes, Earley parsing it is.
If no, it would be defining our own syntax, possibly similar to existing
symbolic math languages, but still a separate syntax.

From user's perspective, the consequence of an Earley parser would be
that an additional error mode: the input text might have multiple valid
parses.
(How to best present that to the user might be one or more GSoC projects.)

The consequence of a non-Earley parser, regardless of technology, would
be that we'd have to drastically cut down on the allowed syntax.
Essentially, we'd have to resolve all potential syntactic ambiguities
when writing the grammar.

I think this decision does not benefit from a Wiki.

2) Which grammar rules to support.

This is a bit tedious: look up what the various syntaxes have, write it
down.

A wiki page would be useful for that.

Aaron Meurer

unread,
Mar 13, 2012, 6:40:42 PM3/13/12
to sy...@googlegroups.com
On Tue, Mar 13, 2012 at 2:58 PM, Joachim Durchholz <j...@durchholz.org> wrote:
> Am 13.03.2012 21:17, schrieb Aaron Meurer:
>
>> So would it help to start a wiki page where we list all the things we
>> want to support, in the order of importance?
>
>> Here's a beginning of that list (in order):
>>
>>
>> - SymPy syntax:  This is probably obvious, but correct SymPy/Python
>> syntax should always be parsed exactly as it is given.  If the
>> heuristic parser has ambiguities problems that would prevent this, we
>> can just preparse with a call to sympify(), and only use heuristics if
>> that fails.
>>
>> - Mathematica, Maple, Maxima, etc. syntax. Where they conflict, we
>> should pick the more popular variant, or if that's nontrivial, we can
>> just let it decide as a side-effect of the implementation (i.e., leave
>> the behavior undefined in that case).
>>
>> - LaTeX.  The ability to parse LaTeX math into something that can be
>> computed would be very useful.  WolframAlpha has support for this.
>
>
> It's almost guaranteed that combining syntaxes from different sources gives
> an ambiguous grammar. The only technique that can deal with that would those
> in the succession of the Earley parser.
>
> I see that http://en.wikipedia.org/wiki/Earley_parser lists four different
> Python implementations, one of them just 150 lines.

Just about all of them are relatively short. I suppose it wouldn't be
hard, then, to just implement this from scratch.

>
>
>> - Text based math.  What I mean here is, it should support parsing
>> things as you would type them in plain text without trying to follow
>> any kind of set grammar.  Stuff like 3x^2 + 2, d/dx x^2.
>
>
> That's really hard to do well. Most of the time, the users's guess of the
> parser's guess will be quite different than the actuall guess of the parser.
>
>
>> - Special symbols: Support stuff like √x or ∫x^2 dx.  Allow, to some
>> degree, pasting in stuff from the SymPy pretty printer (in particular,
>> things that are not printed on more than one line, like 3⋅x₃).
>
>
> That's simple. Just plop in the appropriate grammar rules. Make √ a prefix
> operator, ∫...dx a "circumfix" one.
> ₃ would probably have to be lexed as <sub>3<endsub>, where <sub> and
> <endsub> are synthetic lexer symbols.

Or preparse and replace ∫ with "integrate" and so on.

Subscripts have no syntactical meaning, so those should actually just
be considered part of the Symbol name (maybe translated from "₃" to
"_3").

If you put the information you put in the earlier post together on a
wiki, and format it so it's nice and readable, it can make things
easier to go through than trying to read through this thread again.

>
> 2) Which grammar rules to support.
>
> This is a bit tedious: look up what the various syntaxes have, write it
> down.
>
> A wiki page would be useful for that.

OK, I started https://github.com/sympy/sympy/wiki/parsing. I included
what I said above, and also your bit about the different parsers.
Feel free to edit that page however you want.

Aaron Meurer

Joachim Durchholz

unread,
Mar 14, 2012, 2:53:12 AM3/14/12
to sy...@googlegroups.com
Am 13.03.2012 23:40, schrieb Aaron Meurer:
>> I see that http://en.wikipedia.org/wiki/Earley_parser lists four different
>> Python implementations, one of them just 150 lines.
>
> Just about all of them are relatively short.

Ah, good to know, I checked only the last one because it was announced
as short in WP.

> I suppose it wouldn't be
> hard, then, to just implement this from scratch.

Agreed.

> Or preparse and replace ∫ with "integrate" and so on.

Not necessary. Just define ∫ as an alias of "integrate" at the semantic
level.
The advantage is that you avoid any questions of how this affects lexing
and parsing, and it's really easy to implement.

> Subscripts have no syntactical meaning, so those should actually just
> be considered part of the Symbol name

Agreed.

> (maybe translated from "₃" to "_3").

Just leave the symbol as it is.
Somebody might have both x₃ and x_3 in the same formula.

The general rule with syntax is: don't mess with the text unless you
really need to.
It introduces all kinds of problems, from introducing subtle ambiguities
that weren't in the original input to making it hard to produce error
messages that are written in terms of the original user input.

> OK, I started https://github.com/sympy/sympy/wiki/parsing. I included
> what I said above, and also your bit about the different parsers.
> Feel free to edit that page however you want.

Okay, done.
The relevant part is that if we want to do ambiguous grammar, we need
Earley.
The only serious problem that I see is that we might find that whatever
the people enter, it will be ambiguous. I guess standard expressions
like a+b will work fine, but [a,b] might be interpreted as list,
parameter list, interval, and whatnot, so the system would spit out
ambiguities for each and every nontrivial input.
I can see various ways to deal with that kind of problem, but I guess we
can wait and see - it might be less of a problem after all. If it turns
out to be a showstopper, we can always remove grammar rules, losing a
bit of compatibility with the syntax it was taken from and gaining a
reduction in ambiguity.

Somebody needs to go out and collect the grammar rules for all the
grammars that we want to parse.

Bharath M R

unread,
Mar 14, 2012, 4:43:19 AM3/14/12
to sy...@googlegroups.com
I am a complete noob here. But if we are trying to implement wide array of syntaxes,
can't we write separate parsers for each of them and combine them with a try exception statements
Say we have a latex input .
Then
try:
    sympify(input)
except InterpretationError:
    try:
        mathematica(input)
except InterpretationError:
   try:
         latex(input)
 except InterpretationError:
      throw Exception("Not interpreted")

Something like the above. This might help us to divide the task of writing a parser and also the need make the parser universal and hence can 
reduce the complexity of the parser. 

Sergiu Ivanov

unread,
Mar 14, 2012, 9:15:15 AM3/14/12
to sy...@googlegroups.com
On Tue, Mar 13, 2012 at 10:18 PM, Joachim Durchholz <j...@durchholz.org> wrote:
> Am 13.03.2012 17:07, schrieb Sergiu Ivanov:
>
>>> A parser generator is required for LL and LR variants.
>>> The problem with these is that they usually come with their own syntax,
>>> so
>>> using them requires learning new skills.
>>
>>
>> I think the effort required to acquire these new skills is still less
>> than that of mastering a certain parsing technique and then
>> implementing it.  (Just for reference, I think I can remember I once
>> learnt the basics of JavaCC in a day.)
>
>
> You learnt the syntax of the tool.
> The hard part is dealing with conflicts.
> And to deal with conflicts, you need to have mastered the parsing technique.
> In the sense of "what does it actually do internally".
> At which point you're already fully equipped to roll your own parser.

Yes, you're right, I didn't initially take that into consideration.

>> On the other hand, you can switch the parsing techniques much easier
>> if you are using parser generators.
>
>
> No, not at all.
> All parser generators have a different input syntax, different mechanisms to
> attach semantic actions, etc.
> That's a decision that cannot be reversed easily.

I see. I thought the differences in syntax were not that influential.

>>>>  I'm not sure about the state of parser
>>>> generators for Python, but this page
>>>> http://wiki.python.org/moin/LanguageParsing may provide some
>>>> information.
>>>
>>>
>>>
>>> A parser that's embedded into Python as a DSL would be an option.
>>
>>
>> I'm not sure I can properly understand your idea; why do we need to
>> *embed* the DSL into Python?
>
>
> Wikipedia:
> "embedded (or internal) domain-specific languages, implemented as libraries
> which exploit the syntax of their host general purpose language or a subset
> thereof, while adding domain-specific language elements (data types,
> routines, methods, macros etc.)."
>
> That is, have a few functions, and the grammar becomes a set of function
> calls that will contain the grammar, or generate the data to drive the
> parser engine.

Hm, I see. I vaguely had a similar idea in my mind, but didn't know
the term; thank you :-)

>> and http://pypi.python.org/pypi/modgrammar
>
> - Cannot handle left recursion (BLOCKER)

Hm, thanks for pointing out. I tried to check that yesterday and I
still cannot find the information in their docs (at least direct
searching for "recursion" doesn't work for me).

>>> It's easier to build alternate definitions into the semantic processing
>>> that
>>> happens after the parse.
>>
>>
>> Hm, I think I can see your point now.  It is indeed more correct to
>> process alternate definitions into the after-parse semantic
>> processing.  However, I'm not sure the resulting code is going to be
>> simpler.
>
>
> If you parse things, you need a place to link up a subtree with some
> semantics.
> Providing alternate syntaxes for the same semantics just means you have to
> assign the same semantics to some different kinds of subtrees. You just
> reuse the existing semantic object (callable, class, whatever). Maybe with a
> small adapter if the semantic object assumes a specific structure in the
> tree (most parser toolkits abstract away even from that).

I see; agreed.

>>> Having a larger grammar isn't so much a maintenance problem either if
>>> each
>>> rule follows a pattern from a small set of standard patterns.
>>
>>
>> I think I don't understand this.  Could you please give more details?
>
>
> It's no problem if you have a+b, a-b, a*b, a^b, a&&b, etc. etc.
> It *is* a problem if you have a-b, a*b, and -a, and want -a to bind tighter
> than a*b.
>
> It is not a problem to have a gazillion of control structures like
> while...end, switch, etc.
> It *is* a problem if you have if...then...else... (without a closing endif).
> Google for "dangling else".

Ah, I see what you mean now.

>>> Spell checking, again, is something that is easily done on the parse tree
>>> after it has been created.
>>
>>
>> I'm not sure I agree.  Consider the (supposedly valid) sentences
>> "integrate sin(x) by x" and "limit sin(x) when x goes to zero".  I
>> don't think I'd recommend parsing these two sentences with one
>> (general) rule, which means that the words "integrate" and "limit"
>> actually determine which of the rules to use.  If spell checking
>> doesn't happen before lexing, the necessary difference between
>> "integrate" and "limit" may not be detected.
>
>
> Where does spell checking factor in here?

I wanted to say that if "integrate" and "limit" belong to different
classes of symbols, then, should the lexer encounter "intgrate", it
wouldn't be able to correctly classify it as "integrate".

>>>> The preprocessor could also drop
>>>> incomprehensible (and thus supposedly meaningless) words, like in
>>>> "find the integral of x^2".
>>>
>>>
>>> Experience with this kind of approach was that it tends to reduce
>>> predictability. In this case, the user might consider a word meaningless
>>> but
>>> the program has some obscure definition for it and suddenly spits out
>>> error
>>> messages that refer to stuff that the user doesn't know about
>>
>>
>> I don't think this is a problem because the user is not supposed to
>> purposefully input meaningless words in normal scenarios.
>
>
> Then I don't understand what purpose the mechanism serves.

I was thinking about the situation when the user intuitively enters
some text which includes elements of a natural language (like "find"
and "the" in "find the limit sin(x)/x when x goes to 0"). In this
case the user thinks that all words are meaningless, but, for the
application, "find" and "the" bear no meaning.

Sergiu

Sergiu Ivanov

unread,
Mar 14, 2012, 9:17:54 AM3/14/12
to sy...@googlegroups.com
On Wed, Mar 14, 2012 at 3:15 PM, Sergiu Ivanov
<unlimite...@gmail.com> wrote:
>
>In this
> case the user thinks that all words are meaningless

Oops, I wanted to say "the user thinks all words are meaningful".

Sorry :-(

Sergiu

Joachim Durchholz

unread,
Mar 15, 2012, 6:19:29 PM3/15/12
to sy...@googlegroups.com
Am 14.03.2012 09:43, schrieb Bharath M R:
> I am a complete noob here. But if we are trying to implement wide array of
> syntaxes,
> can't we write separate parsers for each of them and combine them with a
> try exception statements
> Say we have a latex input .
> Then
> try:
> sympify(input)
> except InterpretationError:
> try:
> mathematica(input)
> except InterpretationError:
> try:
> latex(input)
> except InterpretationError:
> throw Exception("Not interpreted")
>
> Something like the above. This might help us to divide the task of writing
> a parser and also the need make the parser universal and hence can
> reduce the complexity of the parser.

Technical detail: Parsers don't necessarily throw an exception if the
fail to parse, they might return Null instead of a parse tree if there
is an error, or return a data structure with a status flag set.

But, yes, it would be possible to try each syntax and see which of them
gives a useful result.

Maybe it's even useless to assume that a LaTex formula may contain a
Maple-syntax subformula. What's the use cases that we want to support here?
(We could be in for serious trouble anyway if we allow mixing syntaxes
within the same formula. For example, square brackets might have
different interpretations depending on whether it's interpreted as
LaTeX, SymPy, or Mathematica.)

However, I would not stick with the first syntax that gives a match.
It's entirely possible that two different syntaxes give a valid parse.
So: try all syntaxes. Check if they give the same abstract syntax
(logical structure, if you will). If they give different
interpretations, alert the user to the fact and ask which interpretation
is what he meant.

Just food for thought.

Joachim Durchholz

unread,
Mar 15, 2012, 6:58:50 PM3/15/12
to sy...@googlegroups.com
Am 14.03.2012 14:15, schrieb Sergiu Ivanov:
>>> On the other hand, you can switch the parsing techniques much easier
>>> if you are using parser generators.
>>
>> No, not at all.
>> All parser generators have a different input syntax, different mechanisms to
>> attach semantic actions, etc.
>> That's a decision that cannot be reversed easily.
>
> I see. I thought the differences in syntax were not that influential.

Well, it's not just syntax. The way you express semantic actions can
have a deep impact.
It's alike the similarities and differences between C and Pascal: both
are statically-types, imperative language with little thought for
modularity. Still, you wouldn't want to rewrite programs from one
language to the other, not even as small ones as 500 lines.

>>> and http://pypi.python.org/pypi/modgrammar
>>
>> - Cannot handle left recursion (BLOCKER)
>
> Hm, thanks for pointing out. I tried to check that yesterday and I
> still cannot find the information in their docs (at least direct
> searching for "recursion" doesn't work for me).

http://packages.python.org/modgrammar/tutorial.html#left-recursion

Last paragraph of the section.

The point here is that while a left-recursive and a right-recursive
grammar are equivalent in the set of inputs that they accept, they
generate different parse trees.

>>>> Spell checking, again, is something that is easily done on the parse tree
>>>> after it has been created.
>>>
>>>
>>> I'm not sure I agree. Consider the (supposedly valid) sentences
>>> "integrate sin(x) by x" and "limit sin(x) when x goes to zero". I
>>> don't think I'd recommend parsing these two sentences with one
>>> (general) rule, which means that the words "integrate" and "limit"
>>> actually determine which of the rules to use. If spell checking
>>> doesn't happen before lexing, the necessary difference between
>>> "integrate" and "limit" may not be detected.
>>
>>
>> Where does spell checking factor in here?
>
> I wanted to say that if "integrate" and "limit" belong to different
> classes of symbols, then, should the lexer encounter "intgrate", it
> wouldn't be able to correctly classify it as "integrate".

Hmmm... that usually doesn't end well. You never know whether a
particular misspelling was intentional or not.

E.g. is "limits" a misspelled keyword, or a free variable because he's
doing some first-order logic on interval arithmetic (where he might be
manipulating sets of limits, i.e. upper/lower bounds).

The standard advice to programming language designers is that this is
One Of Those Well-Intentioned Bad Ideas From The 50ies.
Other ideas like what were "make a programming language look like
English" (that gave us Cobol), or "put all good ideas into one language
to give use the One True Language That Has It All" (that gave us PL/I).

That doesn't mean that this kind of advice is always valid, but you need
to know what you're doing and why those approaches failed, to avoid
falling to the same traps that have been known for over half a century now.

>>>>> The preprocessor could also drop
>>>>> incomprehensible (and thus supposedly meaningless) words, like in
>>>>> "find the integral of x^2".
>>>>
>>>> Experience with this kind of approach was that it tends to reduce
>>>> predictability. In this case, the user might consider a word meaningless
>>>> but
>>>> the program has some obscure definition for it and suddenly spits out
>>>> error
>>>> messages that refer to stuff that the user doesn't know about
>>>
>>> I don't think this is a problem because the user is not supposed to
>>> purposefully input meaningless words in normal scenarios.
>>
>> Then I don't understand what purpose the mechanism serves.
>
> I was thinking about the situation when the user intuitively enters
> some text which includes elements of a natural language (like "find"
> and "the" in "find the limit sin(x)/x when x goes to 0"). In this
> case the user thinks that all words are meaningless, but, for the
> application, "find" and "the" bear no meaning.

Well, it's really hard to write a parser that does this kind of stuff
well enough to be worth it.
I do not know of a single project that successfully implemented such a
thing, but about several that failed abysmally.

You're trying to get a DWIM thing implemented.
See http://catb.org/jargon/html/D/DWIM.html .
The first rule for this kind of thing is: Don't guess what the problems
are, go out and watch what problems bite the users in reality. Then add
some (cautious) spell checking and other DWIM mechanisms, and
iteratively correct them until they do *far* more good than harm.
It's nothing that can be designed. You need to deal with what people
actually do (and, hence, it is probably too much work for the current
size of the SymPy community).

We could start that by adding some statistics code to SymPy: which
functions are used, what kinds of errors happen, where do they
originate. Define a mail address to send them to, write the tools to
extract information from the raw data. Think about how to ask the user
about whether collecting usage data is okay for them (some might
disagree, an NSA mathematician working with SymPy on some secret
research would most certainly disagree).

Bharath M R

unread,
Mar 17, 2012, 2:35:32 AM3/17/12
to sy...@googlegroups.com
I think writing the parser would be a GSOC project in itself. Also, I think I don't have the expertise to write the parser.
Though it is possible that I can read up and try to implement, I think I won't be able to do a good job of it.

I would like to take up the take up the task of implementing the gamma sympy site by including the following:

1) Back end for svgFig, so that we can plot figures on GAE. As svgFig is
written in python, GAE won't have a problem with it. 

2) Sphinx documentation integration to the site. Suppose we call integrate, the 
site should show links to documentation such as sum, definite integral how-to etc.
 
3)  Integration of python terminal as in sympy Live. This will be hidden by default and
can be accessed whenever it is required. 

4) Make the site render on both hand held devices and computers using twitter bootstrap.

5) If time permits, I would like to write the backend for ascii plots. This looks like a really fun 
project and I think it is necessary on the command line/ipython if the computers don't have matplotlib
preinstalled. This will allow us to plot without any dependencies. 

Thanks,
Bharath M R

Aaron Meurer

unread,
Mar 17, 2012, 4:05:45 AM3/17/12
to sy...@googlegroups.com
Sure, it can be a separate project if you want, though in that case,
you might also look into ways that you could improve SymPy Live in
parallel.

Aaron Meurer

> --
> You received this message because you are subscribed to the Google Groups
> "sympy" group.

> To view this discussion on the web visit
> https://groups.google.com/d/msg/sympy/-/AwIFeazcbT8J.

Sergiu Ivanov

unread,
Mar 17, 2012, 8:03:38 AM3/17/12
to sy...@googlegroups.com
On Fri, Mar 16, 2012 at 12:19 AM, Joachim Durchholz <j...@durchholz.org> wrote:
>
> However, I would not stick with the first syntax that gives a match. It's
> entirely possible that two different syntaxes give a valid parse.
> So: try all syntaxes. Check if they give the same abstract syntax (logical
> structure, if you will). If they give different interpretations, alert the
> user to the fact and ask which interpretation is what he meant.

I think it might be possible to merge the two approaches and, if more
than one grammar gives match, show one valid parse and allow the user
to see the others. In this case we might collect the statistics to
see which of the grammars is usually chosen in ambiguous situations so
that we will be able to alter the defaults correspondingly.

Sergiu

Sergiu Ivanov

unread,
Mar 17, 2012, 8:25:45 AM3/17/12
to sy...@googlegroups.com
On Fri, Mar 16, 2012 at 12:58 AM, Joachim Durchholz <j...@durchholz.org> wrote:
> Am 14.03.2012 14:15, schrieb Sergiu Ivanov:
>
>> I see.  I thought the differences in syntax were not that influential.
>
>
> Well, it's not just syntax. The way you express semantic actions can have a
> deep impact.
> It's alike the similarities and differences between C and Pascal: both are
> statically-types, imperative language with little thought for modularity.
> Still, you wouldn't want to rewrite programs from one language to the other,
> not even as small ones as 500 lines.

Hm, the difference is actually much more fundamental than my
imagination would paint it to me :-(

>>>> and http://pypi.python.org/pypi/modgrammar
>>>
>>>
>>> - Cannot handle left recursion (BLOCKER)
>>
>>
>> Hm, thanks for pointing out.  I tried to check that yesterday and I
>> still cannot find the information in their docs (at least direct
>> searching for "recursion" doesn't work for me).
>
>
> http://packages.python.org/modgrammar/tutorial.html#left-recursion
>
> Last paragraph of the section.

Oh, thank you.

> The point here is that while a left-recursive and a right-recursive grammar
> are equivalent in the set of inputs that they accept, they generate
> different parse trees.

Yes, sure. Avoiding left-recursion is usually not that hard with
proper experience (at least I've always found it easy), but I see your
point.

>> I wanted to say that if "integrate" and "limit" belong to different
>> classes of symbols, then, should the lexer encounter "intgrate", it
>> wouldn't be able to correctly classify it as "integrate".
>
>
> Hmmm... that usually doesn't end well. You never know whether a particular
> misspelling was intentional or not.
>
> E.g. is "limits" a misspelled keyword, or a free variable because he's doing
> some first-order logic on interval arithmetic (where he might be
> manipulating sets of limits, i.e. upper/lower bounds).

Yeah, I see.

> The standard advice to programming language designers is that this is One Of
> Those Well-Intentioned Bad Ideas From The 50ies.
> Other ideas like what were "make a programming language look like English"
> (that gave us Cobol), or "put all good ideas into one language to give use
> the One True Language That Has It All" (that gave us PL/I).

Hm, a very nice historical perspective. Although I'm trying hard to
avoid such Well-Intentioned Bad Ideas, I'm still producing them with
regularity.

> That doesn't mean that this kind of advice is always valid, but you need to
> know what you're doing and why those approaches failed, to avoid falling to
> the same traps that have been known for over half a century now.

Yes, sure, I see.

>> I was thinking about the situation when the user intuitively enters
>> some text which includes elements of a natural language (like "find"
>> and "the" in "find the limit sin(x)/x when x goes to 0").  In this
>> case the user thinks that all words are meaningless, but, for the
>> application, "find" and "the" bear no meaning.
>
> Well, it's really hard to write a parser that does this kind of stuff well
> enough to be worth it.

I see.

> I do not know of a single project that successfully implemented such a
> thing, but about several that failed abysmally.
>
> You're trying to get a DWIM thing implemented.
> See http://catb.org/jargon/html/D/DWIM.html .

Oh, what a vivid example; thank you :-)

> The first rule for this kind of thing is: Don't guess what the problems are,
> go out and watch what problems bite the users in reality. Then add some
> (cautious) spell checking and other DWIM mechanisms, and iteratively correct
> them until they do *far* more good than harm.
> It's nothing that can be designed. You need to deal with what people
> actually do (and, hence, it is probably too much work for the current size
> of the SymPy community).
>
> We could start that by adding some statistics code to SymPy: which functions
> are used, what kinds of errors happen, where do they originate. Define a
> mail address to send them to, write the tools to extract information from
> the raw data.

Very very cool. Collecting anonymous user data should be very useful
in this regard.

> Think about how to ask the user about whether collecting usage
> data is okay for them (some might disagree, an NSA mathematician working
> with SymPy on some secret research would most certainly disagree).

This could be gently offered as and opt-in feature, I think.

Sergiu

Joachim Durchholz

unread,
Mar 17, 2012, 5:57:02 PM3/17/12
to sy...@googlegroups.com
Am 17.03.2012 13:25, schrieb Sergiu Ivanov:
> On Fri, Mar 16, 2012 at 12:58 AM, Joachim Durchholz<j...@durchholz.org> wrote:
>> Am 14.03.2012 14:15, schrieb Sergiu Ivanov:
>>
>>> I see. I thought the differences in syntax were not that influential.
>>
>>
>> Well, it's not just syntax. The way you express semantic actions can have a
>> deep impact.
>> It's alike the similarities and differences between C and Pascal: both are
>> statically-types, imperative language with little thought for modularity.
>> Still, you wouldn't want to rewrite programs from one language to the other,
>> not even as small ones as 500 lines.
>
> Hm, the difference is actually much more fundamental than my
> imagination would paint it to me :-(

Oh, but the difference between C and Pascal is far from fundamental!
If you want something fundamentally different, try Lisp. Or Haskell. Or
even APL :-)

>> The point here is that while a left-recursive and a right-recursive grammar
>> are equivalent in the set of inputs that they accept, they generate
>> different parse trees.
>
> Yes, sure. Avoiding left-recursion is usually not that hard with
> proper experience (at least I've always found it easy), but I see your
> point.

Eliminating left recursion is indeed not difficult :-)
It's just that you'll have to postprocess the parse tree.
Which distributes the structural description of the parse tree over two
places, the syntax and the postprocessing code. I think that's a real
disadvantage that would need to be offset by some advantage elsewhere.

>> The standard advice to programming language designers is that this is One Of
>> Those Well-Intentioned Bad Ideas From The 50ies.
>> Other ideas like what were "make a programming language look like English"
>> (that gave us Cobol), or "put all good ideas into one language to give use
>> the One True Language That Has It All" (that gave us PL/I).
>
> Hm, a very nice historical perspective. Although I'm trying hard to
> avoid such Well-Intentioned Bad Ideas, I'm still producing them with
> regularity.

"Experience is the sum of errors made."
;-)

>> The first rule for this kind of thing is: Don't guess what the problems are,
>> go out and watch what problems bite the users in reality. Then add some
>> (cautious) spell checking and other DWIM mechanisms, and iteratively correct
>> them until they do *far* more good than harm.
>> It's nothing that can be designed. You need to deal with what people
>> actually do (and, hence, it is probably too much work for the current size
>> of the SymPy community).
>>
>> We could start that by adding some statistics code to SymPy: which functions
>> are used, what kinds of errors happen, where do they originate. Define a
>> mail address to send them to, write the tools to extract information from
>> the raw data.
>
> Very very cool. Collecting anonymous user data should be very useful
> in this regard.

Thanks :-)

>> Think about how to ask the user about whether collecting usage
>> data is okay for them (some might disagree, an NSA mathematician working
>> with SymPy on some secret research would most certainly disagree).
>
> This could be gently offered as and opt-in feature, I think.

Just my thinking.

Don't expect an easy miracle though.
User input is multi-channel high-noise input, extracting useful signals
from that ain't easy.

Joachim Durchholz

unread,
Mar 17, 2012, 5:57:26 PM3/17/12
to sy...@googlegroups.com
Am 17.03.2012 13:03, schrieb Sergiu Ivanov:
> On Fri, Mar 16, 2012 at 12:19 AM, Joachim Durchholz<j...@durchholz.org> wrote:
>>
>> However, I would not stick with the first syntax that gives a match. It's
>> entirely possible that two different syntaxes give a valid parse.
>> So: try all syntaxes. Check if they give the same abstract syntax (logical
>> structure, if you will). If they give different interpretations, alert the
>> user to the fact and ask which interpretation is what he meant.
>
> I think it might be possible to merge the two approaches and, if more
> than one grammar gives match, show one valid parse and allow the user
> to see the others.

Sounds good.

> In this case we might collect the statistics to
> see which of the grammars is usually chosen in ambiguous situations so
> that we will be able to alter the defaults correspondingly.

How about letting the user specify the syntax, the default option being
"auto-detect"?

I'm a bit sceptical about trying to get correlations out of user input.
Repeating myself: User input is multi-channel high-noise input,
extracting useful signals from that is hard.

Sergiu Ivanov

unread,
Mar 18, 2012, 6:45:03 AM3/18/12
to sy...@googlegroups.com
On Sat, Mar 17, 2012 at 11:57 PM, Joachim Durchholz <j...@durchholz.org> wrote:
> Am 17.03.2012 13:25, schrieb Sergiu Ivanov:
>
>> Hm, the difference is actually much more fundamental than my
>> imagination would paint it to me :-(
>
>
> Oh, but the difference between C and Pascal is far from fundamental!
> If you want something fundamentally different, try Lisp. Or Haskell. Or even
> APL :-)

Yes, I should have said "influential" instead of fundamental.

APL looks fun :-) I wish I had the time to actually try it :-)

>> Yes, sure.  Avoiding left-recursion is usually not that hard with
>> proper experience (at least I've always found it easy), but I see your
>> point.
>
> Eliminating left recursion is indeed not difficult :-)
> It's just that you'll have to postprocess the parse tree.
> Which distributes the structural description of the parse tree over two
> places, the syntax and the postprocessing code. I think that's a real
> disadvantage that would need to be offset by some advantage elsewhere.

Yeah, I see.

> Don't expect an easy miracle though.
> User input is multi-channel high-noise input, extracting useful signals from
> that ain't easy.

Yes, right. I wanted to say that gamma.sympy.org won't be visited by
many visitors in the beginning, which may allow a human eye even to
extract some information. However, this assumption may become invalid
very quickly, so yes, collecting and analysing data automatically is
the way to go, although it's not immediately straightforward how to do
it.

Sergiu

Sergiu Ivanov

unread,
Mar 18, 2012, 7:02:52 AM3/18/12
to sy...@googlegroups.com
On Sat, Mar 17, 2012 at 11:57 PM, Joachim Durchholz <j...@durchholz.org> wrote:
> Am 17.03.2012 13:03, schrieb Sergiu Ivanov:
>
>> In this case we might collect the statistics to
>>
>> see which of the grammars is usually chosen in ambiguous situations so
>> that we will be able to alter the defaults correspondingly.
>
>
> How about letting the user specify the syntax, the default option being
> "auto-detect"?

Yeah, this sounds cool. What I like best about this approach is that
things are expected to work nicely (or fail gracefully) both when the
user specifies the syntax and when they don't.

In this case we may even disregard the idea of collecting information
as to which grammars are usually preferred.

> I'm a bit sceptical about trying to get correlations out of user input.
> Repeating myself: User input is multi-channel high-noise input, extracting
> useful signals from that is hard.

Does your scepticism mean that you don't expect much useful
information without considerable effort or that you don't expect
anything useful of collecting the data at all?

Sergiu

Joachim Durchholz

unread,
Mar 18, 2012, 4:52:14 PM3/18/12
to sy...@googlegroups.com
Am 18.03.2012 12:02, schrieb Sergiu Ivanov:
>> I'm a bit sceptical about trying to get correlations out of user input.
>> Repeating myself: User input is multi-channel high-noise input, extracting
>> useful signals from that is hard.
>
> Does your scepticism mean that you don't expect much useful
> information without considerable effort or that you don't expect
> anything useful of collecting the data at all?

It's the former: the information is in the data, but it's hard to
extract it.

Also, the statistics involved is nontrivial. You'd need a working body
of statistical expertise to interpret the correlations correctly.
There's the simple case of applying the rules incorrectly, and the
somewhat more advanced case that if you have 100 correlations, each with
a significance of 99%, you can expect one of them to be bogus (and if
you're unlucky, several are, and if you made some error, all of them are
except those that happened to be real).

Arie Lakeman

unread,
Nov 26, 2014, 9:15:55 PM11/26/14
to sy...@googlegroups.com
Hi, has anything more happened on this?  In particular I'm interested in any Latex parsing that can be done.  I notice the github page https://github.com/sympy/sympy/wiki/Parsing hasn't been updated in 2 years.

Aaron Meurer

unread,
Nov 27, 2014, 1:01:32 AM11/27/14
to sy...@googlegroups.com
No projects on parsing were ever accepted for GSoC. It is still
something that we want to improve, but no major work has been done on
it.

Aaron Meurer
> --
> You received this message because you are subscribed to the Google Groups
> "sympy" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to sympy+un...@googlegroups.com.
> To post to this group, send email to sy...@googlegroups.com.
> Visit this group at http://groups.google.com/group/sympy.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/sympy/94664689-0936-4b50-a03a-bcded3a908df%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages