SymPy project ideas?

David Li

unread,

Sep 3, 2012, 2:31:35 PM9/3/12

to sy...@googlegroups.com

Hello all,

As a high school student, I am encouraged to conduct a science fair experiment each year. I became interested in contributing to SymPy through the 2011 Google Code-In project, and for this year, I am interested in somehow working on SymPy for science fair. I reviewed the GSoC 2012 Ideas and believe I could work on a few of those ideas, in particular, implementing by-hand differentiation/integration in order to show steps or working on some sort of natural-language input for SymPy Gamma/sympify. My question is, are these projects desirable for SymPy, and are there other project ideas (that you think would be approachable)?

I saw the discussion on SymPy Gamma at https://groups.google.com/forum/?fromgroups=#!topic/sympy/YJNc_MoccYg; however, there seems to have been little development since then. Is this still a project SymPy would like to pursue? For a project, I could investigate natural-language input, perhaps by integrating NLTK. After playing with NLTK, I think some areas of research could involve improving the tokenizer to handle math expressions (for instance, currently 'tan(x)' gets parsed as ['tan', '(', 'x', ')']), and of course, actually interpreting the input. A different project would involve investigating/implementing by-hand differentiation/integration methods so that SymPy could show steps.

To give some background about my learning, I am currently taking Multivariable Calculus/Differential Equations. I have completed AP Calculus BC; I have basic knowledge of logic and set theory, but that is the extent of my mathematical knowledge.

Thank you,

David Li

Aaron Meurer

unread,

Sep 3, 2012, 3:37:34 PM9/3/12

to sy...@googlegroups.com

On Mon, Sep 3, 2012 at 12:31 PM, David Li <li.da...@gmail.com> wrote:
> Hello all,
>
> As a high school student, I am encouraged to conduct a science fair
> experiment each year. I became interested in contributing to SymPy through
> the 2011 Google Code-In project, and for this year, I am interested in
> somehow working on SymPy for science fair. I reviewed the GSoC 2012 Ideas
> and believe I could work on a few of those ideas, in particular,
> implementing by-hand differentiation/integration in order to show steps or
> working on some sort of natural-language input for SymPy Gamma/sympify. My
> question is, are these projects desirable for SymPy, and are there other
> project ideas (that you think would be approachable)?

Yes, definitely. It wouldn't be on that page if it wasn't desirable.

I can't think of any other things. Definitely there are several
directions you could go with some of the projects, though, such as the
SymPy Gamma one.

>
> I saw the discussion on SymPy Gamma at
> https://groups.google.com/forum/?fromgroups=#!topic/sympy/YJNc_MoccYg;
> however, there seems to have been little development since then. Is this
> still a project SymPy would like to pursue? For a project, I could
> investigate natural-language input, perhaps by integrating NLTK. After
> playing with NLTK, I think some areas of research could involve improving
> the tokenizer to handle math expressions (for instance, currently 'tan(x)'
> gets parsed as ['tan', '(', 'x', ')']), and of course, actually interpreting
> the input. A different project would involve investigating/implementing
> by-hand differentiation/integration methods so that SymPy could show steps.

There was actually quite a discussion about SymPy Gamma by a potential
GSoC student. I believe it was
https://groups.google.com/d/topic/sympy/rGQ8L5Z26Y0/discussion. There
was some interesting discussion there.

And you might look at the standard library tokenize module, which does
exactly what you say for valid Python code. The goal with SymPy gamma
would be to extend that somehow to be able to parse things that people
might enter but that aren't valid Python, like 2x or x^2.

>
> To give some background about my learning, I am currently taking
> Multivariable Calculus/Differential Equations. I have completed AP Calculus
> BC; I have basic knowledge of logic and set theory, but that is the extent
> of my mathematical knowledge.

That's better than most high school students. The projects you talk
about should be doable with that background. The SymPy Gamma one
requires more computer science than math, but you can learn that on
your own (no matter what you do, you will need to learn things on your
own, which is a good thing and probably a requirement of the science
fair anyway).

Aaron Meurer

>
> Thank you,
> David Li
>
> --
> You received this message because you are subscribed to the Google Groups
> "sympy" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/sympy/-/Ww56mnNfXdgJ.
> To post to this group, send email to sy...@googlegroups.com.
> To unsubscribe from this group, send email to
> sympy+un...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/sympy?hl=en.

Joachim Durchholz

unread,

Sep 3, 2012, 4:21:02 PM9/3/12

to sy...@googlegroups.com

We would not want to add an entire natural language processing toolkit;
SymPy has a rather strict "no external dependencies" policy because it
needs to be installable in installer-unfriendly environments
(non-administrator accounts, mobile devices).

However, it would be extremely useful if we could "steal" the parser
engine from such a toolkit.
Most if not all of these engines accept arbitrary context-free grammars.
With such an engine, we could just write down the BNF of some grammar
(Mathematica, Latex, natural language, whatever), and experiment with it
until it works satisfactorily.

A general remark: Natural language is notoriously hard to parse. Either
it's intuitive, then it's too ambiguous to be useful in a context like
that of symbolic math; or it's precise, in which case it isn't natural
language anymore. Finding the right trade-off for such things is an
ongoing research topic. And after that, you get into the *really*
"interesting" problems...
My advice would be to avoid natural language if it's just a means to an
end; natural language processing just isn't explored well enough for
that, and you'll likely get more problems than the approach can solve.
If, on the other hand, natural language processing is your primary
interest, by all means continue with it, there's a lot of PhD material
in there :-)

David Li

unread,

Sep 3, 2012, 6:11:03 PM9/3/12

to sy...@googlegroups.com

Alright, thanks for that other thread. I'll review this and discuss with my teacher to come up with a more specific plan.

The tokenize module is quite interesting - I guess how Gamma would eventually work is to try to process non-Python syntax but also accept Python expressions? Or perhaps some sort of relaxed/non-strict Python grammar could be implemented (with some processing to allow integrate, integral of, etc.) so that tan x, 3x, e^x are accepted as well as tan(x), 3*x, and exp(x).

On Monday, September 3, 2012 1:21:27 PM UTC-7, Joachim Durchholz wrote:

We would not want to add an entire natural language processing toolkit;
SymPy has a rather strict "no external dependencies" policy because it
needs to be installable in installer-unfriendly environments
(non-administrator accounts, mobile devices).

Alright, I wasn't aware of that requirement. Thanks for pointing that out. NLTK would have been too onerous of a dependency in any case, as it requires a total of 800 MB of corpus data to run its various algorithms.

However, it would be extremely useful if we could "steal" the parser
engine from such a toolkit.
Most if not all of these engines accept arbitrary context-free grammars.
With such an engine, we could just write down the BNF of some grammar
(Mathematica, Latex, natural language, whatever), and experiment with it
until it works satisfactorily.

I think this was another of the ideas listed on the wiki - I guess it falls under a similar category too. So perhaps some heuristic for differentiating between various input languages and then interpreting them as Python (Python, TeX, "English-like", etc.) could also be an interesting task.

A general remark: Natural language is notoriously hard to parse. Either
it's intuitive, then it's too ambiguous to be useful in a context like
that of symbolic math; or it's precise, in which case it isn't natural
language anymore. Finding the right trade-off for such things is an
ongoing research topic. And after that, you get into the *really*
"interesting" problems...
My advice would be to avoid natural language if it's just a means to an
end; natural language processing just isn't explored well enough for
that, and you'll likely get more problems than the approach can solve.
If, on the other hand, natural language processing is your primary
interest, by all means continue with it, there's a lot of PhD material
in there :-)

Since Gamma only deals with mathematical expressions (which is more limited than Wolfram|Alpha) I believe at least some basic English-like queries can be interpreted. I should've been more specific about that. I thought that natural language could help somewhat with the task, or at least point me towards algorithms and ideas, which is why I mentioned it. Given how difficult it is, though, I guess just being able to interpret 2x, sin x, and integral of x^2 would be a nice step up in functionality.

Thanks for all your help and suggestions!

David Li

Joachim Durchholz

unread,

Sep 4, 2012, 5:29:48 AM9/4/12

to sy...@googlegroups.com

Am 04.09.2012 00:11, schrieb David Li:
> So perhaps some heuristic for differentiating
> between various input languages and then interpreting them as Python
> (Python, TeX, "English-like", etc.) could also be an interesting task.

Heh. That's simple:
- Have a grammar for each syntax that we have,
- run the input through all grammars,
- use the grammar that doesn't return an error.

The fun begins when considering the following cases:
1) No grammar matches.
2) More than one grammar matches.

For (1), you'd want to somehow rank the grammars according to how close
the input is to each grammar, and assume the user really meant the
closest one.

For (2), you'd want to check if the different grammars all really mean
the same. E.g. "1*1" should parse the same for all math grammars. Just
continue processing.
Otherwise, you'll have to ask the user. Or randomly guess one and let
the user explicitly select grammars.

There's also a slight complication for case (2): You may get different
parse trees but they'd boil down to the same operations. For examples,
grammars with different numbers of precedence levels tend to end up that
way; 1*2 could end as

op: *
int: 1
int: 2

or as

op: *
literal
int: 1
literal
int: 1

where the second grammar would for some reason differentiate between
literals, names, and other representations, where the first does not.

You'll either need a pass that normalizes grammars, or require that
commonalities between grammars are handled by identical rules.
The first approach probably requires less work because SymPy already has
routines for simplifying expressions; however, that makes error
reporting more difficult because the transformations aren't built for
keeping track of input line/column numbers.

You see, there's enough to do :-)

Not all aspects need to be addressed on the first round though. Just
choose how much of this all you want to deal with, and code in a way
that the rest can be added later without rewriting everything.

> Since Gamma only deals with mathematical expressions (which is more limited
> than Wolfram|Alpha) I believe at least some basic English-like queries can
> be interpreted.

> ...

> Given how
> difficult it is, though, I guess just being able to interpret 2x, sin
> x, and integral of x^2 would be a nice step up in functionality.

Indeed, that's easy enough. You can always write a grammar that accepts
a subset of English.
Main points:
- Do not require parentheses for function parameters; a function call is
just: name {expr}
- Make name {expr} bind weaker than all operators, so sin x+y is
equivalent to sin (x+y).

> I should've been more specific about that. I thought that
> natural language could help somewhat with the task, or at least point me
> towards algorithms and ideas, which is why I mentioned it.

That wouldn't have worked. Parsing natural language is really hard. And
the algorithms beyond parsing aren't related much to natural language.

Still, the natural language parsers should be suitable.

Aaron Meurer

unread,

Sep 4, 2012, 2:18:48 PM9/4/12

to sy...@googlegroups.com

Another thing you could look at is what should be done at the parsing
stage and what should be done after the parsing. For example, "2 x",
"x y", and "tan x" are all the same syntax as far as the parser is
concerned (unless you want to put all predefined names in the grammar
itself), but the first two are implicit multiplication and the second
is implicit calling. So maybe those should be parsed to the same
object and then differentiated in software somehow. Then comes
questions of how to interpret things like "tan x y" (tan(x)*y or
tan(x*y), or fail).

Another interesting example that I thought of is something like
sin^2(x) for sin(x)**2 (the former is common notation for this, and
indeed SymPy even pretty prints it that way). To parse the one like
the other would require changing the precedence order, as it normally
would be parsed as sin^(2(x)). So you might think of ways to make
that work, and whether those ways work at the parsing stage, the
post-parsing stage, or both.

So what I would do is try things in order of easiest to hardest (and
natural language heuristics are one of the hardest), and stop working
when you either run out of time or feel that you've done enough. You
almost certainly won't get to do it all, but it's not clear just how
far you will get, so set yourself up to do as much as you can.

By the way, the standard library tokenize module is exactly the same
as the parser in SymPy, except we've extended ours to do some other
stuff (e.g., parse "x!" as factorial(x), wrap all undefined names in
Symbol, wrap all number literals in Integer or Float, etc.). So for
the parts that are just extending tokenize, you should put it there.
For the rest, it should go in the parsing module (another good thing
to think about by the way is a good way of organizing the parsing
code; that was discussed a little bit on that other thread).

Aaron Meurer

> --
> You received this message because you are subscribed to the Google Groups
> "sympy" group.

David Li

unread,

Sep 4, 2012, 8:14:08 PM9/4/12

to sy...@googlegroups.com

Okay, some bad news - this might not qualify as a science fair project since it doesn't really have an "experiment". My teacher will double-check, but he wasn't too sure. However, I would still like to pursue this project as it interests me.

On Tuesday, September 4, 2012 11:19:11 AM UTC-7, Aaron Meurer wrote:

Another thing you could look at is what should be done at the parsing
stage and what should be done after the parsing. For example, "2 x",
"x y", and "tan x" are all the same syntax as far as the parser is
concerned (unless you want to put all predefined names in the grammar
itself), but the first two are implicit multiplication and the second
is implicit calling. So maybe those should be parsed to the same
object and then differentiated in software somehow. Then comes
questions of how to interpret things like "tan x y" (tan(x)*y or
tan(x*y), or fail).

Yes, what I was thinking is that there would be a "whitespace expansion" step (probably after tokenization) that would convert statements like 2xy into "2 x y" and then tokenize again, and then differentiate between those syntaxes when constructing some sort of AST.

Another interesting example that I thought of is something like
sin^2(x) for sin(x)**2 (the former is common notation for this, and
indeed SymPy even pretty prints it that way). To parse the one like
the other would require changing the precedence order, as it normally
would be parsed as sin^(2(x)). So you might think of ways to make
that work, and whether those ways work at the parsing stage, the
post-parsing stage, or both.

Okay, so that's another thing to keep in mind - I'll have to compile a list of allowed syntactical elements sometime.

So what I would do is try things in order of easiest to hardest (and
natural language heuristics are one of the hardest), and stop working
when you either run out of time or feel that you've done enough. You
almost certainly won't get to do it all, but it's not clear just how
far you will get, so set yourself up to do as much as you can.

By the way, the standard library tokenize module is exactly the same
as the parser in SymPy, except we've extended ours to do some other
stuff (e.g., parse "x!" as factorial(x), wrap all undefined names in
Symbol, wrap all number literals in Integer or Float, etc.). So for
the parts that are just extending tokenize, you should put it there.
For the rest, it should go in the parsing module (another good thing
to think about by the way is a good way of organizing the parsing
code; that was discussed a little bit on that other thread).

Alright, I'll keep this in mind as I work on an API.

David Li

Chris Smith

unread,

Sep 4, 2012, 9:13:54 PM9/4/12

to sy...@googlegroups.com

On Wed, Sep 5, 2012 at 5:59 AM, David Li <li.da...@gmail.com> wrote:
> Okay, some bad news - this might not qualify as a science fair project since
> it doesn't really have an "experiment". My teacher will double-check, but he
> wasn't too sure. However, I would still like to pursue this project as it
> interests me.

But one things you can point out is that the scientific method (of a
sort) is applied: you have to try identify the salient variables, test
to see that you've controlled them well, revise your theory when you
find logical errors, you have record keeping (in the annotated code
and docstrings), etc...

Joachim Durchholz

unread,

Sep 4, 2012, 9:52:13 PM9/4/12

to sy...@googlegroups.com

Am 05.09.2012 02:14, schrieb David Li:
> Yes, what I was thinking is that there would be a "whitespace expansion"
> step (probably after tokenization) that would convert statements like 2xy
> into "2 x y" and then tokenize again

Multiple tokenization steps are usually not worth it.
Make it so that there's a token boundary between 2 and xy.

Splitting xy would be one of those things that need to be syntax-dependent.
SymPy allows defining variable names, so if there's an "x" and a "y",
you can split, and if there's an "xy", you wouldn't want to split.
If there are all three of "x", "y", and "xy", you have an ambiguous
parse; report that as something that the user needs to decide (ambiguous
parses are a fact of life for ad-hoc grammars, and in fact we need to
deal with these for other reasons anyway).

If SymPy provides you with a variable named "xy", add a temporary
grammar rule
xy ::= x y
and make each letter a separate token. (Temporary grammar rules and
ambiguities are anathema in the parser generators used for programming
languages, but they are no problem in the parser generators used for
natural languages. Different parsing technology.)

David Li

unread,

Sep 4, 2012, 10:08:05 PM9/4/12

to sy...@googlegroups.com

Alright, that seems like a good approach. Actually, playing around with the parser, it already seems to parse (but won't evaluate) expressions like 2x: if I add a print statement to show the final list of tokens,

>>> sympy.parsing.sympy_parser.parse_expr("2 x y")

[(1, 'Integer'), (51, '('), (2, '2'), (51, ')'), (1, 'Symbol'), (51, '('), (1, "'x'"), (51, ')'), (1, 'Symbol'), (51, '('), (1, "'y'"), (51, ')'), (0, '')]

Traceback (most recent call last):

File "<stdin>", line 1, in <module>

File "sympy/parsing/sympy_parser.py", line 182, in parse_expr

expr = eval(code, global_dict, local_dict) # take local objects in preference

File "<string>", line 1

Integer (2 )Symbol ('x' )Symbol ('y' )

^

SyntaxError: invalid syntax

So it's already parsing it almost correctly, it just needs to recognize that the two are being multiplied (so a '*' needs to be inserted in there)

David

David Li

unread,

Sep 8, 2012, 7:43:52 PM9/8/12

to sy...@googlegroups.com

Okay, so I've worked a bit on implicit multiplication and implicit function application for sympify. A demo of SymPy Gamma with the changes is at http://sympy-gamma-li.appspot.com/ (+ a visual overhaul, update to Python 2.7 runtime, new Django version). Expressions like '2x', 'ln x', and '5exp(x^2)' should work now.

The SymPy branch is at https://github.com/lidavidm/sympy/tree/sympify_implicit_mul_and_apply. I am still working on making sure the implicit application doesn't apply to None, True, False, and other constants, making sure I haven't broken anything/missed an edge case, and cleaning up the code. Also, I would like to add tests for the Python parser. In fact, I found a bug as I was writing this - (x+2)(x+3) doesn't get correctly parsed.

Implementation: in sympy_parser.py I simply loop over the tokens several times and apply a variety of transformations. I haven't benchmarked this to see how much of a performance impact the loops have. I also check for NAME tokens and split them up if they don't turn out to be in the global scope or something like that, so 'xy' gets parsed as 'x y'.

David Li

Aaron Meurer

unread,

Sep 9, 2012, 6:08:04 AM9/9/12

to sy...@googlegroups.com

This is great.

Yes, so far it is quite buggy. sin(x) gives a NameError, and x + y
gives a pretty nasty error. Also you should think about good error
messages, because even if you fix these bugs, the parser will still be
heuristic, and so there will still be things that won't be recognized
as the user wants, either because it isn't implemented, or because it
is too ambiguous to attempt a guess.

Perhaps you could split out the new interface commits into a separate
branch and submit that as a pull request, because I think that much is
ready to go.

Aaron Meurer

> --
> You received this message because you are subscribed to the Google Groups
> "sympy" group.

> To view this discussion on the web visit
> https://groups.google.com/d/msg/sympy/-/SBll6sBsNuoJ.

Aaron Meurer

unread,

Sep 9, 2012, 6:09:47 AM9/9/12

to sy...@googlegroups.com

On Sun, Sep 9, 2012 at 4:08 AM, Aaron Meurer <asme...@gmail.com> wrote:
> This is great.
>
> Yes, so far it is quite buggy. sin(x) gives a NameError, and x + y
> gives a pretty nasty error. Also you should think about good error
> messages, because even if you fix these bugs, the parser will still be
> heuristic, and so there will still be things that won't be recognized
> as the user wants, either because it isn't implemented, or because it
> is too ambiguous to attempt a guess.
>
> Perhaps you could split out the new interface commits into a separate
> branch and submit that as a pull request, because I think that much is
> ready to go.

Ah, I guess they are already separate, because the parsing stuff is in
SymPy and the interface is in SymPy Gamma. Is there a branch with the
Gamma improvements?

Aaron Meurer

Joachim Durchholz

unread,

Sep 9, 2012, 6:58:16 AM9/9/12

to sy...@googlegroups.com

Am 09.09.2012 01:43, schrieb David Li:
> Okay, so I've worked a bit on implicit multiplication and implicit function
> application for sympify. A demo of SymPy Gamma with the changes is at
> http://sympy-gamma-li.appspot.com/ (+ a visual overhaul, update to Python

> 2.7 runtime, new Django version). Expressions like '2x<http://sympy-gamma-li.appspot.com/input/?i=2x>',
> 'ln x <http://sympy-gamma-li.appspot.com/input/?i=ln+x>', and '5exp(x^2)<http://sympy-gamma-li.appspot.com/input/?i=5exp%28x%5E2%29>'

> should work now.
>
> The SymPy branch is at
> https://github.com/lidavidm/sympy/tree/sympify_implicit_mul_and_apply. I am
> still working on making sure the implicit application doesn't apply to
> None, True, False, and other constants, making sure I haven't broken
> anything/missed an edge case, and cleaning up the code.

I looked at the code, and found I had no idea what patterns are being
recognized, and what parts of the recognized structures are transformed
to what new structures.
Sure, I could trace the single statements and find out what each
function does, but it's really hard to get a bird's-eye view to work
from, you need to build understanding from the details.

So I'd recommend documenting
- the patterns
- the transformations
Alternatively, I'd find it desirable to write the code in a form that
directly expresses the patterns and transformations. This may be far
outside the scope of this work though, so YMMV.
Still, on my personal priority list, having explicit mention of patterns
and transformations would be ahead of adding new transformations, just
so that more people can follow what's happening and give feedback.

Just my 3c :-)

David Li

unread,

Sep 9, 2012, 11:47:35 AM9/9/12

to sy...@googlegroups.com

Alright, so for the SymPy changes, I'll better document the code. Also, I was planning on using namedtuples instead of the plain tuples that I currently am using just to make it clearer what the data represents.

For Gamma, should I remove the notebook? It doesn't work anymore (I don't think it did when I first checked out the code, either) and I think it overlaps with SymPy Live.

As for the error with sin(x), I didn't import that function - that's why. I think the error with x+y is being caused by Gamma since it parses just fine in SymPy. I'll look into those and make a pull request.

David Li

On Monday, September 3, 2012 11:31:35 AM UTC-7, David Li wrote:

Joachim Durchholz

unread,

Sep 9, 2012, 12:27:15 PM9/9/12

to sy...@googlegroups.com

Am 09.09.2012 17:47, schrieb David Li:
> Also, I
> was planning on using namedtuples instead of the plain tuples that I
> currently am using just to make it clearer what the data represents.

+1

Reply all

Reply to author

Forward