args invariant

Aaron Meurer

unread,

May 20, 2013, 4:25:16 PM5/20/13

to sy...@googlegroups.com

I think Matthew wanted to delay this discussion until the end of the
summer, but I'd like to start it now while it is fresh on my mind. I
am currently writing the expression manipulation part of the new
tutorial (http://docs.sympy.org/tutorial/tutorial/manipulation.html),
and this issue is central to my discussion.

There has been some discussion from time to time about the invariants
that SymPy objects are supposed to follow. The invariants are
something like "all elements of an object's .args should be instances
of Basic" and "all Basic objects should be rebuildable from their
args, like obj.func(*obj.args) == obj".

The main discussion has been about the first one. Should we allow
non-Basic args? The common example if Symbol. Currently, Symbol('x')
has empty args. The proposal would make Symbol('x').args be ('x',).
The same for Integer. Integer(2).args is just (), but the proposal
would make it (int(2),).

But notice that the two invariants as I stated them above are
inconsistant, because Symbol('x') is not rebuildable from its args
unless its args are ('x',).

I have started a wiki page to gather my thoughts on this at
https://github.com/sympy/sympy/wiki/Args-Invariant. Basically, I
think there are basically two ways that we should go, which are called
option 1 and option 3 on the wiki page (option 2 is something I think
we should throw out immediately). The question boils down to what a
leaf in an expression tree is. The rest follows from that. The
options are

1. Any non-Basic
3. Any object with empty args

If we choose option 1, then the invariant becomes `obj.func(*obj.args)
== obj`. If we choose option 2, then the invariant becomes `obj.args
== () or obj.func(*obj.args) == obj`.

Now, I don't think it's any secret that I prefer option 3, and that
some people in the community (namely Matthew and Stefan) have been
arguing for option 1, but I think we should work out all the
differences, and pick the best one.

So without making this email twice as long as it already is, I refer
you to the wiki page, where I've started an extensive comparison of
the two options. Feel free to edit it to add more pros/cons,
examples, or to fix the formatting.

I've also created a section of the wiki page on the bottom for
opinions. Please keep the main part to facts, and put your opinions at
the bottom. We can also discuss opinions here on the mailing list.

Aaron Meurer

Matthew Rocklin

unread,

May 21, 2013, 3:06:31 AM5/21/13

to sy...@googlegroups.com

I've added my thoughts to the wiki. I'm double posting them here. These follow Aaron's opinion section which is definitely worth a read.

### Matthew

I mostly prefer 1

- Small point but while it's fresh in your mind, `MatrixSymbol('x', 1, 2).args` can't be () easily. This is because the expression could be `MatrixSymbol('x', n, k+m)`. The second two arguments are the shape and in general are `Expr`s. We'll need to traverse down these so they'll either need to be args or we'll need to do a lot of special casing within `MatrixSymbol`. A pull request exists to replace `'x'` with `Symbol('x')`. This seems wrong to me because `'x'` here is a name, not a mathematical scalar. The right way to do this under 3 would be to create a `String(Basic)`. This also feels wrong to me because we're being forced to reinvent a class with no new functionality.

- In general, I think that having all identifying information within args will simplify the code in the long term. It also opens various performance improvements. See the following timings from my laptop

```

In [2]: timeit Add(x, y, z)

100000 loops, best of 3: 2.15 us per loop

In [3]: timeit Add(x, y, z, evaluate=False) # no idea why this is slower

100000 loops, best of 3: 4.14 us per loop

In [4]: timeit Basic.__new__(Add, x, y, z)

1000000 loops, best of 3: 616 ns per loop

In [5]: timeit (Add, x, y, z)

10000000 loops, best of 3: 96.7 ns per loop

```

Having all information in args turns the SymPy Basic into something like an s-expression (a glorified tuple). Certainly we don't want to go this far (we like lots of things that the Basic object gives to us), but having that simplicity enables more advanced manipulations of SymPy expressions with greater trust in their accuracy. If someone wrote code that required really intense expression manipulation they could switch into tuple mode (or list mode for mutability), do lots of manipulations quickly, then switch back and re-evaluate. If all information is in args then they could perform these transformations with high fidelity. I'm not proposing that this is the reason we should switch but rather that this is an example of something that is easy with an invariant like "All identifying information is in `.args`"

- Personally I manipulate SymPy expressions quite a bit. I often do this from outside SymPy (Theano, LogPy, Strategies). Having the "all identifying information is in `.args`" invariant makes SymPy significantly more interoperable if you don't have access to the codebase (or don't want to fuss with the codebase). Classes like Symbol, Rational, AppliedPredicate all end up requiring special attention in each of these endeavors. I'm mostly willing to put up with it because I'm familiar with them but potential collaborators are often turned off. I believe that interoperability is more important than we realize for long term growth.

I believe that interoperability is the most important direction for SymPy right now (personal opinion). I believe that the "all identifying information is in `.args`" invariant is essential for that goal.

Aaron Meurer

--
You received this message because you are subscribed to the Google Groups "sympy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sympy+un...@googlegroups.com.
To post to this group, send email to sy...@googlegroups.com.
Visit this group at http://groups.google.com/group/sympy?hl=en-US.
For more options, visit https://groups.google.com/groups/opt_out.

Stefan Krastanov

unread,

May 21, 2013, 4:41:38 AM5/21/13

to sy...@googlegroups.com

I will add this to the wiki later. My only issue with option 3 is when dealing with objects that need both an id/name/handle and some sympifiable information (like unspecified number of dimensions). The MatrixSymbol is good example.

One way around this is to implement BasicString but this will end in reimplementing a badly defined type system instead of using the one provided by python. Or we can say that Symbol is our string type (which some time ago Aaron was in favor of, but Ronan was agains).

Somewhat related, why should we use Integer(1) and not int(1)?

Chris Smith

unread,

May 21, 2013, 5:01:01 AM5/21/13

to sympy

Somewhat related, why should we use Integer(1) and not int(1)?

These create little sympy-landmines if you have to remember which things give ints and which give Integers. e.g. Rational(2,3).q and multiplicity(2,12) are ints so you have to be careful how you use them. If all Integers are stored as ints then the consistency helps; but if sometimes a Rational can be produced then you can't do things like `if i.is_Integer` you have to do `if type(i) is int` or `if isinstance(i, int)`. SymPy can't anticipate what you are going to do with the integer-like properties/return values but it can make your life a little easier by making them consistent.

Stefan Krastanov

unread,

May 21, 2013, 5:30:18 AM5/21/13

to sy...@googlegroups.com

This might have been so before the abstract base classes and the numeric tower implemented in 2.5 or 2.6, but now Python itself has a definite interface for checking any object whether it is complex number, real, rational or integer.

I think the correctTM solution would be to use the build-in numeric tower and not create our own. A few months ago mpmath added basic support for that same numeric tower,

This is mostly the same issue mentioned above - we are implemented a type system (sometimes really poorly defined) instead of using what python provides.

http://docs.python.org/2/library/numbers.html

PS It was implemented in 2.6, so there are historic reasons behind our current class structure. So when we drop 2.5 we will have no reason to use Integer(1) instead of int(1), except the need to have only basic in args.

I think this is an argument in favor of option 3 - the ability to reuse what python already provides, whether strings for names or base classes for numbers.

On 21 May 2013 11:01, Chris Smith <smi...@gmail.com> wrote:

Somewhat related, why should we use Integer(1) and not int(1)?

These create little sympy-landmines if you have to remember which things give ints and which give Integers. e.g. Rational(2,3).q and multiplicity(2,12) are ints so you have to be careful how you use them. If all Integers are stored as ints then the consistency helps; but if sometimes a Rational can be produced then you can't do things like `if i.is_Integer` you have to do `if type(i) is int` or `if isinstance(i, int)`. SymPy can't anticipate what you are going to do with the integer-like properties/return values but it can make your life a little easier by making them consistent.

Chris Smith

unread,

May 21, 2013, 5:57:09 AM5/21/13

to sympy

This might have been so before the abstract base classes and the numeric tower implemented in 2.5 or 2.6, but now Python itself has a definite interface for checking any object whether it is complex number, real, rational or integer.

But we support unlimited precision reals, too. If we went to only Python types then that would mean using something like Fraction or Decimal instead. But when doing computations with those we might/would need to recast them as mpmath numbers.

Stefan Krastanov

unread,

May 21, 2013, 6:05:59 AM5/21/13

to sy...@googlegroups.com

mpmath also supports abstract base classes. So we need just to test
`isinstance(obj, AppropriateAbstractBaseClass)`.

The idea is not to go for only python types, rather just use them where appropriate (int instead of Integer for instance, as Integer does not bring anything besides performance degradation).

There can be issues when multiplying python float and mpmath float, but these issues exist now as well. And it is mpmath's job to take care of them, not ours. For instance, the horrible mess for precision tracking in evalf has no place in sympy. It is something that can be done in mpmath.

On 21 May 2013 11:57, Chris Smith <smi...@gmail.com> wrote:

This might have been so before the abstract base classes and the numeric tower implemented in 2.5 or 2.6, but now Python itself has a definite interface for checking any object whether it is complex number, real, rational or integer.

But we support unlimited precision reals, too. If we went to only Python types then that would mean using something like Fraction or Decimal instead. But when doing computations with those we might/would need to recast them as mpmath numbers.

--

Chris Smith

unread,

May 21, 2013, 6:12:33 AM5/21/13

to sympy

There can be issues when multiplying python float and mpmath float, but these issues exist now as well. And it is mpmath's job to take care of them, not ours. For instance, the horrible mess for precision tracking in evalf has no place in sympy. It is something that can be done in mpmath.

That has been my impression as well. I wonder if someone familiar with the evalf routine can comment on how that grew to be somewhat independent of mpmath., e.g. `add_terms`.

Ronan Lamy

unread,

May 21, 2013, 6:13:31 AM5/21/13

to sy...@googlegroups.com

2013/5/21 Stefan Krastanov <krastano...@gmail.com>

mpmath also supports abstract base classes. So we need just to test
`isinstance(obj, AppropriateAbstractBaseClass)`.

The idea is not to go for only python types, rather just use them where appropriate (int instead of Integer for instance, as Integer does not bring anything besides performance degradation).

Integer brings something very important: the ability for 1/2 to return a Rational, instead of 0.5 or 0.

There can be issues when multiplying python float and mpmath float, but these issues exist now as well. And it is mpmath's job to take care of them, not ours. For instance, the horrible mess for precision tracking in evalf has no place in sympy. It is something that can be done in mpmath.

mpmath is a bit dead, in no small part because of SymPy's decision to use a static copy of it instead of treating like a proper upstream.

Matthew Rocklin

unread,

May 21, 2013, 6:29:10 AM5/21/13

to sy...@googlegroups.com

There can be issues when multiplying python float and mpmath float, but these issues exist now as well. And it is mpmath's job to take care of them, not ours. For instance, the horrible mess for precision tracking in evalf has no place in sympy. It is something that can be done in mpmath.

mpmath is a bit dead, in no small part because of SymPy's decision to use a static copy of it instead of treating like a proper upstream.

This is a different conversation but I'd be in favor of reopening the dependency question. Probably this should happen after the current conversation reaches some sort of conclusion. It should also likely happen on a different thread. If anyone agrees and starts such a thread at the appropriate time I'd love to participate.

</tangent>

And now back to our regularly scheduled discussion:

"Go 1!"

Stefan Krastanov

unread,

May 21, 2013, 6:45:20 AM5/21/13

to sy...@googlegroups.com

Correction to one of my previous mails:

> I think this is an argument in favor of option 3 - the ability to reuse what python already provides, whether strings for names or base classes for numbers.

This should have been "option 1".

--

Ondřej Čertík

unread,

May 21, 2013, 11:49:35 AM5/21/13

to sympy

I think the cause and effect is the other way round: if mpmath was
very much alive, then it has pretty much zero effect on how we use it
in sympy. So I don't think we hurt any development of it, especially
since we try to report bugs to it etc.

Ondrej

Aaron Meurer

unread,

May 21, 2013, 12:15:32 PM5/21/13

to sy...@googlegroups.com

Thanks for responding. I think most of what I want to say is in
response to Matthew, but let me just mention that I agree with Ronan.
Integer is very different from int. Aside from the obvious difference
with division, every attribute and method of the classes it inherits
from are avaialble on it, giving it a standard interface. This is
fundamental to this discussion. Basic subclasses are Liskov
substitutable. I can write expr.args or expr.atoms or expr.xreplace
without worrying if expr is an Add or a Symbol or an Integer.
Division is a special case of this. This simply is not true for
strings and ints.

By the way, Ronan, which option do you prefer? Or is there another
even better option that only you can see?

On Tue, May 21, 2013 at 1:06 AM, Matthew Rocklin <mroc...@gmail.com> wrote:
> I've added my thoughts to the wiki. I'm double posting them here. These
> follow Aaron's opinion section which is definitely worth a read.
>
> ### Matthew
>
> I mostly prefer 1
>
> - Small point but while it's fresh in your mind, `MatrixSymbol('x', 1,
> 2).args` can't be () easily. This is because the expression could be
> `MatrixSymbol('x', n, k+m)`. The second two arguments are the shape and in
> general are `Expr`s. We'll need to traverse down these so they'll either
> need to be args or we'll need to do a lot of special casing within
> `MatrixSymbol`. A pull request exists to replace `'x'` with `Symbol('x')`.
> This seems wrong to me because `'x'` here is a name, not a mathematical
> scalar. The right way to do this under 3 would be to create a
> `String(Basic)`. This also feels wrong to me because we're being forced to
> reinvent a class with no new functionality.

OK, you are right about this. MatrixSymbol clearly cannot be a leaf.

Ronan is right when he says that Symbol is not a replacement for str.
The two are unrelated. You can't do Symbol('xyz')[:1] or Symbol('
').join(['a', 'b', 'c']), and Symbol('x') + Symbol('y') != 'x' + 'y'.
But I think it doesn't matter. We don't need a full-fledged string,
just like we don't really need it anywhere that we use Symbol (for
those rare cases where we do do some string processing we just use
Symbol.name). The only aspect of strings that we use in Symbols
(aside from the fact that they are names) is equality, and for that,
they agree: Symbol(a) == Symbol(b) iff a == b.

So my recommendation here goes back to using Symbol as the first
argument of MatrixSymbol. Also, the class should clearly not subclass
from Symbol in this case (I don't remember if it already does).

Perhaps a mathematician's point of view is worth considering here
(this is my point of view, when I am in my graduate student shoes).
When we have a mathematical object like MatrixSymbol that is indexed
by another expression, in this case, the rows and columns, it is often
convenient to consider that object as a function. In this case, a
matrix symbol is a function from NxN -> M, where M is the set of all
matrices over whatever (complex numbers).

This functorial relationship for us means that those are children in
the expression tree. Now, I'm no logician, but to me, it makes sense
to consider a matrix symbol also as a function on the alphabet
(strings). The name we give to a matrix symbol is inconsequential to
what it is, at least mathematically.

Anyway, that analogy may or may not be completely off base, so take it
or leave it.

>
> - In general, I think that having all identifying information within args
> will simplify the code in the long term.

This is where I think you are completely wrong. As I noted in my
first bullet point of my opinions, it seems like it will do this,
because the invariant becomes simpler. But as I hope I demonstrated in
the three fundamental expression recursion algorithms, it will make
things much more complicated. The reason is that we have to check if
things are instances of Basic everywhere, and handle them differently.
Non-Basics don't have Basic interface (in particular, .args, but
really anything that we might want to do). We can't even be sure that
we can do math operations on non-Basics, even if they are Python
builtin numbers, because of the int/int problem (not to mention the
floating point precision issues).

If we go with option 1, what I see happening all throughout SymPy is
code being nested under "if isinstance(expr, Basic)" blocks. Very
little useful code will actually make sense on non-Basics; most if it
will fail with AttributeError or some other exception (if you don't
believe me, remove "arg = sympify(arg)" from the top of any algorithm
and try passing 0 to it).

> It also opens various performance
> improvements. See the following timings from my laptop
>
> ```
> In [2]: timeit Add(x, y, z)
> 100000 loops, best of 3: 2.15 us per loop
>
> In [3]: timeit Add(x, y, z, evaluate=False) # no idea why this is slower
> 100000 loops, best of 3: 4.14 us per loop
>
> In [4]: timeit Basic.__new__(Add, x, y, z)
> 1000000 loops, best of 3: 616 ns per loop
>
> In [5]: timeit (Add, x, y, z)
> 10000000 loops, best of 3: 96.7 ns per loop
> ```

You mentioned this before, and I didn't get it then either. How does
this have anything to do with what the leaves of the expression tree
are?

>
> Having all information in args turns the SymPy Basic into something like an
> s-expression (a glorified tuple). Certainly we don't want to go this far
> (we like lots of things that the Basic object gives to us), but having that
> simplicity enables more advanced manipulations of SymPy expressions with
> greater trust in their accuracy. If someone wrote code that required really
> intense expression manipulation they could switch into tuple mode (or list
> mode for mutability), do lots of manipulations quickly, then switch back and
> re-evaluate. If all information is in args then they could perform these
> transformations with high fidelity. I'm not proposing that this is the
> reason we should switch but rather that this is an example of something that
> is easy with an invariant like "All identifying information is in `.args`"

I claim that this is just as easy (in fact easier, because of the
isinstance(expr, Basic) issue) with the option 3 invariant. You just
treat objects with empty args as leaves. These are the symbolic atoms
from your lisp metaphor.

At the end of the day, *something* has to be a symbolic atom. You
can't recurse forever, until you hit the ones and zeros of the machine
representation and wonder how you can split those down into smaller
entities. I don't see why it has to be a non-Basic type.

>
> - Personally I manipulate SymPy expressions quite a bit. I often do this
> from outside SymPy (Theano, LogPy, Strategies). Having the "all identifying
> information is in `.args`" invariant makes SymPy significantly more
> interoperable if you don't have access to the codebase (or don't want to
> fuss with the codebase).

I don't understand what you mean by "access to the codebase". How are
you doing anything with any SymPy classes if you can't access them?

> Classes like Symbol, Rational, AppliedPredicate
> all end up requiring special attention in each of these endeavors. I'm
> mostly willing to put up with it because I'm familiar with them but
> potential collaborators are often turned off. I believe that
> interoperability is more important than we realize for long term growth.

Well I'd love to hear testimonials from these collaborators. I think
it's important for us as core contributors who are very familiar with
SymPy and its idioms to see how people who are new to it view things.
That will show us what things we are doing that need to be documented
better.

I suspect these people are falling into the trap of seeing the
simplicity of the option 1 invariant over the option 3 invariant,
without noticing that it really leads to more complicated code.

Aaron Meurer

Stefan Krastanov

unread,

May 21, 2013, 12:27:31 PM5/21/13

to sy...@googlegroups.com

Concerning using Symbol where a name is necessary, I just wanted to know how far I should take the idea.

For instance in the category module last year there was a lot of discussion about the basic objects, whether they should be symbols or not, etc. If we were refactoring this module right now, what should we be using for the basic objects?

Symbol, or a thin wrapper around Symbol, or something else...

Aaron Meurer

unread,

May 21, 2013, 1:55:39 PM5/21/13

to sy...@googlegroups.com

What were the specific examples?

Aaron Meurer

Stefan Krastanov

unread,

May 22, 2013, 5:15:37 AM5/22/13

to sy...@googlegroups.com

If I remember correctly one of the examples was just a named object. Not a number, not something that can have meaningful assumptions on it, nor something that needs any new methods. Not a complex number which is what Symbol represents usually.

Ronan Lamy

unread,

May 22, 2013, 6:11:59 AM5/22/13

to sy...@googlegroups.com

2013/5/21 Aaron Meurer <asme...@gmail.com>

Thanks for responding. I think most of what I want to say is in
response to Matthew, but let me just mention that I agree with Ronan.
Integer is very different from int. Aside from the obvious difference
with division, every attribute and method of the classes it inherits
from are avaialble on it, giving it a standard interface. This is
fundamental to this discussion. Basic subclasses are Liskov
substitutable. I can write expr.args or expr.atoms or expr.xreplace
without worrying if expr is an Add or a Symbol or an Integer.
Division is a special case of this. This simply is not true for
strings and ints.

By the way, Ronan, which option do you prefer? Or is there another
even better option that only you can see?

Going back to the basics of the structure of expression trees, I think that we are conflating things that should be considered separate. The elements of .args are child nodes, but .name attributes and the like do not behave at all like nodes of the expression tree. We should consider the latter as attributes of the head node so that the structure of MatrixSymbol('A', m, n) conceptually looks like:

* MatrixSymbolHead('A')
|

+- Symbol('m')

+- Symbol('n')

However, I'm not sure that this MatrixSymbolHead('A') should actually exist as a Python object. If it does, then MatrixSymbol('A', m, n).func should be equal to it. Otherwise, the fundamental invariant should rather be written as 'obj == obj.rebuild(*obj.args)', and the MatrixSymbolHead thing would only exist implicitly in the implementation of MatrixSymbol.rebuild().

Stefan Krastanov

unread,

May 22, 2013, 6:19:46 AM5/22/13

to sy...@googlegroups.com

Ronan, with your suggestion, what if for some reason I want `MatrixSymbol('A', 2, 2)` and `MatrixSymbol('A', 3, 3)`? What is the way to approach this? Maybe it is just bad style to create different "named+additional info" objects with the same name, but I do not see a clear reason why not.

And back to the voting. While I like 1 (nonBasic args) for aesthetic reasons, if it gets clearer how to implement named objects I would be equally happy with 3 (empty args).

Ronan Lamy

unread,

May 22, 2013, 6:41:33 AM5/22/13

to sy...@googlegroups.com

2013/5/22 Stefan Krastanov <krastano...@gmail.com>

Ronan, with your suggestion, what if for some reason I want `MatrixSymbol('A', 2, 2)` and `MatrixSymbol('A', 3, 3)`?

Then you just create the 2 objects. I don't see why it would cause any problem besides the printing confusion.

Matthew Rocklin

unread,

May 23, 2013, 7:00:10 AM5/23/13

to sy...@googlegroups.com

Sorry, I was traveling and away from the computer for a while.

So my recommendation here goes back to using Symbol as the first
argument of MatrixSymbol. Also, the class should clearly not subclass
from Symbol in this case (I don't remember if it already does).

MatrixSymbol does not subclass Symbol (although it used to). In general MatrixExprs stay far away from Exprs.

Symbols however are Exprs and the name is not. If we want all args are basic then I think the clean solution is to make a very simple String class. It doesn't have to do much, just hold a str, be a leaf, and print correctly. Much like Integer. It shouldn't do lots of other things like Symbol does.

>
> - In general, I think that having all identifying information within args
> will simplify the code in the long term.

This is where I think you are completely wrong. As I noted in my
first bullet point of my opinions, it seems like it will do this,
because the invariant becomes simpler. But as I hope I demonstrated in
the three fundamental expression recursion algorithms, it will make
things much more complicated. The reason is that we have to check if
things are instances of Basic everywhere, and handle them differently.
Non-Basics don't have Basic interface (in particular, .args, but
really anything that we might want to do). We can't even be sure that
we can do math operations on non-Basics, even if they are Python
builtin numbers, because of the int/int problem (not to mention the
floating point precision issues).

Yes, you need to guard against these and yes, I think this is work short term. Long term though I think that the number of functions that traverse the tree can be decreased dramatically, resulting in a smaller codebase. @smichr has used bottom_up in simplify/simplify.py a number of times to good effect.

I'm not as sure of my claim though as you are of yours, I don't have nearly as much experience with the core as you do. Probably you're right and I just don't have the experience to see it.

If we go with option 1, what I see happening all throughout SymPy is
code being nested under "if isinstance(expr, Basic)" blocks. Very
little useful code will actually make sense on non-Basics; most if it
will fail with AttributeError or some other exception (if you don't
believe me, remove "arg = sympify(arg)" from the top of any algorithm
and try passing 0 to it).

> It also opens various performance
> improvements. See the following timings from my laptop
>
> ```
> In [2]: timeit Add(x, y, z)
> 100000 loops, best of 3: 2.15 us per loop
>
> In [3]: timeit Add(x, y, z, evaluate=False) # no idea why this is slower
> 100000 loops, best of 3: 4.14 us per loop
>
> In [4]: timeit Basic.__new__(Add, x, y, z)
> 1000000 loops, best of 3: 616 ns per loop
>
> In [5]: timeit (Add, x, y, z)
> 10000000 loops, best of 3: 96.7 ns per loop
> ```

You mentioned this before, and I didn't get it then either. How does
this have anything to do with what the leaves of the expression tree
are?

My main point is that simpler and more consistent data structures enable wilder and potentially helpful transformations. My guess is that a system that switched into tuple-mode would, right now, be very difficult. I think we should work on this.

Maybe my perspective would be more clear with an example. Here is some example code to teach LogPy how to interact with SymPy.

https://github.com/mrocklin/sympy/blob/7d716d43e60ea59445da32affda837df74c672b9/sympy/logpy/train.py

>
> Having all information in args turns the SymPy Basic into something like an
> s-expression (a glorified tuple). Certainly we don't want to go this far
> (we like lots of things that the Basic object gives to us), but having that
> simplicity enables more advanced manipulations of SymPy expressions with
> greater trust in their accuracy. If someone wrote code that required really
> intense expression manipulation they could switch into tuple mode (or list
> mode for mutability), do lots of manipulations quickly, then switch back and
> re-evaluate. If all information is in args then they could perform these
> transformations with high fidelity. I'm not proposing that this is the
> reason we should switch but rather that this is an example of something that
> is easy with an invariant like "All identifying information is in `.args`"

I claim that this is just as easy (in fact easier, because of the
isinstance(expr, Basic) issue) with the option 3 invariant. You just
treat objects with empty args as leaves. These are the symbolic atoms
from your lisp metaphor.

At the end of the day, *something* has to be a symbolic atom. You
can't recurse forever, until you hit the ones and zeros of the machine
representation and wonder how you can split those down into smaller
entities. I don't see why it has to be a non-Basic type.

OK, I can see something like this working. I suppose I would need to change the `_as_tuple/_from_tuple` interface above to also include an `_is_leaf` method. I believe that there is a cost to this increase but maybe the cost is worth it.

Even with this change I'm not confident that SymPy follows this convention. Things like AppliedPredicate screw things up.

>
> - Personally I manipulate SymPy expressions quite a bit. I often do this
> from outside SymPy (Theano, LogPy, Strategies). Having the "all identifying
> information is in `.args`" invariant makes SymPy significantly more
> interoperable if you don't have access to the codebase (or don't want to
> fuss with the codebase).

I don't understand what you mean by "access to the codebase". How are
you doing anything with any SymPy classes if you can't access them?

By access to the codebase I meant write access. More specifically I probably mean familiarity with and ability to change/adapt SymPy. Knowing that Rational has attributes p and q is very hard for non-core developers. I think that removing these obstacles encourages use by external projects.

> Classes like Symbol, Rational, AppliedPredicate
> all end up requiring special attention in each of these endeavors. I'm
> mostly willing to put up with it because I'm familiar with them but
> potential collaborators are often turned off. I believe that
> interoperability is more important than we realize for long term growth.

Well I'd love to hear testimonials from these collaborators. I think
it's important for us as core contributors who are very familiar with
SymPy and its idioms to see how people who are new to it view things.
That will show us what things we are doing that need to be documented
better.

What comes to mind is the theano_sympy project by Frederic and I at SciPy 2012. We started it, I finished the SymPy->Theano part but the reverse direction was never completed. This could be for many reasons. I think it's reasonable that part of that was because the SymPy data structure is not as clean as it could be

https://github.com/nouiz/theano_sympy/blob/master/graph_translation.py#L132

I suspect these people are falling into the trap of seeing the
simplicity of the option 1 invariant over the option 3 invariant,
without noticing that it really leads to more complicated code.

I am clearly one of those nutheads. From their perspective it's a reasonable assumption to make. You see the type, you see .args, there's clearly a pattern, then you run into Rational and get stuck.

In general I can see a solution with 3 working. I think that to be consistent it will require the construction of lots of trivial classes (or maybe just a generic LeafContainer class). I still think that 1 is a better way to go but I'm less confident than I was.

Stefan Krastanov

unread,

May 23, 2013, 7:42:32 AM5/23/13

to sy...@googlegroups.com

>> I suspect these people are falling into the trap of seeing the
>> simplicity of the option 1 invariant over the option 3 invariant,
>> without noticing that it really leads to more complicated code.
>
> I am clearly one of those nutheads. From their perspective it's a reasonable assumption to make. You > see the type, you see .args, there's clearly a pattern, then you run into Rational and get stuck.
>

and

> Long term though I think that the number of functions that traverse the tree can be decreased dramatically, resulting in a smaller codebase. @smichr has used bottom_up in simplify/simplify.py a number of times to good effect.

I share Matthew's opinion on these two issues ("what and why is more
complex"). The code might be marginally more complex (not so if we
actually abstract tree traversal in one single place), but the data
structure will definitely be simpler if we choose option 1 (nonBasic
args). Especially so given that option 3 (empty args) requires the
creation of Dict (instead of frozendict), Tuple (instead of tuple),
Symbol/String (instead of basestring). Just consider that for
generalization of matrices we will need nested Tuple objects (probably
slow and voluminous) instead of nested build-in tuples.

I can see myself enjoying work with option 3 (in any case it is many times
better than the current lack of standard), but for the above reasons I
slightly prefer 1, __mostly because of aesthetics__.

Maybe at some point we would need a BDFL type of decision from Aaron
or Ondřej :)

Aaron Meurer

unread,

May 25, 2013, 5:38:34 PM5/25/13

to sy...@googlegroups.com

OK, sorry for not following this for a few days. I was busy moving to Austin.

A few points:

- I like Ronan's idea of putting the info in the func. This is the
whole reason that we use expr.func instead of type(expr), so that we
can potentially make non-class head objects. This idea also opens up
a lot of possiblities with
https://code.google.com/p/sympy/issues/detail?id=1688.

One comment on Ronan's reply, though: I don't see the point of
rebuild(). How is that different from func.__call__? What are the
obstacles to making the head an object (or just using metaclasses,
though that can cause issues with things like pickling)?

- Regarding Symbol being Expr, this is orthogonal to this discussion.
I agree it is an issue for using it as a name, and that Symbol really
should have a more generic Basic superclass (this comes up with using
Symbol for booleans already, see
https://code.google.com/p/sympy/issues/detail?id=1887#c26).

- I agree that we should reuse common traversal patterns as named
functions. Quite a few traversal algorithms can probably be rewritten
using bottom_up, pre/postorder_traversal, atoms, xreplace, and so on.
But even so, traversal is pretty easy, and sometimes it's simpler to
just write a custom function. Also, if you want to perform some
custom action while traversing, it can be more efficient to do the
traversal once (e.g., using atoms + xreplace traverses the tree
twice).

- I think you are also bringing another issue into the mix here, which
I had not even considered to be up for debate, and that's equality
checking. Right now, objects are supposed to define == structurally,
but they are free to do magic. An example is Dummy, which uses == to
make itself distinct from other objects. I think even with option 1,
this magic is still relevant (we just make the func compare
differently). What Matthew wants is not only for non-Basic args, but
really for __eq__ to always be defined as self.func == other.func and
self.args == other.args. This also prevents us from doing things like
making Lambda(x, x**2) == Lambda(y, y**2) (also another discussion,
but it's worth keeping this ability I think).

> Even with this change I'm not confident that SymPy follows this convention. Things like AppliedPredicate screw things up.

SymPy doesn't follow any conventions consistently. That's one of the
reasons we have this discussion, so we can decide what the convention
should even be. All the skipped tests in test_args represent a
breaking of the Basic args convention. It doesn't even test for
rebuildability: if you add that, even more tests fail.

I like option 3 because it allows us to take advantage of native
Python class abilities. In a language like lisp, you only have lists,
so it is natural to structure things as simply as possible with lists.
But in Python, you can override equality testing. You can make
classes that return arbitrary other classes in their constructors. You
can define objects that act like classes (using metaclasses or just
__call__, depending on your purposes). The coding is harder for us,
for sure, but if we provide consistent high-level interfaces like
.func and .args, I think it is OK, and it lets us solve problems in a
simpler way than if we were more restricted.

Aaron Meurer

Matthew Rocklin

unread,

May 26, 2013, 9:50:29 AM5/26/13

to sy...@googlegroups.com

On Sat, May 25, 2013 at 4:38 PM, Aaron Meurer <asme...@gmail.com> wrote:

OK, sorry for not following this for a few days. I was busy moving to Austin.

A few points:

- I like Ronan's idea of putting the info in the func. This is the
whole reason that we use expr.func instead of type(expr), so that we
can potentially make non-class head objects. This idea also opens up
a lot of possiblities with
https://code.google.com/p/sympy/issues/detail?id=1688.

I also like this. The operation (head) can have identifying information. I'm quite comfortable with "All identifying information is present in .args and .op/.head." Really I'm comfortable with "All identifying information is present in a predefined set of attributes" and "We can reconstruct an object from a predefined set of information"

My understanding of this idea is that Symbol.name would access information in Symbol.op/head/func. What is the structure of the op/head data structure? Another tuple? a dict? arbitrary python object?

One comment on Ronan's reply, though: I don't see the point of
rebuild(). How is that different from func.__call__? What are the
obstacles to making the head an object (or just using metaclasses,
though that can cause issues with things like pickling)?

Perhaps this is the ._from_args method that exists in some Expr classes?

- Regarding Symbol being Expr, this is orthogonal to this discussion.
I agree it is an issue for using it as a name, and that Symbol really
should have a more generic Basic superclass (this comes up with using
Symbol for booleans already, see
https://code.google.com/p/sympy/issues/detail?id=1887#c26).

- I agree that we should reuse common traversal patterns as named
functions. Quite a few traversal algorithms can probably be rewritten
using bottom_up, pre/postorder_traversal, atoms, xreplace, and so on.
But even so, traversal is pretty easy, and sometimes it's simpler to
just write a custom function. Also, if you want to perform some
custom action while traversing, it can be more efficient to do the
traversal once (e.g., using atoms + xreplace traverses the tree
twice).

The problem with custom traversal is that it interweaves math code with traversal code. If I later decide that a traversal should be top_down instead of bottom_up (or something weirder) I shouldn't need to understand the mathematical code to perform this change. There are lots of parts in SymPy where only a few people are qualified to make changes even if those changes have little to do with the domain. By restricting yourself to standard traversal functionality we engage much more of the developer population (and potentially automation) across more of the codebase. My solution to this is to make many small functions that only operate locally and then call traversal functions, like bottom_up, on them.

> Even with this change I'm not confident that SymPy follows this convention. Things like AppliedPredicate screw things up.

SymPy doesn't follow any conventions consistently. That's one of the
reasons we have this discussion, so we can decide what the convention
should even be. All the skipped tests in test_args represent a
breaking of the Basic args convention. It doesn't even test for
rebuildability: if you add that, even more tests fail.

I like option 3 because it allows us to take advantage of native
Python class abilities. In a language like lisp, you only have lists,
so it is natural to structure things as simply as possible with lists.
But in Python, you can override equality testing. You can make
classes that return arbitrary other classes in their constructors. You
can define objects that act like classes (using metaclasses or just
__call__, depending on your purposes). The coding is harder for us,
for sure, but if we provide consistent high-level interfaces like
.func and .args, I think it is OK, and it lets us solve problems in a
simpler way than if we were more restricted.

I think that it's important to keep things as simple as is meaningful. I think that custom data structures limit interoperation. I think that interoperation should be a high priority for SymPy's broader impact. Clearly this can be taken too far. I think that Tuples/s-expressions unnecessarily interweave the operation and the arguments. I.e. in (op, arg, arg, arg) op and the args are on the same footing; I think that this goes too far. Still, I don't think we need anything that is substantially more complex than this.

Something like obj.op and obj.args makes sense to me.

The Python language does offer us a lot of options and I think that a lot of these options, like syntax overloading, are awesome. I think I agree (though not with high certainty) that a consistent interface is sufficient, even with a complex data structure. This requires us to be pretty strict about exactly how objects implement that interface though; are you confident that we can achieve this? In general I prefer simple data structures because it forces a simple interface and because I think that many of the fancier features aren't particularly helpful. Perhaps I just haven't run into these use cases.

Aaron Meurer

unread,

Jun 18, 2013, 9:30:46 PM6/18/13

to sy...@googlegroups.com

Well this discussion got buried, but I still do care about it. We
should talk at SciPy too (if we have time between all the other stuff
we should talk about).

On Sun, May 26, 2013 at 8:50 AM, Matthew Rocklin <mroc...@gmail.com> wrote:
>
>
>
> On Sat, May 25, 2013 at 4:38 PM, Aaron Meurer <asme...@gmail.com> wrote:
>>
>> OK, sorry for not following this for a few days. I was busy moving to
>> Austin.
>>
>> A few points:
>>
>> - I like Ronan's idea of putting the info in the func. This is the
>> whole reason that we use expr.func instead of type(expr), so that we
>> can potentially make non-class head objects. This idea also opens up
>> a lot of possiblities with
>> https://code.google.com/p/sympy/issues/detail?id=1688.
>
>
> I also like this. The operation (head) can have identifying information.
> I'm quite comfortable with "All identifying information is present in .args
> and .op/.head." Really I'm comfortable with "All identifying information is
> present in a predefined set of attributes" and "We can reconstruct an object
> from a predefined set of information"

I am also liking this idea. I think we can get away with a lot of
interesting magic by overriding .func. We might even be able to solve
the issue 1941 problem with it (though that may be too optimistic).

>
> My understanding of this idea is that Symbol.name would access information
> in Symbol.op/head/func. What is the structure of the op/head data
> structure? Another tuple? a dict? arbitrary python object?

According to the invariant, any callable, which can recreate the
object (remember that right now it is just type(obj)).

>
>>
>> One comment on Ronan's reply, though: I don't see the point of
>> rebuild(). How is that different from func.__call__? What are the
>> obstacles to making the head an object (or just using metaclasses,
>> though that can cause issues with things like pickling)?
>
>
> Perhaps this is the ._from_args method that exists in some Expr classes?

You mean _from_rawargs or whatever? I think that's just a hack to
avoid going through the constructor again when you know it won't have
anything to do.

>
>>
>> - Regarding Symbol being Expr, this is orthogonal to this discussion.
>> I agree it is an issue for using it as a name, and that Symbol really
>> should have a more generic Basic superclass (this comes up with using
>> Symbol for booleans already, see
>> https://code.google.com/p/sympy/issues/detail?id=1887#c26).
>>
>> - I agree that we should reuse common traversal patterns as named
>> functions. Quite a few traversal algorithms can probably be rewritten
>> using bottom_up, pre/postorder_traversal, atoms, xreplace, and so on.
>> But even so, traversal is pretty easy, and sometimes it's simpler to
>> just write a custom function. Also, if you want to perform some
>> custom action while traversing, it can be more efficient to do the
>> traversal once (e.g., using atoms + xreplace traverses the tree
>> twice).
>
>
> The problem with custom traversal is that it interweaves math code with
> traversal code. If I later decide that a traversal should be top_down
> instead of bottom_up (or something weirder) I shouldn't need to understand
> the mathematical code to perform this change. There are lots of parts in
> SymPy where only a few people are qualified to make changes even if those
> changes have little to do with the domain. By restricting yourself to
> standard traversal functionality we engage much more of the developer
> population (and potentially automation) across more of the codebase. My
> solution to this is to make many small functions that only operate locally
> and then call traversal functions, like bottom_up, on them.

Sure, but it still ought to be technically possible to only do the
traversal once, I think.

But you raise a good point, though.

There are always the oft-forgotten M-expressions.

>
> Something like obj.op and obj.args makes sense to me.
>
> The Python language does offer us a lot of options and I think that a lot of
> these options, like syntax overloading, are awesome. I think I agree
> (though not with high certainty) that a consistent interface is sufficient,
> even with a complex data structure. This requires us to be pretty strict
> about exactly how objects implement that interface though; are you confident
> that we can achieve this? In general I prefer simple data structures
> because it forces a simple interface and because I think that many of the
> fancier features aren't particularly helpful. Perhaps I just haven't run
> into these use cases.

Well, for example, with using .func instead of type(obj), it lets us
separate the two concepts. It usually isn't necessary, but, for
example, it let's us put a lot more identifying information on the
object without having to literally change the Python type.

Aaron Meurer

Reply all

Reply to author

Forward