Re: [sympy] Recursive collect() (code included)

Chris Smith

unread,

Aug 17, 2012, 10:18:49 AM8/17/12

to sy...@googlegroups.com

There is an rcollect function that you might look at.

Juha Jeronen

unread,

Aug 28, 2012, 7:07:17 AM8/28/12

to sy...@googlegroups.com

On Friday, August 17, 2012 5:18:49 PM UTC+3, smichr wrote:

There is an rcollect function that you might look at.

Thanks, I had missed this. However, I tried it now and unless I'm doing something wrong, it's not being recursive in the way I meant:

import sympy as sy
s = sy.sympify("a*b*c*d + b*c*d + c*d + d")
sy.rcollect(s, "d", "c", "b", "a")

gives

d*(a*b*c + b*c + c + 1)

whereas I'd like to get, and my code (symutil.recursive_collect("a*b*c*d + b*c*d + c*d + d")) produces

d*(1 + c*(1 + b*(1 + a)))

(There was a mistake in my original message where I said "a * b * c * (1 + d) ", which obviously is not equivalent with the original. That was quickly written by hand out of laziness, and well, I got it wrong. The new, correct expression is from actual program output.)

Looking at the source confirms that these two ideas of recursive collection are different:

- SymPy's current rcollect() works by processing anything that matches is_Add inside the expression tree. Hence, it is recursive in the sense that a sum operation at any level in the expression tree will be collected locally. Algorithmically, rcollect() uses a depth-first approach.

- My code works by collecting at the top level *first*, and only then recurses into the args of the collected expr (doing the same top-level collection for them, and recursing, etc.). Atoms are passed through as-is. At each level, the current operation is finally rewritten using the processed args. This extracts the nested multiplications in the example above, because a successful collect() will produce a Mul. Algorithmically, recursive_collect() uses a breadth-first approach.

A further difference is that recursive_collect() comes with an (optional) automatic symbol list generation mechanism. If no "syms" are given by the user, the "syms" list is automatically generated at each level in a way such that the operation count of the subexpression is likely to be minimized when it is collected.

E.g. when given the expanded expression (please excuse the symbol naming; parts of this were automatically generated)

-A*phi_x*u_x - G*phi_y*u_y - __Ux__1__*__Ux__1___x*phi*rho*u_x - __Ux__1__*__Uy__1___x*phi*rho*u_y - __Ux__1___y*__Uy__1__*phi*rho*u_x - __Uy__1__*__Uy__1___y*phi*rho*u_y - 2*__Ux__1__*__Uy__1__*phi*rho*u_xy - phi*rho*u_xx*__Ux__1__**2 - phi*rho*u_yy*__Uy__1__**2

my code using the automatic symbol list produces

phi*rho*(__Ux__1__*(-__Ux__1___x*u_x - __Uy__1___x*u_y - 2*__Uy__1__*u_xy) + __Uy__1__*(-__Ux__1___y*u_x - __Uy__1___y*u_y) - u_xx*__Ux__1__**2 - u_yy*__Uy__1__**2) - A*phi_x*u_x - G*phi_y*u_y

which is (by an eyeball metric) at least near optimal regarding operation count (at least when collecting is the only allowed simplification).

In conclusion, rcollect() and my suggested code do different things, so I would still like to request consideration for whether the new functionality would be useful in standard SymPy.

Why I personally think this would be a useful feature, is that the user may only want to collect an expression to produce a reduction in operation count, without being interested in which symbols should make up syms/vars and in which order. Also, a global "optimal" list of syms/vars for this purpose may not even exist: different subexpressions may be more advantageous to collect using a different ordering of the symbols. The automatic syms generation takes care of this. It is not intended to replace manually provided syms/vars, but to complement it.

And finally, about rcollect():

- Is there a reason why the call syntax differs from collect()? collect() takes in a list of syms, while rcollect() uses Python's *args mechanism for the same (and calls them "vars", instead of "syms" like collect()).

- Suggestion for documentation: rcollect() could be mentioned in the documentation for collect() to make it easier to find. At least I find it a bit confusing that many of the other simplifiers come with a "deep" flag, but for collect() the recursive version is a separate function. (Of course, technically there is the distinction of descending into function arguments vs. recursing into sums, so the difference may be justified.)

Chris Smith

unread,

Aug 28, 2012, 8:07:17 AM8/28/12

to sy...@googlegroups.com

Thanks for the detailed explanation. I don't recall if you have a pull
request for this or not but I can look at it again in a day or so.
(I'm working on the combinatorics right now.)

/c

Juha Jeronen

unread,

Aug 28, 2012, 8:29:24 AM8/28/12

to sy...@googlegroups.com

On 28/08/12 15:07, Chris Smith wrote:
> Thanks for the detailed explanation. I don't recall if you have a pull
> request for this or not but I can look at it again in a day or so.

No, I haven't made a pull request. I thought to ask for opinions first :)

(It's hardly a proper patch at the moment, with the code living in a
custom add-on module defined in my FEM framework project. I can of
course clean it up and suggest a patch for SymPy's simplify module, once
I hear your thoughts on the matter.)

> (I'm working on the combinatorics right now.)

(Ok.)

Aaron Meurer

unread,

Aug 28, 2012, 1:45:56 PM8/28/12

to sy...@googlegroups.com

On Aug 28, 2012, at 5:07 AM, Juha Jeronen <juha.j...@jyu.fi> wrote:

On Friday, August 17, 2012 5:18:49 PM UTC+3, smichr wrote:
There is an rcollect function that you might look at.

Thanks, I had missed this. However, I tried it now and unless I'm doing something wrong, it's not being recursive in the way I meant:

import sympy as sy
s = sy.sympify("a*b*c*d + b*c*d + c*d + d")
sy.rcollect(s, "d", "c", "b", "a")

gives

d*(a*b*c + b*c + c + 1)

whereas I'd like to get, and my code (symutil.recursive_collect("a*b*c*d + b*c*d + c*d + d")) produces

d*(1 + c*(1 + b*(1 + a)))

(There was a mistake in my original message where I said "a * b * c * (1 + d) ", which obviously is not equivalent with the original. That was quickly written by hand out of laziness, and well, I got it wrong. The new, correct expression is from actual program output.)

Looking at the source confirms that these two ideas of recursive collection are different:

- SymPy's current rcollect() works by processing anything that matches is_Add inside the expression tree. Hence, it is recursive in the sense that a sum operation at any level in the expression tree will be collected locally. Algorithmically, rcollect() uses a depth-first approach.

- My code works by collecting at the top level *first*, and only then recurses into the args of the collected expr (doing the same top-level collection for them, and recursing, etc.). Atoms are passed through as-is. At each level, the current operation is finally rewritten using the processed args. This extracts the nested multiplications in the example above, because a successful collect() will produce a Mul. Algorithmically, recursive_collect() uses a breadth-first approach.

A further difference is that recursive_collect() comes with an (optional) automatic symbol list generation mechanism. If no "syms" are given by the user, the "syms" list is automatically generated at each level in a way such that the operation count of the subexpression is likely to be minimized when it is collected.

E.g. when given the expanded expression (please excuse the symbol naming; parts of this were automatically generated)

-A*phi_x*u_x - G*phi_y*u_y - __Ux__1__*__Ux__1___x*phi*rho*u_x - __Ux__1__*__Uy__1___x*phi*rho*u_y - __Ux__1___y*__Uy__1__*phi*rho*u_x - __Uy__1__*__Uy__1___y*phi*rho*u_y - 2*__Ux__1__*__Uy__1__*phi*rho*u_xy - phi*rho*u_xx*__Ux__1__**2 - phi*rho*u_yy*__Uy__1__**2

my code using the automatic symbol list produces

phi*rho*(__Ux__1__*(-__Ux__1___x*u_x - __Uy__1___x*u_y - 2*__Uy__1__*u_xy) + __Uy__1__*(-__Ux__1___y*u_x - __Uy__1___y*u_y) - u_xx*__Ux__1__**2 - u_yy*__Uy__1__**2) - A*phi_x*u_x - G*phi_y*u_y

which is (by an eyeball metric) at least near optimal regarding operation count (at least when collecting is the only allowed simplification).

In conclusion, rcollect() and my suggested code do different things, so I would still like to request consideration for whether the new functionality would be useful in standard SymPy.

Based on this, I'd say the functions would be useful. Any thoughts on where it should go? Maybe collect (or rcollect) with no symbols given would use this.

Why I personally think this would be a useful feature, is that the user may only want to collect an expression to produce a reduction in operation count, without being interested in which symbols should make up syms/vars and in which order. Also, a global "optimal" list of syms/vars for this purpose may not even exist: different subexpressions may be more advantageous to collect using a different ordering of the symbols. The automatic syms generation takes care of this. It is not intended to replace manually provided syms/vars, but to complement it.

And finally, about rcollect():

- Is there a reason why the call syntax differs from collect()? collect() takes in a list of syms, while rcollect() uses Python's *args mechanism for the same (and calls them "vars", instead of "syms" like collect()).

- Suggestion for documentation: rcollect() could be mentioned in the documentation for collect() to make it easier to find. At least I find it a bit confusing that many of the other simplifiers come with a "deep" flag, but for collect() the recursive version is a separate function. (Of course, technically there is the distinction of descending into function arguments vs. recursing into sums, so the difference may be justified.)

These are valid points. We try to be consistent with our APIs, but unfortunately sometimes people either don't know or forget the conventions when contributing code.

Perhaps we should start a little API style guide somewhere for those things that have already become conventions in the SymPy API, like use of the word "deep".

--
You received this message because you are subscribed to the Google Groups "sympy" group.
To view this discussion on the web visit https://groups.google.com/d/msg/sympy/-/GsXjNHijbrsJ.
To post to this group, send email to sy...@googlegroups.com.
To unsubscribe from this group, send email to sympy+un...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/sympy?hl=en.

Juha Jeronen

unread,

Aug 29, 2012, 3:36:22 AM8/29/12

to sy...@googlegroups.com

On 28/08/12 20:45, Aaron Meurer wrote:

On Aug 28, 2012, at 5:07 AM, Juha Jeronen <juha.j...@jyu.fi> wrote:

---8<---[snip]---8<---

In conclusion, rcollect() and my suggested code do different things, so I would still like to request consideration for whether the new functionality would be useful in standard SymPy.

Based on this, I'd say the functions would be useful. Any thoughts on where it should go? Maybe collect (or rcollect) with no symbols given would use this.

Thanks for the input.

I think the functionality should be split into two parts:

1) The automatic symbol list generation. This could be used as a default for collect() and rcollect(), if no symbols are given.

In the case of rcollect(), the list should probably be generated at each level of recursion separately to achieve optimal collection (consider e.g. "c*a*(b*x + b*y) + c*b*(a*x + a*y)").

I think the symbol list generator could also be exposed in the API, as there might be some unanticipated uses for it.

2) The two ways of recursive collection: current rcollect() vs. the suggested code. There may be cases where either of them is preferable, so I think both are needed.

In my opinion, the most logical place for this part of the new functionality would be rcollect(). Should we add a flag for choosing the mode? But which mode should be the default? (This is not exactly "deep" vs. "non-deep", but different recursion strategies.)

Or should we add a new API function, keeping rcollect() as-is? This solution doesn't seem very clean.

And finally, about rcollect():

- Is there a reason why the call syntax differs from collect()? collect() takes in a list of syms, while rcollect() uses Python's *args mechanism for the same (and calls them "vars", instead of "syms" like collect()).

- Suggestion for documentation: rcollect() could be mentioned in the documentation for collect() to make it easier to find. At least I find it a bit confusing that many of the other simplifiers come with a "deep" flag, but for collect() the recursive version is a separate function. (Of course, technically there is the distinction of descending into function arguments vs. recursing into sums, so the difference may be justified.)

These are valid points. We try to be consistent with our APIs, but unfortunately sometimes people either don't know or forget the conventions when contributing code.

I see. Well, that's understandable :)

Perhaps we should start a little API style guide somewhere for those things that have already become conventions in the SymPy API, like use of the word "deep".

Hmm, this sounds a good idea.

-J

Aaron Meurer

unread,

Aug 29, 2012, 7:43:24 PM8/29/12

to sy...@googlegroups.com

On Wed, Aug 29, 2012 at 1:36 AM, Juha Jeronen <juha.j...@jyu.fi> wrote:
> On 28/08/12 20:45, Aaron Meurer wrote:
>
> On Aug 28, 2012, at 5:07 AM, Juha Jeronen <juha.j...@jyu.fi> wrote:
>
> ---8<---[snip]---8<---
>
>
> In conclusion, rcollect() and my suggested code do different things, so I
> would still like to request consideration for whether the new functionality
> would be useful in standard SymPy.
>
>
> Based on this, I'd say the functions would be useful. Any thoughts on where
> it should go? Maybe collect (or rcollect) with no symbols given would use
> this.
>
>
> Thanks for the input.
>
> I think the functionality should be split into two parts:
>
>
> 1) The automatic symbol list generation. This could be used as a default for
> collect() and rcollect(), if no symbols are given.
>
> In the case of rcollect(), the list should probably be generated at each
> level of recursion separately to achieve optimal collection (consider e.g.
> "c*a*(b*x + b*y) + c*b*(a*x + a*y)").
>
> I think the symbol list generator could also be exposed in the API, as there
> might be some unanticipated uses for it.

Is this different from .free_symbols?

>
>
> 2) The two ways of recursive collection: current rcollect() vs. the
> suggested code. There may be cases where either of them is preferable, so I
> think both are needed.
>
> In my opinion, the most logical place for this part of the new functionality
> would be rcollect(). Should we add a flag for choosing the mode? But which
> mode should be the default? (This is not exactly "deep" vs. "non-deep", but
> different recursion strategies.)
>
> Or should we add a new API function, keeping rcollect() as-is? This solution
> doesn't seem very clean.

Well, I actually would like to have just one function, collect(), but
that would require changing the API. So I would try to find a good
name for each strategy, and use method="name" in rcollect(). As to
what should be the default, I don't know.

And as I said, the automatic symbols can happen if no symbols are given.

>
>
>
>
> And finally, about rcollect():
>
> - Is there a reason why the call syntax differs from collect()? collect()
> takes in a list of syms, while rcollect() uses Python's *args mechanism for
> the same (and calls them "vars", instead of "syms" like collect()).
>
> - Suggestion for documentation: rcollect() could be mentioned in the
> documentation for collect() to make it easier to find. At least I find it a
> bit confusing that many of the other simplifiers come with a "deep" flag,
> but for collect() the recursive version is a separate function. (Of course,
> technically there is the distinction of descending into function arguments
> vs. recursing into sums, so the difference may be justified.)
>
>
> These are valid points. We try to be consistent with our APIs, but
> unfortunately sometimes people either don't know or forget the conventions
> when contributing code.
>
>
> I see. Well, that's understandable :)

By the way, it's not so bad that one is called vars and the other
syms, but the thing that is bad is that one is *vars and the other is
syms. So one takes a list and the other doesn't.

Aaron Meurer

>
>
>
> Perhaps we should start a little API style guide somewhere for those things
> that have already become conventions in the SymPy API, like use of the word
> "deep".
>
>
> Hmm, this sounds a good idea.
>
>
> -J
>

> --
> You received this message because you are subscribed to the Google Groups
> "sympy" group.

Juha Jeronen

unread,

Aug 31, 2012, 2:54:33 AM8/31/12

to sy...@googlegroups.com

Hi,

On 30/08/12 02:43, Aaron Meurer wrote:
> On Wed, Aug 29, 2012 at 1:36 AM, Juha Jeronen <juha.j...@jyu.fi> wrote:
>> I think the functionality should be split into two parts:
>>
>>
>> 1) The automatic symbol list generation. This could be used as a default for
>> collect() and rcollect(), if no symbols are given.
>>
>> In the case of rcollect(), the list should probably be generated at each
>> level of recursion separately to achieve optimal collection (consider e.g.
>> "c*a*(b*x + b*y) + c*b*(a*x + a*y)").
>>
>> I think the symbol list generator could also be exposed in the API, as there
>> might be some unanticipated uses for it.
> Is this different from .free_symbols?

Yes, it produces a particular ordering, which is critical for how
collect() works.

I originally used .atoms() (should probably have used .free_symbols in
0.7.x), but then wrote this custom function to get the ordering that was
needed.

Primary sort is descending by #occurrences, secondary by how near the
top of the expression tree the topmost occurrence was seen.

The idea behind this is heuristic. Loosely speaking, on average it
should pay off better to collect first by those symbols which have more
occurrences, since there being so many of each of them, they have the
most potential for reduction.

E.g. returning to the trivial example,

sy.collect( "a*b*c*d + b*c*d + c*d + d", ["d", "c", "b", "a"] )
=> d*(a*b*c + b*c + c + 1)

but

sy.collect( "a*b*c*d + b*c*d + c*d + d", ["a", "b", "c", "d"] )
=> a*b*c*d + b*c*d + c*d + d

(i.e. does nothing) since the original expression is already collected
w.r.t. "a".

So, it's critical that in this particular example, the ordering is d, c,
b, a.

The secondary sort criterion was added to handle expressions like

Uin*(1 + u0_x)*(u0/(__uvmag__1__))

where the optimal collection is to do nothing. There is one of each
symbol here, but still, what happens depends on the order in which they
are given to collect(). If we start with u0_x, that's not good...

In this case, since the expression is not in expanded form, it is
critical to pick the collection order in such a way that symbols nearer
the top of the expression tree are chosen first (choosing a "deeper"
symbol first causes unnecessary expansion).

Of course, in many cases the input to recursive_collect() will be in
expanded form, but my opinion is that it would be nice to do something
semi-sensible also when it isn't.

(I.e. warn about it in the docstring, somewhat like the warning for
simplify(): robust behaviour is only guaranteed for expanded
expressions, but when robustness is not a requirement, the input doesn't
need to be in expanded form.)

>> 2) The two ways of recursive collection: current rcollect() vs. the
>> suggested code. There may be cases where either of them is preferable, so I
>> think both are needed.
>>
>> In my opinion, the most logical place for this part of the new functionality
>> would be rcollect(). Should we add a flag for choosing the mode? But which
>> mode should be the default? (This is not exactly "deep" vs. "non-deep", but
>> different recursion strategies.)
>>
>> Or should we add a new API function, keeping rcollect() as-is? This solution
>> doesn't seem very clean.
> Well, I actually would like to have just one function, collect(), but
> that would require changing the API. So I would try to find a good
> name for each strategy, and use method="name" in rcollect(). As to
> what should be the default, I don't know.

If we don't want to break the API, I think the current method of
rcollect() should be the default (since otherwise the behaviour of the
function would change).

Maybe

method="depth-first" vs. method="breadth-first"?

Difficult to type, especially breat...breadh...aaaaahhh! How about

method="sums" vs. method="all"? (Additionally, method="top" could be
top-level-only, i.e. the current behaviour of collect().)

Something else?

In addition to this, we could add a flag to the existing collect(),
which would make it behave as rcollect(). Default would be off to avoid
breaking the API. This way, the user could use the current rcollect()
API or the new extended collect() API.

As an added bonus, this would be consistent with other simplifiers,
which have exactly this behaviour: same function for recursive and
non-recursive, with "deep" off by default.

E.g. deep=True, and if this was enabled, the kwarg "method" would become
available (just like the suggested change for rcollect), and collect()
would basically call rcollect(), passing through the value for "method".
Then rcollect() could internally use collect() in the non-recursive form.

(Or maybe split collect() into a pure API function and an internal
worker function to avoid the circular-looking use relationship between
collect() and rcollect() in the suggested solution. But that depends on
which you think is clearer; I'm fine with either option.)

> And as I said, the automatic symbols can happen if no symbols are given.

Ok.

>> And finally, about rcollect():
>>
>> - Is there a reason why the call syntax differs from collect()? collect()
>> takes in a list of syms, while rcollect() uses Python's *args mechanism for
>> the same (and calls them "vars", instead of "syms" like collect()).
>>
>> - Suggestion for documentation: rcollect() could be mentioned in the
>> documentation for collect() to make it easier to find. At least I find it a
>> bit confusing that many of the other simplifiers come with a "deep" flag,
>> but for collect() the recursive version is a separate function. (Of course,
>> technically there is the distinction of descending into function arguments
>> vs. recursing into sums, so the difference may be justified.)
>>
>>
>> These are valid points. We try to be consistent with our APIs, but
>> unfortunately sometimes people either don't know or forget the conventions
>> when contributing code.
>>
>>
>> I see. Well, that's understandable :)
> By the way, it's not so bad that one is called vars and the other
> syms, but the thing that is bad is that one is *vars and the other is
> syms. So one takes a list and the other doesn't.

Mm, yes. I did try to pass in a list to rcollect() at first (as in "ah,
just prefix the function name with an r"), until reading the
documentation and noticing the star :)

Consistent types are of course the most important thing in this regard,
but I think consistent parameter naming would be a nice bonus. It would
allow the user-programmer to learn the API faster, if the same kind of
thing always had the same name.

If the names are different, at least I tend to become wary that there is
probably some subtle difference between what kind of symbols are
considered "vars" vs. "syms". Only after looking at the source I could
be sure they were the same thing :)

-J

Aaron Meurer

unread,

Aug 31, 2012, 10:52:04 AM8/31/12

to sy...@googlegroups.com

If it will make it cleaner, we should consider breaking the API, with
a deprecation on the old one if possible.

>
> Maybe
>
> method="depth-first" vs. method="breadth-first"?
>
> Difficult to type, especially breat...breadh...aaaaahhh! How about

bfs and dfs are common abbreviations for these.

>
> method="sums" vs. method="all"? (Additionally, method="top" could be
> top-level-only, i.e. the current behaviour of collect().)

method="all" sounds like "apply all the methods".

Top-level only is exactly what deep=False usually does.

>
> Something else?
>
>
> In addition to this, we could add a flag to the existing collect(),
> which would make it behave as rcollect(). Default would be off to avoid
> breaking the API. This way, the user could use the current rcollect()
> API or the new extended collect() API.
>
> As an added bonus, this would be consistent with other simplifiers,
> which have exactly this behaviour: same function for recursive and
> non-recursive, with "deep" off by default.

In my experience, deep is on by default. Also in my opinion in most
cases it should be, as this is what you want most of the time.

>
> E.g. deep=True, and if this was enabled, the kwarg "method" would become
> available (just like the suggested change for rcollect), and collect()
> would basically call rcollect(), passing through the value for "method".
> Then rcollect() could internally use collect() in the non-recursive form.
>
> (Or maybe split collect() into a pure API function and an internal
> worker function to avoid the circular-looking use relationship between
> collect() and rcollect() in the suggested solution. But that depends on
> which you think is clearer; I'm fine with either option.)

Maybe collect(deep=True) can call rcollect, which we can also
deprecate. Then, when the deprecation period ends, we just remove
rcollect from __init__.py.

Agreed. This can be changed throughout SymPy without an API break (in
other words, patches welcome).

Aaron Meurer

>
> If the names are different, at least I tend to become wary that there is
> probably some subtle difference between what kind of symbols are
> considered "vars" vs. "syms". Only after looking at the source I could
> be sure they were the same thing :)
>
>
> -J
>

Juha Jeronen

unread,

Sep 3, 2012, 3:37:34 AM9/3/12

to sy...@googlegroups.com

On 31/08/12 17:52, Aaron Meurer wrote:
> On Aug 31, 2012, at 12:54 AM, Juha Jeronen <juha.j...@jyu.fi> wrote:
>
>> Hi,
>>
>> On 30/08/12 02:43, Aaron Meurer wrote:
>>> On Wed, Aug 29, 2012 at 1:36 AM, Juha Jeronen <juha.j...@jyu.fi> wrote:
>>>> Or should we add a new API function, keeping rcollect() as-is? This solution
>>>> doesn't seem very clean.
>>> Well, I actually would like to have just one function, collect(), but
>>> that would require changing the API. So I would try to find a good
>>> name for each strategy, and use method="name" in rcollect(). As to
>>> what should be the default, I don't know.
>> If we don't want to break the API, I think the current method of
>> rcollect() should be the default (since otherwise the behaviour of the
>> function would change).
> If it will make it cleaner, we should consider breaking the API, with
> a deprecation on the old one if possible.

Ok.

>> Maybe
>>
>> method="depth-first" vs. method="breadth-first"?
>>
>> Difficult to type, especially breat...breadh...aaaaahhh! How about
> bfs and dfs are common abbreviations for these.

Ok. Those are fine by me. "dfs" for the current rcollect(), and "bfs"
for the suggested recursive_collect().

>> method="sums" vs. method="all"? (Additionally, method="top" could be
>> top-level-only, i.e. the current behaviour of collect().)
> method="all" sounds like "apply all the methods".

Hmm, you're right. So that was a bad choice :)

> Top-level only is exactly what deep=False usually does.

Mm, yes. So it's better to do it with that to be consistent.

>> In addition to this, we could add a flag to the existing collect(),
>> which would make it behave as rcollect(). Default would be off to avoid
>> breaking the API. This way, the user could use the current rcollect()
>> API or the new extended collect() API.
>>
>> As an added bonus, this would be consistent with other simplifiers,
>> which have exactly this behaviour: same function for recursive and
>> non-recursive, with "deep" off by default.
> In my experience, deep is on by default. Also in my opinion in most
> cases it should be, as this is what you want most of the time.

Ok. I'm not sure why I thought it is off by default - might be in some
functions in the old 0.6.7. As I'm using both 0.7.1 and 0.6.7 at the
moment (two machines, different distro), I may be a bit confused about
API details at times :)

>> E.g. deep=True, and if this was enabled, the kwarg "method" would become
>> available (just like the suggested change for rcollect), and collect()
>> would basically call rcollect(), passing through the value for "method".
>> Then rcollect() could internally use collect() in the non-recursive form.
>>
>> (Or maybe split collect() into a pure API function and an internal
>> worker function to avoid the circular-looking use relationship between
>> collect() and rcollect() in the suggested solution. But that depends on
>> which you think is clearer; I'm fine with either option.)
> Maybe collect(deep=True) can call rcollect, which we can also
> deprecate. Then, when the deprecation period ends, we just remove
> rcollect from __init__.py.

Ok.

Summarizing the discussion, the cleanest solution is probably to change
the API to:

collect(expr, syms=None, **kwargs)

kwargs:
deep=True (default; API break!) or deep=False (top level only, like
old collect())
method="dfs" (old rcollect()) or method="bfs" (recursive_collect()).

For syms, a list can be given as before. The default is the special
value None, which means "use automatic syms".

rcollect(expr, *vars) just calls collect(expr, syms, deep=True,
method="dfs").

Questions:

- Is None good as a special value? I think it's at least better than the
string "auto", because IIRC collect() isn't that particular about the
type of syms being a list if you want to collect on one symbol only (I
think this is a very nice convenience feature and should be kept). Given
this behaviour, the value None is the only one that doesn't pass control
information in the data band (as it's not a valid symbol).

(Of course, whatever the special value is, it can be the default when
syms is omitted, so the user doesn't need to worry about it.)

- Which method should be the default? Personally, I would suggest the
new bfs; at least it's what I intuitively expect myself when doing a
"recursive collect" (i.e. "extract as much stuff to as near the top
level as possible").

What is your opinion on this? And the other developers?

Once these are settled, I think I have enough information to prepare a
proper patch (except for practical details, such as the preferred patch
format).

>> Mm, yes. I did try to pass in a list to rcollect() at first (as in "ah,
>> just prefix the function name with an r"), until reading the
>> documentation and noticing the star :)
>>
>> Consistent types are of course the most important thing in this regard,
>> but I think consistent parameter naming would be a nice bonus. It would
>> allow the user-programmer to learn the API faster, if the same kind of
>> thing always had the same name.
> Agreed. This can be changed throughout SymPy without an API break (in
> other words, patches welcome).

Ok. I might look into this when I have the time, but no promises :)

-J

Aaron Meurer

unread,

Sep 3, 2012, 12:39:03 PM9/3/12

to sy...@googlegroups.com

Well it is occasionally, but as I said generally it *should* be True
by default.

By the way, what distro has the old SymPy. Is it an issue of you using
some "stable" version, or should we try to work with the maintainers
to get it upgraded?

Yes, None is a common special value for cases like this.

>
> (Of course, whatever the special value is, it can be the default when
> syms is omitted, so the user doesn't need to worry about it.)
>
>
> - Which method should be the default? Personally, I would suggest the
> new bfs; at least it's what I intuitively expect myself when doing a
> "recursive collect" (i.e. "extract as much stuff to as near the top
> level as possible").
>
> What is your opinion on this? And the other developers?

I'm not sure. Maybe you could take a look at where in the library code
collect is used to see an idea of some use cases. You could also get
an idea by replacing the default and seeing how many tests fail.

>
>
> Once these are settled, I think I have enough information to prepare a
> proper patch (except for practical details, such as the preferred patch
> format).

Use a pull request on Github. We'll help you if you're new to git.

Aaron Meurer

>
>
>>> Mm, yes. I did try to pass in a list to rcollect() at first (as in "ah,
>>> just prefix the function name with an r"), until reading the
>>> documentation and noticing the star :)
>>>
>>> Consistent types are of course the most important thing in this regard,
>>> but I think consistent parameter naming would be a nice bonus. It would
>>> allow the user-programmer to learn the API faster, if the same kind of
>>> thing always had the same name.
>> Agreed. This can be changed throughout SymPy without an API break (in
>> other words, patches welcome).
>
> Ok. I might look into this when I have the time, but no promises :)
>
>
> -J
>

Juha Jeronen

unread,

Sep 4, 2012, 3:02:35 AM9/4/12

to sy...@googlegroups.com

On 09/03/12 19:39, Aaron Meurer wrote:
> On Sep 3, 2012, at 1:37 AM, Juha Jeronen<juha.j...@jyu.fi> wrote:
>
>> On 31/08/12 17:52, Aaron Meurer wrote:
>>
>> Ok. I'm not sure why I thought it is off by default - might be in some
>> functions in the old 0.6.7. As I'm using both 0.7.1 and 0.6.7 at the
>> moment (two machines, different distro), I may be a bit confused about
>> API details at times :)
>
> Well it is occasionally, but as I said generally it *should* be True
> by default.

Ok.

> By the way, what distro has the old SymPy. Is it an issue of you using
> some "stable" version, or should we try to work with the maintainers
> to get it upgraded?

Yeah, it's Debian Stable.

Out of the distros tracked by OSWatershed, debian and arch are the only
ones still offering 0.6.7 :)

http://oswatershed.org/pkg/sympy

And to be exact, I had forgotten that Ubuntu 12.04LTS actually comes
with SymPy 0.7.1rc1, not the final 0.7.1...

>> Summarizing the discussion, the cleanest solution is probably to change
>> the API to:
>>
>> collect(expr, syms=None, **kwargs)
>>
>> kwargs:
>> deep=True (default; API break!) or deep=False (top level only, like
>> old collect())
>> method="dfs" (old rcollect()) or method="bfs" (recursive_collect()).
>>
>> For syms, a list can be given as before. The default is the special
>> value None, which means "use automatic syms".
>>
>> rcollect(expr, *vars) just calls collect(expr, syms, deep=True,
>> method="dfs").
>>
>>
>> Questions:
>>
>>
>> - Is None good as a special value? I think it's at least better than the
>> string "auto", because IIRC collect() isn't that particular about the
>> type of syms being a list if you want to collect on one symbol only (I
>> think this is a very nice convenience feature and should be kept). Given
>> this behaviour, the value None is the only one that doesn't pass control
>> information in the data band (as it's not a valid symbol).
>
> Yes, None is a common special value for cases like this.

Ok.

>> (Of course, whatever the special value is, it can be the default when
>> syms is omitted, so the user doesn't need to worry about it.)
>>
>>
>> - Which method should be the default? Personally, I would suggest the
>> new bfs; at least it's what I intuitively expect myself when doing a
>> "recursive collect" (i.e. "extract as much stuff to as near the top
>> level as possible").
>>
>> What is your opinion on this? And the other developers?
>
> I'm not sure. Maybe you could take a look at where in the library code
> collect is used to see an idea of some use cases. You could also get
> an idea by replacing the default and seeing how many tests fail.

Ok.

>> Once these are settled, I think I have enough information to prepare a
>> proper patch (except for practical details, such as the preferred patch
>> format).
>
> Use a pull request on Github. We'll help you if you're new to git.

Ok.

I have used git, but not Github. I suppose I'll need to create an
account and then do something to create the pull request...

In any case, I'll prepare the patch locally first. I'll ping on this
list when I have a first version ready for review.

-J

Aaron Meurer

unread,

Sep 4, 2012, 2:21:26 PM9/4/12

to sy...@googlegroups.com

We've got a (slightly outdated in some areas) guide at
https://github.com/sympy/sympy/wiki/Development-workflow. GitHub also
has great guides of its own. You basically need to create an account
on GitHub, fork SymPy, add your ssh keys to GitHub, push your work to
a branch in your fork, and press the pull request button. The SSH key
part tends to be the trickiest, but fortunately you only have to do
that once per computer, and GitHub has a pretty good guide on how to
do it.

Aaron Meurer

Juha Jeronen

unread,

Sep 5, 2012, 6:37:41 AM9/5/12

to sy...@googlegroups.com

Ok. Thanks for the info.

I cloned and installed the latest SymPy yesterday from git. It seems
quite a few places call collect(). Just by a quick look at the source,
it's hard to predict whether a recursive or non-recursive call would be
better at each place. I guess I'll look at each instance more closely :)

And by the way, the tests for collect() look very extensive. I'll add
some of my own for the new recursive version when I'm done.

There are two more substance details, which came up when I was thinking
about this yesterday: automatically reordering given syms and handling
number atoms.

About the first one, consider the example

a*(d*c*b + c*b + b + 1) + d*(a*b*c + b*c + c + 1)

If we collect this by ["b","c"] (specifically in that order!), we will get

a*(b*(c*(d + 1) + 1) + 1) + d*(b*c*(a + 1) + c + 1)

leaving an extra "c" in the second parenthetical expression. Collecting
by ["c","b"] similarly leaves an extra "b" in the first one.

This is an actual example of what I vaguely mentioned earlier - that
there are cases in which there does not exist a global optimal ordering
for the syms to produce optimal collection.

Hence, let's allow re-ordering. If we reorder the given syms list
independently in each subexpression, and collect by ["b","c"] (now in
automatically determined, appropriate order for each subexpression) we
obtain

a*(b*(c*(d + 1) + 1) + 1) + d*(c*(b*(a + 1) + 1) + 1)

which gets rid of the extra "c". The first parenthetical expression gets
collected by ["b","c"] and the second by ["c","b"].

Now, it seems to me that this feature would be useful... so I extended
my recursive_collect() accordingly. The new one has a "reorder" flag
which affects the operation when the syms are given. When reordering is
enabled, it basically runs the automatic syms generation, and filters
the list to include only those syms that were given by the user (while
preserving the ordering from the automatically generated list).

What would be the sane default for this option? Intuitively, at least I
would expect the non-recursive version to do *exactly as it is told*
(i.e. no reordering unless explicitly requested), whereas in the
recursive version, it would make more sense to have reordering enabled
by default in order for collect() to "do what I mean" in cases like the
above.

Is this fine?

Why I think this functionality should be included is, that by using this
it is possible to produce efficient recursive collection w.r.t. the
given syms only. Something like this is needed, if the user wants to
(somewhat optimally) collect e.g. only in variables, ignoring constants.

Moving on to the other topic, about the number atoms, is collect()
intended to ignore numbers? To me it seems it does:

import sympy as sy
sy.collect( sy.sympify("2*a + 2*b + 2*c"), sy.sympify("2") )
=> 2*a + 2*b + 2*c

Also, this test

assert collect(-x/8 + x*y, -x) == x*(y - S(1)/8)

in simplify/tests/test_simplify.py, which produces the same result as
collecting w.r.t. just "x" - and indeed, according to the assert, is
expected to do so - seems to suggest that collect() is meant to ignore
numbers.

NumberSymbols, OTOH, seem to be handled the same way as generic Symbols:

sy.collect( sy.sympify("pi*a + pi*b + 2*pi*c"), sy.sympify("pi") )
=> pi*(a + b + 2*c)

I'm asking because my automatic syms generator also currently ignores
numbers. It would be possible to handle numbers, too, but that would
require complicating the ordering a bit (symbols first, numbers last).
If it's not needed, I won't add the extra functionality (and the extra
bugs to go with it).

Finally, reading the latest source, I noticed some things about collect():

- it descends into a top-level Mul even when not recursive. Is this
behaviour desired? (I think it's ok like this. It's just a bit
surprising - maybe needs to be mentioned in the docstring.)

- expr is sympified, but only after some processing (conditional on the
"if evaluate:") is already run on it. This processing assumes that expr
is already an object tree; hence collect() errors out on string input if
evaluate=True. Doing the sympification a few lines earlier would fix
this and should have no negative side effects. If that's ok, I'll change
this.

-J

Aaron Meurer

unread,

Sep 5, 2012, 10:48:29 AM9/5/12

to sy...@googlegroups.com

It seems OK to me.

>
> Why I think this functionality should be included is, that by using this it
> is possible to produce efficient recursive collection w.r.t. the given syms
> only. Something like this is needed, if the user wants to (somewhat
> optimally) collect e.g. only in variables, ignoring constants.
>
>
> Moving on to the other topic, about the number atoms, is collect() intended
> to ignore numbers? To me it seems it does:
>
> import sympy as sy
> sy.collect( sy.sympify("2*a + 2*b + 2*c"), sy.sympify("2") )
> => 2*a + 2*b + 2*c

Probably. One issue with this is that numbers automatically
distribute, so if you want to force them out, you have to use a hack
(this is what factor() does).

>
> Also, this test
>
> assert collect(-x/8 + x*y, -x) == x*(y - S(1)/8)
>
> in simplify/tests/test_simplify.py, which produces the same result as
> collecting w.r.t. just "x" - and indeed, according to the assert, is
> expected to do so - seems to suggest that collect() is meant to ignore
> numbers.
>
> NumberSymbols, OTOH, seem to be handled the same way as generic Symbols:
>
> sy.collect( sy.sympify("pi*a + pi*b + 2*pi*c"), sy.sympify("pi") )
> => pi*(a + b + 2*c)
>
> I'm asking because my automatic syms generator also currently ignores
> numbers. It would be possible to handle numbers, too, but that would require
> complicating the ordering a bit (symbols first, numbers last). If it's not
> needed, I won't add the extra functionality (and the extra bugs to go with
> it).

There are already good functions for pulling out numbers from an
expression, like factor_terms, gcd_terms, as_content_primitive (I'm
not sure what the best one to use is at the moment). So if anything
collect() should just call one of those as a pre- or post-processor.

>
>
> Finally, reading the latest source, I noticed some things about collect():
>
> - it descends into a top-level Mul even when not recursive. Is this
> behaviour desired? (I think it's ok like this. It's just a bit surprising -
> maybe needs to be mentioned in the docstring.)

This is probably an artifact of what I was talking about. Recursive
should really be the default, because it's what most people. If
collect() was truly non-recursive by default, it wouldn't do what most
people wanted it to do.

>
> - expr is sympified, but only after some processing (conditional on the "if
> evaluate:") is already run on it. This processing assumes that expr is
> already an object tree; hence collect() errors out on string input if
> evaluate=True. Doing the sympification a few lines earlier would fix this
> and should have no negative side effects. If that's ok, I'll change this.

It's OK, though note that in general, it's better to just pass SymPy
objects rather than strings to functions.

Juha Jeronen

unread,

Sep 6, 2012, 2:07:48 AM9/6/12

to sy...@googlegroups.com

On 05.09.2012 17:48, Aaron Meurer wrote:
> On Wed, Sep 5, 2012 at 4:37 AM, Juha Jeronen <juha.j...@jyu.fi> wrote:
>> Hence, let's allow re-ordering. If we reorder the given syms list
>> independently in each subexpression, and collect by ["b","c"] (now in
>> automatically determined, appropriate order for each subexpression) we
>> obtain
>>
>> a*(b*(c*(d + 1) + 1) + 1) + d*(c*(b*(a + 1) + 1) + 1)
>>
>> which gets rid of the extra "c". The first parenthetical expression gets
>> collected by ["b","c"] and the second by ["c","b"].
>>
>>
>> Now, it seems to me that this feature would be useful... so I extended my
>> recursive_collect() accordingly. The new one has a "reorder" flag which
>> affects the operation when the syms are given. When reordering is enabled,
>> it basically runs the automatic syms generation, and filters the list to
>> include only those syms that were given by the user (while preserving the
>> ordering from the automatically generated list).
>>
>> What would be the sane default for this option? Intuitively, at least I
>> would expect the non-recursive version to do *exactly as it is told* (i.e.
>> no reordering unless explicitly requested), whereas in the recursive
>> version, it would make more sense to have reordering enabled by default in
>> order for collect() to "do what I mean" in cases like the above.
>>
>> Is this fine?
> It seems OK to me.

Ok.

So, summarizing again:

collect(expr, syms=None, **kwargs)

kwargs:
deep=True (default; API break!) or deep=False (top level only, like
old collect())

method="dfs" (old rcollect()) or method="bfs" (recursive_collect();
suggesting this as default).

reorder=bool; whether to allow reordering of given syms
per-subexpression to optimize collection. When deep=False, default is
False, and when deep=True, default is True (mnemonic: default has the
same state as "deep"). When syms=None (automatic syms), this flag has no
effect.

For syms, a list can be given as before. The default is the special
value None, which means "use automatic syms".

rcollect(expr, *vars) just calls collect(expr, syms, deep=True,
method="dfs").

>> Why I think this functionality should be included is, that by using this it
>> is possible to produce efficient recursive collection w.r.t. the given syms
>> only. Something like this is needed, if the user wants to (somewhat
>> optimally) collect e.g. only in variables, ignoring constants.
>>
>>
>> Moving on to the other topic, about the number atoms, is collect() intended
>> to ignore numbers? To me it seems it does:
>>
>> import sympy as sy
>> sy.collect( sy.sympify("2*a + 2*b + 2*c"), sy.sympify("2") )
>> => 2*a + 2*b + 2*c
> Probably. One issue with this is that numbers automatically
> distribute, so if you want to force them out, you have to use a hack
> (this is what factor() does).

Ah, I see.

>> Also, this test
>>
>> assert collect(-x/8 + x*y, -x) == x*(y - S(1)/8)
>>
>> in simplify/tests/test_simplify.py, which produces the same result as
>> collecting w.r.t. just "x" - and indeed, according to the assert, is
>> expected to do so - seems to suggest that collect() is meant to ignore
>> numbers.
>>
>> NumberSymbols, OTOH, seem to be handled the same way as generic Symbols:
>>
>> sy.collect( sy.sympify("pi*a + pi*b + 2*pi*c"), sy.sympify("pi") )
>> => pi*(a + b + 2*c)
>>
>> I'm asking because my automatic syms generator also currently ignores
>> numbers. It would be possible to handle numbers, too, but that would require
>> complicating the ordering a bit (symbols first, numbers last). If it's not
>> needed, I won't add the extra functionality (and the extra bugs to go with
>> it).
> There are already good functions for pulling out numbers from an
> expression, like factor_terms, gcd_terms, as_content_primitive (I'm
> not sure what the best one to use is at the moment). So if anything
> collect() should just call one of those as a pre- or post-processor.

Ok. Thanks for the tip. I'll see what those can do.

But I think this part can wait until later - it's probably better to
first just integrate the current version, which ignores numbers
(consistently with how collect() currently works), and then see about
improving both.

>> Finally, reading the latest source, I noticed some things about collect():
>>
>> - it descends into a top-level Mul even when not recursive. Is this
>> behaviour desired? (I think it's ok like this. It's just a bit surprising -
>> maybe needs to be mentioned in the docstring.)
> This is probably an artifact of what I was talking about. Recursive
> should really be the default, because it's what most people. If
> collect() was truly non-recursive by default, it wouldn't do what most
> people wanted it to do.

Ok.

>> - expr is sympified, but only after some processing (conditional on the "if
>> evaluate:") is already run on it. This processing assumes that expr is
>> already an object tree; hence collect() errors out on string input if
>> evaluate=True. Doing the sympification a few lines earlier would fix this
>> and should have no negative side effects. If that's ok, I'll change this.
> It's OK, though note that in general, it's better to just pass SymPy
> objects rather than strings to functions.

Ok.

In programs/scripts, I agree. I consider it mainly a convenience feature
for interactive use :)

-J

Aaron Meurer

unread,

Sep 6, 2012, 2:18:23 AM9/6/12

to sy...@googlegroups.com

For interactive use I would just use var() to create the variables, or
even better, use isympy -a.

SymPy does have features that are useful interactively but technically
bad style, but they are always abused by those who don't know any
better (or who don't care).

Reply all

Reply to author

Forward