Changing the default sorting for Add and Mul

53 views
Skip to first unread message

Aaron Meurer

unread,
May 29, 2012, 12:43:53 AM5/29/12
to sy...@googlegroups.com
Hi.

I've been working on fixing the Python 3.3 issues, most of which stem
from hash randomization. I was thinking, instead of trying to fix
every algorithm that recurses through .args, it would be easier to
just make .args canonical. Are there any suggestions on a key that we
could use to sort the args for Add and Mul that would be just as fast
as hash, but be the same everywhere?

I was thinking of using _hashable_content. This should be identical
to sorting by hash, in the sense that the has is computed from the
_hashable_content, but unlike the hash, it will always compare the
same everywhere. There are a few issues with this idea:

1. The hash is not exactly computed from _hashable_content. Rather,
the args are in _hashable_content, and the name of the object is
injected manually by __hash__. Assumedly this can be fixed, though.

2. This would put a restriction on _hashable_content that it must not
only be hashable, but also comparable. This is not an issue now, as
far as I can tell, because it consists only of tuples, ints, and
strings, but for example if we ever wanted to use frozenset, we
wouldn't be able to (we'd rather have to sort it into a tuple, which
could potentially be a performance issue). I haven't made a deep
study of _hashable_content yet, but I don't think this is currently an
issue, but I'm wondering if anyone can foresee it being one.

3. I made a change to use this, and from what I can tell, it isn't
slower, but I'm not 100% convinced on my benchmarking. Can anyone
suggest any benchmarks that would be good to test this? So far, I've
been testing

a = [Symbol('x%d' %i) for i in xrange(10000)]
random.shuffle(a)
Mul(*a)
Add(*a)

using IPython's %timeit magic, and there seems to be no significant
difference. If anyone can suggest a better way to test it, let me
know. Or better yet, test it yourself. I'll post a PR that
implements it as soon as I change _hashable_content to include the
name (hopefully that won't be too difficult).

Aaron Meurer

Aaron Meurer

unread,
May 29, 2012, 1:10:17 AM5/29/12
to sy...@googlegroups.com
On Mon, May 28, 2012 at 10:43 PM, Aaron Meurer <asme...@gmail.com> wrote:
> Hi.
>
> I've been working on fixing the Python 3.3 issues, most of which stem
> from hash randomization.  I was thinking, instead of trying to fix
> every algorithm that recurses through .args, it would be easier to
> just make .args canonical.  Are there any suggestions on a key that we
> could use to sort the args for Add and Mul that would be just as fast
> as hash, but be the same everywhere?
>
> I was thinking of using _hashable_content.  This should be identical
> to sorting by hash, in the sense that the has is computed from the
> _hashable_content, but unlike the hash, it will always compare the
> same everywhere.  There are a few issues with this idea:
>
> 1. The hash is not exactly computed from _hashable_content.  Rather,
> the args are in _hashable_content, and the name of the object is
> injected manually by __hash__.  Assumedly this can be fixed, though.
>
> 2. This would put a restriction on _hashable_content that it must not
> only be hashable, but also comparable.  This is not an issue now, as
> far as I can tell, because it consists only of tuples, ints, and
> strings, but for example if we ever wanted to use frozenset, we
> wouldn't be able to (we'd rather have to sort it into a tuple, which
> could potentially be a performance issue).  I haven't made a deep
> study of _hashable_content yet, but I don't think this is currently an
> issue, but I'm wondering if anyone can foresee it being one.

Ah, I guess this is not true. _hashable_content generally contains
other Basic objects. So what I really want is the "fully denested"
_hashable_content, where each Basic object is recursively replaced
with its _hashable_content. I've no idea how expensive that would be
to compute.

I'm also open to other suggestions for an equitable sort key.

Aaron Meurer

Joachim Durchholz

unread,
May 29, 2012, 1:57:36 PM5/29/12
to sy...@googlegroups.com
Am 29.05.2012 07:10, schrieb Aaron Meurer:
>_hashable_content generally contains
> other Basic objects. So what I really want is the "fully denested"
> _hashable_content, where each Basic object is recursively replaced
> with its _hashable_content. I've no idea how expensive that would be
> to compute.

The behavior will be quadratic with number of levels of sets. This could
start to impact us on deeply nested expressions - not with anything a
human would enter, but if somebody wants to use SymPy from a script,
this might be different. (I'm talking of thousands of nesting levels.)
One way to set that might be to simplify x+0+0+...+0 (1000 terms, then
retry with 100,000 ;-) ).

The downside of caching is that each mutator must invalidate the cached
hash. (Unless mutators are disallowed once an object is entered into a
dict. Or the mutable objects are really copy-on-write ones. I don't know
whether any of these paths is available already, or feasible if
unavailable - we're deep into Systems Programming Land here.)

Aaron Meurer

unread,
May 29, 2012, 6:04:24 PM5/29/12
to sy...@googlegroups.com
In Python, only immutable objects are hashable, and in SymPy, all
Basic subclasses are immutable, so this isn't an issue.

Aaron Meurer

krastano...@gmail.com

unread,
May 29, 2012, 6:11:21 PM5/29/12
to sy...@googlegroups.com
One can easily cheat this rule:

class A(object):
def __init__(self, immut1, immut2, mut):
...
def __hash__(self):
return hash((self.immut1, self.immut2))

For instance this can be a Manifold, that must be "immutable" as
mathematical object, however still be able to learn about new
coordinate systems defined on it.

I am mentioning it because I am not really sure what does this mean:
> In Python, only immutable objects are hashable
I suppose I can get some clarification that way.

Joachim Durchholz

unread,
May 29, 2012, 6:32:30 PM5/29/12
to sy...@googlegroups.com
Am 30.05.2012 00:11, schrieb krastano...@gmail.com:
> I am mentioning it because I am not really sure what does this mean:
>> In Python, only immutable objects are hashable
> I suppose I can get some clarification that way.

My understanding is that nothing in the language or libraries enforces
that rule, but that you're in trouble if you try to use an object with a
mutable hash as a dict key.

Aaron Meurer

unread,
May 29, 2012, 6:33:47 PM5/29/12
to sy...@googlegroups.com
It's a rule that's not enforced. Actually, I think what you're doing
is OK. As long as the hash (and equality comparison) of the object
does not depend on the cached state, it shouldn't be an issue.

The main thing is that if you change the hash of an object during its
lifetime, or change what is used to compute the hash, bad things will
happen. For example:

class Bad(object):
def __init__(self, arg):
self.arg = arg

def __hash__(self):
return hash(self.arg)

In [26]: Bad(1)
Out[26]: <__main__.Bad object at 0x106ca7fd0>

In [27]: hash(Bad(1))
Out[27]: 1

In [28]: a = Bad(1)

In [29]: b = {a:1}

In [30]: a.arg = 2

In [31]: hash(a)
Out[31]: 2

In [32]: b[a]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-32-9f1bfc533672> in <module>()
----> 1 b[a]

KeyError: <__main__.Bad object at 0x106ca7f50>

Here, we implicitly changed the hash of a by mutating the argument of
a that is used to compute the hash. Note that Python doesn't prevent
me from doing that. I could make it if I want, by making __setattr__
raise AttributeError, but even then I could forcibly change it by
modifying the object's __dict__. This is what is meant when we say
that Python is a "consenting adults" language.

Bad things also happen if we change the object, but keep the hash the same:

class Bad2(object):
def __init__(self, arg):
self.arg = arg
self.hash_cache = None
def __hash__(self):
if self.hash_cache is None:
self.hash_cache = hash(self.arg)
return self.hash_cache
def __eq__(self, other):
if not isinstance(other, Bad2):
return False
return self.arg == other.arg

(this is the pattern used by Basic, by the way)

Here we see that finding the key in a dictionary still works:

In [47]: a = Bad2(1)

In [48]: b = Bad2(2)

In [50]: d = {b:2}

In [51]: b.arg = 1

In [52]: hash(b)
Out[52]: 2

In [53]: d[b]
Out[53]: 2

but it no longer keeps the invariant that sets (and dicts) keep unique
objects (keys):

In [68]: a = Bad2(1)

In [69]: b = Bad2(2)

In [70]: hash(a)
Out[70]: 1

In [71]: hash(b)
Out[71]: 2

In [72]: b.arg = 1

In [73]: a == b
Out[73]: True

In [74]: hash(a) == hash(b)
Out[74]: False

In [75]: set([a, b])
Out[75]: set([<__main__.Bad2 object at 0x10692b3d0>, <__main__.Bad2
object at 0x10692b590>])

and this can lead to very subtle issues. For example, look what
happens when I don't compute the hash before changing the object:

In [56]: a = Bad2(1)

In [59]: b = Bad2(2)

In [60]: b.arg = 1

In [61]: a == b
Out[61]: True

In [62]: hash(a) == hash(b)
Out[62]: True

In [63]: set([a, b])
Out[63]: set([<__main__.Bad2 object at 0x10692b910>])

This time, the set contains one element!

So, yes, Python won't enforce it, but if you don't follow the rule
that hash(object) remains the same for the lifetime of object and that
a == b implies hash(a) == hash(b), then you will run into very subtle
issues, because very basic Python operations expect these invariants.

Aaron Meurer
> --
> You received this message because you are subscribed to the Google Groups "sympy" group.
> To post to this group, send email to sy...@googlegroups.com.
> To unsubscribe from this group, send email to sympy+un...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/sympy?hl=en.
>

Aaron Meurer

unread,
May 30, 2012, 2:20:20 AM5/30/12
to sy...@googlegroups.com
On Tue, May 29, 2012 at 11:57 AM, Joachim Durchholz <j...@durchholz.org> wrote:
> Am 29.05.2012 07:10, schrieb Aaron Meurer:
>
>> _hashable_content generally contains
>> other Basic objects. So what I really want is the "fully denested"
>> _hashable_content, where each Basic object is recursively replaced
>> with its _hashable_content.  I've no idea how expensive that would be
>> to compute.
>
>
> The behavior will be quadratic with number of levels of sets. This could
> start to impact us on deeply nested expressions - not with anything a human
> would enter, but if somebody wants to use SymPy from a script, this might be
> different. (I'm talking of thousands of nesting levels.)
> One way to set that might be to simplify x+0+0+...+0 (1000 terms, then retry
> with 100,000 ;-) ).

I don't see how it's quadratic. Maybe I'm missing something.

The problem I do see is that it's inherently recursive, meaning we are
limited by the Python stack (unless we do it with a manual stack).
Actually, it seems that hash() itself works regardless of this fact,
which is interesting. If I create a highly nested expression, one
where I know that recursively calling hashable_content() will overflow
the stack, I can still hash it.

Aaron Meurer

>
> The downside of caching is that each mutator must invalidate the cached
> hash. (Unless mutators are disallowed once an object is entered into a dict.
> Or the mutable objects are really copy-on-write ones. I don't know whether
> any of these paths is available already, or feasible if unavailable - we're
> deep into Systems Programming Land here.)
>

Joachim Durchholz

unread,
May 30, 2012, 3:44:32 AM5/30/12
to sy...@googlegroups.com
Am 30.05.2012 08:20, schrieb Aaron Meurer:
> On Tue, May 29, 2012 at 11:57 AM, Joachim Durchholz<j...@durchholz.org> wrote:
>> Am 29.05.2012 07:10, schrieb Aaron Meurer:
>>
>>> _hashable_content generally contains
>>> other Basic objects. So what I really want is the "fully denested"
>>> _hashable_content, where each Basic object is recursively replaced
>>> with its _hashable_content. I've no idea how expensive that would be
>>> to compute.
>>
>>
>> The behavior will be quadratic with number of levels of sets. This could
>> start to impact us on deeply nested expressions - not with anything a human
>> would enter, but if somebody wants to use SymPy from a script, this might be
>> different. (I'm talking of thousands of nesting levels.)
>> One way to set that might be to simplify x+0+0+...+0 (1000 terms, then retry
>> with 100,000 ;-) ).
>
> I don't see how it's quadratic. Maybe I'm missing something.

Maybe my assumption is wrong. I'm assuming that at each level,
subexpressions are entered into a dict.

If that's true, the hashcodes for all subexpressions will be recomputed
whenever an expression is being built from subexpressions.

> The problem I do see is that it's inherently recursive, meaning we are
> limited by the Python stack (unless we do it with a manual stack).
> Actually, it seems that hash() itself works regardless of this fact,
> which is interesting. If I create a highly nested expression, one
> where I know that recursively calling hashable_content() will overflow
> the stack, I can still hash it.

Does the hash code depend on the subexpressions?
(x+(y+z)) should have a different hash code than (x+(y+a)), for example.

Aaron Meurer

unread,
Jun 21, 2012, 10:17:54 PM6/21/12
to sy...@googlegroups.com
Yes, but that's not exactly sub-expressions. x + y + z is denested
into Add(x, y, z). A better example would be x + y*z, which is Add(x,
Mul(y, z)).

Aaron Meurer

Joachim Durchholz

unread,
Jun 22, 2012, 4:12:09 AM6/22/12
to sy...@googlegroups.com
>>> The problem I do see is that it's inherently recursive, meaning we are
>>> limited by the Python stack (unless we do it with a manual stack).
>>> Actually, it seems that hash() itself works regardless of this fact,
>>> which is interesting. If I create a highly nested expression, one
>>> where I know that recursively calling hashable_content() will overflow
>>> the stack, I can still hash it.
>>
>> Does the hash code depend on the subexpressions?
>> (x+(y+z)) should have a different hash code than (x+(y+a)), for example.
>
> Yes, but that's not exactly sub-expressions. x + y + z is denested
> into Add(x, y, z).

Lol okay :-)

> A better example would be x + y*z, which is Add(x,
> Mul(y, z)).

Okay... then make sure that the hash isn't recomputed for each level of
nesting, else you'll get quadratic slowdown with number of nesting levels.

Oh. And hash codes must not be changed once they have been computed,
that's what Python wants.
BTW what's the typical pattern in Python to make sure that no
hash-relevant attributes are modified?

Aaron Meurer

unread,
Jun 22, 2012, 4:18:14 AM6/22/12
to sy...@googlegroups.com
On Jun 22, 2012, at 2:12 AM, Joachim Durchholz <j...@durchholz.org> wrote:

>>>> The problem I do see is that it's inherently recursive, meaning we are
>>>> limited by the Python stack (unless we do it with a manual stack).
>>>> Actually, it seems that hash() itself works regardless of this fact,
>>>> which is interesting. If I create a highly nested expression, one
>>>> where I know that recursively calling hashable_content() will overflow
>>>> the stack, I can still hash it.
>>>
>>> Does the hash code depend on the subexpressions?
>>> (x+(y+z)) should have a different hash code than (x+(y+a)), for example.
>>
>> Yes, but that's not exactly sub-expressions. x + y + z is denested
>> into Add(x, y, z).
>
> Lol okay :-)
>
> > A better example would be x + y*z, which is Add(x,
>> Mul(y, z)).
>
> Okay... then make sure that the hash isn't recomputed for each level of nesting, else you'll get quadratic slowdown with number of nesting levels.

If it's the same object, the hash is only computed once, and then
hashed. If the object is being rebuilt, it could be an issue.

>
> Oh. And hash codes must not be changed once they have been computed, that's what Python wants.
> BTW what's the typical pattern in Python to make sure that no hash-relevant attributes are modified?

If an attribute is private (starts with _), then it's a signal to
other code that if you use it or modify it, bad things could happen.
You can also enforce it using __setattr__ if you're really worried
about it.

Aaron Meurer
Reply all
Reply to author
Forward
0 new messages