[Python-ideas] New __reference__ hook

28 views
Skip to first unread message

ha...@interia.pl

unread,
Dec 4, 2012, 4:58:04 AM12/4/12
to python...@python.org

Hi, guys.

Python 3 is very close to become a holy grail of programming languages in the sense that almost everything could be redefined. However, there is still one thing missing: the immutable copy-on-assign numeric types.
Consider this part of code:

a = 1
b = a
a += 1
assert a == b + 1

The object "1" gets assigned to the "a" variable, then another independent copy gets assigned to the "b" variable, then the value in the "a" variable gets modified without affecting the second.
The problem is - this behaviour can not be recreated in user-defined classes:

a = MyInteger(1)
b = a
a += 1
assert a == b + 1

The "a" and "b" variables both point to the same object. This is a difference on what one might expect with numeric types.

My proposal is to define another hook that gets called when an object is referenced.

def MyInteger:
def __reference__(self, context):
return copy.copy(self)

Each time when a reference count of an object would normally get incremented, this method should be called and the returned object will be referenced. The default implementation would be of course to return self. The context argument will give the object some information of the reason how we are being referenced.

This will allow easy implementation of such concepts as singletons, copy-on-write, immutables and even simplify things like reference loops.

The most obvious use-case would be implementations of some mathematical types like vectors, polynomials and so on. I've encountered this problem when I was writing a simple vector library and I had to explicitly copy each object on assignment, which is particulary annoying. The programmer's intuition for numeric-like types is for them to be immutable, yet they should have augmented asignment operators. This is a thing that can not be implemented transparently in the current Python, so there is my proposal.

Cheers,
haael.


_______________________________________________
Python-ideas mailing list
Python...@python.org
http://mail.python.org/mailman/listinfo/python-ideas

Steven D'Aprano

unread,
Dec 4, 2012, 6:14:44 AM12/4/12
to python...@python.org
On 04/12/12 20:58, ha...@interia.pl wrote:
>
> Hi, guys.
>
> Python 3 is very close to become a holy grail of programming languages in
> the sense that almost everything could be redefined. However, there is
>still one thing missing: the immutable copy-on-assign numeric types.
> Consider this part of code:

I dispute that "everything can be redefined" is the holy grail of
programming languages. If it were, why isn't everyone using Forth?


> a = 1
> b = a
> a += 1
> assert a == b + 1
>
> The object "1" gets assigned to the "a" variable,

Correct, for some definition of "assigned" and "variable".


> then another independent copy gets assigned to the "b" variable,

Completely, utterly wrong.


>then the value in the "a" variable gets modified

Incorrect.


> without affecting the second.
> The problem is - this behaviour can not be recreated in user-defined
> classes:

Of course it can.

py> from decimal import Decimal # A pure-Python class, prior to Python 3.3
py> a = Decimal(1)
py> b = a
py> a += 1
py> assert a == b + 1
py> print a, b
2 1

If you prefer another example, use fractions.Fraction, also pure Python
and immutable, with support for augmented assignment.


I don't mean to be rude, or dismissive, but this is pretty basic Python
stuff. Please start with the Python data and execution models:

http://docs.python.org/2/reference/datamodel.html
http://docs.python.org/2/reference/executionmodel.html


although I must admit I don't find either of them especially clear. But
in simple terms, you need to reset your thinking: your assumptions about
what Python does are incorrect.

When you say:

a = 1

you are *binding* the name "a" to the object 1. When you then follow by
saying:

b = a

you bind the name "b" to the *same* object 1. It is not a copy. It is not
a "copy on assignment" or any other clever trick. You can prove to
yourself that they are the same object:


py> a = 1
py> b = a
py> a is b
True
py> id(a), id(b)
(140087472, 140087472)


When you then call

a += 1

this does not modify anything. Int objects (and floats, Decimal, strings,
and many others) are *immutable* -- they cannot be modified. So `a += 1`
creates a new object, 2, and binds it to the name "a". But the binding
from object 1 to name "b" is not touched.

If this is still not clear, I recommend you take the discussion onto one
of the other Python mailing lists, especially tu...@python.org or
pytho...@python.org, which are more appropriate for discussing these
things.



--
Steven

Barry Warsaw

unread,
Dec 4, 2012, 2:39:50 PM12/4/12
to python...@python.org
On Dec 04, 2012, at 10:14 PM, Steven D'Aprano wrote:

>I dispute that "everything can be redefined" is the holy grail of
>programming languages. If it were, why isn't everyone using Forth?

On the readability scale, where Python is pretty close to a 10 (almost
everyone can read almost all Python), Perl is a 4 (hard to read your own code
after a week or so), Forth is a 1 (you can't even read your own code after
your fingers stop moving).

:)

a-forth-enthusiast-from-way-back-in-the-day-ly y'rs,
-Barry
signature.asc

Mike Graham

unread,
Dec 5, 2012, 9:10:14 AM12/5/12
to ha...@interia.pl, Python-Ideas
On Tue, Dec 4, 2012 at 4:58 AM, <ha...@interia.pl> wrote:

Python 3 is very close to become a holy grail of programming languages in the sense that almost everything could be redefined. However, there is still one thing missing: the immutable copy-on-assign numeric types.
Consider this part of code:

a = 1
b = a
a += 1
assert a == b + 1

The object "1" gets assigned to the "a" variable, then another independent copy gets assigned to the "b" variable, then the value in the "a" variable gets modified without affecting the second.
The problem is - this behaviour can not be recreated in user-defined classes:

a = MyInteger(1)
b = a
a += 1
assert a == b + 1

The "a" and "b" variables both point to the same object. This is a difference on what one might expect with numeric types.

You misunderstand Python's semantics. Python never implicitly copies anything. Some types, like int, are immutable so you can't distinguish meaningfully between copying and not.

All names like `a` can be rebound `a = ....`, and this never mutates the object. Some objects can be mutated, which is done by some means other than rebinding a name.

I don't know what problem you had defining MyInteger. Here is a definition (albeit comprised of very, very sloppy code) that passes your test

class MyInteger(object):
    def __init__(self, i):
        self._i = i

    def __add__(self, other):
        if isinstance(other, MyInteger):
            other = other._i
        return MyInteger(self._i + other)

    def __eq__(self, other):
        return self._i == other._i

Mike

rand...@fastmail.us

unread,
Dec 5, 2012, 11:06:28 AM12/5/12
to python...@python.org
On Wed, Dec 5, 2012, at 9:10, Mike Graham wrote:
> I don't know what problem you had defining MyInteger. Here is a definition (albeit comprised of very, very sloppy code) that passes your test

Most likely he thought he had to define __iadd__ for += to work.

Jasper St. Pierre

unread,
Dec 5, 2012, 12:05:32 PM12/5/12
to rand...@fastmail.us, python-ideas
And? What's wrong with an __iadd__ that's exactly the same as Mike's __add__?
--
  Jasper

Sturla Molden

unread,
Dec 5, 2012, 1:09:47 PM12/5/12
to Python-Ideas
On 05.12.2012 18:05, Jasper St. Pierre wrote:
> And? What's wrong with an __iadd__ that's exactly the same as Mike's
> __add__?

I think it was a Java-confusion. He thought numbers were copied on
assignment. But there is no difference between value types and object
types in Python. Ints and floats are immutable, but they are not value
types as in Java.

But apart from that, I think allowing overloading of the binding
operator "=" might be a good idea. A special method __bind__ could
return the object to be bound:

a = b

should then bind the name "a" to the return value of

b.__bind__()

if b implements __bind__.

Sure, it could be used to implement copy on assignment. But it would
also do other things like allowing lazy evaluation of an expression.

NumPy code like

z = a*x + b*y + c

could avoid creating three temporary arrays if there was a __bind__
function called on "=". This is a big thing, cf. the difference between
NumPy and numexpr:

z = numexpr.evaluate("""a*x + b*y + c""")

The reason numerical expressions must be written as strings to be
efficient in Python is because there is no __bind__ function.



Sturla

Masklinn

unread,
Dec 5, 2012, 1:51:12 PM12/5/12
to python-ideas
On 2012-12-05, at 19:09 , Sturla Molden wrote:

> On 05.12.2012 18:05, Jasper St. Pierre wrote:
>> And? What's wrong with an __iadd__ that's exactly the same as Mike's
>> __add__?
>
> I think it was a Java-confusion. He thought numbers were copied on assignment. But there is no difference between value types and object types in Python. Ints and floats are immutable, but they are not value types as in Java.
>
> But apart from that, I think allowing overloading of the binding operator "=" might be a good idea. A special method __bind__ could return the object to be bound:
>
> a = b
>
> should then bind the name "a" to the return value of
>
> b.__bind__()
>
> if b implements __bind__.

Sounds odd and full of strange edge-cases. Would bind also get called
when providing parameters to a function call? When putting an object in
a literal of some sort? When returning an object from a function/method?
If not, why not?

> Sure, it could be used to implement copy on assignment. But it would also do other things like allowing lazy evaluation of an expression.
>
> NumPy code like
>
> z = a*x + b*y + c
>
> could avoid creating three temporary arrays if there was a __bind__ function called on "=".

Why? z could just be a "lazy value" at this point, basically a manual
building of thunks, only reifying them when necessary (whenever that
is). It's not like numpy *has* to create three temporary arrays, just
that it *does*.

Bruce Leban

unread,
Dec 5, 2012, 1:54:22 PM12/5/12
to Sturla Molden, Python-Ideas
On Wed, Dec 5, 2012 at 10:09 AM, Sturla Molden <stu...@molden.no> wrote:

But apart from that, I think allowing overloading of the binding operator "=" might be a good idea. A special method __bind__ could return the object to be bound:

   a = b

should then bind the name "a" to the return value of

   b.__bind__()

if b implements __bind__.

It' seems a bit more complicated than that. Take the example below. When is __bind__ going to be called? After a is multiplied by x, b is multipled by y, etc. or before? If after, that doesn't accomplish lazy evaluation as below. If before, then somehow this has to convert to a form that calls z.__bind__(something) and what is that something? 

Sure, it could be used to implement copy on assignment. But it would also do other things like allowing lazy evaluation of an expression.

NumPy code like

   z = a*x + b*y + c

could avoid creating three temporary arrays if there was a __bind__ function called on "=". This is a big thing, cf. the difference between NumPy and numexpr:

   z = numexpr.evaluate("""a*x + b*y + c""")

The reason numerical expressions must be written as strings to be efficient in Python is because there is no __bind__ function.

There is another way to write expressions that don't get evaluated:

lambda: a*x + b*y + c

So you could write this as z.bind(lambda: rhs) or if this is important enough there could be a new bind operator:

lhs @= rhs

which is equivalent to 

lhs.__bind__(lambda: rhs)

I think overriding = so sometimes it does regular binding and sometimes this magic binding would be confusing and dangerous. It means that every assignment operates differently if the lhs is already bound. Consider the difference between

t = a + b
typo = a+b

t @= a+b
typo @= a+b

where typo was supposed to be t but was mistyped. In the first set, line 1 does __bind__ to a+b while line just adds a and b and does a normal binding. In the second set, the first does __bind__ while the second raises an exception that typo is not bound.

It's even worse in the context of something like this:

d = {}
for i in range(2):
    d['x'] = a + i

in the first pass through the loop this is a regular assignment. In the second pass it may call __bind__ depending on what the value of a + 0 is. Ick.

--- Bruce

Sturla Molden

unread,
Dec 5, 2012, 3:09:49 PM12/5/12
to Masklinn, python-ideas

Den 5. des. 2012 kl. 19:51 skrev Masklinn <mask...@masklinn.net>:

>
> Why? z could just be a "lazy value" at this point, basically a manual
> building of thunks, only reifying them when necessary (whenever that
> is). It's not like numpy *has* to create three temporary arrays, just
> that it *does*.
>

It has to, because it does not know when to flush an expression. This strangely enough, accounts for most of the speed difference between Python/NumPy and e.g. Fortran 95. A Fortran 95 compiler can compile an array expression as a single loop. NumPy cannot, because the binary operators does not tell when an expression is "finalized". That is why the numexpr JIT compiler evaluates Python expressions as strings, and needs to include a parser and whatnot. Today, most numerical code is memory bound, not compute bound, as CPUs are immensely faster than RAM. So what keeps numerical/scientific code written in Python slower than C or Fortran today is mostly creation of temporary array objects – i.e. memory access –, not the computations per se. If we could get rid of temprary arrays, Python codes could possibly achieve 80 % of Fortran 95 speed. For scientistis that would mean we don't need to write any more Fortran or C.

But perhaps it is possible to do this with AST magic? I don't know. Nor do I know if __bind__ is the best way to do this. Perhaps not. But I do know that automatically detecting when to "flush a compund expression with (NumPy?) arrays" would be the holy grail for scientific computing with Python. A binary operator x+y would just return a symbolic representation of the expression, but when the full expression needs to be flushed we can e.g. ask OpenCL or LLVM to generate the code on the fly. It would turn numerical computing into something similar to dynamic HTML. And we know how good Python is at generating structured text on the fly.

Sturla

Masklinn

unread,
Dec 5, 2012, 3:45:51 PM12/5/12
to python-ideas
On 2012-12-05, at 21:09 , Sturla Molden wrote:

>
> Den 5. des. 2012 kl. 19:51 skrev Masklinn <mask...@masklinn.net>:
>
>>
>> Why? z could just be a "lazy value" at this point, basically a manual
>> building of thunks, only reifying them when necessary (whenever that
>> is). It's not like numpy *has* to create three temporary arrays, just
>> that it *does*.
>>
>
> It has to, because it does not know when to flush an expression.

That tends to be the hard thing to decide, but it should be possible to
find out most cases e.g. evaluate the thunks when elements are requested
(similar to generators, but do the whole thunk at once), when printing,
etc…

Or use the numexpr approach and perform the reification explicitly.

> But perhaps it is possible to do this with AST magic? I don't know.

I'm not sure there's even a need for AST magic (although you could also
play with that by writing operations within lambdas I guess, I've never
done much AST analysis/rewriting), it could simply use an approach
similar to SQLAlchemy's's ClauseElement: when applying an operation to
e.g. an array, rather than perform it just return a representation of
the operation itself (effectively rebuild some sort of AST), new
operations on *that* would simply build the tree further (composing the
thunk), and an explicit evaluation call or implicit evaluation due to
e.g. accessing stuff would compile the "potential" operation and perform
the actual computations.

> A binary operator x+y would just return a symbolic representation of the
> expression, but when the full expression needs to be flushed we can e.g.
> ask OpenCL or LLVM to generate the code on the fly.

Indeed.

Joao S. O. Bueno

unread,
Dec 5, 2012, 3:48:06 PM12/5/12
to Sturla Molden, python-ideas

Today that can be achieved by crafting a class that overrides all ops
to perform literal transforms and with a "flush" or "calculate"
method. Sympy does something like that, and it would not be hard to
have a numpy module to perform like that with numpy arrays. In this
particular use case, we'd have the full benefit of "explicit is better
than implicit".

js
-><-

Yuval Greenfield

unread,
Dec 5, 2012, 3:50:41 PM12/5/12
to Sturla Molden, python-ideas
On Wed, Dec 5, 2012 at 3:09 PM, Sturla Molden <stu...@molden.no> wrote:

But perhaps it is possible to do this with AST magic? I don't know. Nor do I know if __bind__ is the best way to do this. Perhaps not. But I do know that automatically detecting when to "flush a compund expression with (NumPy?) arrays" would be the holy grail for scientific computing with Python. A binary operator x+y would just return a symbolic representation of the expression, but when the full expression needs to be flushed we can e.g. ask OpenCL or LLVM to generate the code on the fly. It would turn numerical computing into something similar to dynamic HTML. And we know how good Python is at generating structured text on the fly.

Sturla


Not all pixel fiddling can be solved using array calculus, so there will always be C involved at some point.

Still this could be a great advancement. Though I don't think bind-time is the right time to evaluate anything as it would drive optimizing programmers to "one-line" things. Using intermediate variable names to explain an algorithm is crucial for readability in my experience.

Creating intermediate objects only to be evaluated when programmer explicitly demands is the way to go. E.g. "evalf" in sympy http://scipy-lectures.github.com/advanced/sympy.html

Amaury Forgeot d'Arc

unread,
Dec 5, 2012, 3:59:23 PM12/5/12
to Joao S. O. Bueno, python-ideas
On 5 December 2012 18:09, Sturla Molden <stu...@molden.no> wrote:
>
> Den 5. des. 2012 kl. 19:51 skrev Masklinn <mask...@masklinn.net>:
>
>>
>> Why? z could just be a "lazy value" at this point, basically a manual
>> building of thunks, only reifying them when necessary (whenever that
>> is). It's not like numpy *has* to create three temporary arrays, just
>> that it *does*.
>>
>
> It has to, because it does not know when to flush an expression. This strangely enough, accounts for most of the speed difference between Python/NumPy and e.g. Fortran 95. A Fortran 95 compiler can compile an array expression as a single loop. NumPy cannot, because the binary operators does not tell when an expression is "finalized". That is why the numexpr JIT compiler evaluates Python expressions as strings, and needs to include a parser and whatnot. Today, most numerical code is memory bound, not compute bound, as CPUs are immensely faster than RAM. So what keeps numerical/scientific code written in Python slower than C or Fortran today is mostly creation of temporary array objects – i.e. memory access –, not the computations per se. If we could get rid of temprary arrays, Python codes could possibly achieve 80 % of Fortran 95 speed. For scientistis that would mean we don't need to write any more Fortran or C.
>
> But perhaps it is possible to do this with AST magic? I don't know. Nor do I know if __bind__ is the best way to do this. Perhaps not. But I do know that automatically detecting when to "flush a compund expression with (NumPy?) arrays" would be the holy grail for scientific computing with Python. A binary operator x+y would just return a symbolic representation of the expression, but when the full expression needs to be flushed we can e.g. ask OpenCL or LLVM to generate the code on the fly. It would turn numerical computing into something similar to dynamic HTML. And we know how good Python is at generating structured text on the fly.


FYI, the numpy module shipped with PyPy does exactly this: the operations are recorded in some AST structure, which is evaluated only when the first item of the array is read.
This is completely transparent to the user, or to other parts of the interpreter.

PyPy uses JIT techniques to generate machine code specialized for the particular AST, and is typically 2x to 5x faster than Numpy, probably because a lot of allocations/copies are avoided.

-- 
Amaury Forgeot d'Arc

Thomas Kluyver

unread,
Dec 5, 2012, 4:33:07 PM12/5/12
to python-ideas
On 5 December 2012 20:09, Sturla Molden <stu...@molden.no> wrote:
But perhaps it is possible to do this with AST magic?

As far as I understand it, numba [1] does this kind of AST magic (among other things).

https://github.com/numba/numba

Thomas

Terry Reedy

unread,
Dec 5, 2012, 4:59:15 PM12/5/12
to python...@python.org
On 12/5/2012 1:09 PM, Sturla Molden wrote:
> On 05.12.2012 18:05, Jasper St. Pierre wrote:
>> And? What's wrong with an __iadd__ that's exactly the same as Mike's
>> __add__?
>
> I think it was a Java-confusion. He thought numbers were copied on
> assignment. But there is no difference between value types and object
> types in Python. Ints and floats are immutable, but they are not value
> types as in Java.
>
> But apart from that, I think allowing overloading of the binding
> operator "=" might be a good idea.

An assignment statement mutates the current local namespace. The
'current local namespace' is a hidden input to the function performed by
all statements, but it need not be a python object. The key symbol '='
is not an operator and 'a = b' is not an expression.

> A special method __bind__ could
> return the object to be bound:
>
> a = b
>
> should then bind the name "a" to the return value of
>
> b.__bind__()

If one wants to perform 'a = f(b)' or 'a = b.meth()' instead of 'a = b',
then one should just explicitly say so.

> if b implements __bind__.
>
> Sure, it could be used to implement copy on assignment. But it would
> also do other things like allowing lazy evaluation of an expression.
>
> NumPy code like
>
> z = a*x + b*y + c
>
> could avoid creating three temporary arrays if there was a __bind__
> function called on "=".

No, z = (a*x + b*y * c).__bind__, which is how you defined .__bind__
working, still requires that the expression be evaluated to an object.

The definition of Python requires computation of a*x, b*y, (a*x + b*y),
and finally (a*x + b*y) + c in that order. Either '*' or '+' may have
side-effects.

> This is a big thing, cf. the difference between
> NumPy and numexpr:
>
> z = numexpr.evaluate("""a*x + b*y + c""")

> The reason numerical expressions must be written as strings to be
> efficient in Python is because there is no __bind__ function.

No, it is because the semantics of Python require inefficiency that can
only be removed by a special parser-compiler with additional knowledge
of the relevant object class, method, and instance properties. Such
knowledge allows code to be re-written without changing the effect. For
Fortran arrays, the needed information includes the number and length of
each dimension. These are either declared or parameterized and passed as
arguments.

https://code.google.com/p/numexpr/wiki/Overview
says that numexpr.evaluate(ex) first calls compile(ex), but does not say
whether it has compile compile to cpython bytecode or only to ast. In
either case, it converts arraywise operations to blockwise operations
inside loops run on a custom C-coded virtual machine. They imply that
this is not as good as elementwise operations compiled to native machine
code. In any case, it knows that numpy array operations are side-effect
free and must use the runtime dimension and size info.

--
Terry Jan Reedy
Reply all
Reply to author
Forward
0 new messages