[Python-ideas] Deterministic iterator cleanup

105 views
Skip to first unread message

Nathaniel Smith

unread,
Oct 19, 2016, 12:39:34 AM10/19/16
to python...@python.org, Yury Selivanov
Hi all,

I'd like to propose that Python's iterator protocol be enhanced to add
a first-class notion of completion / cleanup.

This is mostly motivated by thinking about the issues around async
generators and cleanup. Unfortunately even though PEP 525 was accepted
I found myself unable to stop pondering this, and the more I've
pondered the more convinced I've become that the GC hooks added in PEP
525 are really not enough, and that we'll regret it if we stick with
them, or at least with them alone :-/. The strategy here is pretty
different -- it's an attempt to dig down and make a fundamental
improvement to the language that fixes a number of long-standing rough
spots, including async generators.

The basic concept is relatively simple: just adding a '__iterclose__'
method that 'for' loops call upon completion, even if that's via break
or exception. But, the overall issue is fairly complicated + iterators
have a large surface area across the language, so the text below is
pretty long. Mostly I wrote it all out to convince myself that there
wasn't some weird showstopper lurking somewhere :-). For a first pass
discussion, it probably makes sense to mainly focus on whether the
basic concept makes sense? The main rationale is at the top, but the
details are there too for those who want them.

Also, for *right* now I'm hoping -- probably unreasonably -- to try to
get the async iterator parts of the proposal in ASAP, ideally for
3.6.0 or 3.6.1. (I know this is about the worst timing for a proposal
like this, which I apologize for -- though async generators are
provisional in 3.6, so at least in theory changing them is not out of
the question.) So again, it might make sense to focus especially on
the async parts, which are a pretty small and self-contained part, and
treat the rest of the proposal as a longer-term plan provided for
context. The comparison to PEP 525 GC hooks comes right after the
initial rationale.

Anyway, I'll be interested to hear what you think!

-n

------------------

Abstract
========

We propose to extend the iterator protocol with a new
``__(a)iterclose__`` slot, which is called automatically on exit from
``(async) for`` loops, regardless of how they exit. This allows for
convenient, deterministic cleanup of resources held by iterators
without reliance on the garbage collector. This is especially valuable
for asynchronous generators.


Note on timing
==============

In practical terms, the proposal here is divided into two separate
parts: the handling of async iterators, which should ideally be
implemented ASAP, and the handling of regular iterators, which is a
larger but more relaxed project that can't start until 3.7 at the
earliest. But since the changes are closely related, and we probably
don't want to end up with async iterators and regular iterators
diverging in the long run, it seems useful to look at them together.


Background and motivation
=========================

Python iterables often hold resources which require cleanup. For
example: ``file`` objects need to be closed; the `WSGI spec
<https://www.python.org/dev/peps/pep-0333/>`_ adds a ``close`` method
on top of the regular iterator protocol and demands that consumers
call it at the appropriate time (though forgetting to do so is a
`frequent source of bugs
<http://blog.dscpl.com.au/2012/10/obligations-for-calling-close-on.html>`_);
and PEP 342 (based on PEP 325) extended generator objects to add a
``close`` method to allow generators to clean up after themselves.

Generally, objects that need to clean up after themselves also define
a ``__del__`` method to ensure that this cleanup will happen
eventually, when the object is garbage collected. However, relying on
the garbage collector for cleanup like this causes serious problems in
at least two cases:

- In Python implementations that do not use reference counting (e.g.
PyPy, Jython), calls to ``__del__`` may be arbitrarily delayed -- yet
many situations require *prompt* cleanup of resources. Delayed cleanup
produces problems like crashes due to file descriptor exhaustion, or
WSGI timing middleware that collects bogus times.

- Async generators (PEP 525) can only perform cleanup under the
supervision of the appropriate coroutine runner. ``__del__`` doesn't
have access to the coroutine runner; indeed, the coroutine runner
might be garbage collected before the generator object. So relying on
the garbage collector is effectively impossible without some kind of
language extension. (PEP 525 does provide such an extension, but it
has a number of limitations that this proposal fixes; see the
"alternatives" section below for discussion.)

Fortunately, Python provides a standard tool for doing resource
cleanup in a more structured way: ``with`` blocks. For example, this
code opens a file but relies on the garbage collector to close it::

def read_newline_separated_json(path):
for line in open(path):
yield json.loads(line)

for document in read_newline_separated_json(path):
...

and recent versions of CPython will point this out by issuing a
``ResourceWarning``, nudging us to fix it by adding a ``with`` block::

def read_newline_separated_json(path):
with open(path) as file_handle: # <-- with block
for line in file_handle:
yield json.loads(line)

for document in read_newline_separated_json(path): # <-- outer for loop
...

But there's a subtlety here, caused by the interaction of ``with``
blocks and generators. ``with`` blocks are Python's main tool for
managing cleanup, and they're a powerful one, because they pin the
lifetime of a resource to the lifetime of a stack frame. But this
assumes that someone will take care of cleaning up the stack frame...
and for generators, this requires that someone ``close`` them.

In this case, adding the ``with`` block *is* enough to shut up the
``ResourceWarning``, but this is misleading -- the file object cleanup
here is still dependent on the garbage collector. The ``with`` block
will only be unwound when the ``read_newline_separated_json``
generator is closed. If the outer ``for`` loop runs to completion then
the cleanup will happen immediately; but if this loop is terminated
early by a ``break`` or an exception, then the ``with`` block won't
fire until the generator object is garbage collected.

The correct solution requires that all *users* of this API wrap every
``for`` loop in its own ``with`` block::

with closing(read_newline_separated_json(path)) as genobj:
for document in genobj:
...

This gets even worse if we consider the idiom of decomposing a complex
pipeline into multiple nested generators::

def read_users(path):
with closing(read_newline_separated_json(path)) as gen:
for document in gen:
yield User.from_json(document)

def users_in_group(path, group):
with closing(read_users(path)) as gen:
for user in gen:
if user.group == group:
yield user

In general if you have N nested generators then you need N+1 ``with``
blocks to clean up 1 file. And good defensive programming would
suggest that any time we use a generator, we should assume the
possibility that there could be at least one ``with`` block somewhere
in its (potentially transitive) call stack, either now or in the
future, and thus always wrap it in a ``with``. But in practice,
basically nobody does this, because programmers would rather write
buggy code than tiresome repetitive code. In simple cases like this
there are some workarounds that good Python developers know (e.g. in
this simple case it would be idiomatic to pass in a file handle
instead of a path and move the resource management to the top level),
but in general we cannot avoid the use of ``with``/``finally`` inside
of generators, and thus dealing with this problem one way or another.
When beauty and correctness fight then beauty tends to win, so it's
important to make correct code beautiful.

Still, is this worth fixing? Until async generators came along I would
have argued yes, but that it was a low priority, since everyone seems
to be muddling along okay -- but async generators make it much more
urgent. Async generators cannot do cleanup *at all* without some
mechanism for deterministic cleanup that people will actually use, and
async generators are particularly likely to hold resources like file
descriptors. (After all, if they weren't doing I/O, they'd be
generators, not async generators.) So we have to do something, and it
might as well be a comprehensive fix to the underlying problem. And
it's much easier to fix this now when async generators are first
rolling out, then it will be to fix it later.

The proposal itself is simple in concept: add a ``__(a)iterclose__``
method to the iterator protocol, and have (async) ``for`` loops call
it when the loop is exited, even if this occurs via ``break`` or
exception unwinding. Effectively, we're taking the current cumbersome
idiom (``with`` block + ``for`` loop) and merging them together into a
fancier ``for``. This may seem non-orthogonal, but makes sense when
you consider that the existence of generators means that ``with``
blocks actually depend on iterator cleanup to work reliably, plus
experience showing that iterator cleanup is often a desireable feature
in its own right.


Alternatives
============

PEP 525 asyncgen hooks
----------------------

PEP 525 proposes a `set of global thread-local hooks managed by new
``sys.{get/set}_asyncgen_hooks()`` functions
<https://www.python.org/dev/peps/pep-0525/#finalization>`_, which
allow event loops to integrate with the garbage collector to run
cleanup for async generators. In principle, this proposal and PEP 525
are complementary, in the same way that ``with`` blocks and
``__del__`` are complementary: this proposal takes care of ensuring
deterministic cleanup in most cases, while PEP 525's GC hooks clean up
anything that gets missed. But ``__aiterclose__`` provides a number of
advantages over GC hooks alone:

- The GC hook semantics aren't part of the abstract async iterator
protocol, but are instead restricted `specifically to the async
generator concrete type <XX find and link Yury's email saying this>`_.
If you have an async iterator implemented using a class, like::

class MyAsyncIterator:
async def __anext__():
...

then you can't refactor this into an async generator without
changing its semantics, and vice-versa. This seems very unpythonic.
(It also leaves open the question of what exactly class-based async
iterators are supposed to do, given that they face exactly the same
cleanup problems as async generators.) ``__aiterclose__``, on the
other hand, is defined at the protocol level, so it's duck-type
friendly and works for all iterators, not just generators.

- Code that wants to work on non-CPython implementations like PyPy
cannot in general rely on GC for cleanup. Without ``__aiterclose__``,
it's more or less guaranteed that developers who develop and test on
CPython will produce libraries that leak resources when used on PyPy.
Developers who do want to target alternative implementations will
either have to take the defensive approach of wrapping every ``for``
loop in a ``with`` block, or else carefully audit their code to figure
out which generators might possibly contain cleanup code and add
``with`` blocks around those only. With ``__aiterclose__``, writing
portable code becomes easy and natural.

- An important part of building robust software is making sure that
exceptions always propagate correctly without being lost. One of the
most exciting things about async/await compared to traditional
callback-based systems is that instead of requiring manual chaining,
the runtime can now do the heavy lifting of propagating errors, making
it *much* easier to write robust code. But, this beautiful new picture
has one major gap: if we rely on the GC for generator cleanup, then
exceptions raised during cleanup are lost. So, again, with
``__aiterclose__``, developers who care about this kind of robustness
will either have to take the defensive approach of wrapping every
``for`` loop in a ``with`` block, or else carefully audit their code
to figure out which generators might possibly contain cleanup code.
``__aiterclose__`` plugs this hole by performing cleanup in the
caller's context, so writing more robust code becomes the path of
least resistance.

- The WSGI experience suggests that there exist important
iterator-based APIs that need prompt cleanup and cannot rely on the
GC, even in CPython. For example, consider a hypothetical WSGI-like
API based around async/await and async iterators, where a response
handler is an async generator that takes request headers + an async
iterator over the request body, and yields response headers + the
response body. (This is actually the use case that got me interested
in async generators in the first place, i.e. this isn't hypothetical.)
If we follow WSGI in requiring that child iterators must be closed
properly, then without ``__aiterclose__`` the absolute most
minimalistic middleware in our system looks something like::

async def noop_middleware(handler, request_header, request_body):
async with aclosing(handler(request_body, request_body)) as aiter:
async for response_item in aiter:
yield response_item

Arguably in regular code one can get away with skipping the ``with``
block around ``for`` loops, depending on how confident one is that one
understands the internal implementation of the generator. But here we
have to cope with arbitrary response handlers, so without
``__aiterclose__``, this ``with`` construction is a mandatory part of
every middleware.

``__aiterclose__`` allows us to eliminate the mandatory boilerplate
and an extra level of indentation from every middleware::

async def noop_middleware(handler, request_header, request_body):
async for response_item in handler(request_header, request_body):
yield response_item

So the ``__aiterclose__`` approach provides substantial advantages
over GC hooks.

This leaves open the question of whether we want a combination of GC
hooks + ``__aiterclose__``, or just ``__aiterclose__`` alone. Since
the vast majority of generators are iterated over using a ``for`` loop
or equivalent, ``__aiterclose__`` handles most situations before the
GC has a chance to get involved. The case where GC hooks provide
additional value is in code that does manual iteration, e.g.::

agen = fetch_newline_separated_json_from_url(...)
while True:
document = await type(agen).__anext__(agen)
if document["id"] == needle:
break
# doesn't do 'await agen.aclose()'

If we go with the GC-hooks + ``__aiterclose__`` approach, this
generator will eventually be cleaned up by GC calling the generator
``__del__`` method, which then will use the hooks to call back into
the event loop to run the cleanup code.

If we go with the no-GC-hooks approach, this generator will eventually
be garbage collected, with the following effects:

- its ``__del__`` method will issue a warning that the generator was
not closed (similar to the existing "coroutine never awaited"
warning).

- The underlying resources involved will still be cleaned up, because
the generator frame will still be garbage collected, causing it to
drop references to any file handles or sockets it holds, and then
those objects's ``__del__`` methods will release the actual operating
system resources.

- But, any cleanup code inside the generator itself (e.g. logging,
buffer flushing) will not get a chance to run.

The solution here -- as the warning would indicate -- is to fix the
code so that it calls ``__aiterclose__``, e.g. by using a ``with``
block::

async with aclosing(fetch_newline_separated_json_from_url(...)) as agen:
while True:
document = await type(agen).__anext__(agen)
if document["id"] == needle:
break

Basically in this approach, the rule would be that if you want to
manually implement the iterator protocol, then it's your
responsibility to implement all of it, and that now includes
``__(a)iterclose__``.

GC hooks add non-trivial complexity in the form of (a) new global
interpreter state, (b) a somewhat complicated control flow (e.g.,
async generator GC always involves resurrection, so the details of PEP
442 are important), and (c) a new public API in asyncio (``await
loop.shutdown_asyncgens()``) that users have to remember to call at
the appropriate time. (This last point in particular somewhat
undermines the argument that GC hooks provide a safe backup to
guarantee cleanup, since if ``shutdown_asyncgens()`` isn't called
correctly then I *think* it's possible for generators to be silently
discarded without their cleanup code being called; compare this to the
``__aiterclose__``-only approach where in the worst case we still at
least get a warning printed. This might be fixable.) All this
considered, GC hooks arguably aren't worth it, given that the only
people they help are those who want to manually call ``__anext__`` yet
don't want to manually call ``__aiterclose__``. But Yury disagrees
with me on this :-). And both options are viable.


Always inject resources, and do all cleanup at the top level
------------------------------------------------------------

It was suggested on python-dev (XX find link) that a pattern to avoid
these problems is to always pass resources in from above, e.g.
``read_newline_separated_json`` should take a file object rather than
a path, with cleanup handled at the top level::

def read_newline_separated_json(file_handle):
for line in file_handle:
yield json.loads(line)

def read_users(file_handle):
for document in read_newline_separated_json(file_handle):
yield User.from_json(document)

with open(path) as file_handle:
for user in read_users(file_handle):
...

This works well in simple cases; here it lets us avoid the "N+1
``with`` blocks problem". But unfortunately, it breaks down quickly
when things get more complex. Consider if instead of reading from a
file, our generator was reading from a streaming HTTP GET request --
while handling redirects and authentication via OAUTH. Then we'd
really want the sockets to be managed down inside our HTTP client
library, not at the top level. Plus there are other cases where
``finally`` blocks embedded inside generators are important in their
own right: db transaction management, emitting logging information
during cleanup (one of the major motivating use cases for WSGI
``close``), and so forth. So this is really a workaround for simple
cases, not a general solution.


More complex variants of __(a)iterclose__
-----------------------------------------

The semantics of ``__(a)iterclose__`` are somewhat inspired by
``with`` blocks, but context managers are more powerful:
``__(a)exit__`` can distinguish between a normal exit versus exception
unwinding, and in the case of an exception it can examine the
exception details and optionally suppress propagation.
``__(a)iterclose__`` as proposed here does not have these powers, but
one can imagine an alternative design where it did.

However, this seems like unwarranted complexity: experience suggests
that it's common for iterables to have ``close`` methods, and even to
have ``__exit__`` methods that call ``self.close()``, but I'm not
aware of any common cases that make use of ``__exit__``'s full power.
I also can't think of any examples where this would be useful. And it
seems unnecessarily confusing to allow iterators to affect flow
control by swallowing exceptions -- if you're in a situation where you
really want that, then you should probably use a real ``with`` block
anyway.


Specification
=============

This section describes where we want to eventually end up, though
there are some backwards compatibility issues that mean we can't jump
directly here. A later section describes the transition plan.


Guiding principles
------------------

Generally, ``__(a)iterclose__`` implementations should:

- be idempotent,
- perform any cleanup that is appropriate on the assumption that the
iterator will not be used again after ``__(a)iterclose__`` is called.
In particular, once ``__(a)iterclose__`` has been called then calling
``__(a)next__`` produces undefined behavior.

And generally, any code which starts iterating through an iterable
with the intention of exhausting it, should arrange to make sure that
``__(a)iterclose__`` is eventually called, whether or not the iterator
is actually exhausted.


Changes to iteration
--------------------

The core proposal is the change in behavior of ``for`` loops. Given
this Python code::

for VAR in ITERABLE:
LOOP-BODY
else:
ELSE-BODY

we desugar to the equivalent of::

_iter = iter(ITERABLE)
_iterclose = getattr(type(_iter), "__iterclose__", lambda: None)
try:
traditional-for VAR in _iter:
LOOP-BODY
else:
ELSE-BODY
finally:
_iterclose(_iter)

where the "traditional-for statement" here is meant as a shorthand for
the classic 3.5-and-earlier ``for`` loop semantics.

Besides the top-level ``for`` statement, Python also contains several
other places where iterators are consumed. For consistency, these
should call ``__iterclose__`` as well using semantics equivalent to
the above. This includes:

- ``for`` loops inside comprehensions
- ``*`` unpacking
- functions which accept and fully consume iterables, like
``list(it)``, ``tuple(it)``, ``itertools.product(it1, it2, ...)``, and
others.


Changes to async iteration
--------------------------

We also make the analogous changes to async iteration constructs,
except that the new slot is called ``__aiterclose__``, and it's an
async method that gets ``await``\ed.


Modifications to basic iterator types
-------------------------------------

Generator objects (including those created by generator comprehensions):
- ``__iterclose__`` calls ``self.close()``
- ``__del__`` calls ``self.close()`` (same as now), and additionally
issues a ``ResourceWarning`` if the generator wasn't exhausted. This
warning is hidden by default, but can be enabled for those who want to
make sure they aren't inadverdantly relying on CPython-specific GC
semantics.

Async generator objects (including those created by async generator
comprehensions):
- ``__aiterclose__`` calls ``self.aclose()``
- ``__del__`` issues a ``RuntimeWarning`` if ``aclose`` has not been
called, since this probably indicates a latent bug, similar to the
"coroutine never awaited" warning.

QUESTION: should file objects implement ``__iterclose__`` to close the
file? On the one hand this would make this change more disruptive; on
the other hand people really like writing ``for line in open(...):
...``, and if we get used to iterators taking care of their own
cleanup then it might become very weird if files don't.


New convenience functions
-------------------------

The ``itertools`` module gains a new iterator wrapper that can be used
to selectively disable the new ``__iterclose__`` behavior::

# QUESTION: I feel like there might be a better name for this one?
class preserve(iterable):
def __init__(self, iterable):
self._it = iter(iterable)

def __iter__(self):
return self

def __next__(self):
return next(self._it)

def __iterclose__(self):
# Swallow __iterclose__ without passing it on
pass

Example usage (assuming that file objects implements ``__iterclose__``)::

with open(...) as handle:
# Iterate through the same file twice:
for line in itertools.preserve(handle):
...
handle.seek(0)
for line in itertools.preserve(handle):
...

The ``operator`` module gains two new functions, with semantics
equivalent to the following::

def iterclose(it):
if hasattr(type(it), "__iterclose__"):
type(it).__iterclose__(it)

async def aiterclose(ait):
if hasattr(type(ait), "__aiterclose__"):
await type(ait).__aiterclose__(ait)

These are particularly useful when implementing the changes in the next section:


__iterclose__ implementations for iterator wrappers
---------------------------------------------------

Python ships a number of iterator types that act as wrappers around
other iterators: ``map``, ``zip``, ``itertools.accumulate``,
``csv.reader``, and others. These iterators should define a
``__iterclose__`` method which calls ``__iterclose__`` in turn on
their underlying iterators. For example, ``map`` could be implemented
as::

class map:
def __init__(self, fn, *iterables):
self._fn = fn
self._iters = [iter(iterable) for iterable in iterables]

def __iter__(self):
return self

def __next__(self):
return self._fn(*[next(it) for it in self._iters])

def __iterclose__(self):
for it in self._iters:
operator.iterclose(it)

In some cases this requires some subtlety; for example,
```itertools.tee``
<https://docs.python.org/3/library/itertools.html#itertools.tee>`_
should not call ``__iterclose__`` on the underlying iterator until it
has been called on *all* of the clone iterators.


Example / Rationale
-------------------

The payoff for all this is that we can now write straightforward code like::

def read_newline_separated_json(path):
for line in open(path):
yield json.loads(line)

and be confident that the file will receive deterministic cleanup
*without the end-user having to take any special effort*, even in
complex cases. For example, consider this silly pipeline::

list(map(lambda key: key.upper(),
doc["key"] for doc in read_newline_separated_json(path)))

If our file contains a document where ``doc["key"]`` turns out to be
an integer, then the following sequence of events will happen:

1. ``key.upper()`` raises an ``AttributeError``, which propagates out
of the ``map`` and triggers the implicit ``finally`` block inside
``list``.
2. The ``finally`` block in ``list`` calls ``__iterclose__()`` on the
map object.
3. ``map.__iterclose__()`` calls ``__iterclose__()`` on the generator
comprehension object.
4. This injects a ``GeneratorExit`` exception into the generator
comprehension body, which is currently suspended inside the
comprehension's ``for`` loop body.
5. The exception propagates out of the ``for`` loop, triggering the
``for`` loop's implicit ``finally`` block, which calls
``__iterclose__`` on the generator object representing the call to
``read_newline_separated_json``.
6. This injects an inner ``GeneratorExit`` exception into the body of
``read_newline_separated_json``, currently suspended at the ``yield``.
7. The inner ``GeneratorExit`` propagates out of the ``for`` loop,
triggering the ``for`` loop's implicit ``finally`` block, which calls
``__iterclose__()`` on the file object.
8. The file object is closed.
9. The inner ``GeneratorExit`` resumes propagating, hits the boundary
of the generator function, and causes
``read_newline_separated_json``'s ``__iterclose__()`` method to return
successfully.
10. Control returns to the generator comprehension body, and the outer
``GeneratorExit`` continues propagating, allowing the comprehension's
``__iterclose__()`` to return successfully.
11. The rest of the ``__iterclose__()`` calls unwind without incident,
back into the body of ``list``.
12. The original ``AttributeError`` resumes propagating.

(The details above assume that we implement ``file.__iterclose__``; if
not then add a ``with`` block to ``read_newline_separated_json`` and
essentially the same logic goes through.)

Of course, from the user's point of view, this can be simplified down to just:

1. ``int.upper()`` raises an ``AttributeError``
1. The file object is closed.
2. The ``AttributeError`` propagates out of ``list``

So we've accomplished our goal of making this "just work" without the
user having to think about it.


Transition plan
===============

While the majority of existing ``for`` loops will continue to produce
identical results, the proposed changes will produce
backwards-incompatible behavior in some cases. Example::

def read_csv_with_header(lines_iterable):
lines_iterator = iter(lines_iterable)
for line in lines_iterator:
column_names = line.strip().split("\t")
break
for line in lines_iterator:
values = line.strip().split("\t")
record = dict(zip(column_names, values))
yield record

This code used to be correct, but after this proposal is implemented
will require an ``itertools.preserve`` call added to the first ``for``
loop.

[QUESTION: currently, if you close a generator and then try to iterate
over it then it just raises ``Stop(Async)Iteration``, so code the
passes the same generator object to multiple ``for`` loops but forgets
to use ``itertools.preserve`` won't see an obvious error -- the second
``for`` loop will just exit immediately. Perhaps it would be better if
iterating a closed generator raised a ``RuntimeError``? Note that
files don't have this problem -- attempting to iterate a closed file
object already raises ``ValueError``.]

Specifically, the incompatibility happens when all of these factors
come together:

- The automatic calling of ``__(a)iterclose__`` is enabled
- The iterable did not previously define ``__(a)iterclose__``
- The iterable does now define ``__(a)iterclose__``
- The iterable is re-used after the ``for`` loop exits

So the problem is how to manage this transition, and those are the
levers we have to work with.

First, observe that the only async iterables where we propose to add
``__aiterclose__`` are async generators, and there is currently no
existing code using async generators (though this will start changing
very soon), so the async changes do not produce any backwards
incompatibilities. (There is existing code using async iterators, but
using the new async for loop on an old async iterator is harmless,
because old async iterators don't have ``__aiterclose__``.) In
addition, PEP 525 was accepted on a provisional basis, and async
generators are by far the biggest beneficiary of this PEP's proposed
changes. Therefore, I think we should strongly consider enabling
``__aiterclose__`` for ``async for`` loops and async generators ASAP,
ideally for 3.6.0 or 3.6.1.

For the non-async world, things are harder, but here's a potential
transition path:

In 3.7:

Our goal is that existing unsafe code will start emitting warnings,
while those who want to opt-in to the future can do that immediately:

- We immediately add all the ``__iterclose__`` methods described above.
- If ``from __future__ import iterclose`` is in effect, then ``for``
loops and ``*`` unpacking call ``__iterclose__`` as specified above.
- If the future is *not* enabled, then ``for`` loops and ``*``
unpacking do *not* call ``__iterclose__``. But they do call some other
method instead, e.g. ``__iterclose_warning__``.
- Similarly, functions like ``list`` use stack introspection (!!) to
check whether their direct caller has ``__future__.iterclose``
enabled, and use this to decide whether to call ``__iterclose__`` or
``__iterclose_warning__``.
- For all the wrapper iterators, we also add ``__iterclose_warning__``
methods that forward to the ``__iterclose_warning__`` method of the
underlying iterator or iterators.
- For generators (and files, if we decide to do that),
``__iterclose_warning__`` is defined to set an internal flag, and
other methods on the object are modified to check for this flag. If
they find the flag set, they issue a ``PendingDeprecationWarning`` to
inform the user that in the future this sequence would have led to a
use-after-close situation and the user should use ``preserve()``.

In 3.8:

- Switch from ``PendingDeprecationWarning`` to ``DeprecationWarning``

In 3.9:

- Enable the ``__future__`` unconditionally and remove all the
``__iterclose_warning__`` stuff.

I believe that this satisfies the normal requirements for this kind of
transition -- opt-in initially, with warnings targeted precisely to
the cases that will be effected, and a long deprecation cycle.

Probably the most controversial / risky part of this is the use of
stack introspection to make the iterable-consuming functions sensitive
to a ``__future__`` setting, though I haven't thought of any situation
where it would actually go wrong yet...


Acknowledgements
================

Thanks to Yury Selivanov, Armin Rigo, and Carl Friedrich Bolz for
helpful discussion on earlier versions of this idea.

--
Nathaniel J. Smith -- https://vorpus.org
_______________________________________________
Python-ideas mailing list
Python...@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Neil Girdhar

unread,
Oct 19, 2016, 3:38:35 AM10/19/16
to python-ideas, python...@python.org, yseli...@gmail.com, n...@pobox.com
This is a very interesting proposal.  I just wanted to share something I found in my quick search:


Could you explain why the accepted answer there doesn't address this issue?

class Parse(object):
    """A generator that iterates through a file"""
    def __init__(self, path):
        self.path = path
  def __iter__(self):
        with open(self.path) as f:
            yield from f

Best,

Neil

Vincent Michel

unread,
Oct 19, 2016, 7:33:15 AM10/19/16
to python...@python.org
Thanks Nathaniel for this great proposal.

As I went through your mail, I realized all the comments I wanted to
make were already covered in later paragraphs. And I don't think there's
a single point I disagree with.

I don't have a strong opinion about the synchronous part of the
proposal. I actually wouldn't mind the disparity between asynchronous
and synchronous iterators if '__aiterclose__' were to be accepted and
'__iterclose__' rejected.

However, I would like very much to see the asynchronous part happening
in python 3.6. I can add another example for the reference: aioreactive
(a fresh implementation of Rx for asyncio) is planning to handle
subscriptions to a producer using a context manager:

https://github.com/dbrattli/aioreactive#subscriptions-are-async-iterables

async with listen(xs) as ys:
async for x in ys:
do_something(x)

Like the proposal points out, this happens in the *user* code. With
'__aiterclose__', the former example could be simplified as:

async for x in listen(xs):
do_something(x)

Or even better:

async for x in xs:
do_something(x)


Cheers,
/Vincent

Oscar Benjamin

unread,
Oct 19, 2016, 7:34:35 AM10/19/16
to python...@python.org
On 17 October 2016 at 09:08, Nathaniel Smith <n...@pobox.com> wrote:
> Hi all,

Hi Nathaniel. I'm just reposting what I wrote on pypy-dev (as
requested) but under the assumption that you didn't substantially
alter your draft - I apologise if some of the quoted text below has
already been edited.

> Always inject resources, and do all cleanup at the top level
> ------------------------------------------------------------
>
> It was suggested on python-dev (XX find link) that a pattern to avoid
> these problems is to always pass resources in from above, e.g.
> ``read_newline_separated_json`` should take a file object rather than
> a path, with cleanup handled at the top level::

I suggested this and I still think that it is the best idea.

> def read_newline_separated_json(file_handle):
> for line in file_handle:
> yield json.loads(line)
>
> def read_users(file_handle):
> for document in read_newline_separated_json(file_handle):
> yield User.from_json(document)
>
> with open(path) as file_handle:
> for user in read_users(file_handle):
> ...
>
> This works well in simple cases; here it lets us avoid the "N+1
> problem". But unfortunately, it breaks down quickly when things get
> more complex. Consider if instead of reading from a file, our
> generator was processing the body returned by an HTTP GET request --
> while handling redirects and authentication via OAUTH. Then we'd
> really want the sockets to be managed down inside our HTTP client
> library, not at the top level. Plus there are other cases where
> ``finally`` blocks embedded inside generators are important in their
> own right: db transaction management, emitting logging information
> during cleanup (one of the major motivating use cases for WSGI
> ``close``), and so forth.

I haven't written the kind of code that you're describing so I can't
say exactly how I would do it. I imagine though that helpers could be
used to solve some of the problems that you're referring to though.
Here's a case I do know where the above suggestion is awkward:

def concat(filenames):
for filename in filenames:
with open(filename) as inputfile:
yield from inputfile

for line in concat(filenames):
...

It's still possible to safely handle this use case by creating a
helper though. fileinput.input almost does what you want:

with fileinput.input(filenames) as lines:
for line in lines:
...

Unfortunately if filenames is empty this will default to sys.stdin so
it's not perfect but really I think introducing useful helpers for
common cases (rather than core language changes) should be considered
as the obvious solution here. Generally it would have been better if
the discussion for PEP 525 has focussed more on helping people to
debug/fix dependence on __del__ rather than trying to magically fix
broken code.

> New convenience functions
> -------------------------
>
> The ``itertools`` module gains a new iterator wrapper that can be used
> to selectively disable the new ``__iterclose__`` behavior::
>
> # XX FIXME: I feel like there might be a better name for this one?
> class protect(iterable):
> def __init__(self, iterable):
> self._it = iter(iterable)
>
> def __iter__(self):
> return self
>
> def __next__(self):
> return next(self._it)
>
> def __iterclose__(self):
> # Swallow __iterclose__ without passing it on
> pass
>
> Example usage (assuming that file objects implements ``__iterclose__``)::
>
> with open(...) as handle:
> # Iterate through the same file twice:
> for line in itertools.protect(handle):
> ...
> handle.seek(0)
> for line in itertools.protect(handle):
> ...

It would be much simpler to reverse this suggestion and say let's
introduce a helper that selectively *enables* the new behaviour you're
proposing i.e.:

for line in itertools.closeafter(open(...)):
...
if not line.startswith('#'):
break # <--------------- file gets closed here

Then we can leave (async) for loops as they are and there are no
backward compatbility problems etc.

--
Oscar

Oscar Benjamin

unread,
Oct 19, 2016, 7:40:48 AM10/19/16
to python...@python.org
Looking more closely at this I realise that there is no way to
implement closeafter like this without depending on closeafter.__del__
to do the closing. So actually this is not a solution to the problem
at all. Sorry for the noise there!

Todd

unread,
Oct 19, 2016, 11:08:36 AM10/19/16
to python-ideas
On Wed, Oct 19, 2016 at 3:38 AM, Neil Girdhar <miste...@gmail.com> wrote:
This is a very interesting proposal.  I just wanted to share something I found in my quick search:


Could you explain why the accepted answer there doesn't address this issue?

class Parse(object):
    """A generator that iterates through a file"""
    def __init__(self, path):
        self.path = path
  def __iter__(self):
        with open(self.path) as f:
            yield from f

Best,

Neil


I think the difference is that this new approach guarantees cleanup the exact moment the loop ends, no matter how it ends. 

If I understand correctly, your approach will do cleanup when the loop ends only if the iterator is exhausted.  But if someone zips it with a shorter iterator, uses itertools.islice or something similar, breaks the loop, returns inside the loop, or in some other way ends the loop before the iterator is exhausted, the cleanup won't happen when the iterator is garbage collected.  And for non-reference-counting python implementations, when this happens is completely unpredictable.

Yury Selivanov

unread,
Oct 19, 2016, 11:52:35 AM10/19/16
to python...@python.org
I'm -1 on the idea. Here's why:


1. Python is a very dynamic language with GC and that is one of its
fundamental properties. This proposal might make GC of iterators more
deterministic, but that is only one case.

For instance, in some places in asyncio source code we have statements
like this: "self = None". Why? When an exception occurs and we want to
save it (for instance to log it), it holds a reference to the Traceback
object. Which in turn references frame objects. Which means that a lot
of objects in those frames will be alive while the exception object is
alive. So in asyncio we go to great lengths to avoid unnecessary runs
of GC, but this is an exception! Most of Python code out there today
doesn't do this sorts of tricks.

And this is just one example of how you can have cycles that require a
run of GC. It is not possible to have deterministic GC in real life
Python applications. This proposal addresses only *one* use case,
leaving 100s of others unresolved.

IMO, while GC-related issues can be annoying to debug sometimes, it's
not worth it to change the behaviour of iteration in Python only to
slightly improve on this.


2. This proposal will make writing iterators significantly harder.
Consider 'itertools.chain'. We will have to rewrite it to add the
proposed __iterclose__ method. The Chain iterator object will have to
track all of its iterators, call __iterclose__ on them when it's
necessary (there are a few corner cases). Given that this object is
implemented in C, it's quite a bit of work. And we'll have a lot of
objects to fix.

We can probably update all iterators in standard library (in 3.7), but
what about third-party code? It will take many years until you can say
with certainty that most of Python code supports __iterclose__ /
__aiterclose__.


3. This proposal changes the behaviour of 'for' and 'async for'
statements significantly. To do partial iteration you will have to use
a special builtin function to guard the iterator from being closed.
This is completely non-obvious to any existing Python user and will be
hard to explain to newcomers.


4. This proposal only addresses iteration with 'for' and 'async for'
statements. If you iterate using a 'while' loop and 'next()' function,
this proposal wouldn't help you. Also see the point #2 about
third-party code.


5. Asynchronous generators (AG) introduced by PEP 525 are finalized in a
very similar fashion to synchronous generators. There is an API to help
Python to call event loop to finalize AGs. asyncio in 3.6 (and other
event loops in the near future) already uses this API to ensure that
*all AGs in a long-running program are properly finalized* while it is
being run.

There is an extra loop method (`loop.shutdown_asyncgens`) that should be
called right before stopping the loop (exiting the program) to make sure
that all AGs are finalized, but if you forget to call it the world won't
end. The process will end and the interpreter will shutdown, maybe
issuing a couple of ResourceWarnings.

No exception will pass silently in the current PEP 525 implementation.
And if some AG isn't properly finalized a warning will be issued.

The current AG finalization mechanism must stay even if this proposal
gets accepted, as it ensures that even manually iterated AGs are
properly finalized.


6. If this proposal gets accepted, I think we shouldn't introduce it in
any form in 3.6. It's too late to implement it for both sync- and
async-generators. Implementing it only for async-generators will only
add cognitive overhead. Even implementing this only for
async-generators will (and should!) delay 3.6 release significantly.


7. To conclude: I'm not convinced that this proposal fully solves the
issue of non-deterministic GC of iterators. It cripples iteration
protocols to partially solve the problem for 'for' and 'async for'
statements, leaving manual iteration unresolved. It will make it harder
to write *correct* (async-) iterators. It introduces some *implicit*
context management to 'for' and 'async for' statements -- something that
IMO should be done by user with an explicit 'with' or 'async with'.


Yury

Random832

unread,
Oct 19, 2016, 12:39:42 PM10/19/16
to python...@python.org
On Wed, Oct 19, 2016, at 11:51, Yury Selivanov wrote:
> I'm -1 on the idea. Here's why:
>
>
> 1. Python is a very dynamic language with GC and that is one of its
> fundamental properties. This proposal might make GC of iterators more
> deterministic, but that is only one case.

There is a huge difference between wanting deterministic GC and wanting
cleanup code to be called deterministically. We're not talking about
memory usage here.

Yury Selivanov

unread,
Oct 19, 2016, 12:44:41 PM10/19/16
to python...@python.org
On 2016-10-19 12:38 PM, Random832 wrote:
> On Wed, Oct 19, 2016, at 11:51, Yury Selivanov wrote:
>> I'm -1 on the idea. Here's why:
>>
>>
>> 1. Python is a very dynamic language with GC and that is one of its
>> fundamental properties. This proposal might make GC of iterators more
>> deterministic, but that is only one case.
> There is a huge difference between wanting deterministic GC and wanting
> cleanup code to be called deterministically. We're not talking about
> memory usage here.
>

I understand, but both topics are closely tied together. Cleanup code
can be implemented in some __del__ method of some non-iterator object.
This proposal doesn't address such cases, it focuses only on iterators.

My point is that it's not worth it to *significantly* change iteration
(protocols and statements) in Python to only *partially* address the issue.

Yury

Neil Girdhar

unread,
Oct 19, 2016, 1:08:36 PM10/19/16
to python...@googlegroups.com, python-ideas
--

I don't see that.  The "cleanup" will happen when collection is interrupted by an exception.  This has nothing to do with garbage collection either since the cleanup happens deterministically when the block is ended.  If this is the only example, then I would say this behavior is already provided and does not need to be added.
 

---
You received this message because you are subscribed to a topic in the Google Groups "python-ideas" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/python-ideas/5xdf0WF1WyY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to python-ideas...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

_______________________________________________
Python-ideas mailing list
Python...@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

--

---
You received this message because you are subscribed to a topic in the Google Groups "python-ideas" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/python-ideas/5xdf0WF1WyY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to python-ideas...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Nathaniel Smith

unread,
Oct 19, 2016, 2:12:48 PM10/19/16
to Neil Girdhar, python-ideas, python...@googlegroups.com
On Wed, Oct 19, 2016 at 10:08 AM, Neil Girdhar <miste...@gmail.com> wrote:
>
>
> On Wed, Oct 19, 2016 at 11:08 AM Todd <todd...@gmail.com> wrote:
>>
>> On Wed, Oct 19, 2016 at 3:38 AM, Neil Girdhar <miste...@gmail.com>
>> wrote:
>>>
>>> This is a very interesting proposal. I just wanted to share something I
>>> found in my quick search:
>>>
>>>
>>> http://stackoverflow.com/questions/14797930/python-custom-iterator-close-a-file-on-stopiteration
>>>
>>> Could you explain why the accepted answer there doesn't address this
>>> issue?
>>>
>>> class Parse(object):
>>> """A generator that iterates through a file"""
>>> def __init__(self, path):
>>> self.path = path
>>>
>>> def __iter__(self):
>>> with open(self.path) as f:
>>> yield from f

BTW it may make this easier to read if we notice that it's essentially
a verbose way of writing:

def parse(path):
with open(path) as f:
yield from f

>>
>> I think the difference is that this new approach guarantees cleanup the
>> exact moment the loop ends, no matter how it ends.
>>
>> If I understand correctly, your approach will do cleanup when the loop
>> ends only if the iterator is exhausted. But if someone zips it with a
>> shorter iterator, uses itertools.islice or something similar, breaks the
>> loop, returns inside the loop, or in some other way ends the loop before the
>> iterator is exhausted, the cleanup won't happen when the iterator is garbage
>> collected. And for non-reference-counting python implementations, when this
>> happens is completely unpredictable.
>>
>> --
>
>
> I don't see that. The "cleanup" will happen when collection is interrupted
> by an exception. This has nothing to do with garbage collection either
> since the cleanup happens deterministically when the block is ended. If
> this is the only example, then I would say this behavior is already provided
> and does not need to be added.

I think there might be a misunderstanding here. Consider code like
this, that breaks out from the middle of the for loop:

def use_that_generator():
for line in parse(...):
if found_the_line_we_want(line):
break
# -- mark --
do_something_with_that_line(line)

With current Python, what will happen is that when we reach the marked
line, then the for loop has finished and will drop its reference to
the generator object. At this point, the garbage collector comes into
play. On CPython, with its reference counting collector, the garbage
collector will immediately collect the generator object, and then the
generator object's __del__ method will restart 'parse' by having the
last 'yield' raise a GeneratorExit, and *that* exception will trigger
the 'with' block's cleanup. But in order to get there, we're
absolutely depending on the garbage collector to inject that
GeneratorExit. And on an implementation like PyPy that doesn't use
reference counting, the generator object will become collect*ible* at
the marked line, but might not actually be collect*ed* for an
arbitrarily long time afterwards. And until it's collected, the file
will remain open. 'with' blocks guarantee that the resources they hold
will be cleaned up promptly when the enclosing stack frame gets
cleaned up, but for a 'with' block inside a generator then you still
need something to guarantee that the enclosing stack frame gets
cleaned up promptly!

This proposal is about providing that thing -- with __(a)iterclose__,
the end of the for loop immediately closes the generator object, so
the garbage collector doesn't need to get involved.

Essentially the same thing happens if we replace the 'break' with a
'raise'. Though with exceptions, things can actually get even messier,
even on CPython. Here's a similar example except that (a) it exits
early due to an exception (which then gets caught elsewhere), and (b)
the invocation of the generator function ended up being kind of long,
so I split the for loop into two lines with a temporary variable:

def use_that_generator2():
it = parse("/a/really/really/really/really/really/really/really/long/path")
for line in it:
if not valid_format(line):
raise ValueError()

def catch_the_exception():
try:
use_that_generator2()
except ValueError:
# -- mark --
...

Here the ValueError() is raised from use_that_generator2(), and then
caught in catch_the_exception(). At the marked line,
use_that_generator2's stack frame is still pinned in memory by the
exception's traceback. And that means that all the local variables are
also pinned in memory, including our temporary 'it'. Which means that
parse's stack frame is also pinned in memory, and the file is not
closed.

With the __(a)iterclose__ proposal, when the exception is thrown then
the 'for' loop in use_that_generator2() immediately closes the
generator object, which in turn triggers parse's 'with' block, and
that closes the file handle. And then after the file handle is closed,
the exception continues propagating. So at the marked line, it's still
the case that 'it' will be pinned in memory, but now 'it' is a closed
generator object that has already relinquished its resources.

-n

--
Nathaniel J. Smith -- https://vorpus.org

Chris Angelico

unread,
Oct 19, 2016, 2:14:11 PM10/19/16
to python-ideas
On Thu, Oct 20, 2016 at 3:38 AM, Random832 <rand...@fastmail.com> wrote:
> On Wed, Oct 19, 2016, at 11:51, Yury Selivanov wrote:
>> I'm -1 on the idea. Here's why:
>>
>>
>> 1. Python is a very dynamic language with GC and that is one of its
>> fundamental properties. This proposal might make GC of iterators more
>> deterministic, but that is only one case.
>
> There is a huge difference between wanting deterministic GC and wanting
> cleanup code to be called deterministically. We're not talking about
> memory usage here.

Currently, iterators get passed around casually - you can build on
them, derive from them, etc, etc, etc. If you change the 'for' loop to
explicitly close an iterator, will you also change 'yield from'? What
about other forms of iteration? Will the iterator be closed when it
runs out normally?

This proposal is to iterators what 'with' is to open files and other
resources. I can build on top of an open file fairly easily:

@contextlib.contextmanager
def file_with_header(fn):
with open(fn, "w") as f:
f.write("Header Row")
yield f

def main():
with file_with_header("asdf") as f:
"""do stuff"""

I create a context manager based on another context manager, and I
have a guarantee that the end of the main() 'with' block is going to
properly close the file. Now, what happens if I do something similar
with an iterator?

def every_second(it):
try:
next(it)
except StopIteration:
return
for value in it:
yield value
try:
next(it)
except StopIteration:
break

This will work, because it's built on a 'for' loop. What if it's built
on a 'while' loop instead?

def every_second_broken(it):
try:
while True:
nextIit)
yield next(it)
except StopIteration:
pass

Now it *won't* correctly call the end-of-iteration function, because
there's no 'for' loop. This is going to either (a) require that EVERY
consumer of an iterator follow this new protocol, or (b) introduce a
ton of edge cases.

ChrisA

Paul Moore

unread,
Oct 19, 2016, 2:39:20 PM10/19/16
to Chris Angelico, python-ideas
On 19 October 2016 at 19:13, Chris Angelico <ros...@gmail.com> wrote:
> Now it *won't* correctly call the end-of-iteration function, because
> there's no 'for' loop. This is going to either (a) require that EVERY
> consumer of an iterator follow this new protocol, or (b) introduce a
> ton of edge cases.

Also, unless I'm misunderstanding the proposal, there's a fairly major
compatibility break. At present we have:

>>> lst = [1,2,3,4]
>>> it = iter(lst)
>>> for i in it:
... if i == 2: break

>>> for i in it:
... print(i)
3
4
>>>

With the proposed behaviour, if I understand it, "it" would be closed
after the first loop, so resuming "it" for the second loop wouldn't
work. Am I right in that? I know there's a proposed itertools function
to bring back the old behaviour, but it's still a compatibility break.
And code like this, that partially consumes an iterator, is not
uncommon.

Paul

Ethan Furman

unread,
Oct 19, 2016, 3:11:05 PM10/19/16
to python...@python.org
On 10/19/2016 11:38 AM, Paul Moore wrote:

> Also, unless I'm misunderstanding the proposal, there's a fairly major
> compatibility break. At present we have:
>
>>>> lst = [1,2,3,4]
>>>> it = iter(lst)
>>>> for i in it:
> ... if i == 2: break
>
>>>> for i in it:
> ... print(i)
> 3
> 4
>>>>
>
> With the proposed behaviour, if I understand it, "it" would be closed
> after the first loop, so resuming "it" for the second loop wouldn't
> work. Am I right in that? I know there's a proposed itertools function
> to bring back the old behaviour, but it's still a compatibility break.
> And code like this, that partially consumes an iterator, is not
> uncommon.

Agreed. I like the idea in general, but this particular break feels like a deal-breaker.

I'd be okay with not having break close the iterator, and either introducing a 'break_and_close' type of keyword or some other way of signalling that we will not be using the iterator any more so go ahead and close it. Does that invalidate, or take away most of value of, the proposal?

--
~Ethan~

Todd

unread,
Oct 19, 2016, 3:22:56 PM10/19/16
to python-ideas
On Wed, Oct 19, 2016 at 2:38 PM, Paul Moore <p.f....@gmail.com> wrote:
On 19 October 2016 at 19:13, Chris Angelico <ros...@gmail.com> wrote:
> Now it *won't* correctly call the end-of-iteration function, because
> there's no 'for' loop. This is going to either (a) require that EVERY
> consumer of an iterator follow this new protocol, or (b) introduce a
> ton of edge cases.

Also, unless I'm misunderstanding the proposal, there's a fairly major
compatibility break. At present we have:

>>> lst = [1,2,3,4]
>>> it = iter(lst)
>>> for i in it:
...   if i == 2: break

>>> for i in it:
...   print(i)
3
4
>>>

With the proposed behaviour, if I understand it, "it" would be closed
after the first loop, so resuming "it" for the second loop wouldn't
work. Am I right in that? I know there's a proposed itertools function
to bring back the old behaviour, but it's still a compatibility break.
And code like this, that partially consumes an iterator, is not
uncommon.

Paul


I may very well be misunderstanding the purpose of the proposal, but that is not how I saw it being used.  I thought of it being used to clean up things that happened in the loop, rather than clean up the iterator itself.  This would allow the iterator to manage events that occurred in the body of the loop.  So it would be more like this scenario:

>>> lst = objiterer([obj1, obj2, obj3, obj4])
>>> it = iter(lst)
>>> for i, _ in zip(it, [1, 2]):
...   b = i.some_method()
>>> for i in it:
...   c = i.other_method()
>>>

In this case, objiterer would do some cleanup related to obj1 and obj2 in the first loop and some cleanup related to obj3 and obj4 in the second loop.  There would be no backwards-compatibility break, the method would be purely opt-in and most typical iterators wouldn't need it.

However, in this case perhaps it might be better to have some method that is called after every loop, no matter how the loop is terminated (break, continue, return).  This would allow the cleanup to be done every loop rather than just at the end.

Nathaniel Smith

unread,
Oct 19, 2016, 3:24:48 PM10/19/16
to Paul Moore, python-ideas
On Wed, Oct 19, 2016 at 11:38 AM, Paul Moore <p.f....@gmail.com> wrote:
> On 19 October 2016 at 19:13, Chris Angelico <ros...@gmail.com> wrote:
>> Now it *won't* correctly call the end-of-iteration function, because
>> there's no 'for' loop. This is going to either (a) require that EVERY
>> consumer of an iterator follow this new protocol, or (b) introduce a
>> ton of edge cases.
>
> Also, unless I'm misunderstanding the proposal, there's a fairly major
> compatibility break. At present we have:
>
>>>> lst = [1,2,3,4]
>>>> it = iter(lst)
>>>> for i in it:
> ... if i == 2: break
>
>>>> for i in it:
> ... print(i)
> 3
> 4
>>>>
>
> With the proposed behaviour, if I understand it, "it" would be closed
> after the first loop, so resuming "it" for the second loop wouldn't
> work. Am I right in that? I know there's a proposed itertools function
> to bring back the old behaviour, but it's still a compatibility break.
> And code like this, that partially consumes an iterator, is not
> uncommon.

Right -- did you reach the "transition plan" section? (I know it's
wayyy down there.) The proposal is to hide this behind a __future__ at
first + a mechanism during the transition period to catch code that
depends on the old behavior and issue deprecation warnings. But it is
a compatibility break, yes.

-n

--
Nathaniel J. Smith -- https://vorpus.org

Nathaniel Smith

unread,
Oct 19, 2016, 3:34:51 PM10/19/16
to Paul Moore, python-ideas
I should also say, regarding your specific example, I guess it's an
open question whether we would want list_iterator.__iterclose__ to
actually do anything. It could flip the iterator to a state where it
always raises StopIteration, or RuntimeError, or it could just be a
no-op that allows iteration to continue normally afterwards.
list_iterator doesn't have a close method right now, and it certainly
can't "close" the underlying list (whatever that would even mean), so
I don't think there's a strong expectation that it should do anything
in particular. The __iterclose__ contract is that you're not supposed
to call __next__ afterwards, so there's no real rule about what
happens if you do. And there aren't strong conventions right now about
what happens when you try to iterate an explicitly closed iterator --
files raise an error, generators just act like they were exhausted. So
there's a few options that all seem more-or-less reasonable and I
don't know that it's very important which one we pick.

Brendan Barnwell

unread,
Oct 19, 2016, 3:57:19 PM10/19/16
to python...@python.org
On 2016-10-19 12:21, Nathaniel Smith wrote:
>> >Also, unless I'm misunderstanding the proposal, there's a fairly major
>> >compatibility break. At present we have:
>> >
>>>>> >>>>lst = [1,2,3,4]
>>>>> >>>>it = iter(lst)
>>>>> >>>>for i in it:
>> >... if i == 2: break
>> >
>>>>> >>>>for i in it:
>> >... print(i)
>> >3
>> >4
>>>>> >>>>
>> >
>> >With the proposed behaviour, if I understand it, "it" would be closed
>> >after the first loop, so resuming "it" for the second loop wouldn't
>> >work. Am I right in that? I know there's a proposed itertools function
>> >to bring back the old behaviour, but it's still a compatibility break.
>> >And code like this, that partially consumes an iterator, is not
>> >uncommon.
>
> Right -- did you reach the "transition plan" section? (I know it's
> wayyy down there.) The proposal is to hide this behind a __future__ at
> first + a mechanism during the transition period to catch code that
> depends on the old behavior and issue deprecation warnings. But it is
> a compatibility break, yes.

To me this makes the change too hard to swallow. Although the issues
you describe are real, it doesn't seem worth it to me to change the
entire semantics of for loops just for these cases. There are lots of
for loops that are not async and/or do not rely on resource cleanup.
This will change how all of them work, just to fix something that
sometimes is a problem for some resource-wrapping iterators.

Moreover, even when the iterator does wrap a resource, sometimes I want
to be able to stop and resume iteration. It's not uncommon, for
instance, to have code using the csv module that reads some rows, pauses
to make a decision (e.g., to parse differently depending what header
columns are present, or skip some number of rows), and then resumes.
This would increase the burden of updating code to adapt to the new
breakage (since in this case the programmer would likely have to, or at
least want to, think about what is going on rather than just blindly
wrapping everything with protect() ).

--
Brendan Barnwell
"Do not follow where the path may lead. Go, instead, where there is no
path, and leave a trail."
--author unknown

Neil Girdhar

unread,
Oct 19, 2016, 4:14:32 PM10/19/16
to Nathaniel Smith, python...@googlegroups.com, python-ideas
Yes, I understand that.  Maybe this is clearer.  This class adds an iterclose to any iterator so that when iteration ends, iterclose is automatically called:

def my_iterclose():
    print("Closing!")


class AddIterclose:

    def __init__(self, iterable, iterclose):
        self.iterable = iterable
        self.iterclose = iterclose

    def __iter__(self):
        try:
            for x in self.iterable:
                yield x
        finally:
            self.iterclose()


try:
    for x in AddIterclose(range(10), my_iterclose):
        print(x)
        if x == 5:
            raise ValueError
except:
    pass

Chris Angelico

unread,
Oct 19, 2016, 4:17:43 PM10/19/16
to python-ideas
On Thu, Oct 20, 2016 at 7:14 AM, Neil Girdhar <miste...@gmail.com> wrote:
> class AddIterclose:
>
> def __init__(self, iterable, iterclose):
> self.iterable = iterable
> self.iterclose = iterclose
>
> def __iter__(self):
> try:
> for x in self.iterable:
> yield x
> finally:
> self.iterclose()

Can this be simplified down to a generator?

def AddIterclose(iterable, iterclose):
try:
yield from iterable
finally:
iterclose()

ChrisA

Neil Girdhar

unread,
Oct 19, 2016, 4:29:54 PM10/19/16
to python-ideas, n...@pobox.com, python...@python.org
Ohhh, sorry, you want __iterclose__ to happen when iteration is terminated by a break statement as well?   Okay, I understand, and that's fair.

However, I would rather that people be explicit about when they're iterating (use the iteration protocol) and when they're managing a resource (use a context manager).  Trying to figure out where the context manager should go automatically (which is what it sounds like the proposal amounts to) is too difficult to get right, and when you get it wrong you close too early, and then what's the user supposed to do?  Suppress the early close with an even more convoluted notation?

If there is a problem with people iterating over things without a generator, my suggestion is to force them to use the generator.  For example, don't make your object iterable: make the value yielded by the context manager iterable.

Best,

Neil

(On preview, Re: Chris Angelico's refactoring of my code, nice!!)

Yury Selivanov

unread,
Oct 19, 2016, 4:34:44 PM10/19/16
to python...@python.org
Making 'for' loop to behave differently for built-in containers (i.e.
make __iterclose__ a no-op for them) will only make this whole thing
even more confusing.

It has to be consistent: if you partially iterate over *anything*
without wrapping it with `preserve()`, it should always close the iterator.

Yury

Nathaniel Smith

unread,
Oct 19, 2016, 5:03:12 PM10/19/16
to Yury Selivanov, python...@python.org
Hi Yury,

Thanks for the detailed comments! Replies inline below.

On Wed, Oct 19, 2016 at 8:51 AM, Yury Selivanov <yseliv...@gmail.com> wrote:
> I'm -1 on the idea. Here's why:
>
>
> 1. Python is a very dynamic language with GC and that is one of its
> fundamental properties. This proposal might make GC of iterators more
> deterministic, but that is only one case.
>
> For instance, in some places in asyncio source code we have statements like
> this: "self = None". Why? When an exception occurs and we want to save it
> (for instance to log it), it holds a reference to the Traceback object.
> Which in turn references frame objects. Which means that a lot of objects
> in those frames will be alive while the exception object is alive. So in
> asyncio we go to great lengths to avoid unnecessary runs of GC, but this is
> an exception! Most of Python code out there today doesn't do this sorts of
> tricks.
>
> And this is just one example of how you can have cycles that require a run
> of GC. It is not possible to have deterministic GC in real life Python
> applications. This proposal addresses only *one* use case, leaving 100s of
> others unresolved.

Maybe I'm misunderstanding, but I think those 100s of other cases
where you need deterministic cleanup are why 'with' blocks were
invented, and in my experience they work great for that. Once you get
in the habit, it's very easy and idiomatic to attach a 'with' to each
file handle, socket, etc., at the point where you create it. So from
where I stand, it seems like those 100s of unresolved cases actually
are resolved?

The problem is that 'with' blocks are great, and generators are great,
but when you put them together into the same language there's this
weird interaction that emerges, where 'with' blocks inside generators
don't really work for their intended purpose unless you're very
careful and willing to write boilerplate.

Adding deterministic cleanup to generators plugs this gap. Beyond
that, I do think it's a nice bonus that other iterables can take
advantage of the feature, but this isn't just a random "hey let's
smush two constructs together to save a line of code" thing --
iteration is special because it's where generator call stacks and
regular call stacks meet.

> IMO, while GC-related issues can be annoying to debug sometimes, it's not
> worth it to change the behaviour of iteration in Python only to slightly
> improve on this.
>
> 2. This proposal will make writing iterators significantly harder. Consider
> 'itertools.chain'. We will have to rewrite it to add the proposed
> __iterclose__ method. The Chain iterator object will have to track all of
> its iterators, call __iterclose__ on them when it's necessary (there are a
> few corner cases). Given that this object is implemented in C, it's quite a
> bit of work. And we'll have a lot of objects to fix.

When you say "make writing iterators significantly harder", is it fair
to say that you're thinking mostly of what I'm calling "iterator
wrappers"? For most day-to-day iterators, it's pretty trivial to
either add a close method or not; the tricky cases are when you're
trying to manage a collection of sub-iterators.

itertools.chain is a great challenge / test case here, because I think
it's about as hard as this gets :-). It took me a bit to wrap my head
around, but I think I've got it, and that it's not so bad actually.

Right now, chain's semantics are:

# copied directly from the docs
def chain(*iterables):
for it in iterables:
for element in it:
yield element

In a post-__iterclose__ world, the inner for loop there will already
handle closing each iterators as its finished being consumed, and if
the generator is closed early then the inner for loop will also close
the current iterator. What we need to add is that if the generator is
closed early, we should also close all the unprocessed iterators.

The first change is to replace the outer for loop with a while/pop
loop, so that if an exception occurs we'll know which iterables remain
to be processed:

def chain(*iterables):
try:
while iterables:
for element in iterables.pop(0):
yield element
...

Now, what do we do if an exception does occur? We need to call
iterclose on all of the remaining iterables, but the tricky bit is
that this might itself raise new exceptions. If this happens, we don't
want to abort early; instead, we want to continue until we've closed
all the iterables, and then raise a chained exception. Basically what
we want is:

def chain(*iterables):
try:
while iterables:
for element in iterables.pop(0):
yield element
finally:
try:
operators.iterclose(iter(iterables[0]))
finally:
try:
operators.iterclose(iter(iterables[1]))
finally:
try:
operators.iterclose(iter(iterables[2]))
finally:
...

but of course that's not valid syntax. Fortunately, it's not too hard
to rewrite that into real Python -- but it's a little dense:

def chain(*iterables):
try:
while iterables:
for element in iterables.pop(0):
yield element
# This is equivalent to the nested-finally chain above:
except BaseException as last_exc:
for iterable in iterables:
try:
operators.iterclose(iter(iterable))
except BaseException as new_exc:
if new_exc.__context__ is None:
new_exc.__context__ = last_exc
last_exc = new_exc
raise last_exc

It's probably worth wrapping that bottom part into an iterclose_all()
helper, since the pattern probably occurs in other cases as well.
(Actually, now that I think about it, the map() example in the text
should be doing this instead of what it's currently doing... I'll fix
that.)

This doesn't strike me as fundamentally complicated, really -- the
exception chaining logic makes it look scary, but basically it's just
the current chain() plus a cleanup loop. I believe that this handles
all the corner cases correctly. Am I missing something? And again,
this strikes me as one of the worst cases -- the vast majority of
iterators out there are not doing anything nearly this complicated
with subiterators.

> We can probably update all iterators in standard library (in 3.7), but what
> about third-party code? It will take many years until you can say with
> certainty that most of Python code supports __iterclose__ / __aiterclose__.

Adding support to itertools, toolz.itertoolz, and generators (which
are the most common way to implement iterator wrappers) will probably
take care of 95% of uses, but yeah, there's definitely a long tail
that will take time to shake out. The (extremely tentative) transition
plan has __iterclose__ as opt-in until 3.9, so that's about 3.5 years
from now.

__aiterclose__ is a different matter of course, since there are very
very few async iterator wrappers in the wild, and in general I think
most people writing async iterators are watching async/await-related
language developments very closely.

> 3. This proposal changes the behaviour of 'for' and 'async for' statements
> significantly. To do partial iteration you will have to use a special
> builtin function to guard the iterator from being closed. This is
> completely non-obvious to any existing Python user and will be hard to
> explain to newcomers.

It's true that it's non-obvious to existing users, but that's true of
literally every change that we could ever make :-). That's why we have
release notes, deprecation warnings, enthusiastic blog posts, etc.

For newcomers... well, it's always difficult for those of us with more
experience to put ourselves back in the mindset, but I don't see why
this would be particularly difficult to explain? for loops consume
their iterator; if you don't want that then here's how you avoid it.
That's no more difficult to explain than what an iterator is in the
first place, I don't think, and for me at least it's a lot easier to
wrap my head around than the semantics of else blocks on for loops
:-). (I always forget how those work.)

> 4. This proposal only addresses iteration with 'for' and 'async for'
> statements. If you iterate using a 'while' loop and 'next()' function, this
> proposal wouldn't help you. Also see the point #2 about third-party code.

True. If you're doing manual iteration, then you are still responsible
for manual cleanup (if that's what you want), just like today. This
seems fine to me -- I'm not sure why it's an objection to this
proposal :-).

> 5. Asynchronous generators (AG) introduced by PEP 525 are finalized in a
> very similar fashion to synchronous generators. There is an API to help
> Python to call event loop to finalize AGs. asyncio in 3.6 (and other event
> loops in the near future) already uses this API to ensure that *all AGs in a
> long-running program are properly finalized* while it is being run.
>
> There is an extra loop method (`loop.shutdown_asyncgens`) that should be
> called right before stopping the loop (exiting the program) to make sure
> that all AGs are finalized, but if you forget to call it the world won't
> end. The process will end and the interpreter will shutdown, maybe issuing
> a couple of ResourceWarnings.

There is no law that says that the interpreter always shuts down after
the event loop exits. We're talking about a fundamental language
feature here, it shouldn't be dependent on the details of libraries
and application shutdown tendencies :-(.

> No exception will pass silently in the current PEP 525 implementation.

Exceptions that occur inside a garbage-collected iterator will be
printed to the console, or possibly logged according to whatever the
event loop does with unhandled exceptions. And sure, that's better
than nothing, if someone remembers to look at the console/logs. But
they *won't* be propagated out to the containing frame, they can't be
caught, etc. That's a really big difference.

> And if some AG isn't properly finalized a warning will be issued.

This actually isn't true of the code currently in asyncio master -- if
the loop is already closed (either manually by the user or by its
__del__ being called) when the AG finalizer executes, then the AG is
silently discarded:
https://github.com/python/asyncio/blob/e3fed68754002000be665ad1a379a747ad9247b6/asyncio/base_events.py#L352

This isn't really an argument against the mechanism though, just a bug
you should probably fix :-).

I guess it does point to my main dissatisfaction with the whole GC
hook machinery, though. At this point I have spent many, many hours
tracing through the details of this catching edge cases -- first
during the initial PEP process, where there were a few rounds of
revision, then again the last few days when I first thought I found a
bunch of bugs that turned out to be spurious because I'd missed one
line in the PEP, plus one real bug that you already know about (the
finalizer-called-from-wrong-thread issue), and then I spent another
hour carefully reading through the code again with PEP 442 open
alongside once I realized how subtle the resurrection and cyclic
reference issues are here, and now here's another minor bug for you.

At this point I'm about 85% confident that it does actually function
as described, or that we'll at least be able to shake out any
remaining weird edge cases over the next 6-12 months as people use it.
But -- and I realize this is an aesthetic reaction as much as anything
else -- this all feels *really* unpythonic to me. Looking at the Zen,
the phrases that come to mind are "complicated", and "If the
implementation is hard to explain, ...".

The __(a)iterclose__ proposal definitely has its complexity as well,
but it's a very different kind. The core is incredibly
straightforward: "there is this method, for loops always call it".
That's it. When you look at a for loop, you can be extremely confident
about what's going to happen and when. Of course then there's the
question of defining this method on all the diverse iterators that we
have floating around -- I'm not saying it's trivial. But you can take
them one at a time, and each individual case is pretty
straightforward.

> The current AG finalization mechanism must stay even if this proposal gets
> accepted, as it ensures that even manually iterated AGs are properly
> finalized.

Like I said in the text, I don't find this very persuasive, since if
you're manually iterating then you can just as well take manual
responsibility for cleaning things up. But I could live with both
mechanisms co-existing.

> 6. If this proposal gets accepted, I think we shouldn't introduce it in any
> form in 3.6. It's too late to implement it for both sync- and
> async-generators. Implementing it only for async-generators will only add
> cognitive overhead. Even implementing this only for async-generators will
> (and should!) delay 3.6 release significantly.

I certainly don't want to delay 3.6. I'm not as convinced as you that
the async-generator code alone is so complicated that it would force a
delay, but if it is then 3.6.1 is also an option worth considering.

> 7. To conclude: I'm not convinced that this proposal fully solves the issue
> of non-deterministic GC of iterators. It cripples iteration protocols to
> partially solve the problem for 'for' and 'async for' statements, leaving
> manual iteration unresolved. It will make it harder to write *correct*
> (async-) iterators. It introduces some *implicit* context management to
> 'for' and 'async for' statements -- something that IMO should be done by
> user with an explicit 'with' or 'async with'.

The goal isn't to "fully solve the problem of non-deterministic GC of
iterators". That would require magic :-). The goal is to provide tools
so that when users run into this problem, they have viable options to
solve it. Right now, we don't have those tools, as evidenced by the
fact that I've basically never seen code that does this "correctly".
We can tell people that they should be using explicit 'with' on every
generator that might contain cleanup code, but they don't and they
won't, and as a result their code quality is suffering on several axes
(portability across Python implementations, 'with' blocks inside
generators that don't actually do anything except spuriously hide
ResourceWarnings, etc.).

Adding __(a)iterclose__ to (async) for loops makes it easy and
convenient to do the right thing in common cases; and in the
less-usual case where you want to do manual iteration, then you can
and should use a manual 'with' block too. The proposal is not trying
to replace 'with' blocks :-).

As for implicitness, eh. If 'for' is defined to mean 'iterate and then
close', then that's what 'for' means. If we make the change then there
won't be anything more implicit about 'for' calling __iterclose__ than
there is about 'for' calling __iter__ or __next__. Definitely this
will take some adjustment for those who are used to the old system,
but sometimes that's the price of progress ;-).

-n

--
Nathaniel J. Smith -- https://vorpus.org

Nathaniel Smith

unread,
Oct 19, 2016, 5:09:13 PM10/19/16
to Yury Selivanov, python...@python.org
You're probably right. My gut is leaning the same way, I'm just
hesitant to commit because I haven't thought about it for long. But I
do stand by the claim that this is probably not *that* important
either way :-).

-n

--
Nathaniel J. Smith -- https://vorpus.org

Yury Selivanov

unread,
Oct 19, 2016, 5:53:29 PM10/19/16
to Nathaniel Smith, python...@python.org
Nathaniel,

On 2016-10-19 5:02 PM, Nathaniel Smith wrote:

> Hi Yury,
>
> Thanks for the detailed comments! Replies inline below.

NP!

>
> On Wed, Oct 19, 2016 at 8:51 AM, Yury Selivanov <yseliv...@gmail.com> wrote:
>> I'm -1 on the idea. Here's why:
>>
>>
>> 1. Python is a very dynamic language with GC and that is one of its
>> fundamental properties. This proposal might make GC of iterators more
>> deterministic, but that is only one case.
>>
>> For instance, in some places in asyncio source code we have statements like
>> this: "self = None". Why? When an exception occurs and we want to save it
>> (for instance to log it), it holds a reference to the Traceback object.
>> Which in turn references frame objects. Which means that a lot of objects
>> in those frames will be alive while the exception object is alive. So in
>> asyncio we go to great lengths to avoid unnecessary runs of GC, but this is
>> an exception! Most of Python code out there today doesn't do this sorts of
>> tricks.
>>
>> And this is just one example of how you can have cycles that require a run
>> of GC. It is not possible to have deterministic GC in real life Python
>> applications. This proposal addresses only *one* use case, leaving 100s of
>> others unresolved.
> Maybe I'm misunderstanding, but I think those 100s of other cases
> where you need deterministic cleanup are why 'with' blocks were
> invented, and in my experience they work great for that. Once you get
> in the habit, it's very easy and idiomatic to attach a 'with' to each
> file handle, socket, etc., at the point where you create it. So from
> where I stand, it seems like those 100s of unresolved cases actually
> are resolved?

Not all code can be written with 'with' statements, see my example with
'self = None' in asyncio. Python code can be quite complex, involving
classes with __del__ that do some cleanups etc. Fundamentally, you
cannot make GC of such objects deterministic.

IOW I'm not convinced that if we implement your proposal we'll fix 90%
(or even 30%) of cases where non-deterministic and postponed cleanup is
harmful.
> The problem is that 'with' blocks are great, and generators are great,
> but when you put them together into the same language there's this
> weird interaction that emerges, where 'with' blocks inside generators
> don't really work for their intended purpose unless you're very
> careful and willing to write boilerplate.
>
> Adding deterministic cleanup to generators plugs this gap. Beyond
> that, I do think it's a nice bonus that other iterables can take
> advantage of the feature, but this isn't just a random "hey let's
> smush two constructs together to save a line of code" thing --
> iteration is special because it's where generator call stacks and
> regular call stacks meet.

Yes, I understand that your proposal really improves some things. OTOH
it undeniably complicates the iteration protocol and requires a long
period of deprecations, teaching users and library authors new
semantics, etc.

We only now begin to see Python 3 gaining traction. I don't want us to
harm that by introducing another set of things to Python 3 that are
significantly different from Python 2. DeprecationWarnings/future
imports don't excite users either.

>> IMO, while GC-related issues can be annoying to debug sometimes, it's not
>> worth it to change the behaviour of iteration in Python only to slightly
>> improve on this.
>>
>> 2. This proposal will make writing iterators significantly harder. Consider
>> 'itertools.chain'. We will have to rewrite it to add the proposed
>> __iterclose__ method. The Chain iterator object will have to track all of
>> its iterators, call __iterclose__ on them when it's necessary (there are a
>> few corner cases). Given that this object is implemented in C, it's quite a
>> bit of work. And we'll have a lot of objects to fix.
> When you say "make writing iterators significantly harder", is it fair
> to say that you're thinking mostly of what I'm calling "iterator
> wrappers"? For most day-to-day iterators, it's pretty trivial to
> either add a close method or not; the tricky cases are when you're
> trying to manage a collection of sub-iterators.

Yes, mainly iterator wrappers. You'll also will need to educate users
to refactor (more on that below) their __del__ methods to
__(a)iterclose__ in 3.6.
>
> itertools.chain is a great challenge / test case here, because I think
> it's about as hard as this gets :-). It took me a bit to wrap my head
> around, but I think I've got it, and that it's not so bad actually.
Now imagine that being applied throughout the stdlib, plus some of it
will have to be implemented in C. I'm not saying it's impossible, I'm
saying that it will require additional effort for CPython and ecosystem.

[..]
>
>> 3. This proposal changes the behaviour of 'for' and 'async for' statements
>> significantly. To do partial iteration you will have to use a special
>> builtin function to guard the iterator from being closed. This is
>> completely non-obvious to any existing Python user and will be hard to
>> explain to newcomers.
> It's true that it's non-obvious to existing users, but that's true of
> literally every change that we could ever make :-). That's why we have
> release notes, deprecation warnings, enthusiastic blog posts, etc.

We don't often change the behavior of basic statements like 'for', if ever.

>
> For newcomers... well, it's always difficult for those of us with more
> experience to put ourselves back in the mindset, but I don't see why
> this would be particularly difficult to explain? for loops consume
> their iterator; if you don't want that then here's how you avoid it.
> That's no more difficult to explain than what an iterator is in the
> first place, I don't think, and for me at least it's a lot easier to
> wrap my head around than the semantics of else blocks on for loops
> :-). (I always forget how those work.)

A lot of code that you find on stackoverflow etc will be broken. Porting
code from Python2/<3.6 will be challenging. People are still struggling
to understand 'dict.keys()'-like views in Python 3.

>
>> 4. This proposal only addresses iteration with 'for' and 'async for'
>> statements. If you iterate using a 'while' loop and 'next()' function, this
>> proposal wouldn't help you. Also see the point #2 about third-party code.
> True. If you're doing manual iteration, then you are still responsible
> for manual cleanup (if that's what you want), just like today. This
> seems fine to me -- I'm not sure why it's an objection to this
> proposal :-).

Right now we can implement the __del__ method to cleanup iterators. And
it works for both partial iteration and cases where people forgot to
close the iterator explicitly.

With you proposal, to achieve the same (and make the code compatible
with new for-loop semantics), users will have to implement both
__iterclose__ and __del__.

>
>> 5. Asynchronous generators (AG) introduced by PEP 525 are finalized in a
>> very similar fashion to synchronous generators. There is an API to help
>> Python to call event loop to finalize AGs. asyncio in 3.6 (and other event
>> loops in the near future) already uses this API to ensure that *all AGs in a
>> long-running program are properly finalized* while it is being run.
>>
>> There is an extra loop method (`loop.shutdown_asyncgens`) that should be
>> called right before stopping the loop (exiting the program) to make sure
>> that all AGs are finalized, but if you forget to call it the world won't
>> end. The process will end and the interpreter will shutdown, maybe issuing
>> a couple of ResourceWarnings.
> There is no law that says that the interpreter always shuts down after
> the event loop exits. We're talking about a fundamental language
> feature here, it shouldn't be dependent on the details of libraries
> and application shutdown tendencies :-(.

It's not about shutting down the interpreter or exiting the process.
The majority of async applications just run the loop until they exit.
The point of PEP 525 and how the finalization is handled in asyncio is
that AGs will be properly cleaned up for the absolute majority of time
(while the loop is running).

[..]
>> And if some AG isn't properly finalized a warning will be issued.
> This actually isn't true of the code currently in asyncio master -- if
> the loop is already closed (either manually by the user or by its
> __del__ being called) when the AG finalizer executes, then the AG is
> silently discarded:
> https://github.com/python/asyncio/blob/e3fed68754002000be665ad1a379a747ad9247b6/asyncio/base_events.py#L352
>
> This isn't really an argument against the mechanism though, just a bug
> you should probably fix :-).

I don't think it's a bug. When the loop is closed, the hook will do
nothing, so the asynchronous generator will be cleaned up by the
interpreter. If it has an 'await' expression in its 'finally'
statement, the interpreter will issue a warning.

I'll add a comment explaining this.

>
> I guess it does point to my main dissatisfaction with the whole GC
> hook machinery, though. At this point I have spent many, many hours
> tracing through the details of this catching edge cases -- first
> during the initial PEP process, where there were a few rounds of
> revision, then again the last few days when I first thought I found a
> bunch of bugs that turned out to be spurious because I'd missed one
> line in the PEP, plus one real bug that you already know about (the
> finalizer-called-from-wrong-thread issue), and then I spent another
> hour carefully reading through the code again with PEP 442 open
> alongside once I realized how subtle the resurrection and cyclic
> reference issues are here, and now here's another minor bug for you.

Yes, I agree it's not an easy thing to digest. Good thing is that
asyncio has a reference implementation of PEP 525 support, so people can
learn from it. I'll definitely add more comments to make the code
easier to read.

>
> At this point I'm about 85% confident that it does actually function
> as described, or that we'll at least be able to shake out any
> remaining weird edge cases over the next 6-12 months as people use it.
> But -- and I realize this is an aesthetic reaction as much as anything
> else -- this all feels *really* unpythonic to me. Looking at the Zen,
> the phrases that come to mind are "complicated", and "If the
> implementation is hard to explain, ...".
>
> The __(a)iterclose__ proposal definitely has its complexity as well,
> but it's a very different kind. The core is incredibly
> straightforward: "there is this method, for loops always call it".
> That's it. When you look at a for loop, you can be extremely confident
> about what's going to happen and when. Of course then there's the
> question of defining this method on all the diverse iterators that we
> have floating around -- I'm not saying it's trivial. But you can take
> them one at a time, and each individual case is pretty
> straightforward.

The __(a)iterclose__ semantics is clear. What's not clear is how much
harm changing the semantics of for-loops will do (and how to quantify
the amount of good :))

[..]

>> 7. To conclude: I'm not convinced that this proposal fully solves the issue
>> of non-deterministic GC of iterators. It cripples iteration protocols to
>> partially solve the problem for 'for' and 'async for' statements, leaving
>> manual iteration unresolved. It will make it harder to write *correct*
>> (async-) iterators. It introduces some *implicit* context management to
>> 'for' and 'async for' statements -- something that IMO should be done by
>> user with an explicit 'with' or 'async with'.
> The goal isn't to "fully solve the problem of non-deterministic GC of
> iterators". That would require magic :-). The goal is to provide tools
> so that when users run into this problem, they have viable options to
> solve it. Right now, we don't have those tools, as evidenced by the
> fact that I've basically never seen code that does this "correctly".
> We can tell people that they should be using explicit 'with' on every
> generator that might contain cleanup code, but they don't and they
> won't, and as a result their code quality is suffering on several axes
> (portability across Python implementations, 'with' blocks inside
> generators that don't actually do anything except spuriously hide
> ResourceWarnings, etc.).

Perhaps we should focus on teaching people that using 'with' statements
inside (async-) generators is a bad idea. What you should do instead is
to have a 'with' statement wrapping the code that uses the generator.

Yury

Nathaniel Smith

unread,
Oct 19, 2016, 6:02:23 PM10/19/16
to Chris Angelico, python-ideas
On Wed, Oct 19, 2016 at 11:13 AM, Chris Angelico <ros...@gmail.com> wrote:
> On Thu, Oct 20, 2016 at 3:38 AM, Random832 <rand...@fastmail.com> wrote:
>> On Wed, Oct 19, 2016, at 11:51, Yury Selivanov wrote:
>>> I'm -1 on the idea. Here's why:
>>>
>>>
>>> 1. Python is a very dynamic language with GC and that is one of its
>>> fundamental properties. This proposal might make GC of iterators more
>>> deterministic, but that is only one case.
>>
>> There is a huge difference between wanting deterministic GC and wanting
>> cleanup code to be called deterministically. We're not talking about
>> memory usage here.
>
> Currently, iterators get passed around casually - you can build on
> them, derive from them, etc, etc, etc. If you change the 'for' loop to
> explicitly close an iterator, will you also change 'yield from'?

Oh good point -- 'yield from' definitely needs a mention. Fortunately,
I think it's pretty easy: the only way the child generator in a 'yield
from' can be aborted early is if the parent generator is aborted
early, so the semantics you'd want are that iff the parent generator
is closed, then the child generator is also closed. 'yield from'
already implements those semantics :-). So the only remaining issue is
what to do if the child iterator completes normally, and in this case
I guess 'yield from' probably should call '__iterclose__' at that
point, like the equivalent for loop would.

> What
> about other forms of iteration? Will the iterator be closed when it
> runs out normally?

The iterator is closed if someone explicitly closes it, either by
calling the method by hand, or by passing it to a construct that calls
that method -- a 'for' loop without preserve(...), etc. Obviously any
given iterator's __next__ method could decide to do whatever it wants
when it's exhausted normally, including executing its 'close' logic,
but there's no magic that causes __iterclose__ to be called here.

The distinction between exhausted and exhausted+closed is useful:
consider some sort of file-wrapping iterator that implements
__iterclose__ as closing the file. Then this exhausts the iterator and
then closes the file:

for line in file_wrapping_iter:
...

and this also exhausts the iterator, but since __iterclose__ is not
called, it doesn't close the file, allowing it to be re-used:

for line in preserve(file_wrapping_iter):
...

OTOH there is one important limitation to this, which is that if
you're implementing your iterator by using a generator, then
generators in particular don't provide any way to distinguish between
exhausted and exhausted+closed (this is just how generators already
work, nothing to do with this proposal). Once a generator has been
exhausted, its close() method becomes a no-op.
BTW, it's probably easier to read this way :-):

def every_second(it):
for i, value in enumerate(it):
if i % 2 == 1:
yield value

> This will work, because it's built on a 'for' loop. What if it's built
> on a 'while' loop instead?
>
> def every_second_broken(it):
> try:
> while True:
> nextIit)
> yield next(it)
> except StopIteration:
> pass
>
> Now it *won't* correctly call the end-of-iteration function, because
> there's no 'for' loop. This is going to either (a) require that EVERY
> consumer of an iterator follow this new protocol, or (b) introduce a
> ton of edge cases.

Right. If the proposal is accepted then a lot (I suspect the vast
majority) of iterator consumers will automatically DTRT because
they're already using 'for' loops or whatever; for those that don't,
they'll do whatever they're written to do, and that might or might not
match what users have come to expect. Hence the transition period,
ResourceWarnings and DeprecationWarnings, etc. I think the benefits
are worth it, but there certainly is a transition cost.

-n

--
Nathaniel J. Smith -- https://vorpus.org

Paul Moore

unread,
Oct 19, 2016, 6:08:47 PM10/19/16
to Nathaniel Smith, python-ideas
I missed that you propose phasing this in, but it doesn't really alter
much, I think the current behaviour is valuable and common, and I'm -1
on breaking it. It's just too much of a fundamental change to how
loops and iterators interact for me to be comfortable with it -
particularly as it's only needed for a very specific use case (none of
my programs ever use async - why should I have to rewrite my loops
with a clumsy extra call just to cater for a problem that only occurs
in async code?)

IMO, and I'm sorry if this is controversial, there's a *lot* of new
language complexity that's been introduced for the async use case, and
it's only the fact that it can be pretty much ignored by people who
don't need or use async features that makes it acceptable (the "you
don't pay for what you don't use" principle). The problem with this
proposal is that it doesn't conform to that principle - it has a
direct, negative impact on users who have no interest in async.

Paul

Robert Collins

unread,
Oct 19, 2016, 6:42:22 PM10/19/16
to Nathaniel Smith, Yury Selivanov, python...@python.org
Hey Nathaniel - I like the intent here, but I think perhaps it would
be better if the problem is approached differently.

Seems to me that making *generators* have a special 'you are done now'
interface is special casing, which usually makes things harder to
learn and predict; and that more the net effect is that all loop
constructs will need to learn about that special case, whether looping
over a list, a generator, or whatever.

Generators already have a well defined lifecycle - but as you say its
not defined consistently across Python VM's. The language has no
guarantees about when finalisation will occur :(. The PEP 525 aclose
is a bit awkward itself in this way - but unlike regular generators it
does have a reason, which is that the language doesn't define an event
loop context as a built in thing - so finalisation can't reliably
summon one up.

So rather than adding a special case to finalise objects used in one
particular iteration - which will play havoc with break statements,
can we instead look at making escape analysis a required part of the
compiler: the borrow checker in rust is getting pretty good at
managing a very similar problem :).

I haven't fleshed out exactly what would be entailed, so consider this
a 'what if' and YMMV :).

-Rob


On 19 October 2016 at 17:38, Nathaniel Smith <n...@pobox.com> wrote:
> Hi all,
>
> for line in file_handle:
> yield json.loads(line)
>
> Always inject resources, and do all cleanup at the top level
> ------------------------------------------------------------
>
> It was suggested on python-dev (XX find link) that a pattern to avoid
> these problems is to always pass resources in from above, e.g.
> ``read_newline_separated_json`` should take a file object rather than
> a path, with cleanup handled at the top level::
>
> def read_newline_separated_json(file_handle):
> for line in file_handle:
> yield json.loads(line)
>
> def read_users(file_handle):
> for document in read_newline_separated_json(file_handle):
> yield User.from_json(document)
>
> with open(path) as file_handle:
> for user in read_users(file_handle):
> ...
>
> This works well in simple cases; here it lets us avoid the "N+1
> ``with`` blocks problem". But unfortunately, it breaks down quickly
> when things get more complex. Consider if instead of reading from a
> file, our generator was reading from a streaming HTTP GET request --
> while handling redirects and authentication via OAUTH. Then we'd
> really want the sockets to be managed down inside our HTTP client
> library, not at the top level. Plus there are other cases where
> ``finally`` blocks embedded inside generators are important in their
> own right: db transaction management, emitting logging information
> during cleanup (one of the major motivating use cases for WSGI
> New convenience functions
> -------------------------
>
> The ``itertools`` module gains a new iterator wrapper that can be used
> to selectively disable the new ``__iterclose__`` behavior::
>
> # QUESTION: I feel like there might be a better name for this one?
> class preserve(iterable):
> def __init__(self, iterable):
> self._it = iter(iterable)
>
> def __iter__(self):
> return self
>
> def __next__(self):
> return next(self._it)
>
> def __iterclose__(self):
> # Swallow __iterclose__ without passing it on
> pass
>
> Example usage (assuming that file objects implements ``__iterclose__``)::
>
> with open(...) as handle:
> # Iterate through the same file twice:
> for line in itertools.preserve(handle):
> ...
> handle.seek(0)
> for line in itertools.preserve(handle):
> ...
>
> The ``operator`` module gains two new functions, with semantics
> equivalent to the following::
>
> def iterclose(it):
> if hasattr(type(it), "__iterclose__"):
> type(it).__iterclose__(it)
>
> async def aiterclose(ait):
> if hasattr(type(ait), "__aiterclose__"):
> await type(ait).__aiterclose__(ait)
>
> These are particularly useful when implementing the changes in the next section:
>
>
> __iterclose__ implementations for iterator wrappers
> ---------------------------------------------------
>
> Python ships a number of iterator types that act as wrappers around
> other iterators: ``map``, ``zip``, ``itertools.accumulate``,
> ``csv.reader``, and others. These iterators should define a
> ``__iterclose__`` method which calls ``__iterclose__`` in turn on
> their underlying iterators. For example, ``map`` could be implemented
> as::
>
> class map:
> def __init__(self, fn, *iterables):
> self._fn = fn
> self._iters = [iter(iterable) for iterable in iterables]
>
> def __iter__(self):
> return self
>
> def __next__(self):
> --
> Nathaniel J. Smith -- https://vorpus.org

Yury Selivanov

unread,
Oct 19, 2016, 6:58:43 PM10/19/16
to python...@python.org


On 2016-10-19 6:07 PM, Paul Moore wrote:
> I missed that you propose phasing this in, but it doesn't really alter
> much, I think the current behaviour is valuable and common, and I'm -1
> on breaking it. It's just too much of a fundamental change to how
> loops and iterators interact for me to be comfortable with it -
> particularly as it's only needed for a very specific use case (none of
> my programs ever use async - why should I have to rewrite my loops
> with a clumsy extra call just to cater for a problem that only occurs
> in async code?)

If I understand Nathaniel's proposal, fixing 'async for' isn't the only
motivation. Moreover, async generators aren't that different from sync
generators in terms of finalization.

Yury

Terry Reedy

unread,
Oct 19, 2016, 10:08:26 PM10/19/16
to python...@python.org
On 10/19/2016 12:38 AM, Nathaniel Smith wrote:

> I'd like to propose that Python's iterator protocol be enhanced to add
> a first-class notion of completion / cleanup.

With respect the the standard iterator protocol, a very solid -1 from
me. (I leave commenting specifically on __aiterclose__ to Yury.)

1. I consider the introduction of iterables and the new iterator
protocol in 2.2 and their gradual replacement of lists in many
situations to be the greatest enhancement to Python since 1.3 (my first
version). They are, to me, they one of Python's greatest features and
the minimal nature of the protocol an essential part of what makes them
great.

2. I think you greatly underestimate the negative impact, just as we did
with changing str is bytes to str is unicode. The change itself,
embodied in for loops, will break most non-trivial programs. You
yourself note that there will have to be pervasive changes in the stdlib
just to begin fixing the breakage.

3. Though perhaps common for what you do, the need for the change is
extremely rare in the overall Python world. Iterators depending on an
external resource are rare (< 1%, I would think). Incomplete iteration
is also rare (also < 1%, I think). And resources do not always need to
releases immediately.

4. Previous proposals to officially augment the iterator protocol, even
with optional methods, have been rejected, and I think this one should
be too.

a. Add .__len__ as an option. We added __length_hint__, which an
iterator may implement, but which is not part of the iterator protocol.
It is also ignored by bool().

b., c. Add __bool__ and/or peek(). I posted a LookAhead wrapper class
that implements both for most any iterable. I suspect that the is
rarely used.


> def read_newline_separated_json(path):
> with open(path) as file_handle: # <-- with block
> for line in file_handle:
> yield json.loads(line)

One problem with passing paths around is that it makes the receiving
function hard to test. I think functions should at least optionally
take an iterable of lines, and make the open part optional. But then
closing should also be conditional.

If the combination of 'with', 'for', and 'yield' do not work together,
then do something else, rather than changing the meaning of 'for'.
Moving responsibility for closing the file from 'with' to 'for', makes
'with' pretty useless, while overloading 'for' with something that is
rarely needed. This does not strike me as the right solution to the
problem.

> for document in read_newline_separated_json(path): # <-- outer for loop
> ...

If the outer loop determines when the file should be closed, then why
not open it there? What fails with

try:
lines = open(path)
gen = read_newline_separated_json(lines)
for doc in gen: do_something(doc)
finally:
lines.close
# and/or gen.throw(...) to stop the generator.

--
Terry Jan Reedy

Nathaniel Smith

unread,
Oct 21, 2016, 2:04:13 AM10/21/16
to Paul Moore, python-ideas
Oh, goodness, no -- like Yury said, the use cases here are not
specific to async at all. I mean, none of the examples are async even
:-).

The motivation here is that prompt (non-GC-dependent) cleanup is a
good thing for a variety of reasons: determinism, portability across
Python implementations, proper exception propagation, etc. async does
add yet another entry to this list, but I don't the basic principle is
controversial. 'with' blocks are a whole chunk of extra syntax that
were added to the language just for this use case. In fact 'with'
blocks weren't even needed for the functionality -- we already had
'try/finally', they just weren't ergonomic enough. This use case is so
important that it's had multiple rounds of syntax directed at it
before async/await was even a glimmer in C#'s eye :-).

BUT, currently, 'with' and 'try/finally' have a gap: if you use them
inside a generator (async or not, doesn't matter), then they often
fail at accomplishing their core purpose. Sure, they'll execute their
cleanup code whenever the generator is cleaned up, but there's no
ergonomic way to clean up the generator. Oops. I mean, you *could*
respond by saying "you should never use 'with' or 'try/finally' inside
a generator" and maybe add that as a rule to your style manual and
linter -- and some people in this thread have suggested more-or-less
that -- but that seems like a step backwards. This proposal instead
tries to solve the problem of making 'with'/'try/finally' work and be
ergonomic in general, and it should be evaluated on that basis, not on
the async/await stuff.

The reason I'm emphasizing async generators is that they effect the
timeline, not the motivation:

- PEP 525 actually does add async-only complexity to the language (the
new GC hooks). It doesn't affect non-async users, but it is still
complexity. And it's possible that if we have iterclose, then we don't
need the new GC hooks (though this is still an open discussion :-)).
If this is true, then now is the time to act, while reverting the GC
hooks change is still a possibility; otherwise, we risk the situation
where we add iterclose later, decide that the GC hooks no longer
provide enough additional value to justify their complexity... but
we're stuck with them anyway.

- For synchronous iteration, the need for a transition period means
that the iterclose proposal will take a few years to provide benefits.
For asynchronous iteration, it could potentially start providing
benefits much sooner -- but there's a very narrow window for that,
before people start using async generators and backwards compatibility
constraints kick in. If we delay a few months then we'll probably have
to delay a few years.

...that said, I guess there is one way that async/await directly
affected my motivation here, though it's not what you think :-).
async/await have gotten me experimenting with writing network servers,
and let me tell you, there is nothing that focuses the mind on
correctness and simplicity like trying to write a public-facing
asynchronous network server. You might think "oh well if you're trying
to do some fancy rocket science and this is a feature for rocket
scientists then that's irrelevant to me", but that's actually not what
I mean at all. The rocket science part is like, trying to run through
all possible execution orders of the different callbacks in your head,
or to mentally simulate what happens if a client shows up that writes
at 1 byte/second. When I'm trying to do that,then the last thing I
want is be distracted by also trying to figure out boring mechanical
stuff like whether or not the language is actually going to execute my
'finally' block -- yet right now that's a question that actually
cannot be answered without auditing my whole source code! And that
boring mechanical stuff is still boring mechanical stuff when writing
less terrifying code -- it's just that I'm so used to wasting a
trickle of cognitive energy on this kind of thing it that normally I
don't notice it so much.

And, also, regarding the "clumsy extra call": the preserve() call
isn't just arbitrary clumsiness -- it's a signal that hey, you're
turning off a safety feature. Now the language won't take care of this
cleanup for you, so it's your responsibility. Maybe you should think
about how you want to handle that. Of course your decision could be
"whatever, this is a one-off script, the GC is good enough". But it's
probably worth the ~0.5 seconds of thought to make that an active,
conscious decision, because they aren't all one-off scripts.

-n

--
Nathaniel J. Smith -- https://vorpus.org

Nathaniel Smith

unread,
Oct 21, 2016, 2:38:18 AM10/21/16
to Terry Reedy, python...@python.org
On Wed, Oct 19, 2016 at 7:07 PM, Terry Reedy <tjr...@udel.edu> wrote:
> On 10/19/2016 12:38 AM, Nathaniel Smith wrote:
>
>> I'd like to propose that Python's iterator protocol be enhanced to add
>> a first-class notion of completion / cleanup.
>
>
> With respect the the standard iterator protocol, a very solid -1 from me.
> (I leave commenting specifically on __aiterclose__ to Yury.)
>
> 1. I consider the introduction of iterables and the new iterator protocol in
> 2.2 and their gradual replacement of lists in many situations to be the
> greatest enhancement to Python since 1.3 (my first version). They are, to
> me, they one of Python's greatest features and the minimal nature of the
> protocol an essential part of what makes them great.

Minimalism for its own sake isn't really a core Python value, and in
any case the minimalism ship has kinda sailed -- we effectively
already have send/throw/close as optional parts of the protocol
(they're most strongly associated with generators, but you're free to
add them to your own iterators and e.g. yield from will happily work
with that). This proposal is basically "we formalize and start
automatically calling the 'close' methods that are already there".

> 2. I think you greatly underestimate the negative impact, just as we did
> with changing str is bytes to str is unicode. The change itself, embodied
> in for loops, will break most non-trivial programs. You yourself note that
> there will have to be pervasive changes in the stdlib just to begin fixing
> the breakage.

The long-ish list of stdlib changes is about enabling the feature
everywhere, not about fixing backwards incompatibilities.

It's an important question though what programs will break and how
badly. To try and get a better handle on it I've been playing a bit
with an instrumented version of CPython that logs whenever the same
iterator is passed to multiple 'for' loops. I'll write up the results
in more detail, but the summary so far is that there seem to be ~8
places in the stdlib that would need preserve() calls added, and ~3 in
django. Maybe 2-3 hours and 1 hour of work respectively to fix?

It's not a perfect measure, and the cost certainly isn't zero, but
it's at a completely different order of magnitude than the str
changes. Among other things, this is a transition that allows for
gradual opt-in via a __future__, and fine-grained warnings pointing
you at what you need to fix, neither of which were possible for
str->unicode.

> 3. Though perhaps common for what you do, the need for the change is
> extremely rare in the overall Python world. Iterators depending on an
> external resource are rare (< 1%, I would think). Incomplete iteration is
> also rare (also < 1%, I think). And resources do not always need to
> releases immediately.

This could equally well be an argument that the change is fine -- e.g.
if you're always doing complete iteration, or just iterating over
lists and stuff, then it literally doesn't affect you at all either
way...

> 4. Previous proposals to officially augment the iterator protocol, even with
> optional methods, have been rejected, and I think this one should be too.
>
> a. Add .__len__ as an option. We added __length_hint__, which an iterator
> may implement, but which is not part of the iterator protocol. It is also
> ignored by bool().
>
> b., c. Add __bool__ and/or peek(). I posted a LookAhead wrapper class that
> implements both for most any iterable. I suspect that the is rarely used.
>
>
>> def read_newline_separated_json(path):
>> with open(path) as file_handle: # <-- with block
>> for line in file_handle:
>> yield json.loads(line)
>
>
> One problem with passing paths around is that it makes the receiving
> function hard to test. I think functions should at least optionally take an
> iterable of lines, and make the open part optional. But then closing should
> also be conditional.

Sure, that's all true, but this is the problem with tiny documentation
examples :-). The point here was to explain the surprising interaction
between generators and with blocks in the simplest way, not to
demonstrate the ideal solution to the problem of reading
newline-separated JSON. Everything you want is still doable in a
post-__iterclose__ world -- in particular, if you do

for doc in read_newline_separated_json(lines_generator()):
...

then both iterators will be closed when the for loop exits. But if you
want to re-use the lines_generator, just write:

it = lines_generator()
for doc in read_newline_separated_json(preserve(it)):
...
for more_lines in it:
...

> If the combination of 'with', 'for', and 'yield' do not work together, then
> do something else, rather than changing the meaning of 'for'. Moving
> responsibility for closing the file from 'with' to 'for', makes 'with'
> pretty useless, while overloading 'for' with something that is rarely
> needed. This does not strike me as the right solution to the problem.
>
>> for document in read_newline_separated_json(path): # <-- outer for loop
>> ...
>
>
> If the outer loop determines when the file should be closed, then why not
> open it there? What fails with
>
> try:
> lines = open(path)
> gen = read_newline_separated_json(lines)
> for doc in gen: do_something(doc)
> finally:
> lines.close
> # and/or gen.throw(...) to stop the generator.

Sure, that works in this trivial case, but they aren't all trivial
:-). See the example from my first email about a WSGI-like interface
where response handlers are generators: in that use case, your
suggestion that we avoid all resource management inside generators
would translate to: "webapps can't open files". (Or database
connections, proxy requests, ... or at least, can't hold them open
while streaming out response data.)

Or sticking to concrete examples, here's a toy-but-plausible generator
where the put-the-with-block-outside strategy seems rather difficult
to implement:

# Yields all lines in all files in 'directory' that contain the
substring 'needle'
def recursive_grep(directory, needle):
for dirpath, _, filenames in os.walk(directory):
for filename in filenames:
with open(os.path.join(dirpath, filename)) as file_handle:
for line in file_handle:
if needle in line:
yield line

-n

--
Nathaniel J. Smith -- https://vorpus.org

Steven D'Aprano

unread,
Oct 21, 2016, 3:40:02 AM10/21/16
to python...@python.org
On Thu, Oct 20, 2016 at 11:03:11PM -0700, Nathaniel Smith wrote:

> The motivation here is that prompt (non-GC-dependent) cleanup is a
> good thing for a variety of reasons: determinism, portability across
> Python implementations, proper exception propagation, etc. async does
> add yet another entry to this list, but I don't the basic principle is
> controversial.

Perhaps it should be.

The very first thing you say is "determinism". Hmmm. As we (or at least,
some of us) move towards more async code, more threads or multi-
processing, even another attempt to remove the GIL from CPython which
will allow people to use threads with less cost, how much should we
really value determinism? That's not a rhetorical question -- I don't
know the answer.

Portability across Pythons... if all Pythons performed exactly the same,
why would we need multiple implementations? The way I see it,
non-deterministic cleanup is the cost you pay for a non-reference
counting implementation, for those who care about the garbage collection
implementation. (And yes, ref counting is garbage collection.)


[...]
> 'with' blocks are a whole chunk of extra syntax that
> were added to the language just for this use case. In fact 'with'
> blocks weren't even needed for the functionality -- we already had
> 'try/finally', they just weren't ergonomic enough. This use case is so
> important that it's had multiple rounds of syntax directed at it
> before async/await was even a glimmer in C#'s eye :-).
>
> BUT, currently, 'with' and 'try/finally' have a gap: if you use them
> inside a generator (async or not, doesn't matter), then they often
> fail at accomplishing their core purpose. Sure, they'll execute their
> cleanup code whenever the generator is cleaned up, but there's no
> ergonomic way to clean up the generator. Oops.

How often is this *actually* a problem in practice?

On my system, I can open 1000+ files as a regular user. I can't even
comprehend opening a tenth of that as an ordinary application, although
I can imagine that if I were writing a server application things would
be different. But then I don't expect to write server applications in
quite the same way as I do quick scripts or regular user applications.

So it seems to me that a leaked file handler or two normally shouldn't
be a problem in practice. They'll be friend when the script or
application closes, and in the meantime, you have hundreds more
available. 90% of the time, using `with file` does exactly what we want,
and the times it doesn't (because we're writing a generator that isn't
closed promptly) 90% of those times it doesn't matter. So (it seems to
me) that you're talking about changing the behaviour of for-loops to
suit only a small proportion of cases: maybe 10% of 10%.

It is not uncommon to pass an iterator (such as a generator) through a
series of filters, each processing only part of the iterator:

it = generator()
header = collect_header(it)
body = collect_body(it)
tail = collect_tail(it)

Is it worth disrupting this standard idiom? I don't think so.



--
Steve

Neil Girdhar

unread,
Oct 21, 2016, 4:13:28 AM10/21/16
to python...@googlegroups.com
What's wrong with something like:

with make_cm() as cm:
    for i in cm:
           do_something(i)

You've justified that we need determinstic cleanup and you're right that that's what context managers are for.  It sounds like you want to have an implicit context manager around every iteration because people might forget to use one.  Is that right?

What's wrong with making it so that the return value of make_cm() (the context manager) is *not* iterable?  That way if someone tries:

for i in make_cm():
     # etc.

it won't work.  They are forced to make a context manager, and then they are guaranteed to clean everything up properly.  Is it just a question of compactness in the end?

Steven D'Aprano

unread,
Oct 21, 2016, 5:54:38 AM10/21/16
to python...@python.org
You know, I'm actually starting to lean towards this proposal and away
from my earlier objections...

On Wed, Oct 19, 2016 at 12:33:57PM -0700, Nathaniel Smith wrote:

> I should also say, regarding your specific example, I guess it's an
> open question whether we would want list_iterator.__iterclose__ to
> actually do anything. It could flip the iterator to a state where it
> always raises StopIteration,

That seems like the most obvious.

[...]
> The __iterclose__ contract is that you're not supposed
> to call __next__ afterwards, so there's no real rule about what
> happens if you do.

If I recall correctly, in your proposal you use language like "behaviour
is undefined". I don't like that language, because it sounds like
undefined behaviour in C, which is something to be avoided like the
plague. I hope I don't need to explain why, but for those who may not
understand the dangers of "undefined behaviour" as per the C standard,
you can start here:

https://randomascii.wordpress.com/2014/05/19/undefined-behavior-can-format-your-drive/

So let's make it clear that what we actually mean is not C-ish undefined
behaviour, where the compiler is free to open a portal to the Dungeon
Dimensions or use Guido's time machine to erase code that executes
before the undefined code:

https://blogs.msdn.microsoft.com/oldnewthing/20140627-00/?p=633/

but rather ordinary, standard "implementation-dependent behaviour". If
you call next() on a closed iterator, you'll get whatever the iterator
happens to do when it is closed. That will be *recommended* to raise
whatever error is appropriate to the iterator, but not enforced.

That makes it just like the part of the iterator protocol that says that
once an iterator raise StopIterator, it should always raise
StopIterator. Those that don't are officially called "broken", but they
are allowed and you can write one if you want to.

Shorter version:

- calling next() on a closed iterator is expected to be an error of
some sort, often RuntimeError error, but the iterator is free to use a
different error if that makes sense (e.g. closed files);

- if your own iterator classes break that convention, they will be
called "broken", but nobody will stop you from writing such "broken"
iterators.



--
Steve

Paul Moore

unread,
Oct 21, 2016, 6:04:41 AM10/21/16
to Nathaniel Smith, python-ideas
On 21 October 2016 at 07:03, Nathaniel Smith <n...@pobox.com> wrote:
> Oh, goodness, no -- like Yury said, the use cases here are not
> specific to async at all. I mean, none of the examples are async even
> :-).
[...]

Ah I follow now. Sorry for the misunderstanding, I'd skimmed a bit
more than I realised I had.

However, it still feels to me that the code I currently write doesn't
need this feature, and I'm therefore unclear as to why it's
sufficiently important to warrant a backward compatibility break.

It's quite possible that I've never analysed my code well enough to
*notice* that there's a problem. Or that I rely on CPython's GC
behaviour without realising it. Also, it's honestly very rare that I
need deterministic cleanup, as opposed to guaranteed cleanup - running
out of file handles, for example, isn't really a problem I encounter.

But it's also possible that it's a code design difference. You use the
example (from memory, sorry if this is slightly different to what you
wrote):

def filegen(filename):
with open(filename) as f:
for line in f:
yield line

# caller
for line in filegen(name):
...

I wouldn't normally write a function like that - I'd factor it
differently, with the generator taking an open file (or file-like
object) and the caller opening the file:

def filegen(fd):
for line in f:
yield line

# caller
with open(filename) as fd:
for line in filegen(fd):
...

With that pattern, there's no issue. And the filegen function is more
generic, as it can be used with *any* file-like object (a StringIO,
for testing, for example).

> And, also, regarding the "clumsy extra call": the preserve() call
> isn't just arbitrary clumsiness -- it's a signal that hey, you're
> turning off a safety feature. Now the language won't take care of this
> cleanup for you, so it's your responsibility. Maybe you should think
> about how you want to handle that. Of course your decision could be
> "whatever, this is a one-off script, the GC is good enough". But it's
> probably worth the ~0.5 seconds of thought to make that an active,
> conscious decision, because they aren't all one-off scripts.

Well, if preserve() did mean just that, then that would be OK. I'd
never use it, as I don't care about deterministic cleanup, so it makes
no difference to me if it's on or off.

But that's not the case - in fact, preserve() means "give me the old
Python 3.5 behaviour", and (because deterministic cleanup isn't
important to me) that's a vague and unclear distinction. So I don't
know whether my code is affected by the behaviour change and I have to
guess at whether I need preserve().

What I think is needed here is a clear explanation of how this
proposal affects existing code that *doesn't* need or care about
cleanup. The example that's been mentioned is

with open(filename) as f:
for line in f:
if is_end_of_header(line): break
process_header(line)

for line in f:
process_body(line)

and similar code that relies on being able to part-process an iterator
in a for loop, and then have a later loop pick up where the first left
off.

Most users of iterators and generators probably have little
understanding of GeneratorExit, closing generators, etc. And that's a
good thing - it's why iterators in Python are so useful. So the
proposal needs to explain how it impacts that sort of user, in terms
that they understand. It's a real pity that the explanation isn't "you
can ignore all of this, as you aren't affected by the problem it's
trying to solve" - that's what I was getting at.

At the moment, the take home message for such users feels like it's
"you might need to scatter preserve() around your code, to avoid the
behaviour change described above, which you glazed over because it
talked about all that coroutiney stuff you don't understand" :-)

Paul

Paul

Paul Moore

unread,
Oct 21, 2016, 6:08:36 AM10/21/16
to Steven D'Aprano, Python-Ideas
On 21 October 2016 at 10:53, Steven D'Aprano <st...@pearwood.info> wrote:
> On Wed, Oct 19, 2016 at 12:33:57PM -0700, Nathaniel Smith wrote:
>
>> I should also say, regarding your specific example, I guess it's an
>> open question whether we would want list_iterator.__iterclose__ to
>> actually do anything. It could flip the iterator to a state where it
>> always raises StopIteration,
>
> That seems like the most obvious.

So - does this mean "unless you understand what preserve() does,
you're OK to not use it and your code will continue to work as
before"? If so, then I'd be happy with this.

But I genuinely don't know (without going rummaging through docs) what
that statement means in any practical sense.
Paul

Steven D'Aprano

unread,
Oct 21, 2016, 6:29:54 AM10/21/16
to python...@python.org
On Wed, Oct 19, 2016 at 05:52:34PM -0400, Yury Selivanov wrote:

> IOW I'm not convinced that if we implement your proposal we'll fix 90%
> (or even 30%) of cases where non-deterministic and postponed cleanup is
> harmful.

Just because something doesn't solve ALL problems doesn't mean it isn't
worth doing. Reference counting doesn't solve the problem of cycles, but
Python worked really well for many years even though cycles weren't
automatically broken. Then a second GC was added, but it didn't solve
the problem of cycles with __del__ finalizers. And recently (a year or
two ago) there was an improvement that made the GC better able to deal
with such cases -- but I expect that there are still edge cases where
objects aren't collected.

Had people said "garbage collection doesn't solve all the edge cases,
therefore its not worth doing" where would we be?

I don't know how big a problem the current lack of deterministic GC
of resources opened in generators actually is. I guess that users of
CPython will have *no idea*, because most of the time the ref counter
will cleanup quite early. But not all Pythons are CPython, and despite
my earlier post, I now think I've changed my mind and support this
proposal.

One reason for this is that I thought hard about my own code where I use
the double-for-loop idiom:

for x in iterator:
if cond: break
...

# later
for y in iterator: # same iterator
...


and I realised:

(1) I don't do this *that* often;
(2) when I do, it really wouldn't be that big a problem for me to
guard against auto-closing:

for x in protect(iterator):
if cond: break
...

(3) if I need to write hybrid code that runs over multiple versions,
that's easy too:

try:
from itertools import protect
except ImportError:
def protect(it):
return it



> Yes, mainly iterator wrappers. You'll also will need to educate users
> to refactor (more on that below) their __del__ methods to
> __(a)iterclose__ in 3.6.

Couldn't __(a)iterclose__ automatically call __del__ if it exists? Seems
like a reasonable thing to inherit from object.


> A lot of code that you find on stackoverflow etc will be broken.

"A lot"? Or a little? Are you guessing, or did you actually count it?

If we are worried about code like this:


it = iter([1, 2, 3])
a = list(it)
# currently b will be [], with this proposal it will raise RuntimeError
b = list(it)


we can soften the proposal's recommendation that iterators raise
RuntimeError on calling next() when they are closed. I've suggested that
"whatever exception makes sense" should be the rule. Iterators with no
resources to close can simply raise StopIteration instead. That will
preserve the current behaviour.


> Porting
> code from Python2/<3.6 will be challenging. People are still struggling
> to understand 'dict.keys()'-like views in Python 3.

I spend a lot of time on the tutor and python-list mailing lists, and a
little bit of time on Reddit /python, and I don't think I've ever seen
anyone struggle with those. I'm sure it happens, but I don't think it
happens often. After all, for the most common use-case, there's no real
difference between Python 2 and 3:

for key, value in mydict.items():
...


[...]
> With you proposal, to achieve the same (and make the code compatible
> with new for-loop semantics), users will have to implement both
> __iterclose__ and __del__.

As I ask above, couldn't we just inherit a default __(a)iterclose__ from
object that looks like this?

def __iterclose__(self):
finalizer = getattr(type(self), '__del__', None)
if finalizer:
finalizer(self)


I know it looks a bit funny for non-iterables to have an iterclose
method, but they'll never actually be called.


[...]
> The __(a)iterclose__ semantics is clear. What's not clear is how much
> harm changing the semantics of for-loops will do (and how to quantify
> the amount of good :))


The "easy" way to find out (easy for those who aren't volunteering to do
the work) is to fork Python, make the change, and see what breaks. I
suspect not much, and most of the breakage will be easy to fix.

As for the amount of good, this proposal originally came from PyPy. I
expect that CPython users won't appreciate it as much as PyPy users, and
Jython/IronPython users when they eventually support Python 3.x.



--
Steve

Steven D'Aprano

unread,
Oct 21, 2016, 7:24:45 AM10/21/16
to python...@python.org
On Fri, Oct 21, 2016 at 11:03:51AM +0100, Paul Moore wrote:

> At the moment, the take home message for such users feels like it's
> "you might need to scatter preserve() around your code, to avoid the
> behaviour change described above, which you glazed over because it
> talked about all that coroutiney stuff you don't understand" :-)

I now believe that's not necessarily the case. I think that the message
should be:

- If your iterator class has a __del__ or close method, then you need
to read up on __(a)iterclose__.

- If you iterate over open files twice, then all you need to remember is
that the file will be closed when you exit the first loop. To avoid
that auto-closing behaviour, use itertools.preserve().

- Iterating over lists, strings, tuples, dicts, etc. won't change, since
they don't have __del__ or close() methods.


I think that covers all the cases the average Python code will care
about.



--
Steve

Steven D'Aprano

unread,
Oct 21, 2016, 7:49:56 AM10/21/16
to python...@python.org
On Fri, Oct 21, 2016 at 11:07:46AM +0100, Paul Moore wrote:
> On 21 October 2016 at 10:53, Steven D'Aprano <st...@pearwood.info> wrote:
> > On Wed, Oct 19, 2016 at 12:33:57PM -0700, Nathaniel Smith wrote:
> >
> >> I should also say, regarding your specific example, I guess it's an
> >> open question whether we would want list_iterator.__iterclose__ to
> >> actually do anything. It could flip the iterator to a state where it
> >> always raises StopIteration,
> >
> > That seems like the most obvious.

I've changed my mind -- I think maybe it should do nothing, and preserve
the current behaviour of lists.

I'm now more concerned with keeping current behaviour as much as
possible than creating some sort of consistent error condition for all
iterators. Consistency is over-rated, and we already have inconsistency
here: file iterators behave differently from list iterators, because
they can be closed:


py> f = open('/proc/mdstat', 'r')
py> a = list(f)
py> b = list(f)
py> len(a), len(b)
(20, 0)
py> f.close()
py> c = list(f)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: I/O operation on closed file.

We don't need to add a close() to list iterators just so they are
consistent with files. Just let __iterclose__ be a no-op.


> So - does this mean "unless you understand what preserve() does,
> you're OK to not use it and your code will continue to work as
> before"? If so, then I'd be happy with this.

Almost.

Code like this will behave exactly the same as it currently does:

for x in it:
process(x)

y = list(it)

If it is a file object, the second call to list() will raise ValueError;
if it is a list_iterator, or generator, etc., y will be an empty list.
That part (I think) shouldn't change.


What *will* change is code that partially processes the iterator in two
different places. A simple example:

py> it = iter([1, 2, 3, 4, 5, 6])
py> for x in it:
... if x == 4: break
...
py> for x in it:
... print(x)
...
5
6


This *may* change. With this proposal, the first loop will "close" the
iterator when you exit from the loop. For a list, there's no finaliser,
no __del__ to call, so we can keep the current behaviour and nobody will
notice any difference.

But if `it` is a file iterator instead of a list iterator, the file will
be closed when you exit the first for-loop, and the second loop will
raise ValueError. That will be different.

The fix here is simple: protect the first call from closing:

for x in itertools.preserve(it): # preserve, protect, whatever
...


Or, if `it` is your own class, give it a __iterclose__ method that does
nothing.


This is a backwards-incompatible change, so I think we would need to do
this:

(1) In Python 3.7, we introduce a __future__ directive:

from __future__ import iterclose

to enable the new behaviour. (Remember, future directives apply on a
module-by-module basis.)

(2) Without the directive, we keep the old behaviour, except that
warnings are raised if something will change.

(3) Then in 3.8 iterclose becomes the default, the warnings go away, and
the new behaviour just happens.


If that's too fast for people, we could slow it down:

(1) Add the future directive to Python 3.7;

(2) but no warnings by default (you have to opt-in to the
warnings with an environment variable, or command-line switch).

(3) Then in 3.8 the warnings are on by default;

(4) And the iterclose behaviour doesn't become standard until 3.9.


That means if this change worries you, you can ignore it until you
migrate to 3.8 (which won't be production-ready until about 2020 or so),
and don't have to migrate your code until 3.9, which will be a year or
two later. But early adopters can start targetting the new functionality
from 3.7 if they like.

I don't think there's any need for a __future__ directive for
aiterclose, since there's not enough backwards-incompatibility to care
about. (I think, but don't mind if people disagree.) That can happen
starting in 3.7, and when people complain that their syncronous
generators don't have deterministic garbage collection like their
asyncronous ones do, we can point them at the future directive.

Bottom line is: at first I thought this was a scary change that would
break too much code. But now I think it won't break much, and we can
ease into it really slowly over two or three releases. So I think that
the cost is probably low. I'm still not sure on how great the benefit
will be, but I'm leaning towards a +1 on this.



--
Steve

Paul Moore

unread,
Oct 21, 2016, 9:36:10 AM10/21/16
to Steven D'Aprano, Python-Ideas
On 21 October 2016 at 12:23, Steven D'Aprano <st...@pearwood.info> wrote:
> On Fri, Oct 21, 2016 at 11:03:51AM +0100, Paul Moore wrote:
>
>> At the moment, the take home message for such users feels like it's
>> "you might need to scatter preserve() around your code, to avoid the
>> behaviour change described above, which you glazed over because it
>> talked about all that coroutiney stuff you don't understand" :-)
>
> I now believe that's not necessarily the case. I think that the message
> should be:
>
> - If your iterator class has a __del__ or close method, then you need
> to read up on __(a)iterclose__.
>
> - If you iterate over open files twice, then all you need to remember is
> that the file will be closed when you exit the first loop. To avoid
> that auto-closing behaviour, use itertools.preserve().
>
> - Iterating over lists, strings, tuples, dicts, etc. won't change, since
> they don't have __del__ or close() methods.
>
>
> I think that covers all the cases the average Python code will care
> about.

OK, that's certainly a lot less scary.

Some thoughts, remain, though:

1. You mention files. Presumably (otherwise what would be the point of
the change?) there will be other iterables that change similarly.
There's no easy way to know in advance.
2. Cleanup protocols for iterators are pretty messy now - __del__,
close, __iterclose__, __aiterclose__. What's the chance 3rd party
implementers get something wrong?
3. What about generators? If you write your own generator, you don't
control the cleanup code. The example:

def mygen(name):
with open(name) as f:
for line in f:
yield line

is a good example - don't users of this generator need to use
preserve() in order to be able to do partial iteration? And yet how
would the writer of the generator know to document this? And if it
isn't documented, how does the user of the generator know preserve is
needed?

My feeling is that this proposal is a relatively significant amount of
language churn, to solve a relatively niche problem, and furthermore
one that is actually only a problem to non-CPython implementations[1].
My instincts are that we need to back off on the level of such change,
to give users a chance to catch their breath. We're not at the level
of where we need something like the language change moratorium (PEP
3003) but I don't think it would do any harm to give users a chance to
catch their breath after the wave of recent big changes (async,
typing, path protocol, f-strings, funky unpacking, Windows build and
installer changes, ...).

To put this change in perspective - we've lived without it for many
years now, can we not wait a little while longer?

From another message:
> Bottom line is: at first I thought this was a scary change that would
> break too much code. But now I think it won't break much, and we can
> ease into it really slowly over two or three releases. So I think that
> the cost is probably low. I'm still not sure on how great the benefit
> will be, but I'm leaning towards a +1 on this.

And yet, it still seems to me that it's going to force me to change
(maybe not much, but some of) my existing code, for absolutely zero
direct benefit, as I don't personally use or support PyPy or any other
non-CPython implementations. Don't forget that PyPy still doesn't even
implement Python 3.5 - so no-one benefits from this change until PyPy
supports Python 3.8, or whatever version this becomes the default in.
It's very easy to misuse an argument like this to block *any* sort of
change, and that's not my intention here - but I am trying to
understand what the real-world issue is here, and how (and when!) this
proposal would allow people to write code to fix that problem. At the
moment, it feels like:

* The problem is file handle leaks in code running under PyPy
* The ability to fix this will come in around 4 years (random guess
as to when PyPy implements Python 3.8, plus an assumption that the
code needing to be fixed can immediately abandon support for all
earlier versions of PyPy).

Any other cases seem to me to be theoretical at the moment. Am I being
unfair in this assessment? (It feels like I might be, but I can't be
sure how).

Paul

[1] As I understand it. CPython's refcounting GC makes this a
non-issue, correct?

Yury Selivanov

unread,
Oct 21, 2016, 11:09:52 AM10/21/16
to python...@python.org


On 2016-10-21 6:29 AM, Steven D'Aprano wrote:
> On Wed, Oct 19, 2016 at 05:52:34PM -0400, Yury Selivanov wrote:
[..]
>> With you proposal, to achieve the same (and make the code compatible
>> with new for-loop semantics), users will have to implement both
>> __iterclose__ and __del__.
> As I ask above, couldn't we just inherit a default __(a)iterclose__ from
> object that looks like this?
>
> def __iterclose__(self):
> finalizer = getattr(type(self), '__del__', None)
> if finalizer:
> finalizer(self)
>
>
> I know it looks a bit funny for non-iterables to have an iterclose
> method, but they'll never actually be called.

No, we can't call __del__ from __iterclose__. Otherwise we'd
break even more code that this proposal already breaks:


for i in iter:
...
iter.something() # <- this would be call after iter.__del__()

[..]
> As for the amount of good, this proposal originally came from PyPy. I
> expect that CPython users won't appreciate it as much as PyPy users, and
> Jython/IronPython users when they eventually support Python 3.x.

AFAIK the proposal came "for" PyPy, not "from". And the
issues Nathaniel tries to solve do also exist in CPython. It's
only a question if changing 'for' statement and iteration protocol
is worth the trouble.

Yury

Yury Selivanov

unread,
Oct 21, 2016, 11:16:16 AM10/21/16
to python...@python.org


On 2016-10-21 7:13 AM, Steven D'Aprano wrote:
> Consistency is over-rated, and we already have inconsistency
> here: file iterators behave differently from list iterators, because
> they can be closed:

This is **very** arguable :)

Yury

Gustavo Carneiro

unread,
Oct 21, 2016, 11:20:22 AM10/21/16
to Python-Ideas
Personally, I hadn't realised we had this problem in asyncio until now.

Does this problem happen in asyncio at all?  Or does asyncio somehow work around it by making sure to always explicitly destroy the frames of all coroutine objects, as long as someone waits on each task?
--
Gustavo J. A. M. Carneiro
Gambit Research
"The universe is always one step beyond logic." -- Frank Herbert

Ronan Lamy

unread,
Oct 21, 2016, 4:29:03 PM10/21/16
to python...@python.org
Le 21/10/16 à 14:35, Paul Moore a écrit :
>
> [1] As I understand it. CPython's refcounting GC makes this a
> non-issue, correct?

Wrong. Any guarantee that you think the CPython GC provides goes out of
the window as soon as you have a reference cycle. Refcounting does not
actually make GC deterministic, it merely hides the problem away from view.

For instance, on CPython 3.5, running this code:

#%%%%%%%%%

class some_resource:
def __enter__(self):
print("Open resource")
return 42

def __exit__(self, *args):
print("Close resource")

def some_iterator():
with some_resource() as s:
yield s

def main():
it = some_iterator()
for i in it:
if i == 42:
print("The answer is", i)
break
print("End loop")

# later ...
try:
1/0
except ZeroDivisionError as e:
exc = e

main()
print("Exit")

#%%%%%%%%%%

produces:

Open resource
The answer is 42
End loop
Exit
Close resource

What happens is that 'exc' holds a cyclic reference back to the main()
frame, which prevents it from being destroyed when the function exits,
and that frame, in turn, holds a reference to the iterator, via the
local variable 'it'. And so, the iterator remains alive, and the
resource unclosed, until the next garbage collection.

Yury Selivanov

unread,
Oct 21, 2016, 4:53:32 PM10/21/16
to python...@python.org


On 2016-10-21 11:19 AM, Gustavo Carneiro wrote:
> Personally, I hadn't realised we had this problem in asyncio until now.
>
> Does this problem happen in asyncio at all? Or does asyncio somehow work
> around it by making sure to always explicitly destroy the frames of all
> coroutine objects, as long as someone waits on each task?

No, I think asyncio code is free of the problem this proposal
is trying to address.

We might have some "problem" in 3.6 when people start using
async generators more often. But I think it's important for us
to teach people to manage the associated resources from the
outside of the generator (i.e. don't put 'async with' or 'with'
inside the generator's body; instead, wrap the code that uses
the generator with 'async with' or 'with').

Yury

Chris Barker

unread,
Oct 21, 2016, 5:01:07 PM10/21/16
to Steven D'Aprano, Python-Ideas
On Fri, Oct 21, 2016 at 12:12 AM, Steven D'Aprano <st...@pearwood.info> wrote:
Portability across Pythons... if all Pythons performed exactly the same,
why would we need multiple implementations? The way I see it,
non-deterministic cleanup is the cost you pay for a non-reference
counting implementation, for those who care about the garbage collection
implementation. (And yes, ref counting is garbage collection.)

Hmm -- and yet "with" was added, and I an't imageine that its largest use-case is with ( ;-) ) open:

with open(filename, mode) as my_file:
    ....
    ....

And yet for years I happily counted on reference counting to close my files, and was particularly happy with:

data = open(filename, mode).read()

I really liked that that file got opened, read, and closed and cleaned up right off the bat.

And then context managers were introduced. And it seems to be there is a consensus in the Python community that we all should be using them when working on files, and I myself have finally started routinely using them, and teaching newbies to use them -- which is kind of a pain, 'cause I want to have them do basic file reading stuff before I explain what a "context manager" is.

Anyway, my point is that the broader Python community really has been pretty consistent about making it easy to write code that will work the same way (maybe not with the same performance) across python implementations. Ans specifically  with deterministic resource management.

On my system, I can open 1000+ files as a regular user. I can't even
comprehend opening a tenth of that as an ordinary application, although
I can imagine that if I were writing a server application things would
be different.

well, what you can image isn't really the point -- I've bumped into that darn open file limit in my work, which was not a server application (though it was some pretty serious number crunching...). And I'm sure I'm not alone. OK, to be fair that was a poorly designed library, not an issue with determinism of resource management (through designing the lib well WOULD depend on that)

But then I don't expect to write server applications in
quite the same way as I do quick scripts or regular user applications.

Though data analysts DO write "quick scripts" that might need to do things like access 100s of files...
 
So it seems to me that a leaked file handler or two normally shouldn't
be a problem in practice. They'll be friend when the script or
application closes, and in the meantime, you have hundreds more
available. 90% of the time, using `with file` does exactly what we want,
and the times it doesn't (because we're writing a generator that isn't
closed promptly) 90% of those times it doesn't matter.

that was the case with "with file" from the beginning -- particularly on cPython. And yet we all thought it was a great idea.
 
So (it seems to
me) that you're talking about changing the behaviour of for-loops to
suit only a small proportion of cases: maybe 10% of 10%.

I don't see what the big overhead is here. for loops would get a new feature, but it would only be used by the objects that chose to implement it. So no huge change.

-CHB


--

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris....@noaa.gov

Paul Moore

unread,
Oct 21, 2016, 6:54:49 PM10/21/16
to Chris Barker, Python-Ideas
On 21 October 2016 at 21:59, Chris Barker <chris....@noaa.gov> wrote:
>> So (it seems to
>> me) that you're talking about changing the behaviour of for-loops to
>> suit only a small proportion of cases: maybe 10% of 10%.
>
>
> I don't see what the big overhead is here. for loops would get a new
> feature, but it would only be used by the objects that chose to implement
> it. So no huge change.

But the point is that the feature *would* affect people who don't need
it. That's what I'm struggling to understand. I keep hearing "most
code won't be affected", but then discussions about how we ensure that
people are warned of where they need to add preserve() to their
existing code to get the behaviour they already have. (And, of course,
they need to add an "if we're on older pythons, define a no-op version
of preserve() backward compatibility wrapper if they want their code
to work cross version). I genuinely expect preserve() to pretty much
instantly appear on people's lists of "python warts", and that bothers
me.

But I'm reaching the point where I'm just saying the same things over
and over, so I'll bow out of this discussion now. I remain confused,
but I'm going to have to trust that the people who have got a handle
on the issue have understood the point I'm making, and have it
covered.

Paul

Amit Green

unread,
Oct 21, 2016, 10:34:17 PM10/21/16
to Nathaniel Smith, Yury Selivanov, python...@python.org
NOTE: This is my first post to this mailing list, I'm not really sure
      how to post a message, so I'm attempting a reply-all.

I like Nathaniel's idea for __iterclose__.

I suggest the following changes to deal with a few of the complex issues
he discussed.

1.  Missing __iterclose__, or a value of none, works as before,
    no changes.

2.  An iterator can be used in one of three ways:

    A. 'for' loop, which will call __iterclose__ when it exits

    B.  User controlled, in which case the user is responsible to use the
        iterator inside a with statement.

    C.  Old style.  The user is responsible for calling __iterclose__

3.  An iterator keeps track of __iter__ calls, this allows it to know
    when to cleanup.


The two key additions, above, are:

    #2B. User can use iterator with __enter__ & __exit cleanly.

    #3.  By tracking __iter__ calls, it makes complex user cases easier
         to handle.

Specification
=============

An iterator may implement the following method: __iterclose__.  A missing
method, or a value of None is allowed.

When the user wants to control the iterator, the user is expected to
use the iterator with a with clause.


The core proposal is the change in behavior of ``for`` loops. Given this
Python code:

  for VAR in ITERABLE:
      LOOP-BODY
  else:
      ELSE-BODY

we desugar to the equivalent of:

  _iter = iter(ITERABLE)
  _iterclose = getattr(_iter, '__iterclose__', None)

  if _iterclose is none:

      traditional-for VAR in _iter:
         LOOP-BODY
      else:
         ELSE-BODY
  else:
     _stop_exception_seen = False try:

         traditional-for VAR in _iter:
             LOOP-BODY
         else:
             _stop_exception_seen = True
             ELSE-BODY
     finally:
        if not _stop_exception_seen:
            _iterclose(_iter)

The test for 'none' allows us to skip the setup of a try/finally clause.

Also we don't bother to call __iterclose__ if the iterator threw
StopException at us.


Modifications to basic iterator types
=====================================

An iterator will implement something like the following:

  _cleanup       - Private funtion, does the following:

                        _enter_count = _itercount = -1

                        Do any neccessary cleanup, release resources, etc.

                   NOTE: Is also called internally by the iterator,
                   before throwing StopIterator

  _iter_count    - Private value, starts at 0.

  _enter_count   - Private value, starts at 0.

  __iter__       - if _iter_count >= 0:
                       _iter_count += 1

                   return self

  __iterclose__  - if _iter_count is 0:
                       if _enter_count is 0:
                           _cleanup()
                   elif _iter_count > 0:
                       _iter_count -= 1

  __enter__      - if _enter_count >= 0:
                       _enter_count += 1

                   Return itself.

  __exit__       - if _enter_count is > 0
                       _enter_count -= 1

                       if _enter_count is _iter_count is 0:
                            _cleanup()

The suggetions on _iter_count & _enter_count are just example; internal
details can differ (and better error handling).


Examples:
=========

NOTE: Example are givin using xrange() or [1, 2, 3, 4, 5, 6, 7] for
      simplicity.  For real use, the iterator would have resources such
      as open files it needs to close on cleanup.


1.  Simple example:

        for v in xrange(7):
            print v

    Creates an iterator with a _usage_count of 0.  The iterator exits
    normally (by throwing StopException), we don't bother to call
    __iterclose__


2.  Break example:

        for v in [1, 2, 3, 4, 5, 6, 7]:
            print v

            if v == 3:
                break

    Creates an iterator with a _usage_count of 0.

    The iterator exists after generating 4 numbers, we then call
    __iterclose__ & the iterator does any necessary cleanup.

3.  Convert example #2 to print the next value:

        with iter([1, 2, 3, 4, 5, 6, 7]) as seven:
            for v in seven:
                print v

                if v == 3:
                    break

            print 'Next value is: ', seven.next()

    This will print:

            1
            2
            3
            Next value is: 4

    How this works:

        1.  We create an iterator named seven (by calling list.__iter__).

        2.  We call seven.__enter__

        3.  The for loop calls: seven.next() 3 times, and then calls:
            seven.__iterclose__

            Since the _enter_count is 1, the iterator does not do
            cleanup yet.

        4.  We call seven.next()

        5.  We call seven.__exit.  The iterator does its cleanup now.

4.  More complicated example:

        with iter([1, 2, 3, 4, 5, 6, 7]) as seven:
            for v in seven:
                print v

                if v == 1:
                    for v in seven:
                        print 'stolen: ', v

                        if v == 3:
                            break

                if v == 5:
                    break

            for v in seven:
                print v * v

    This will print:

        1
        stolen: 2
        stolen: 3
        4
        5
        36
        49

    How this works:

        1.  Same as #3 above, cleanup is done by the __exit__

5.  Alternate way of doing #4.

        seven = iter([1, 2, 3, 4, 5, 6, 7])

        for v in seven:
            print v

            if v == 1:
                for v in seven:
                    print 'stolen: ', v

                    if v == 3:
                        break

            if v == 5:
                break

        for v in seven:
            print v * v
            break           #   Different from #4

        seven.__iterclose__()

    This will print:

        1
        stolen: 2
        stolen: 3
        4
        5
        36

    How this works:

        1.  We create an iterator named seven.

        2.  The for loops all call seven.__iter__, causing _iter_count
            to increment.

        3.  The for loops all call seven.__iterclose__ on exit, decrement
            _iter_count.

        4.  The user calls the final __iterclose_, which close the
            iterator.

    NOTE:
        Method #5 is NOT recommended, the 'with' syntax is better.

        However, something like itertools.zip could call __iterclose__
        during cleanup


Change to iterators
===================

All python iterators would need to add __iterclose__ (possibly with a
value of None), __enter__, & __exit__.

Third party iterators that do not implenent __iterclose__ cannot be
used in a with clause.  A new function could be added to itertools,
something like:

    with itertools.with_wrapper(third_party_iterator) as x:
        ...

The 'with_wrapper' would attempt to call __iterclose__ when its __exit__
function is called.

On Wed, Oct 19, 2016 at 12:38 AM, Nathaniel Smith <n...@pobox.com> wrote:
Hi all,


I'd like to propose that Python's iterator protocol be enhanced to add
a first-class notion of completion / cleanup.

cleanup in a more structured way: ``with`` blocks. For example, this

code opens a file but relies on the garbage collector to close it::

  def read_newline_separated_json(path):
      for line in open(path):
          yield json.loads(line)

  for document in read_newline_separated_json(path):
      ...

and recent versions of CPython will point this out by issuing a
``ResourceWarning``, nudging us to fix it by adding a ``with`` block::

  def read_newline_separated_json(path):
      with open(path) as file_handle:      # <-- with block
          for line in file_handle:
              yield json.loads(line)

  for document in read_newline_separated_json(path):  # <-- outer for loop
      ...

But there's a subtlety here, caused by the interaction of ``with``
blocks and generators. ``with`` blocks are Python's main tool for
managing cleanup, and they're a powerful one, because they pin the
lifetime of a resource to the lifetime of a stack frame. But this
assumes that someone will take care of cleaning up the stack frame...
and for generators, this requires that someone ``close`` them.

In this case, adding the ``with`` block *is* enough to shut up the
``ResourceWarning``, but this is misleading -- the file object cleanup

here is still dependent on the garbage collector. The ``with`` block
will only be unwound when the ``read_newline_separated_json``
generator is closed. If the outer ``for`` loop runs to completion then
the cleanup will happen immediately; but if this loop is terminated
early by a ``break`` or an exception, then the ``with`` block won't
      for line in file_handle:
          yield json.loads(line)

--
Nathaniel J. Smith -- https://vorpus.org

Ethan Furman

unread,
Oct 21, 2016, 11:21:17 PM10/21/16
to python...@python.org
On 10/21/2016 03:48 PM, Amit Green wrote:

> NOTE: This is my first post to this mailing list, I'm not really sure
> how to post a message, so I'm attempting a reply-all.

Seems to have worked! :)

> I like Nathaniel's idea for __iterclose__.
>
> I suggest the following changes to deal with a few of the complex issues
> he discussed.

Your examples are interesting, but they don't seem to address the issue of closing down for loops that are using generators when those loops exit early:

-----------------------------
def some_work():
with some_resource():
for widget in resource:
yield widget


for pane in some_work():
break:

# what happens here?
-----------------------------

How does your solution deal with that situation? Or are you saying that this would be closed with your modifications, and if I didn't want the generator to be closed I would have to do:

-----------------------------
with some_work() as temp_gen:
for pane in temp_gen:
break:

for another_pane in temp_gen:
# temp_gen is still alive here
-----------------------------

In other words, instead using the preserve() function, we would use a with statement?

--
~Ethan~

Nathaniel Smith

unread,
Oct 21, 2016, 11:46:36 PM10/21/16
to Amit Green, Yury Selivanov, python...@python.org
On Fri, Oct 21, 2016 at 3:48 PM, Amit Green <amit....@gmail.com> wrote:
> NOTE: This is my first post to this mailing list, I'm not really sure
> how to post a message, so I'm attempting a reply-all.
>
> I like Nathaniel's idea for __iterclose__.
>
> I suggest the following changes to deal with a few of the complex issues
> he discussed.
>
> 1. Missing __iterclose__, or a value of none, works as before,
> no changes.
>
> 2. An iterator can be used in one of three ways:
>
> A. 'for' loop, which will call __iterclose__ when it exits
>
> B. User controlled, in which case the user is responsible to use the
> iterator inside a with statement.
>
> C. Old style. The user is responsible for calling __iterclose__
>
> 3. An iterator keeps track of __iter__ calls, this allows it to know
> when to cleanup.
>
>
> The two key additions, above, are:
>
> #2B. User can use iterator with __enter__ & __exit cleanly.
>
> #3. By tracking __iter__ calls, it makes complex user cases easier
> to handle.

These are interesting ideas! A few general comments:

- I don't think we want the "don't bother to call __iterclose__ on
exhaustion" functionality --it's actually useful to be able to
distinguish between

# closes file_handle
for line in file_handle:
...

and

# leaves file_handle open
for line in preserve(file_handle):
...

To be able to distinguish these cases, it's important that the 'for'
loop always call __iterclose__ (which preserve() might then cancel
out).

- I think it'd be practically difficult and maybe too much magic to
add __enter__/__exit__/nesting-depth counts to every iterator
implementation. But, the idea of using a context manager for repeated
partial iteration is a great idea :-). How's this for a simplified
version that still covers the main use cases?

@contextmanager
def reuse_then_close(it): # TODO: come up with a better name
it = iter(it)
try:
yield preserve(it)
finally:
iterclose(it)

with itertools.reuse_then_close(some_generator(...)) as it:
for obj in it:
...
# still open here, because our reference to the iterator is
wrapped in preserve(...)
for obj in it:
...
# but then closed here, by the 'with' block

-n

Nathaniel Smith

unread,
Oct 22, 2016, 12:26:21 AM10/22/16
to Steven D'Aprano, python...@python.org
On Fri, Oct 21, 2016 at 3:29 AM, Steven D'Aprano <st...@pearwood.info> wrote:
> As for the amount of good, this proposal originally came from PyPy.

Just to be clear, I'm not a PyPy dev, and the PyPy devs' contribution
here was mostly to look over a draft I circulated and to agree that it
seemed like something that'd be useful to them.

-n

--
Nathaniel J. Smith -- https://vorpus.org

Nick Coghlan

unread,
Oct 22, 2016, 12:08:41 PM10/22/16
to Nathaniel Smith, python...@python.org
On 20 October 2016 at 07:02, Nathaniel Smith <n...@pobox.com> wrote:
> The first change is to replace the outer for loop with a while/pop
> loop, so that if an exception occurs we'll know which iterables remain
> to be processed:
>
> def chain(*iterables):
> try:
> while iterables:
> for element in iterables.pop(0):
> yield element
> ...
>
> Now, what do we do if an exception does occur? We need to call
> iterclose on all of the remaining iterables, but the tricky bit is
> that this might itself raise new exceptions. If this happens, we don't
> want to abort early; instead, we want to continue until we've closed
> all the iterables, and then raise a chained exception. Basically what
> we want is:
>
> def chain(*iterables):
> try:
> while iterables:
> for element in iterables.pop(0):
> yield element
> finally:
> try:
> operators.iterclose(iter(iterables[0]))
> finally:
> try:
> operators.iterclose(iter(iterables[1]))
> finally:
> try:
> operators.iterclose(iter(iterables[2]))
> finally:
> ...
>
> but of course that's not valid syntax. Fortunately, it's not too hard
> to rewrite that into real Python -- but it's a little dense:
>
> def chain(*iterables):
> try:
> while iterables:
> for element in iterables.pop(0):
> yield element
> # This is equivalent to the nested-finally chain above:
> except BaseException as last_exc:
> for iterable in iterables:
> try:
> operators.iterclose(iter(iterable))
> except BaseException as new_exc:
> if new_exc.__context__ is None:
> new_exc.__context__ = last_exc
> last_exc = new_exc
> raise last_exc
>
> It's probably worth wrapping that bottom part into an iterclose_all()
> helper, since the pattern probably occurs in other cases as well.
> (Actually, now that I think about it, the map() example in the text
> should be doing this instead of what it's currently doing... I'll fix
> that.)

At this point your code is starting to look a whole lot like the code
in contextlib.ExitStack.__exit__ :)

Accordingly, I'm going to suggest that while I agree the problem you
describe is one that genuinely emerges in large production
applications and other complex systems, this particular solution is
simply far too intrusive to be accepted as a language change for
Python - you're talking a fundamental change to the meaning of
iteration for the sake of the relatively small portion of the
community that either work on such complex services, or insist on
writing their code as if it might become part of such a service, even
when it currently isn't. Given that simple applications vastly
outnumber complex ones, and always will, I think making such a change
would be a bad trade-off that didn't come close to justifying the
costs imposed on the rest of the ecosystem to adjust to it.

A potentially more fruitful direction of research to pursue for 3.7
would be the notion of "frame local resources", where each Python
level execution frame implicitly provided a lazily instantiated
ExitStack instance (or an equivalent) for resource management.
Assuming that it offered an "enter_frame_context" function that mapped
to "contextlib.ExitStack.enter_context", such a system would let us do
things like:

from frame_resources import enter_frame_context

def readlines_1(fname):
return enter_frame_context(open(fname)).readlines()

def readlines_2(fname):
return [*enter_frame_context(open(fname))]

def readlines_3(fname):
return [line for line in enter_frame_context(open(fname))]

def iterlines_1(fname):
yield from enter_frame_context(open(fname))

def iterlines_2(fname):
for line in enter_frame_context(open(fname)):
yield line

def iterlines_3(fname):
f = enter_frame_context(open(fname))
while True:
try:
yield next(f)
except StopIteration:
pass

to indicate "clean up this file handle when this frame terminates,
regardless of the GC implementation used by the interpreter". Such a
feature already gets you a long way towards the determinism you want,
as frames are already likely to be cleaned up deterministically even
in Python implementations that don't use automatic reference counting
- the bit that's non-deterministic is cleaning up the local variables
referenced *from* those frames.

And then further down the track, once such a system had proven its
utility, *then* we could talk about expanding the iteration protocol
to allow for implicit registration of iterable cleanup functions as
frame local resources. With the cleanup functions not firing until the
*frame* exits, then the backwards compatibility break would be
substantially reduced (for __main__ module code there'd essentially be
no compatibility break at all, and similarly for CPython local
variables), and the level of impact on language implementations would
also be much lower (reduced to supporting the registration of cleanup
functions with frame objects, and executing those cleanup functions
when the frame terminates)

Regards,
Nick.

--
Nick Coghlan | ncog...@gmail.com | Brisbane, Australia

Nick Coghlan

unread,
Oct 22, 2016, 12:18:09 PM10/22/16
to Chris Barker, Python-Ideas
On 22 October 2016 at 06:59, Chris Barker <chris....@noaa.gov> wrote:
> And then context managers were introduced. And it seems to be there is a
> consensus in the Python community that we all should be using them when
> working on files, and I myself have finally started routinely using them,
> and teaching newbies to use them -- which is kind of a pain, 'cause I want
> to have them do basic file reading stuff before I explain what a "context
> manager" is.

This is actually a case where style guidelines would ideally differ
between between scripting use cases (let the GC handle it whenever,
since your process will be terminating soon anyway) and
library(/framework/application) development use cases (promptly clean
up after yourself, since you don't necessarily know your context of
use).

However, that script/library distinction isn't well-defined in
computing instruction in general, and most published style guides are
written by library/framework/application developers, so students and
folks doing ad hoc scripting tend to be the recipients of a lot of
well-meaning advice that isn't actually appropriate for them :(

Cheers,
Nick.

--
Nick Coghlan | ncog...@gmail.com | Brisbane, Australia

Nick Coghlan

unread,
Oct 22, 2016, 11:23:50 PM10/22/16
to Chris Barker, Python-Ideas
On 23 October 2016 at 02:17, Nick Coghlan <ncog...@gmail.com> wrote:
> On 22 October 2016 at 06:59, Chris Barker <chris....@noaa.gov> wrote:
>> And then context managers were introduced. And it seems to be there is a
>> consensus in the Python community that we all should be using them when
>> working on files, and I myself have finally started routinely using them,
>> and teaching newbies to use them -- which is kind of a pain, 'cause I want
>> to have them do basic file reading stuff before I explain what a "context
>> manager" is.
>
> This is actually a case where style guidelines would ideally differ
> between between scripting use cases (let the GC handle it whenever,
> since your process will be terminating soon anyway) and
> library(/framework/application) development use cases (promptly clean
> up after yourself, since you don't necessarily know your context of
> use).
>
> However, that script/library distinction isn't well-defined in
> computing instruction in general, and most published style guides are
> written by library/framework/application developers, so students and
> folks doing ad hoc scripting tend to be the recipients of a lot of
> well-meaning advice that isn't actually appropriate for them :(

Pondering this overnight, I realised there's a case where folks using
Python primarily as a scripting language can still run into many of
the resource management problems that arise in larger applications:
IPython notebooks, where the persistent kernel can keep resources
alive for a surprisingly long time in the absence of a reference
counting GC. Yes, they have the option of just restarting the kernel
(which many applications don't have), but it's still a nicer user
experience if we can help them avoid having those problems arise in
the first place.

This is likely mitigated in practice *today* by IPython users mostly
being on CPython for access to the Scientific Python stack, but we can
easily foresee a future where the PyPy community have worked out
enough of their NumPy compatibility and runtime redistribution
challenges that it becomes significantly more common to be using
notebooks against Python kernels that don't use automatic reference
counting.

I'm significantly more amenable to that as a rationale for pursuing
non-syntactic approaches to local resource management than I am the
notion of pursuing it for the sake of high performance application
development code.

Chris, would you be open to trying a thought experiment with some of
your students looking at ways to introduce function-scoped
deterministic resource management *before* introducing with
statements? Specifically, I'm thinking of a progression along the
following lines:

# Cleaned up whenever the interpreter gets around to cleaning up
the function locals
def readlines_with_default_resource_management(fname):
return open(fname).readlines()

# Cleaned up on function exit, even if the locals are still
referenced from an exception traceback
# or the interpreter implementation doesn't use a reference counting GC
from local_resources import function_resource

def readlines_with_declarative_cleanup(fname):
return function_resource(open(fname)).readlines()

# Cleaned up at the end of the with statement
def readlines_with_imperative_cleanup(fname):
with open(fname) as f:
return f.readlines()

The idea here is to change the requirement for new developers from
"telling the interpreter what to *do*" (which is the situation we have
for context managers) to "telling the interpreter what we *want*"
(which is for it to link a managed resource with the lifecycle of the
currently running function call, regardless of interpreter
implementation details)

Under that model, Inada-san's recent buffer snapshotting proposal
would effectively be an optimised version of the one liner:

def snapshot(data, limit, offset=0):
return bytes(function_resource(memoryview(data))[offset:limit])

The big refactoring benefit that this feature would offer over with
statements is that it doesn't require a structural change to the code
- it's just wrapping an existing expression in a new function call
that says "clean this up promptly when the function terminates, even
if it's still part of a reference cycle, or we're not using a
reference counting GC".

Chris Barker

unread,
Oct 24, 2016, 1:17:38 PM10/24/16
to Nick Coghlan, Python-Ideas
On Sat, Oct 22, 2016 at 9:17 AM, Nick Coghlan <ncog...@gmail.com> wrote:
 
This is actually a case where style guidelines would ideally differ
between between scripting use cases ... and
library(/framework/application) development use cases 

Hmm -- interesting idea -- and I recall Guido bringing something like this up on one of these lists not too long ago -- "scripting" use cases really are different that "systems programming" 

However, that script/library distinction isn't well-defined in
computing instruction in general,

no it's not -- except in the case of "scripting languages" vs. "systems languages" -- you can go back to the classic  Ousterhout paper:


But Python really is suitable for both use cases, so tricky to know how to teach.

And my classes, at least, have folks with a broad range of use-cases in mind, so I can't choose one way or another. And, indeed, there is no small amount of code (and coder) that starts out as a quicky script, but ends up embedded in a larger system down the road.

And (another and?) one of the great things ABOUT Python is that is IS suitable for such a broad range of use-cases.

-CHB


Chris Barker

unread,
Oct 24, 2016, 1:39:06 PM10/24/16
to Nick Coghlan, Python-Ideas
On Sat, Oct 22, 2016 at 8:22 PM, Nick Coghlan <ncog...@gmail.com> wrote:
 
Pondering this overnight, I realised there's a case where folks using
Python primarily as a scripting language can still run into many of
the resource management problems that arise in larger applications:
IPython notebooks
 
This is likely mitigated in practice *today* by IPython users mostly

being on CPython for access to the Scientific Python stack,

sure -- though there is no reason that Jupyter notebooks aren't really useful to all sort of non-data-crunching tasks. It's just that that's the community it was born in.

I can imagine they would be great for database exploration/management, for instance.

Chris, would you be open to trying a thought experiment with some of
your students looking at ways to introduce function-scoped
deterministic resource management *before* introducing with
statements?

At first thought, talking about this seems like it would just confuse newbies even MORE. Most of my students really want simple examples they can copy and then change for their specific use case.

But I do have some pretty experienced developers (new to Python, but not programming) in my classes, too, that I might be able to bring this up with.

     # Cleaned up whenever the interpreter gets around to cleaning up
the function locals
    def readlines_with_default_resource_management(fname):
        return open(fname).readlines()

    # Cleaned up on function exit, even if the locals are still
referenced from an exception traceback
    # or the interpreter implementation doesn't use a reference counting GC
    from local_resources import function_resource

    def readlines_with_declarative_cleanup(fname):
       return function_resource(open(fname)).readlines()

    # Cleaned up at the end of the with statement
    def readlines_with_imperative_cleanup(fname):
        with open(fname) as f:
            return f.readlines()

The idea here is to change the requirement for new developers from
"telling the interpreter what to *do*" (which is the situation we have
for context managers) to "telling the interpreter what we *want*"
(which is for it to link a managed resource with the lifecycle of the
currently running function call, regardless of interpreter
implementation details)

I can see that, but I'm not sure newbies will -- it either case, you have to think about what you want -- which is the complexity I'm trying to avoid at this stage. Until much later, when I get into weak references, I can pretty much tell people that python will take care of itself with regards to resource management.

That's what context mangers are for, in fact. YOU can use:

with open(...) as infile:
    .....

Without needing to know what actually has to be "cleaned up" about a file. In the case of files, it's a close() call, simple enough (in the absence of Exceptions...), but with a database connection or something, it could be a lot more complex, and it's nice to know that it will simply be taken care of for you by the context manager.

The big refactoring benefit that this feature would offer over with
statements is that it doesn't require a structural change to the code
- it's just wrapping an existing expression in a new function call
that says "clean this up promptly when the function terminates, even
if it's still part of a reference cycle, or we're not using a
reference counting GC".

hmm -- that would be simpler in one sense, but wouldn't it require a new function to be defined for everything you might want to do this with? rather than the same "with" syntax for everything?

-CHB


Stephen J. Turnbull

unread,
Oct 24, 2016, 10:00:47 PM10/24/16
to Chris Barker, Python-Ideas
Chris Barker wrote:
> Nick Coghlan wrote:

>> Chris, would you be open to trying a thought experiment with some of
>> your students looking at ways to introduce function-scoped
>> deterministic resource management *before* introducing with
>> statements?

I'm with Chris, I think: this seems inappropriate to me. A student
has to be rather sophisticated to understand resource management at
all in Python. Eg, generators and closures can hang on to resources
between calls, yet there's no syntactic marker at the call site.

>> The idea here is to change the requirement for new developers from
>> "telling the interpreter what to *do*" (which is the situation we have
>> for context managers) to "telling the interpreter what we *want*"
>> (which is for it to link a managed resource with the lifecycle of the
>> currently running function call, regardless of interpreter
>> implementation details)

I think this attempt at a distinction is spurious. On the syntactic
side,

with open("file") as f:
results = read_and_process_lines(f)

the with statement effectively links management of the file resource
to the lifecycle of read_and_process_lines. (Yes, I know what you
mean by "link" -- will "new developers"?) On the semantic side,
constructs like closures and generators (which they may be cargo-
culting!) mean that it's harder to link resource management to
(syntactic) function calls than a new developer might think. (Isn't
that Nathaniel's motivation for the OP?) And then there's the loop
that may not fully consume an iterator problem: that must be
explicitly decided -- the question for language designers is which of
"close generators on loop exit" or "leave generators open on loop
exit" should be marked with explicit syntax -- and what if you've got
two generators involved, and want different decisions for both?

Chris:

> I can see that, but I'm not sure newbies will -- it either case,
> you have to think about what you want -- which is the complexity
> I'm trying to avoid at this stage.

Indeed.

> Until much later, when I get into weak references, I can pretty
> much tell people that python will take care of itself with regards
> to resource management.

I hope you phrase that very carefully. Python takes care of itself,
but does not take care of the use case. That's the programmer's
responsibility. In a very large number of use cases, including the
novice developer's role in a large project, that is a distinction that
makes no difference. But the "close generators on loop exit" (or
maybe not!) use case makes it clear that in general the developer must
explicitly manage resources.

> That's what context mangers are for, in fact. YOU can use:
>
> with open(...) as infile:
> .....
>
> Without needing to know what actually has to be "cleaned up" about
> a file. In the case of files, it's a close() call, simple enough
> (in the absence of Exceptions...), but with a database connection
> or something, it could be a lot more complex, and it's nice to know
> that it will simply be taken care of for you by the context
> manager.

But somebody has to write that context manager. I suppose in the
organizational context imagined here, it was written for the project
by the resource management wonk in the group, and the new developer
just cargo-cults it at first.

> > The big refactoring benefit that this feature would offer over
> > with statements is that it doesn't require a structural change to
> > the code - it's just wrapping an existing expression in a new
> > function call that says "clean this up promptly when the function
> > terminates, even if it's still part of a reference cycle, or
> > we're not using a reference counting GC".
>
> hmm -- that would be simpler in one sense, but wouldn't it require
> a new function to be defined for everything you might want to do
> this with? rather than the same "with" syntax for everything?

Even if it can be done with a single "ensure_cleanup" function, Python
isn't Haskell. I think context management deserves syntax to mark it.

After all, from the "open and read one file" scripting standpoint,
there's really not a difference between

f = open("file")
process(f)

and

with open("file") as f:
process(f)

(see "taking care of Python ~= taking care of use case" above). But
the with statement and indentation clearly mark the call to process as
receiving special treatment. As Chris says, the developer doesn't
need to know anything but that the object returned by the with
expression participates "appropriately" in the context manager
protocol (which she may think of as the "with protocol"!, ie, *magic*)
and gets the "special treatment" it needs.

So (for me) this is full circle: "with" context management is what we
need, but it interacts poorly with stateful "function" calls -- and
that's what Nathaniel proposes to deal with.

Neil Girdhar

unread,
Oct 24, 2016, 10:04:23 PM10/24/16
to python...@googlegroups.com
I still don't understand why the stateful function calls don't just return context managers whose enter block returns the iterable.  Am I missing something?  I think Python already works for this.
 

_______________________________________________
Python-ideas mailing list
Python...@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Nick Coghlan

unread,
Oct 25, 2016, 3:54:34 AM10/25/16
to Chris Barker, Python-Ideas
On 25 October 2016 at 03:16, Chris Barker <chris....@noaa.gov> wrote:
> On Sat, Oct 22, 2016 at 9:17 AM, Nick Coghlan <ncog...@gmail.com> wrote:
>
>>
>> This is actually a case where style guidelines would ideally differ
>> between between scripting use cases ... and
>> library(/framework/application) development use cases
>
>
> Hmm -- interesting idea -- and I recall Guido bringing something like this
> up on one of these lists not too long ago -- "scripting" use cases really
> are different that "systems programming"
>
>> However, that script/library distinction isn't well-defined in
>> computing instruction in general,
>
> no it's not -- except in the case of "scripting languages" vs. "systems
> languages" -- you can go back to the classic Ousterhout paper:
>
> https://www.tcl.tk/doc/scripting.html
>
> But Python really is suitable for both use cases, so tricky to know how to
> teach.

Steven Lott was pondering the same question a few years back
(regarding his preference for teaching procedural programming before
any other paradigms), so I had a go at articulating the general idea:
http://www.curiousefficiency.org/posts/2011/08/scripting-languages-and-suitable.html

The main paragraph is still pretty unhelpful though, since I handwave
away the core of the problem as "the art of software design":

"""A key part of the art of software design is learning how to choose
an appropriate level of complexity for the problem at hand - when a
problem calls for a simple script, throwing an entire custom
application at it would be overkill. On the other hand, trying to
write complex applications using only scripts and no higher level
constructs will typically lead to an unmaintainable mess."""

Cheers,
Nick.

P.S. I'm going to stop now since we're getting somewhat off-topic, but
I wanted to highlight this excellent recent article on the challenges
of determining the level of "suitable complexity" for any given
software engineering problem:
https://hackernoon.com/how-to-accept-over-engineering-for-what-it-really-is-6fca9a919263#.k4nqzjl52

Nick Coghlan

unread,
Oct 25, 2016, 4:17:23 AM10/25/16
to Chris Barker, Python-Ideas
Nope, hence the references to contextlib.ExitStack:
https://docs.python.org/3/library/contextlib.html#contextlib.ExitStack

That's a tool for dynamic manipulation of context managers, so even
today you can already write code like this:

>>> @with_resource_manager
... def example(rm, *, msg=None, exc=None):
... rm.enter_context(cm())
... rm.callback(print, "Deferred callback")
... if msg is not None: print(msg)
... if exc is not None: raise exc
...
>>> example(msg="Normal return")
Enter CM
Normal return
Deferred callback
Exit CM
>>> example(exc=RuntimeError("Exception thrown"))
Enter CM
Deferred callback
Exit CM
Traceback (most recent call last):
...
RuntimeError: Exception thrown

The setup code to support it is just a few lines of code:

>>> import functools
>>> from contextlib import ExitStack
>>> def with_resource_manager(f):
... @functools.wraps(f)
... def wrapper(*args, **kwds):
... with ExitStack() as rm:
... return f(rm, *args, **kwds)
... return wrapper
...

Plus the example context manager definition:

>>> from contextlib import contextmanager
>>> @contextmanager
... def cm():
... print("Enter CM")
... try:
... yield
... finally:
... print("Exit CM")
...

So the gist of my proposal (from an implementation perspective) is
that if we give frame objects an ExitStack instance (or an operational
equivalent) that can be created on demand and will be cleaned up when
the frame exits (regardless of how that happens), then we can define
an API for adding "at frame termination" callbacks (including making
it easy to dynamically add context managers to that stack) without
needing to define your own scaffolding for that feature - it would
just be a natural part of the way frame objects work.

Nick Coghlan

unread,
Oct 25, 2016, 4:34:24 AM10/25/16
to Stephen J. Turnbull, Python-Ideas
On 25 October 2016 at 11:59, Stephen J. Turnbull
<turnbull....@u.tsukuba.ac.jp> wrote:
> On the semantic side,
> constructs like closures and generators (which they may be cargo-
> culting!) mean that it's harder to link resource management to
> (syntactic) function calls than a new developer might think. (Isn't
> that Nathaniel's motivation for the OP?)

This is my read of Nathaniel's motivation as well, and hence my
proposal: rather than trying to auto-magically guess when a developer
intended for their resource management to be linked to the current
executing frame (which requires fundamentally changing how iteration
works in a way that breaks the world, and still doesn't solve the
problem in general), I'm starting to think that we instead need a way
to let them easily say "This resource, the one I just created or have
otherwise gained access to? Link its management to the lifecycle of
the currently running function or frame, so it gets cleaned up when it
finishes running".

Precisely *how* a particular implementation did that resource
management would be up to the particular Python implementation, but
one relatively straightforward way would be to use
contextlib.ExitStack under the covers, and then when the frame
finishes execution have a check that goes:

- did the lazily instantiated ExitStack instance get created during
frame execution?
- if yes, close it immediately, thus reclaiming all the registered resources

The spelling of the *surface* API though is something I'd need help
from educators in designing - my problem is that I already know all
the moving parts and how they fit together (hence my confidence that
something like this would be relatively easy to implement, at least in
CPython, if we decided we wanted to do it), but I *don't* know what
kinds for terms could be used in the API if we wanted to make it
approachable to relative beginners. My initial thought would be to
offer:

from local_resources import function_resource

and:

from local_resources import frame_resource

Where the only difference between the two is that the first one would
complain if you tried to use it outside a normal function body, while
the second would be usable anywhere (function, class, module,
generator, coroutine).

Both would accept and automatically enter context managers as input,
as if you'd wrapped the rest of the frame body in a with statement.

Cheers,
Nick.

--
Nick Coghlan | ncog...@gmail.com | Brisbane, Australia

Yury Selivanov

unread,
Oct 25, 2016, 12:00:28 PM10/25/16
to python...@python.org


On 2016-10-25 4:33 AM, Nick Coghlan wrote:
> I'm starting to think that we instead need a way
> to let them easily say "This resource, the one I just created or have
> otherwise gained access to? Link its management to the lifecycle of
> the currently running function or frame, so it gets cleaned up when it
> finishes running".


But how would it help with a partial iteration over generators
with a "with" statement inside?

def it():
with open(file) as f:
for line in f:
yield line

Nathaniel proposal addresses this by fixing "for" statements,
so that the outer loop that iterates over "it" would close
the generator once the iteration is stopped.

With your proposal you want to attach the opened file to the
frame, but you'd need to attach it to the frame of *caller* of
"it", right?

Yury

Nathaniel Smith

unread,
Oct 25, 2016, 6:26:17 PM10/25/16
to Nick Coghlan, python...@python.org
One of the versions I tried but didn't include in my email used
ExitStack :-). It turns out not to work here: the problem is that we
effectively need to enter *all* the contexts before unwinding, even if
trying to enter one of them fails. ExitStack is nested like (try (try
(try ... finally) finally) finally), and we need (try finally (try
finally (try finally ...))) But this is just a small side-point
anyway, since most code is not implementing complicated
meta-iterators; I'll address your real proposal below.

> Accordingly, I'm going to suggest that while I agree the problem you
> describe is one that genuinely emerges in large production
> applications and other complex systems, this particular solution is
> simply far too intrusive to be accepted as a language change for
> Python - you're talking a fundamental change to the meaning of
> iteration for the sake of the relatively small portion of the
> community that either work on such complex services, or insist on
> writing their code as if it might become part of such a service, even
> when it currently isn't. Given that simple applications vastly
> outnumber complex ones, and always will, I think making such a change
> would be a bad trade-off that didn't come close to justifying the
> costs imposed on the rest of the ecosystem to adjust to it.
>
> A potentially more fruitful direction of research to pursue for 3.7
> would be the notion of "frame local resources", where each Python
> level execution frame implicitly provided a lazily instantiated
> ExitStack instance (or an equivalent) for resource management.
> Assuming that it offered an "enter_frame_context" function that mapped
> to "contextlib.ExitStack.enter_context", such a system would let us do
> things like:

So basically a 'with expression', that gives up the block syntax --
taking its scope from the current function instead -- in return for
being usable in expression context? That's a really interesting, and I
see the intuition that it might be less disruptive if our implicit
iterclose calls are scoped to the function rather than the 'for' loop.

But having thought about it and investigated some... I don't think
function-scoping addresses my problem, and I don't see evidence that
it's meaningfully less disruptive to existing code.

First, "my problem":

Obviously, Python's a language that should be usable for folks doing
one-off scripts, and for paranoid folks trying to write robust complex
systems, and for everyone in between -- these are all really important
constituencies. And unfortunately, there is a trade-off here, where
the changes we're discussing effect these constituencies differently.
But it's not just a matter of shifting around a fixed amount of pain;
the *quality* of the pain really changes under the different
proposals.

In the status quo:
- for one-off scripts: you can just let the GC worry about generator
and file handle cleanup, re-use iterators, whatever, it's cool
- for robust systems: because it's the *caller's* responsibility to
ensure that iterators are cleaned up, you... kinda can't really use
generators without -- pick one -- (a) draconian style guides (like
forbidding 'with' inside generators or forbidding bare 'for' loops
entirely), (b) lots of auditing (every time you write a 'for' loop, go
read the source to the generator you're iterating over -- no
modularity for you and let's hope the answer doesn't change!), or (c)
introducing really subtle bugs. Or all of the above. It's true that a
lot of the time you can ignore this problem and get away with it one
way or another, but if you're trying to write robust code then this
doesn't really help -- it's like saying the footgun only has 1 bullet
in the chamber. Not as reassuring as you'd think. It's like if every
time you called a function, you had to explicitly say whether you
wanted exception handling to be enabled inside that function, and if
you forgot then the interpreter might just skip the 'finally' blocks
while unwinding. There's just *isn't* a good solution available.

In my proposal (for-scoped-iterclose):
- for robust systems: life is great -- you're still stopping to think
a little about cleanup every time you use an iterator (because that's
what it means to write robust code!), but since the iterators now know
when they need cleanup and regular 'for' loops know how to invoke it,
then 99% of the time (i.e., whenever you don't intend to re-use an
iterator) you can be confident that just writing 'for' will do exactly
the right thing, and the other 1% of the time (when you do want to
re-use an iterator), you already *know* you're doing something clever.
So the cognitive overhead on each for-loop is really low.
- for one-off scripts: ~99% of the time (actual measurement, see
below) everything just works, except maybe a little bit better. 1% of
the time, you deploy the clever trick of re-using an iterator with
multiple for loops, and it breaks, so this is some pain. Here's what
you see:

gen_obj = ...
for first_line in gen_obj:
break
for lines in gen_obj:
...

Traceback (most recent call last):
File "/tmp/foo.py", line 5, in <module>
for lines in gen_obj:
AlreadyClosedIteratorError: this iterator was already closed,
possibly by a previous 'for' loop. (Maybe you want
itertools.preserve?)

(We could even have a PYTHONDEBUG flag that when enabled makes that
error message include the file:line of the previous 'for' loop that
called __iterclose__.)

So this is pain! But the pain is (a) rare, not pervasive, (b)
immediately obvious (an exception, the code doesn't work at all), not
subtle and delayed, (c) easily googleable, (d) easy to fix and the fix
is reliable. It's a totally different type of pain than the pain that
we currently impose on folks who want to write robust code.

Now compare to the new proposal (function-scoped-iterclose):

- For those who want robust cleanup: Usually, I only need an iterator
for as long as I'm iterating over it; that may or may not correspond
to the end of the function (often won't). When these don't coincide,
it can cause problems. E.g., consider the original example from my
proposal:

def read_newline_separated_json(path):
with open(path) as f:
for line in f:
yield json.loads(line)

but now suppose that I'm a Data Scientist (tm) so instead of having 1
file full of newline-separated JSON, I have a 100 gigabytes worth of
the stuff stored in lots of files in a directory tree. Well, that's no
problem, I'll just wrap that generator:

def read_newline_separated_json_tree(tree):
for root, _, paths in os.walk(tree):
for path in paths:
for document in read_newline_separated_json(join(root, path)):
yield document

And then I'll run it on PyPy, because that's what you do when you have
100 GB of string processing, and... it'll crash, because the call to
read_newline_separated_tree ends up doing thousands of calls to
read_newline_separated_json, but never cleans up any of them up until
the function exits, so eventually we run out of file descriptors.

A similar situation arises in the main loop of something like an HTTP server:

while True:
request = read_request(sock)
for response_chunk in application_handler(request):
send_response_chunk(sock)

Here we'll accumulate arbitrary numbers of un-closed
application_handler generators attached to the stack frame, which is
no good at all. And this has the interesting failure mode that you'll
probably miss it in testing, because most clients will only re-use a
connection a small number of times.

So what this means is that every time I write a for loop, I can't just
do a quick "am I going to break out of the for-loop and then re-use
this iterator?" check -- I have to stop and think about whether this
for-loop is nested inside some other loop, etc. And, again, if I get
it wrong, then it's a subtle bug that will bite me later. It's true
that with the status quo, we need to wrap, X% of for-loops with 'with'
blocks, and with this proposal that number would drop to, I don't
know, (X/5)% or something. But that's not the most important cost: the
most important cost is the cognitive overhead of figuring out which
for-loops need the special treatment, and in this proposal that
checking is actually *more* complicated than the status quo.

- For those who just want to write a quick script and not think about
it: here's a script that does repeated partial for-loops over a
generator object:

https://github.com/python/cpython/blob/553a84c4c9d6476518e2319acda6ba29b8588cb4/Tools/scripts/gprof2html.py#L40-L79

(and note that the generator object even has an ineffective 'with
open(...)' block inside it!)

With the function-scoped-iterclose, this script would continue to work
as it does now. Excellent.

But, suppose that I decide that that main() function is really
complicated and that it would be better to refactor some of those
loops out into helper functions. (Probably actually true in this
example.) So I do that and... suddenly the code breaks. And in a
rather confusing way, because it has to do with this complicated
long-distance interaction between two different 'for' loops *and*
where they're placed with respect to the original function versus the
helper function.

If I were an intermediate-level Python student (and I'm pretty sure
anyone who is starting to get clever with re-using iterators counts as
"intermediate level"), then I'm pretty sure I'd actually prefer the
immediate obvious feedback from the for-scoped-iterclose. This would
actually be a good time to teach folks about this aspect of resource
handling, actually -- it's certainly an important thing to learn
eventually on your way to Python mastery, even if it isn't needed for
every script.

In the pypy-dev thread about this proposal, there's some very
distressed emails from someone who's been writing Python for a long
time but only just realized that generator cleanup relies on the
garbage collector:

https://mail.python.org/pipermail/pypy-dev/2016-October/014709.html
https://mail.python.org/pipermail/pypy-dev/2016-October/014720.html

It's unpleasant to have the rug pulled out from under you like this
and suddenly realize that you might have to go re-evaluate all the
code you've ever written, and making for loops safe-by-default and
fail-fast-when-unsafe avoids that.

Anyway, in summary: function-scoped-iterclose doesn't seem to
accomplish my goal of getting rid of the *type* of pain involved when
you have to run a background thread in your brain that's doing
constant paranoid checking every time you write a for loop. Instead it
arguably takes that type of pain and spreads it around both the
experts and the novices :-/.

-------------

Now, let's look at some evidence about how disruptive the two
proposals are for real code:

As mentioned else-thread, I wrote a stupid little CPython hack [1] to
report when the same iterator object gets passed to multiple 'for'
loops, and ran the CPython and Django testsuites with it [2]. Looking
just at generator objects [3], across these two large codebases there
are exactly 4 places where this happens. (Rough idea of prevalence:
these 4 places together account for a total of 8 'for' loops; this is
out of a total of 11,503 'for' loops total, of which 665 involve
generator objects.) The 4 places are:

1) CPython's Lib/test/test_collections.py:1135, Lib/_collections_abc.py:378

This appears to be a bug in the CPython test suite -- the little MySet
class does 'def __init__(self, itr): self.contents = itr', which
assumes that itr is a container that can be repeatedly iterated. But a
bunch of the methods on collections.abc.Set like to pass in a
generator object here instead, which breaks everything. If repeated
'for' loops on generators raised an error then this bug would have
been caught much sooner.

2) CPython's Tools/scripts/gprof2html.py lines 45, 54, 59, 75

Discussed above -- as written, for-scoped-iterclose would break this
script, but function-scoped-iterclose would not, so here
function-scoped-iterclose wins.

3) Django django/utils/regex_helper.py:236

This code is very similar to the previous example in its general
outline, except that the 'for' loops *have* been factored out into
utility functions. So in this case for-scoped-iterclose and
function-scoped-iterclose are equally disruptive.

4) CPython's Lib/test/test_generators.py:723

I have to admit I cannot figure out what this code is doing, besides
showing off :-). But the different 'for' loops are in different stack
frames, so I'm pretty sure that for-scoped-iterclose and
function-scoped-iterclose would be equally disruptive.

Obviously there's a bias here in that these are still relatively
"serious" libraries; I don't have a big corpus of one-off scripts that
are just a big __main__, though gprof2html.py isn't far from that. (If
anyone knows where to find such a thing let me know...) But still, the
tally here is that out of 4 examples, we have 1 subtle bug that
iterclose might have caught, 2 cases where for-scoped-iterclose and
function-scoped-iterclose are equally disruptive, and only 1 where
function-scoped-iterclose is less disruptive -- and in that case it's
arguably just avoiding an obvious error now in favor of a more
confusing error later.

If this reduced the backwards-incompatible cases by a factor of, like,
10x or 100x, then that would be a pretty strong argument in its favor.
But it seems to be more like... 1.5x.

-n

[1] https://github.com/njsmith/cpython/commit/2b9d60e1c1b89f0f1ac30cbf0a5dceee835142c2
[2] CPython: revision b0a272709b from the github mirror; Django:
revision 90c3b11e87
[3] I also looked at "all iterators" and "all iterators with .close
methods", but this email is long enough... basically the pattern is
the same: there are another 13 'for' loops that involve repeated
iteration over non-generator objects, and they're roughly equally
split between spurious effects due to bugs in the CPython test-suite
or my instrumentation, cases where for-scoped-iterclose and
function-scoped-iterclose both cause the same problems, and cases
where function-scoped-iterclose is less disruptive.

-n

--
Nathaniel J. Smith -- https://vorpus.org

Nathaniel Smith

unread,
Oct 25, 2016, 6:49:49 PM10/25/16
to Nick Coghlan, python...@python.org
...Doh. I spent all that time evaluating the function-scoped-cleanup
proposal from the high-level design perspective, and then immediately
after hitting send, I suddenly realized that I'd missed a much more
straightforward technical problem.

One thing that 'with' blocks / for-scoped-iterclose do is that they
put an upper bound on the lifetime of generator objects. That's
important if you're using a non-refcounting-GC, or if there might be
reference cycles. But it's not all they do: they also arrange to make
sure that any cleanup code is executed in the context of the code
that's using the generator. This is *also* really important: if you
have an exception in your cleanup code, and the GC runs your cleanup
code, then that exception will just disappear into nothingness (well,
it'll get printed to the console, but that's hardly better). So you
don't want to let the GC run your cleanup code. If you have an async
generator, you want to run the cleanup code under supervision of the
calling functions coroutine runner, and ideally block the running
coroutine while you do it; doing this from the GC is
difficult-to-impossible (depending on how picky you are -- PEP 525
does part of it, but not all). Again, letting the GC get involved is
bad.

So for the function-scoped-iterclose proposal: does this implicit
ExitStack-like object take a strong reference to iterators, or just a
weak one?

If it takes a strong reference, then suddenly we're pinning all
iterators in memory until the end of the enclosing function, which
will often look like a memory leak. I think this would break a *lot*
more existing code than the for-scoped-iterclose proposal does, and in
more obscure ways that are harder to detect and warn about ahead of
time. So that's out.

If it takes a weak reference, ... then there's a good chance that
iterators will get garbage collected before the ExitStack has a chance
to clean them up properly. So we still have no guarantee that the
cleanup will happen in the right context, that exceptions will not be
lost, and so forth. In fact, it becomes literally non-deterministic:
you might see an exception propagate properly on one run, and not on
the next, depending on exactly when the garbage collector happened to
run.

IMHO that's *way* too spooky to be allowed, but I can't see any way to
fix it within the function-scoping framework :-(

-n

Nick Coghlan

unread,
Oct 26, 2016, 11:22:00 AM10/26/16
to Yury Selivanov, python...@python.org
On 26 October 2016 at 01:59, Yury Selivanov <yseliv...@gmail.com> wrote:
> But how would it help with a partial iteration over generators
> with a "with" statement inside?
>
> def it():
> with open(file) as f:
> for line in f:
> yield line
>
> Nathaniel proposal addresses this by fixing "for" statements,
> so that the outer loop that iterates over "it" would close
> the generator once the iteration is stopped.
>
> With your proposal you want to attach the opened file to the
> frame, but you'd need to attach it to the frame of *caller* of
> "it", right?

Every frame in the stack would still need to opt in to deterministic
cleanup of its resources, but the difference is that it becomes an
inline operation within the expression creating the iterator, rather
than a complete restructuring of the function:

def iter_consumer(fname):
for line in function_resource(open(fname)):
...

It doesn't matter *where* the iterator is being used (or even if you
received it as a parameter), you get an easy way to say "When this
function exits, however that happens, clean this up".

Cheers,
Nick.

--
Nick Coghlan | ncog...@gmail.com | Brisbane, Australia

Nick Coghlan

unread,
Oct 26, 2016, 12:55:21 PM10/26/16
to Nathaniel Smith, python...@python.org
On 26 October 2016 at 08:25, Nathaniel Smith <n...@pobox.com> wrote:
> On Sat, Oct 22, 2016 at 9:02 AM, Nick Coghlan <ncog...@gmail.com> wrote:
>> At this point your code is starting to look a whole lot like the code
>> in contextlib.ExitStack.__exit__ :)
>
> One of the versions I tried but didn't include in my email used
> ExitStack :-). It turns out not to work here: the problem is that we
> effectively need to enter *all* the contexts before unwinding, even if
> trying to enter one of them fails. ExitStack is nested like (try (try
> (try ... finally) finally) finally), and we need (try finally (try
> finally (try finally ...)))

Regardless of any other outcome from this thread, it may be useful to
have a "contextlib.ResourceSet" as an abstraction for collective
management of resources, regardless of whatever else happens. As you
say, the main difference is that the invocation of the cleanup
functions wouldn't be nested at all and could be called in an
arbitrary order (if that's not sufficient for a particular use case,
then you'd need to define an ExitStack for the items where the order
of cleanup matters, and then register *that* with the ResourceSet).
(Note: I've changed my preferred API name from "function_resource" +
"frame_resource" to the general purpose "scoped_resource" - while it's
somewhat jargony, which I consider unfortunate, the goal is to make
the runtime scope of the resource match the lexical scope of the
reference as closely as is feasible, and if folks are going to
understand how Python manages references and resources, they're going
to need to learn the basics of Python's scope management at some
point)

Given your points below, the defensive coding recommendation here would be to

- always wrap your iterators in scoped_resource() to tell Python to
clean them up when the function is done
- explicitly call close_resources() after the affected for loops to
clean the resources up early

You'd still be vulnerable to resource leaks in libraries you didn't
write, but would have decent control over your own code without having
to make overly draconian changes to your style guide - you'd only need
one new rule, which is "Whenever you're iterating over something, pass
it through scoped_resource first".

To simplify this from a forwards compatibility perspective (i.e. so it
can implicitly adjust when an existing type gains a cleanup method),
we'd make scoped_resource() quite permissive, accepting arbitrary
objects with the following behaviours:

- if it's a context manager, enter it, and register the exit callback
- if it's not a context manager, but has a close() method, register
the close method
- otherwise, pass it straight through without taking any other action

This would allow folks to always declare something as a scoped
resource without impeding their ability to handle objects that aren't
resources at all.

The long term question would then become whether it made sense to have
certain language constructs implicitly mark their targets as scoped
resources *by default*, and clean them up selectively after the loop
rather than using the blunt instrument of cleaning up all previously
registered resources. If we did start seriously considering such a
change, then there would be potential utility in an "unmanaged_iter()"
wrapper which forwarded *only* the iterator protocol methods, thus
hiding any __exit__() or close() methods from scoped_resource().

However, the time to consider such a change in default behaviour would
be *after* we had some experience with explicit declarations and
management of scoped resources - plenty of folks are writing plenty of
software today in garbage collected languages (including Python), and
coping with external resource management problems as they arise, so we
don't need to do anything hasty here. I personally think an explicit
solution is likely to be sufficient (given the caveat of adding a
"gc.collect()" counterpart), with an API like `scoped_resource` being
adopted over time in libraries, frameworks and applications based on
actual defects found in running production systems as well as the
defensive coding style, and your example below makes me even more
firmly convinced that that's a better way to go.

> In my proposal (for-scoped-iterclose):
> - for robust systems: life is great -- you're still stopping to think
> a little about cleanup every time you use an iterator (because that's
> what it means to write robust code!), but since the iterators now know
> when they need cleanup and regular 'for' loops know how to invoke it,
> then 99% of the time (i.e., whenever you don't intend to re-use an
> iterator) you can be confident that just writing 'for' will do exactly
> the right thing, and the other 1% of the time (when you do want to
> re-use an iterator), you already *know* you're doing something clever.
> So the cognitive overhead on each for-loop is really low.

In mine, if your style guide says "Use scoped_resource() and an
explicit close_resources() call when iterating", you'd add it (or your
automated linter would complain that it was missing). So the cognitive
overhead is higher, but it would remain where it belongs (i.e. on
professional developers being paid to write robust code).

> - for one-off scripts: ~99% of the time (actual measurement, see
> below) everything just works, except maybe a little bit better. 1% of
> the time, you deploy the clever trick of re-using an iterator with
> multiple for loops, and it breaks, so this is some pain. Here's what
> you see:
>
> gen_obj = ...
> for first_line in gen_obj:
> break
> for lines in gen_obj:
> ...
>
> Traceback (most recent call last):
> File "/tmp/foo.py", line 5, in <module>
> for lines in gen_obj:
> AlreadyClosedIteratorError: this iterator was already closed,
> possibly by a previous 'for' loop. (Maybe you want
> itertools.preserve?)
>
> (We could even have a PYTHONDEBUG flag that when enabled makes that
> error message include the file:line of the previous 'for' loop that
> called __iterclose__.)
>
> So this is pain! But the pain is (a) rare, not pervasive, (b)
> immediately obvious (an exception, the code doesn't work at all), not
> subtle and delayed, (c) easily googleable, (d) easy to fix and the fix
> is reliable. It's a totally different type of pain than the pain that
> we currently impose on folks who want to write robust code.

And it's completely unecessary - with explicit scoped_resource() calls
absolutely nothing changes for the scripting use case, and even with
implicit ones, re-use *within the same scope* would still be fine
(you'd only get into trouble if the resource escaped the scope where
it was first marked as a scoped resource).

> Now compare to the new proposal (function-scoped-iterclose):
>
> - For those who want robust cleanup: Usually, I only need an iterator
> for as long as I'm iterating over it; that may or may not correspond
> to the end of the function (often won't). When these don't coincide,
> it can cause problems. E.g., consider the original example from my
> proposal:
>
> def read_newline_separated_json(path):
> with open(path) as f:
> for line in f:
> yield json.loads(line)
>
> but now suppose that I'm a Data Scientist (tm) so instead of having 1
> file full of newline-separated JSON, I have a 100 gigabytes worth of
> the stuff stored in lots of files in a directory tree. Well, that's no
> problem, I'll just wrap that generator:
>
> def read_newline_separated_json_tree(tree):
> for root, _, paths in os.walk(tree):
> for path in paths:
> for document in read_newline_separated_json(join(root, path)):
> yield document

If you're being paid to write robust code and are using Python 3.7+,
then you'd add scoped_resource() around the
read_newline_separated_json() call and then add a close_resources()
call after that loop. That'd be part of your job, and just another
point in the long list of reasons why developing software as a
profession isn't the same thing as doing it as a hobby. We'd design
scoped_resource() in such a way that it could be harmlessly wrapped
around "paths" as well, even though we know that's technically not
necessary (since it's just a list of strings).

As noted above, I'm also open to the notion of some day making all for
loops implicitly declare the iterators they operate on as scoped
resources, but I don't think we should do that without gaining some
experience with the explicit form first (where we can be confident
that any unexpected negative consequences will be encountered by folks
already well equipped to deal with them).

> And then I'll run it on PyPy, because that's what you do when you have
> 100 GB of string processing, and... it'll crash, because the call to
> read_newline_separated_tree ends up doing thousands of calls to
> read_newline_separated_json, but never cleans up any of them up until
> the function exits, so eventually we run out of file descriptors.

And we'll go "Oops", and refactor our code to better control the scope
of our resources, either by adding a with statement around the
innermost loop or using the new scoped resources API (if such a thing
gets added). The *whole point* of iterative development is to solve
the problems you know you have, not the problems you or someone else
might potentially have at some point in the indeterminate future.

> A similar situation arises in the main loop of something like an HTTP server:
>
> while True:
> request = read_request(sock)
> for response_chunk in application_handler(request):
> send_response_chunk(sock)
>
> Here we'll accumulate arbitrary numbers of un-closed
> application_handler generators attached to the stack frame, which is
> no good at all. And this has the interesting failure mode that you'll
> probably miss it in testing, because most clients will only re-use a
> connection a small number of times.

And the fixed code (given the revised API proposal above) looks like this:

while True:
request = read_request(sock)
for response_chunk in scoped_resource(application_handler(request)):
send_response_chunk(sock)
close_resources()

This pattern has the advantage of also working if the resources you
want to manage aren't precisely what your iterating over, or if you're
iterating over them in a while loop rather than a for loop.

> So what this means is that every time I write a for loop, I can't just
> do a quick "am I going to break out of the for-loop and then re-use
> this iterator?" check -- I have to stop and think about whether this
> for-loop is nested inside some other loop, etc.

Or you unconditionally add the scoped_resource/close_resources calls
to force non-reference-counted implementations to behave a bit more
like CPython and don't worry about it further.

> - For those who just want to write a quick script and not think about
> it: here's a script that does repeated partial for-loops over a
> generator object:
>
> https://github.com/python/cpython/blob/553a84c4c9d6476518e2319acda6ba29b8588cb4/Tools/scripts/gprof2html.py#L40-L79
>
> (and note that the generator object even has an ineffective 'with
> open(...)' block inside it!)
>
> With the function-scoped-iterclose, this script would continue to work
> as it does now. Excellent.

As it would with the explicit scoped_resource/close_resources API.

> But, suppose that I decide that that main() function is really
> complicated and that it would be better to refactor some of those
> loops out into helper functions. (Probably actually true in this
> example.) So I do that and... suddenly the code breaks. And in a
> rather confusing way, because it has to do with this complicated
> long-distance interaction between two different 'for' loops *and*
> where they're placed with respect to the original function versus the
> helper function.

I do agree the fact that it would break common code refactoring
patterns is a good counter-argument against the idea of ever calling
scoped_resource() implicitly.

> Anyway, in summary: function-scoped-iterclose doesn't seem to
> accomplish my goal of getting rid of the *type* of pain involved when
> you have to run a background thread in your brain that's doing
> constant paranoid checking every time you write a for loop. Instead it
> arguably takes that type of pain and spreads it around both the
> experts and the novices :-/.

Does the addition of the explicit close_resources() API mitigate your concern?

> Now, let's look at some evidence about how disruptive the two
> proposals are for real code:
>
> As mentioned else-thread, I wrote a stupid little CPython hack [1] to
> report when the same iterator object gets passed to multiple 'for'
> loops, and ran the CPython and Django testsuites with it [2]. Looking
> just at generator objects [3], across these two large codebases there
> are exactly 4 places where this happens.

The standard library and a web framework are in no way typical of
Python application and scripting code.

> 3) Django django/utils/regex_helper.py:236
>
> This code is very similar to the previous example in its general
> outline, except that the 'for' loops *have* been factored out into
> utility functions. So in this case for-scoped-iterclose and
> function-scoped-iterclose are equally disruptive.

But explicitly scoped resource management leaves it alone.

> 4) CPython's Lib/test/test_generators.py:723
>
> I have to admit I cannot figure out what this code is doing, besides
> showing off :-). But the different 'for' loops are in different stack
> frames, so I'm pretty sure that for-scoped-iterclose and
> function-scoped-iterclose would be equally disruptive.

And explicitly scoped resource management again leaves it alone.

> Obviously there's a bias here in that these are still relatively
> "serious" libraries; I don't have a big corpus of one-off scripts that
> are just a big __main__, though gprof2html.py isn't far from that. (If
> anyone knows where to find such a thing let me know...) But still, the
> tally here is that out of 4 examples, we have 1 subtle bug that
> iterclose might have caught, 2 cases where for-scoped-iterclose and
> function-scoped-iterclose are equally disruptive, and only 1 where
> function-scoped-iterclose is less disruptive -- and in that case it's
> arguably just avoiding an obvious error now in favor of a more
> confusing error later.
>
> If this reduced the backwards-incompatible cases by a factor of, like,
> 10x or 100x, then that would be a pretty strong argument in its favor.
> But it seems to be more like... 1.5x.

The explicit-API-only aspect of the proposal eliminates 100% of the
backwards incompatibilities :)

Cheers,
Nick.

--
Nick Coghlan | ncog...@gmail.com | Brisbane, Australia

Nick Coghlan

unread,
Oct 26, 2016, 1:03:15 PM10/26/16
to Nathaniel Smith, python...@python.org
On 26 October 2016 at 08:48, Nathaniel Smith <n...@pobox.com> wrote:
> If it takes a strong reference, then suddenly we're pinning all
> iterators in memory until the end of the enclosing function, which
> will often look like a memory leak. I think this would break a *lot*
> more existing code than the for-scoped-iterclose proposal does, and in
> more obscure ways that are harder to detect and warn about ahead of
> time.

It would take a strong reference, which is another reason why
close_resources() would be an essential part of the explicit API
(since it would drop the references in addition to calling the
__exit__() and close() methods of the declared resources), and also
yet another reason why you've convinced me that the only implicit API
that would ever make sense is one that was scoped specifically to the
iteration process.

However, I still think the explicit-API-only suggestion is a much
better path to pursue than any implicit proposal - it will give folks
that see it for the first something to Google, and it's a general
purpose technique rather than being restricted specifically to the
cases where the resource to be managed and the iterator being iterated
over are one and the same object.

Cheers,
Nick.

--
Nick Coghlan | ncog...@gmail.com | Brisbane, Australia

Neil Girdhar

unread,
Oct 28, 2016, 3:11:10 AM10/28/16
to python-ideas, ncog...@gmail.com, python...@python.org, n...@pobox.com
I still don't understand why you can't write it like this:

def read_newline_separated_json_tree(tree):
    for root, _, paths in os.walk(tree):
        for path in paths:
            with read_newline_separated_json(join(root, path)) as iterable:
                yield from iterable

Zero extra lines.  Works today.  Does everything you want.
 

A similar situation arises in the main loop of something like an HTTP server:

  while True:
      request = read_request(sock)
      for response_chunk in application_handler(request):
          send_response_chunk(sock)

Same thing:


while True:
    request = read_request(sock)
    with application_handler(request) as iterable:
        for response_chunk in iterable:
            send_response_chunk(sock)


I'll stop posting about this, but I don't see the motivation behind this proposals except replacing one explicit context management line with a hidden "line" of cognitive overhead.  I think the solution is to stop returning an iterable when you have state needing a cleanup.  Instead, return a context manager and force the caller to open it to get at the iterable.

Best,

Neil
Reply all
Reply to author
Forward
0 new messages