[Python-Dev] PEP 554 v3 (new interpreters module)

154 views
Skip to first unread message

Eric Snow

unread,
Sep 13, 2017, 9:47:00 PM9/13/17
to Python-Dev
I've updated PEP 554 in response to feedback. (thanks all!) There
are a few unresolved points (some of them added to the Open Questions
section), but the current PEP has changed enough that I wanted to get
it out there first.

Notably changed:

* the API relative to object passing has changed somewhat drastically
(hopefully simpler and easier to understand), replacing "FIFO" with
"channel"
* added an examples section
* added an open questions section
* added a rejected ideas section
* added more items to the deferred functionality section
* the rationale section has moved down below the examples

Please let me know what you think. I'm especially interested in
feedback about the channels. Thanks!

-eric


++++++++++++++++++++++++++++++++++++++++++++++++

PEP: 554
Title: Multiple Interpreters in the Stdlib
Author: Eric Snow <ericsnow...@gmail.com>
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 2017-09-05
Python-Version: 3.7
Post-History:


Abstract
========

CPython has supported subinterpreters, with increasing levels of
support, since version 1.5. The feature has been available via the
C-API. [c-api]_ Subinterpreters operate in
`relative isolation from one another <Interpreter Isolation_>`_, which
provides the basis for an
`alternative concurrency model <Concurrency_>`_.

This proposal introduces the stdlib ``interpreters`` module. The module
will be `provisional <Provisional Status_>`_. It exposes the basic
functionality of subinterpreters already provided by the C-API.


Proposal
========

The ``interpreters`` module will be added to the stdlib. It will
provide a high-level interface to subinterpreters and wrap the low-level
``_interpreters`` module. The proposed API is inspired by the
``threading`` module. See the `Examples`_ section for concrete usage
and use cases.

API for interpreters
--------------------

The module provides the following functions:

``list_all()``::

Return a list of all existing interpreters.

``get_current()``::

Return the currently running interpreter.

``create()``::

Initialize a new Python interpreter and return it. The
interpreter will be created in the current thread and will remain
idle until something is run in it. The interpreter may be used
in any thread and will run in whichever thread calls
``interp.run()``.


The module also provides the following class:

``Interpreter(id)``::

id:

The interpreter's ID (read-only).

is_running():

Return whether or not the interpreter is currently executing code.
Calling this on the current interpreter will always return True.

destroy():

Finalize and destroy the interpreter.

This may not be called on an already running interpreter. Doing
so results in a RuntimeError.

run(source_str, /, **shared):

Run the provided Python source code in the interpreter. Any
keyword arguments are added to the interpreter's execution
namespace. If any of the values are not supported for sharing
between interpreters then RuntimeError gets raised. Currently
only channels (see "create_channel()" below) are supported.

This may not be called on an already running interpreter. Doing
so results in a RuntimeError.

A "run()" call is quite similar to any other function call. Once
it completes, the code that called "run()" continues executing
(in the original interpreter). Likewise, if there is any uncaught
exception, it propagates into the code where "run()" was called.

The big difference is that "run()" executes the code in an
entirely different interpreter, with entirely separate state.
The state of the current interpreter in the current OS thread
is swapped out with the state of the target interpreter (the one
that will execute the code). When the target finishes executing,
the original interpreter gets swapped back in and its execution
resumes.

So calling "run()" will effectively cause the current Python
thread to pause. Sometimes you won't want that pause, in which
case you should make the "run()" call in another thread. To do
so, add a function that calls "run()" and then run that function
in a normal "threading.Thread".

Note that interpreter's state is never reset, neither before
"run()" executes the code nor after. Thus the interpreter
state is preserved between calls to "run()". This includes
"sys.modules", the "builtins" module, and the internal state
of C extension modules.

Also note that "run()" executes in the namespace of the "__main__"
module, just like scripts, the REPL, "-m", and "-c". Just as
the interpreter's state is not ever reset, the "__main__" module
is never reset. You can imagine concatenating the code from each
"run()" call into one long script. This is the same as how the
REPL operates.

Supported code: source text.

API for sharing data
--------------------

The mechanism for passing objects between interpreters is through
channels. A channel is a simplex FIFO similar to a pipe. The main
difference is that channels can be associated with zero or more
interpreters on either end. Unlike queues, which are also many-to-many,
channels have no buffer.

``create_channel()``::

Create a new channel and return (recv, send), the RecvChannel and
SendChannel corresponding to the ends of the channel. The channel
is not closed and destroyed (i.e. garbage-collected) until the number
of associated interpreters returns to 0.

An interpreter gets associated with a channel by calling its "send()"
or "recv()" method. That association gets dropped by calling
"close()" on the channel.

Both ends of the channel are supported "shared" objects (i.e. may be
safely shared by different interpreters. Thus they may be passed as
keyword arguments to "Interpreter.run()".

``list_all_channels()``::

Return a list of all open (RecvChannel, SendChannel) pairs.


``RecvChannel(id)``::

The receiving end of a channel. An interpreter may use this to
receive objects from another interpreter. At first only bytes will
be supported.

id:

The channel's unique ID.

interpreters:

The list of associated interpreters (those that have called
the "recv()" method).

__next__():

Return the next object from the channel. If none have been sent
then wait until the next send.

recv():

Return the next object from the channel. If none have been sent
then wait until the next send. If the channel has been closed
then EOFError is raised.

recv_nowait(default=None):

Return the next object from the channel. If none have been sent
then return the default. If the channel has been closed
then EOFError is raised.

close():

No longer associate the current interpreter with the channel (on
the receiving end). This is a noop if the interpreter isn't
already associated. Once an interpreter is no longer associated
with the channel, subsequent (or current) send() and recv() calls
from that interpreter will raise EOFError.

Once number of associated interpreters on both ends drops to 0,
the channel is actually marked as closed. The Python runtime
will garbage collect all closed channels. Note that "close()" is
automatically called when it is no longer used in the current
interpreter.

This operation is idempotent. Return True if the current
interpreter was still associated with the receiving end of the
channel and False otherwise.


``SendChannel(id)``::

The sending end of a channel. An interpreter may use this to send
objects to another interpreter. At first only bytes will be
supported.

id:

The channel's unique ID.

interpreters:

The list of associated interpreters (those that have called
the "send()" method).

send(obj):

Send the object to the receiving end of the channel. Wait until
the object is received. If the channel does not support the
object then TypeError is raised. Currently only bytes are
supported. If the channel has been closed then EOFError is
raised.

send_nowait(obj):

Send the object to the receiving end of the channel. If the
object is received then return True. Otherwise return False.
If the channel does not support the object then TypeError is
raised. If the channel has been closed then EOFError is raised.

close():

No longer associate the current interpreter with the channel (on
the sending end). This is a noop if the interpreter isn't already
associated. Once an interpreter is no longer associated with the
channel, subsequent (or current) send() and recv() calls from that
interpreter will raise EOFError.

Once number of associated interpreters on both ends drops to 0,
the channel is actually marked as closed. The Python runtime
will garbage collect all closed channels. Note that "close()" is
automatically called when it is no longer used in the current
interpreter.

This operation is idempotent. Return True if the current
interpreter was still associated with the sending end of the
channel and False otherwise.


Examples
========

Run isolated code
-----------------

::

interp = interpreters.create()
print('before')
interp.run('print("during")')
print('after')

Run in a thread
---------------

::

interp = interpreters.create()
def run():
interp.run('print("during")')
t = threading.Thread(target=run)
print('before')
t.start()
print('after')

Pre-populate an interpreter
---------------------------

::

interp = interpreters.create()
interp.run("""if True:
import some_lib
import an_expensive_module
some_lib.set_up()
""")
wait_for_request()
interp.run("""if True:
some_lib.handle_request()
""")

Handling an exception
---------------------

::

interp = interpreters.create()
try:
interp.run("""if True:
raise KeyError
""")
except KeyError:
print("got the error from the subinterpreter")

Synchronize using a channel
---------------------------

::

interp = interpreters.create()
r, s = interpreters.create_channel()
def run():
interp.run("""if True:
reader.recv()
print("during")
reader.close()
""",
reader=r)
t = threading.Thread(target=run)
print('before')
t.start()
print('after')
s.send(b'')
s.close()

Sharing a file descriptor
-------------------------

::

interp = interpreters.create()
r1, s1 = interpreters.create_channel()
r2, s2 = interpreters.create_channel()
def run():
interp.run("""if True:
fd = int.from_bytes(
reader.recv(), 'big')
for line in os.fdopen(fd):
print(line)
writer.send(b'')
""",
reader=r1, writer=s2)
t = threading.Thread(target=run)
t.start()
with open('spamspamspam') as infile:
fd = infile.fileno().to_bytes(1, 'big')
s.send(fd)
r.recv()

Passing objects via pickle
--------------------------

::

interp = interpreters.create()
r, s = interpreters.create_channel()
interp.run("""if True:
import pickle
""",
reader=r)
def run():
interp.run("""if True:
data = reader.recv()
while data:
obj = pickle.loads(data)
do_something(obj)
data = reader.recv()
reader.close()
""",
reader=r)
t = threading.Thread(target=run)
t.start()
for obj in input:
data = pickle.dumps(obj)
s.send(data)
s.send(b'')


Rationale
=========

Running code in multiple interpreters provides a useful level of
isolation within the same process. This can be leveraged in number
of ways. Furthermore, subinterpreters provide a well-defined framework
in which such isolation may extended.

CPython has supported subinterpreters, with increasing levels of
support, since version 1.5. While the feature has the potential
to be a powerful tool, subinterpreters have suffered from neglect
because they are not available directly from Python. Exposing the
existing functionality in the stdlib will help reverse the situation.

This proposal is focused on enabling the fundamental capability of
multiple isolated interpreters in the same Python process. This is a
new area for Python so there is relative uncertainly about the best
tools to provide as companions to subinterpreters. Thus we minimize
the functionality we add in the proposal as much as possible.

Concerns
--------

* "subinterpreters are not worth the trouble"

Some have argued that subinterpreters do not add sufficient benefit
to justify making them an official part of Python. Adding features
to the language (or stdlib) has a cost in increasing the size of
the language. So it must pay for itself. In this case, subinterpreters
provide a novel concurrency model focused on isolated threads of
execution. Furthermore, they present an opportunity for changes in
CPython that will allow simulateous use of multiple CPU cores (currently
prevented by the GIL).

Alternatives to subinterpreters include threading, async, and
multiprocessing. Threading is limited by the GIL and async isn't
the right solution for every problem (nor for every person).
Multiprocessing is likewise valuable in some but not all situations.
Direct IPC (rather than via the multiprocessing module) provides
similar benefits but with the same caveat.

Notably, subinterpreters are not intended as a replacement for any of
the above. Certainly they overlap in some areas, but the benefits of
subinterpreters include isolation and (potentially) performance. In
particular, subinterpreters provide a direct route to an alternate
concurrency model (e.g. CSP) which has found success elsewhere and
will appeal to some Python users. That is the core value that the
``interpreters`` module will provide.

* "stdlib support for subinterpreters adds extra burden
on C extension authors"

In the `Interpreter Isolation`_ section below we identify ways in
which isolation in CPython's subinterpreters is incomplete. Most
notable is extension modules that use C globals to store internal
state. PEP 3121 and PEP 489 provide a solution for most of the
problem, but one still remains. [petr-c-ext]_ Until that is resolved,
C extension authors will face extra difficulty to support
subinterpreters.

Consequently, projects that publish extension modules may face an
increased maintenance burden as their users start using subinterpreters,
where their modules may break. This situation is limited to modules
that use C globals (or use libraries that use C globals) to store
internal state.

Ultimately this comes down to a question of how often it will be a
problem in practice: how many projects would be affected, how often
their users will be affected, what the additional maintenance burden
will be for projects, and what the overall benefit of subinterpreters
is to offset those costs. The position of this PEP is that the actual
extra maintenance burden will be small and well below the threshold at
which subinterpreters are worth it.


About Subinterpreters
=====================

Shared data
-----------

Subinterpreters are inherently isolated (with caveats explained below),
in contrast to threads. This enables `a different concurrency model
<Concurrency_>`_ than is currently readily available in Python.
`Communicating Sequential Processes`_ (CSP) is the prime example.

A key component of this approach to concurrency is message passing. So
providing a message/object passing mechanism alongside ``Interpreter``
is a fundamental requirement. This proposal includes a basic mechanism
upon which more complex machinery may be built. That basic mechanism
draws inspiration from pipes, queues, and CSP's channels. [fifo]_

The key challenge here is that sharing objects between interpreters
faces complexity due in part to CPython's current memory model.
Furthermore, in this class of concurrency, the ideal is that objects
only exist in one interpreter at a time. However, this is not practical
for Python so we initially constrain supported objects to ``bytes``.
There are a number of strategies we may pursue in the future to expand
supported objects and object sharing strategies.

Note that the complexity of object sharing increases as subinterpreters
become more isolated, e.g. after GIL removal. So the mechanism for
message passing needs to be carefully considered. Keeping the API
minimal and initially restricting the supported types helps us avoid
further exposing any underlying complexity to Python users.

To make this work, the mutable shared state will be managed by the
Python runtime, not by any of the interpreters. Initially we will
support only one type of objects for shared state: the channels provided
by ``create_channel()``. Channels, in turn, will carefully manage
passing objects between interpreters.

Interpreter Isolation
---------------------

CPython's interpreters are intended to be strictly isolated from each
other. Each interpreter has its own copy of all modules, classes,
functions, and variables. The same applies to state in C, including in
extension modules. The CPython C-API docs explain more. [caveats]_

However, there are ways in which interpreters share some state. First
of all, some process-global state remains shared:

* file descriptors
* builtin types (e.g. dict, bytes)
* singletons (e.g. None)
* underlying static module data (e.g. functions) for
builtin/extension/frozen modules

There are no plans to change this.

Second, some isolation is faulty due to bugs or implementations that did
not take subinterpreters into account. This includes things like
extension modules that rely on C globals. [cryptography]_ In these
cases bugs should be opened (some are already):

* readline module hook functions (http://bugs.python.org/issue4202)
* memory leaks on re-init (http://bugs.python.org/issue21387)

Finally, some potential isolation is missing due to the current design
of CPython. Improvements are currently going on to address gaps in this
area:

* interpreters share the GIL
* interpreters share memory management (e.g. allocators, gc)
* GC is not run per-interpreter [global-gc]_
* at-exit handlers are not run per-interpreter [global-atexit]_
* extensions using the ``PyGILState_*`` API are incompatible [gilstate]_

Concurrency
-----------

Concurrency is a challenging area of software development. Decades of
research and practice have led to a wide variety of concurrency models,
each with different goals. Most center on correctness and usability.

One class of concurrency models focuses on isolated threads of
execution that interoperate through some message passing scheme. A
notable example is `Communicating Sequential Processes`_ (CSP), upon
which Go's concurrency is based. The isolation inherent to
subinterpreters makes them well-suited to this approach.


Existing Usage
--------------

Subinterpreters are not a widely used feature. In fact, the only
documented case of wide-spread usage is
`mod_wsgi <https://github.com/GrahamDumpleton/mod_wsgi>`_. On the one
hand, this case provides confidence that existing subinterpreter support
is relatively stable. On the other hand, there isn't much of a sample
size from which to judge the utility of the feature.


Provisional Status
==================

The new ``interpreters`` module will be added with "provisional" status
(see PEP 411). This allows Python users to experiment with the feature
and provide feedback while still allowing us to adjust to that feedback.
The module will be provisional in Python 3.7 and we will make a decision
before the 3.8 release whether to keep it provisional, graduate it, or
remove it.


Alternate Python Implementations
================================

TBD


Open Questions
==============

Leaking exceptions across interpreters
--------------------------------------

As currently proposed, uncaught exceptions from ``run()`` propagate
to the frame that called it. However, this means that exception
objects are leaking across the inter-interpreter boundary. Likewise,
the frames in the traceback potentially leak.

While that might not be a problem currently, it would be a problem once
interpreters get better isolation relative to memory management (which
is necessary to stop sharing the GIL between interpreters). So the
semantics of how the exceptions propagate needs to be resolved.

Initial support for buffers in channels
---------------------------------------

An alternative to support for bytes in channels in support for
read-only buffers (the PEP 3119 kind). Then ``recv()`` would return
a memoryview to expose the buffer in a zero-copy way. This is similar
to what ``multiprocessing.Connection`` supports. [mp-conn]

Switching to such an approach would help resolve questions of how
passing bytes through channels will work once we isolate memory
management in interpreters.


Deferred Functionality
======================

In the interest of keeping this proposal minimal, the following
functionality has been left out for future consideration. Note that
this is not a judgement against any of said capability, but rather a
deferment. That said, each is arguably valid.

Interpreter.call()
------------------

It would be convenient to run existing functions in subinterpreters
directly. ``Interpreter.run()`` could be adjusted to support this or
a ``call()`` method could be added::

Interpreter.call(f, *args, **kwargs)

This suffers from the same problem as sharing objects between
interpreters via queues. The minimal solution (running a source string)
is sufficient for us to get the feature out where it can be explored.

timeout arg to pop() and push()
-------------------------------

Typically functions that have a ``block`` argument also have a
``timeout`` argument. We can add it later if needed.

get_main()
----------

CPython has a concept of a "main" interpreter. This is the initial
interpreter created during CPython's runtime initialization. It may
be useful to identify the main interpreter. For instance, the main
interpreter should not be destroyed. However, for the basic
functionality of a high-level API a ``get_main()`` function is not
necessary. Furthermore, there is no requirement that a Python
implementation have a concept of a main interpreter. So until there's
a clear need we'll leave ``get_main()`` out.

Interpreter.run_in_thread()
---------------------------

This method would make a ``run()`` call for you in a thread. Doing this
using only ``threading.Thread`` and ``run()`` is relatively trivial so
we've left it out.

Synchronization Primitives
--------------------------

The ``threading`` module provides a number of synchronization primitives
for coordinating concurrent operations. This is especially necessary
due to the shared-state nature of threading. In contrast,
subinterpreters do not share state. Data sharing is restricted to
channels, which do away with the need for explicit synchronization. If
any sort of opt-in shared state support is added to subinterpreters in
the future, that same effort can introduce synchronization primitives
to meet that need.

CSP Library
-----------

A ``csp`` module would not be a large step away from the functionality
provided by this PEP. However, adding such a module is outside the
minimalist goals of this proposal.

Syntactic Support
-----------------

The ``Go`` language provides a concurrency model based on CSP, so
it's similar to the concurrency model that subinterpreters support.
``Go`` provides syntactic support, as well several builtin concurrency
primitives, to make concurrency a first-class feature. Conceivably,
similar syntactic (and builtin) support could be added to Python using
subinterpreters. However, that is *way* outside the scope of this PEP!

Multiprocessing
---------------

The ``multiprocessing`` module could support subinterpreters in the same
way it supports threads and processes. In fact, the module's
maintainer, Davin Potts, has indicated this is a reasonable feature
request. However, it is outside the narrow scope of this PEP.

C-extension opt-in/opt-out
--------------------------

By using the ``PyModuleDef_Slot`` introduced by PEP 489, we could easily
add a mechanism by which C-extension modules could opt out of support
for subinterpreters. Then the import machinery, when operating in
a subinterpreter, would need to check the module for support. It would
raise an ImportError if unsupported.

Alternately we could support opting in to subinterpreter support.
However, that would probably exclude many more modules (unnecessarily)
than the opt-out approach.

The scope of adding the ModuleDef slot and fixing up the import
machinery is non-trivial, but could be worth it. It all depends on
how many extension modules break under subinterpreters. Given the
relatively few cases we know of through mod_wsgi, we can leave this
for later.

Poisoning channels
------------------

CSP has the concept of poisoning a channel. Once a channel has been
poisoned, and ``send()`` or ``recv()`` call on it will raise a special
exception, effectively ending execution in the interpreter that tried
to use the poisoned channel.

This could be accomplished by adding a ``poison()`` method to both ends
of the channel. The ``close()`` method could work if it had a ``force``
option to force the channel closed. Regardless, these semantics are
relatively specialized and can wait.

Sending channels over channels
------------------------------

Some advanced usage of subinterpreters could take advantage of the
ability to send channels over channels, in addition to bytes. Given
that channels will already be multi-interpreter safe, supporting then
in ``RecvChannel.recv()`` wouldn't be a big change. However, this can
wait until the basic functionality has been ironed out.

Reseting __main__
-----------------

As proposed, every call to ``Interpreter.run()`` will execute in the
namespace of the interpreter's existing ``__main__`` module. This means
that data persists there between ``run()`` calls. Sometimes this isn't
desireable and you want to execute in a fresh ``__main__``. Also,
you don't necessarily want to leak objects there that you aren't using
any more.

Solutions include:

* a ``create()`` arg to indicate resetting ``__main__`` after each
``run`` call
* an ``Interpreter.reset_main`` flag to support opting in or out
after the fact
* an ``Interpreter.reset_main()`` method to opt in when desired

This isn't a critical feature initially. It can wait until later
if desirable.

Support passing ints in channels
--------------------------------

Passing ints around should be fine and ultimately is probably
desirable. However, we can get by with serializing them as bytes
for now. The goal is a minimal API for the sake of basic
functionality at first.

File descriptors and sockets in channels
----------------------------------------

Given that file descriptors and sockets are process-global resources,
support for passing them through channels is a reasonable idea. They
would be a good candidate for the first effort at expanding the types
that channels support. They aren't strictly necessary for the initial
API.


Rejected Ideas
==============

Explicit channel association
----------------------------

Interpreters are implicitly associated with channels upon ``recv()`` and
``send()`` calls. They are de-associated with ``close()`` calls. The
alternative would be explicit methods. It would be either
``add_channel()`` and ``remove_channel()`` methods on ``Interpreter``
objects or something similar on channel objects.

In practice, this level of management shouldn't be necessary for users.
So adding more explicit support would only add clutter to the API.

Use pipes instead of channels
-----------------------------

A pipe would be a simplex FIFO between exactly two interpreters. For
most use cases this would be sufficient. It could potentially simplify
the implementation as well. However, it isn't a big step to supporting
a many-to-many simplex FIFO via channels. Also, with pipes the API
ends up being slightly more complicated, requiring naming the pipes.

Use queues instead of channels
------------------------------

The main difference between queues and channels is that queues support
buffering. This would complicate the blocking semantics of ``recv()``
and ``send()``. Also, queues can be built on top of channels.

"enumerate"
-----------

The ``list_all()`` function provides the list of all interpreters.
In the threading module, which partly inspired the proposed API, the
function is called ``enumerate()``. The name is different here to
avoid confusing Python users that are not already familiar with the
threading API. For them "enumerate" is rather unclear, whereas
"list_all" is clear.


References
==========

.. [c-api]
https://docs.python.org/3/c-api/init.html#sub-interpreter-support

.. _Communicating Sequential Processes:

.. [CSP]
https://en.wikipedia.org/wiki/Communicating_sequential_processes
https://github.com/futurecore/python-csp

.. [fifo]
https://docs.python.org/3/library/multiprocessing.html#multiprocessing.Pipe
https://docs.python.org/3/library/multiprocessing.html#multiprocessing.Queue
https://docs.python.org/3/library/queue.html#module-queue
http://stackless.readthedocs.io/en/2.7-slp/library/stackless/channels.html
https://golang.org/doc/effective_go.html#sharing
http://www.jtolds.com/writing/2016/03/go-channels-are-bad-and-you-should-feel-bad/

.. [caveats]
https://docs.python.org/3/c-api/init.html#bugs-and-caveats

.. [petr-c-ext]
https://mail.python.org/pipermail/import-sig/2016-June/001062.html
https://mail.python.org/pipermail/python-ideas/2016-April/039748.html

.. [cryptography]
https://github.com/pyca/cryptography/issues/2299

.. [global-gc]
http://bugs.python.org/issue24554

.. [gilstate]
https://bugs.python.org/issue10915
http://bugs.python.org/issue15751

.. [global-atexit]
https://bugs.python.org/issue6531

.. [mp-conn]
https://docs.python.org/3/library/multiprocessing.html#multiprocessing.Connection


Copyright
=========

This document has been placed in the public domain.
_______________________________________________
Python-Dev mailing list
Pytho...@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: https://mail.python.org/mailman/options/python-dev/dev-python%2Bgarchive-30976%40googlegroups.com

Nick Coghlan

unread,
Sep 14, 2017, 12:01:08 AM9/14/17
to Eric Snow, Python-Dev
On 14 September 2017 at 11:44, Eric Snow <ericsnow...@gmail.com> wrote:
> I've updated PEP 554 in response to feedback. (thanks all!) There
> are a few unresolved points (some of them added to the Open Questions
> section), but the current PEP has changed enough that I wanted to get
> it out there first.
>
> Notably changed:
>
> * the API relative to object passing has changed somewhat drastically
> (hopefully simpler and easier to understand), replacing "FIFO" with
> "channel"
> * added an examples section
> * added an open questions section
> * added a rejected ideas section
> * added more items to the deferred functionality section
> * the rationale section has moved down below the examples
>
> Please let me know what you think. I'm especially interested in
> feedback about the channels. Thanks!

I like the new pipe-like channels API more than the previous named
FIFO approach :)

> send(obj):
>
> Send the object to the receiving end of the channel. Wait until
> the object is received. If the channel does not support the
> object then TypeError is raised. Currently only bytes are
> supported. If the channel has been closed then EOFError is
> raised.

I still expect any form of object sharing to hinder your
per-interpreter GIL efforts, so restricting the initial implementation
to memoryview-only seems more future-proof to me.


> Pre-populate an interpreter
> ---------------------------
>
> ::
>
> interp = interpreters.create()
> interp.run("""if True:
> import some_lib
> import an_expensive_module
> some_lib.set_up()
> """)
> wait_for_request()
> interp.run("""if True:
> some_lib.handle_request()
> """)

I find the "if True:"'s sprinkled through the examples distracting, so
I'd prefer either:

1. Using textwrap.dedent; or
2. Assigning the code to a module level attribute

::
interp = interpreters.create()
setup_code = """\
import some_lib
import an_expensive_module
some_lib.set_up()
"""
interp.run(setup_code)
wait_for_request()

handler_code = """\
some_lib.handle_request()
"""
interp.run(handler_code)

> Handling an exception
> ---------------------
>
> ::
>
> interp = interpreters.create()
> try:
> interp.run("""if True:
> raise KeyError
> """)
> except KeyError:
> print("got the error from the subinterpreter")

As with the message passing through channels, I think you'll really
want to minimise any kind of implicit object sharing that may
interfere with future efforts to make the GIL truly an *interpreter*
lock, rather than the global process lock that it is currently.

One possible way to approach that would be to make the low level run()
API a more Go-style API rather than a Python-style one, and have it
return a (result, err) 2-tuple. "err.raise()" would then translate the
foreign interpreter's exception into a local interpreter exception,
but the *traceback* for that exception would be entirely within the
current interpreter.
Interpreters themselves will also need to be shared objects, as:

- they all have access to "interpreters.list_all()"
- when we do "interpreters.create_interpreter()", the calling
interpreter gets a reference to itself via
"interpreters.get_current()"

(These shared objects are what I suspect you may end up needing a
process global read/write lock to manage, by the way - I think it
would be great if you can figure out a way to avoid that, it's just
not entirely clear to me what that might look like. I do think you're
on the right track by prohibiting the destruction of an interpreter
that's currently running, and the destruction of channels that are
currently still associated with an interpreter)

> Interpreter Isolation
> ---------------------
>

This sections is a really nice addition :)

> Existing Usage
> --------------
>
> Subinterpreters are not a widely used feature. In fact, the only
> documented case of wide-spread usage is
> `mod_wsgi <https://github.com/GrahamDumpleton/mod_wsgi>`_. On the one
> hand, this case provides confidence that existing subinterpreter support
> is relatively stable. On the other hand, there isn't much of a sample
> size from which to judge the utility of the feature.

Nathaniel pointed out that JEP embeds CPython subinterpreters inside
the JVM similar to the way that mod_wsgi embeds them inside Apache
httpd: https://github.com/ninia/jep/wiki/How-Jep-Works

> Open Questions
> ==============
>
> Leaking exceptions across interpreters
> --------------------------------------
>
> As currently proposed, uncaught exceptions from ``run()`` propagate
> to the frame that called it. However, this means that exception
> objects are leaking across the inter-interpreter boundary. Likewise,
> the frames in the traceback potentially leak.
>
> While that might not be a problem currently, it would be a problem once
> interpreters get better isolation relative to memory management (which
> is necessary to stop sharing the GIL between interpreters). So the
> semantics of how the exceptions propagate needs to be resolved.

As noted above, I think you *really* want to avoid leaking exceptions
in the initial implementation. A non-exception-based error signaling
mechanism would be one way to do that, similar to how the low-level
subprocess APIs actually report the return code, which higher level
APIs then turn into an exception.

resp.raise_for_status() does something similar for HTTP responses in
the requests API.

> Initial support for buffers in channels
> ---------------------------------------
>
> An alternative to support for bytes in channels in support for
> read-only buffers (the PEP 3119 kind). Then ``recv()`` would return
> a memoryview to expose the buffer in a zero-copy way. This is similar
> to what ``multiprocessing.Connection`` supports. [mp-conn]
>
> Switching to such an approach would help resolve questions of how
> passing bytes through channels will work once we isolate memory
> management in interpreters.

Exactly :)

> Reseting __main__
> -----------------
>
> As proposed, every call to ``Interpreter.run()`` will execute in the
> namespace of the interpreter's existing ``__main__`` module. This means
> that data persists there between ``run()`` calls. Sometimes this isn't
> desireable and you want to execute in a fresh ``__main__``. Also,
> you don't necessarily want to leak objects there that you aren't using
> any more.
>
> Solutions include:
>
> * a ``create()`` arg to indicate resetting ``__main__`` after each
> ``run`` call
> * an ``Interpreter.reset_main`` flag to support opting in or out
> after the fact
> * an ``Interpreter.reset_main()`` method to opt in when desired
>
> This isn't a critical feature initially. It can wait until later
> if desirable.

I was going to note that you can already do this:

interp.run("globals().clear()")

However, that turns out to clear *too* much, since it also clobbers
all the __dunder__ attributes that the interpreter needs in a code
execution environment.

Either way, if you added this, I think it would make more sense as an
"importlib.util.reset_globals()" operation, rather than have it be
something specific to subinterpreters.

Cheers,
Nick.

--
Nick Coghlan | ncog...@gmail.com | Brisbane, Australia

Yury Selivanov

unread,
Sep 14, 2017, 1:09:26 AM9/14/17
to Nick Coghlan, Python-Dev
On Wed, Sep 13, 2017 at 11:56 PM, Nick Coghlan <ncog...@gmail.com> wrote:
[..]
>> send(obj):
>>
>> Send the object to the receiving end of the channel. Wait until
>> the object is received. If the channel does not support the
>> object then TypeError is raised. Currently only bytes are
>> supported. If the channel has been closed then EOFError is
>> raised.
>
> I still expect any form of object sharing to hinder your
> per-interpreter GIL efforts, so restricting the initial implementation
> to memoryview-only seems more future-proof to me.

+1. Working with memoryviews is as convenient as with bytes.

Yury

Nathaniel Smith

unread,
Sep 14, 2017, 1:29:17 AM9/14/17
to Nick Coghlan, Python Dev
On Sep 13, 2017 9:01 PM, "Nick Coghlan" <ncog...@gmail.com> wrote:
On 14 September 2017 at 11:44, Eric Snow <ericsnow...@gmail.com> wrote:
>    send(obj):
>
>        Send the object to the receiving end of the channel.  Wait until
>        the object is received.  If the channel does not support the
>        object then TypeError is raised.  Currently only bytes are
>        supported.  If the channel has been closed then EOFError is
>        raised.

I still expect any form of object sharing to hinder your
per-interpreter GIL efforts, so restricting the initial implementation
to memoryview-only seems more future-proof to me.

I don't get it. With bytes, you can either share objects or copy them and the user can't tell the difference, so you can change your mind later if you want. But memoryviews require some kind of cross-interpreter strong reference to keep the underlying buffer object alive. So if you want to minimize object sharing, surely bytes are more future-proof.


> Handling an exception
> ---------------------
>
> ::
>
>    interp = interpreters.create()
>    try:
>        interp.run("""if True:
>            raise KeyError
>            """)
>    except KeyError:
>        print("got the error from the subinterpreter")

As with the message passing through channels, I think you'll really
want to minimise any kind of implicit object sharing that may
interfere with future efforts to make the GIL truly an *interpreter*
lock, rather than the global process lock that it is currently.

One possible way to approach that would be to make the low level run()
API a more Go-style API rather than a Python-style one, and have it
return a (result, err) 2-tuple. "err.raise()" would then translate the
foreign interpreter's exception into a local interpreter exception,
but the *traceback* for that exception would be entirely within the
current interpreter.

It would also be reasonable to simply not return any value/exception from run() at all, or maybe just a bool for whether there was an unhandled exception. Any high level API is going to be injecting code on both sides of the interpreter boundary anyway, so it can do whatever exception and traceback translation it wants to.


> Reseting __main__
> -----------------
>
> As proposed, every call to ``Interpreter.run()`` will execute in the
> namespace of the interpreter's existing ``__main__`` module.  This means
> that data persists there between ``run()`` calls.  Sometimes this isn't
> desireable and you want to execute in a fresh ``__main__``.  Also,
> you don't necessarily want to leak objects there that you aren't using
> any more.
>
> Solutions include:
>
> * a ``create()`` arg to indicate resetting ``__main__`` after each
>   ``run`` call
> * an ``Interpreter.reset_main`` flag to support opting in or out
>   after the fact
> * an ``Interpreter.reset_main()`` method to opt in when desired
>
> This isn't a critical feature initially.  It can wait until later
> if desirable.

I was going to note that you can already do this:

    interp.run("globals().clear()")

However, that turns out to clear *too* much, since it also clobbers
all the __dunder__ attributes that the interpreter needs in a code
execution environment.

Either way, if you added this, I think it would make more sense as an
"importlib.util.reset_globals()" operation, rather than have it be
something specific to subinterpreters.

This is another point where the API could reasonably say that if you want clean namespaces then you should do that yourself (e.g. by setting up your own globals dict and using it to execute any post-bootstrap code).

-n

Nick Coghlan

unread,
Sep 14, 2017, 8:46:34 PM9/14/17
to Nathaniel Smith, Python Dev
On 14 September 2017 at 15:27, Nathaniel Smith <n...@pobox.com> wrote:
> On Sep 13, 2017 9:01 PM, "Nick Coghlan" <ncog...@gmail.com> wrote:
>
> On 14 September 2017 at 11:44, Eric Snow <ericsnow...@gmail.com>
> wrote:
>> send(obj):
>>
>> Send the object to the receiving end of the channel. Wait until
>> the object is received. If the channel does not support the
>> object then TypeError is raised. Currently only bytes are
>> supported. If the channel has been closed then EOFError is
>> raised.
>
> I still expect any form of object sharing to hinder your
> per-interpreter GIL efforts, so restricting the initial implementation
> to memoryview-only seems more future-proof to me.
>
>
> I don't get it. With bytes, you can either share objects or copy them and
> the user can't tell the difference, so you can change your mind later if you
> want.
> But memoryviews require some kind of cross-interpreter strong
> reference to keep the underlying buffer object alive. So if you want to
> minimize object sharing, surely bytes are more future-proof.

Not really, because the only way to ensure object separation (i.e no
refcounted objects accessible from multiple interpreters at once) with
a bytes-based API would be to either:

1. Always copy (eliminating most of the low overhead communications
benefits that subinterpreters may offer over multiple processes)
2. Make the bytes implementation more complicated by allowing multiple
bytes objects to share the same underlying storage while presenting as
distinct objects in different interpreters
3. Make the output on the receiving side not actually a bytes object,
but instead a view onto memory owned by another object in a different
interpreter (a "memory view", one might say)

And yes, using memory views for this does mean defining either a
subclass or a mediating object that not only keeps the originating
object alive until the receiving memoryview is closed, but also
retains a reference to the originating interpreter so that it can
switch to it when it needs to manipulate the source object's refcount
or call one of the buffer methods.

Yury and I are fine with that, since it means that either the sender
*or* the receiver can decide to copy the data (e.g. by calling
bytes(obj) before sending, or bytes(view) after receiving), and in the
meantime, the object holding the cross-interpreter view knows that it
needs to switch interpreters (and hence acquire the sending
interpreter's GIL) before doing anything with the source object.

The reason we're OK with this is that it means that only reading a new
message from a channel (i.e creating a cross-interpreter view) or
discarding a previously read message (i.e. closing a cross-interpreter
view) will be synchronisation points where the receiving interpreter
necessarily needs to acquire the sending interpreter's GIL.

By contrast, if we allow an actual bytes object to be shared, then
either every INCREF or DECREF on that bytes object becomes a
synchronisation point, or else we end up needing some kind of
secondary per-interpreter refcount where the interpreter doesn't drop
its shared reference to the original object in its source interpreter
until the internal refcount in the borrowing interpreter drops to
zero.

>> Handling an exception
>> ---------------------
> It would also be reasonable to simply not return any value/exception from
> run() at all, or maybe just a bool for whether there was an unhandled
> exception. Any high level API is going to be injecting code on both sides of
> the interpreter boundary anyway, so it can do whatever exception and
> traceback translation it wants to.

So any more detailed response would *have* to come back as a channel message?

That sounds like a reasonable option to me, too, especially since
module level code doesn't have a return value as such - you can really
only say "it raised an exception (and this was the exception it
raised)" or "it reached the end of the code without raising an
exception".

Given that, I think subprocess.run() (with check=False) is the right
API precedent here:
https://docs.python.org/3/library/subprocess.html#subprocess.run

That always returns subprocess.CompletedProcess, and then you can call
"cp.check_returncode()" to get it to raise
subprocess.CalledProcessError for non-zero return codes.

For interpreter.run(), we could keep the initial RunResult *really*
simple and only report back:

* source: the source code passed to run()
* shared: the keyword args passed to run() (name chosen to match
functools.partial)
* completed: completed execution without raising an exception? (True
if yes, False otherwise)

Whether or not to report more details for a raised exception, and
provide some mechanism to reraise it in the calling interpreter could
then be deferred until later.

The subprocess.run() comparison does make me wonder whether this might
be a more future-proof signature for Interpreter.run() though:

def run(source_str, /, *, channels=None):
...

That way channels can be a namespace *specifically* for passing in
channels, and can be reported as such on RunResult. If we decide to
allow arbitrary shared objects in the future, or add flag options like
"reraise=True" to reraise exceptions from the subinterpreter in the
current interpreter, we'd have that ability, rather than having the
entire potential keyword namespace taken up for passing shared
objects.

Nathaniel Smith

unread,
Sep 14, 2017, 10:06:08 PM9/14/17
to Nick Coghlan, Python Dev
Ah, that makes more sense.

I am nervous that allowing arbitrary memoryviews gives a *little* more
power than we need or want. I like that the current API can reasonably
be emulated using subprocesses -- it opens up the door for backports,
compatibility support on language implementations that don't support
subinterpreters, direct benchmark comparisons between the two
implementation strategies, etc. But if we allow arbitrary memoryviews,
then this requires that you can take (a) an arbitrary object, not
specified ahead of time, and (b) provide two read-write views on it in
separate interpreters such that modifications made in one are
immediately visible in the other. Subprocesses can do one or the other
-- they can copy arbitrary data, and if you warn them ahead of time
when you allocate the buffer, they can do real zero-copy shared
memory. But the combination is really difficult.

It'd be one thing if this were like a key feature that gave
subinterpreters an advantage over subprocesses, but it seems really
unlikely to me that a library won't know ahead of time when it's
filling in a buffer to be transferred, and if anything it seems like
we'd rather not expose read-write shared mappings in any case. It's
extremely non-trivial to do right [1].

tl;dr: let's not rule out a useful implementation strategy based on a
feature we don't actually need.

One alternative would be your option (3) -- you can put bytes in and
get memoryviews out, and since bytes objects are immutable it's OK.

[1] https://en.wikipedia.org/wiki/Memory_model_(programming)
Would channels be a dict, or...?

-n

--
Nathaniel J. Smith -- https://vorpus.org

Nick Coghlan

unread,
Sep 15, 2017, 12:25:57 AM9/15/17
to Nathaniel Smith, Python Dev
One constraint we'd want to impose is that the memory view in the
receiving interpreter should always be read-only - while we don't
currently expose the ability to request that at the Python layer,
memoryviews *do* support the creation of read-only views at the C API
layer (which then gets reported to Python code via the "view.readonly"
attribute).

While that change alone is enough to preserve the simplex nature of
the channel, it wouldn't be enough to prevent the *sender* from
mutating the buffer contents and having that change be visible in the
recipient.

In that regard it may make sense to maintain both restrictions
initially (as you suggested below): only accept bytes on the sending
side (to prevent mutation by the sender), and expose that as a
read-only memory view on the receiving side (to allow for zero-copy
data sharing without allowing mutation by the receiver).

> It'd be one thing if this were like a key feature that gave
> subinterpreters an advantage over subprocesses, but it seems really
> unlikely to me that a library won't know ahead of time when it's
> filling in a buffer to be transferred, and if anything it seems like
> we'd rather not expose read-write shared mappings in any case. It's
> extremely non-trivial to do right [1].
>
> tl;dr: let's not rule out a useful implementation strategy based on a
> feature we don't actually need.

Yeah, the description Eric currently has in the PEP is a summary of a
much longer suggestion Yury, Neil Schumenauer and I put together while
waiting for our flights following the core dev sprint, and the full
version had some of these additional constraints on it (most notably
the "read-only in the receiving interpreter" one).

> One alternative would be your option (3) -- you can put bytes in and
> get memoryviews out, and since bytes objects are immutable it's OK.

Indeed, I think that will be a sensible starting point. However, I
genuinely want to allow for zero-copy sharing of NumPy arrays
eventually, as that's where I think this idea gets most interesting:
the potential to allow for multiple parallel read operations on a
given NumPy array *in Python* (rather than Cython or C) without
running afoul of the GIL, and without needing to mess about with the
complexities of operating system level IPC.

>>>> Handling an exception
>> That way channels can be a namespace *specifically* for passing in
>> channels, and can be reported as such on RunResult. If we decide to
>> allow arbitrary shared objects in the future, or add flag options like
>> "reraise=True" to reraise exceptions from the subinterpreter in the
>> current interpreter, we'd have that ability, rather than having the
>> entire potential keyword namespace taken up for passing shared
>> objects.
>
> Would channels be a dict, or...?

Yeah, it would be a direct replacement for the way the current draft
is proposing to use the keywords dict - it would just be a separate
dictionary instead.

It does occur to me that if we wanted to align with the way the
`runpy` module spells that concept, we'd call the option
`init_globals`, but I'm thinking it will be better to only allow
channels to be passed through directly, and require that everything
else be sent through a channel.

Cheers,
Nick.

--
Nick Coghlan | ncog...@gmail.com | Brisbane, Australia

Nick Coghlan

unread,
Sep 15, 2017, 1:37:27 AM9/15/17
to Eric Snow, Python-Dev
On 14 September 2017 at 11:44, Eric Snow <ericsnow...@gmail.com> wrote:
> About Subinterpreters
> =====================
>
> Shared data
> -----------

[snip]

> To make this work, the mutable shared state will be managed by the
> Python runtime, not by any of the interpreters. Initially we will
> support only one type of objects for shared state: the channels provided
> by ``create_channel()``. Channels, in turn, will carefully manage
> passing objects between interpreters.

Something I think you may want to explicitly call out as *not* being
shared is the thread objects in threading.enumerate(), as the way that
works in the current implementation makes sense, but isn't
particularly obvious (what I have below comes from experimenting with
your branch at https://github.com/python/cpython/pull/1748).

Specifically, what happens is that the operating system thread
underlying the existing interpreter thread that calls interp.run()
gets borrowed as the operating system thread underlying the MainThread
object in the called interpreter. That MainThread object then gets
preserved in the interpreter's interpreter state, but the mapping to
an underlying OS thread will change freely based on who's calling into
it. From outside an interpreter, you *can't* request to run code in
subthreads directly - you'll always run your given code in the main
thread, and it will be up to that to dispatch requests to subthreads.

Beyond the thread lending that happens when you call interp.run()
(where one of your threads gets borrowed as the other interpreter's
main thread), each interpreter otherwise maintains a completely
disjoint set of thread objects that it is solely responsible for.

This also clarifies for me what it means for an interpreter to be a
"main" interpreter: it's the interpreter who's main thread actually
corresponds to the main thread of the overall operating system
process, rather than being temporarily borrowed from another
interpreter.

We're going to have to put some thought into how we want that to
interact with the signal handling logic - right now, I believe *any*
main thread will consider it its responsibility to process signals
delivered to the runtime (and embedding application avoid the
potential problems arising from that by simply not installing the
CPython signal handlers in the first place), and we probably want to
change that condition to be "the main thread in the main interpreter".

Cheers,
Nick.

--
Nick Coghlan | ncog...@gmail.com | Brisbane, Australia

Antoine Pitrou

unread,
Sep 18, 2017, 6:48:31 AM9/18/17
to pytho...@python.org

Hi,

First my high-level opinion about the PEP: the CSP model can probably
be already implemented using Queues. To me, the interesting promise of
subinterpreters is if they allow to remove the GIL while sharing memory
for big objects (such as Numpy arrays). This means the PEP should
probably focus on potential concurrency improvements rather than try to
faithfully follow the CSP model.

Other than that, a bunch of detailed comments follow:

On Wed, 13 Sep 2017 18:44:31 -0700
Eric Snow <ericsnow...@gmail.com> wrote:
>
> API for interpreters
> --------------------
>
> The module provides the following functions:
>
> ``list_all()``::
>
> Return a list of all existing interpreters.

See my naming proposal in the previous thread.

>
> run(source_str, /, **shared):
>
> Run the provided Python source code in the interpreter. Any
> keyword arguments are added to the interpreter's execution
> namespace.

"Execution namespace" specifically means the __main__ module in the
target interpreter, right?

> If any of the values are not supported for sharing
> between interpreters then RuntimeError gets raised. Currently
> only channels (see "create_channel()" below) are supported.
>
> This may not be called on an already running interpreter. Doing
> so results in a RuntimeError.

I would distinguish between both error cases: RuntimeError for calling
run() on an already running interpreter, ValueError for values which
are not supported for sharing.

> Likewise, if there is any uncaught
> exception, it propagates into the code where "run()" was called.

That makes it a bit harder to differentiate with errors raised by run()
itself (see above), though how much of an annoyance this is remains
unclear. The more litigious implication, though, is that it forces the
interpreter to support migration of arbitrary objects from one
interpreter to another (since a traceback keeps all local variables
alive).

> API for sharing data
> --------------------
>
> The mechanism for passing objects between interpreters is through
> channels. A channel is a simplex FIFO similar to a pipe. The main
> difference is that channels can be associated with zero or more
> interpreters on either end.

So it seems channels have become more complicated now? Is it important
to support multi-producer multi-consumer channels?

> Unlike queues, which are also many-to-many,
> channels have no buffer.

How does it work? Does send() block until someone else calls recv()?
That does not sound like a good idea to me. I don't think it's a
coincidence that the most varied kinds of I/O (from socket or file IO
to threading Queues to multiprocessing Pipes) have non-blocking send().

send() blocking until someone else calls recv() is not only bad for
performance, it also increases the likelihood of deadlocks.

> recv_nowait(default=None):
>
> Return the next object from the channel. If none have been sent
> then return the default. If the channel has been closed
> then EOFError is raised.
>
> close():
>
> No longer associate the current interpreter with the channel (on
> the receiving end). This is a noop if the interpreter isn't
> already associated. Once an interpreter is no longer associated
> with the channel, subsequent (or current) send() and recv() calls
> from that interpreter will raise EOFError.

EOFError normally means the *other* (sending) side has closed the
channel (but it becomes complicated with a multi-producer multi-consumer
setup...). When *this* side has closed the channel, we should raise
ValueError.

> The Python runtime
> will garbage collect all closed channels. Note that "close()" is
> automatically called when it is no longer used in the current
> interpreter.

"No longer used" meaning it loses all references in this interpreter?

> send(obj):
>
> Send the object to the receiving end of the channel. Wait until
> the object is received. If the channel does not support the
> object then TypeError is raised. Currently only bytes are
> supported. If the channel has been closed then EOFError is
> raised.

Similar remark as above (EOFError vs. ValueError).
More generally, send() raising EOFError sounds unheard of.

A sidenote: context manager support (__enter__ / __exit__) on channels
would sound more useful to me than iteration support.

> Initial support for buffers in channels
> ---------------------------------------
>
> An alternative to support for bytes in channels in support for
> read-only buffers (the PEP 3119 kind).

Probably you mean PEP 3118.

> Then ``recv()`` would return
> a memoryview to expose the buffer in a zero-copy way.

It will probably not do much if you only can pass buffers and not
structured objects, because unserializing (e.g. unpickling) from a
buffer will still copy memory around.

To pass a Numpy array, for example, you not only need to pass its
contents but also its metadata (its value type -- named "dtype" --, its
shape and strides). This may be serialized as simple tuples of atomic
types (str, int, bytes, other tuples), but you want to include a
memoryview of the data area somewhere in those tuples.

(and, of course, at some point, this will feel like reinventing
pickle :)) but pickle has no mechanism to avoid memory copies, so it
can't readily be reused here -- otherwise you're just reinventing
multiprocessing...)

> timeout arg to pop() and push()
> -------------------------------

pop() and push() don't exist anymore :-)

> Synchronization Primitives
> --------------------------
>
> The ``threading`` module provides a number of synchronization primitives
> for coordinating concurrent operations. This is especially necessary
> due to the shared-state nature of threading. In contrast,
> subinterpreters do not share state. Data sharing is restricted to
> channels, which do away with the need for explicit synchronization.

I think this rationale confuses Python-level data sharing with
process-level data sharing. The main point of subinterpreters
(compared to multiprocessing) is that they live in the same OS
process. So it's really not true that you can't share a low-level
synchronization primitive (say a semaphore) between subinterpreters.

(also see multiprocessing/synchronize.py, which implements all
synchronization primitives using basic low-level semaphores)

> Solutions include:
>
> * a ``create()`` arg to indicate resetting ``__main__`` after each
> ``run`` call
> * an ``Interpreter.reset_main`` flag to support opting in or out
> after the fact
> * an ``Interpreter.reset_main()`` method to opt in when desired

This would all be a false promise. Persistent state lives in other
places than __main__ (for example the loaded modules and their
respective configurations - think logging or decimal).

> Use queues instead of channels
> ------------------------------
>
> The main difference between queues and channels is that queues support
> buffering. This would complicate the blocking semantics of ``recv()``
> and ``send()``. Also, queues can be built on top of channels.

But buffering with background threads in pure Python will be order
of magnitudes slower than optimized buffering in a custom low-level
implementation. It would be a pity if a subinterpreters Queue ended
out as slow as a multiprocessing Queue.

Regards

Antoine.

Eric Snow

unread,
Sep 22, 2017, 9:10:48 PM9/22/17
to Antoine Pitrou, Python-Dev
Thanks for the feedback, Antoine. Sorry for the delay; it's been a
busy week for me. I just pushed an updated PEP to the repo. Once
I've sorted out the question of passing bytes through channels I plan
on posting the PEP to the list again for another round of discussion.
In the meantime, I've replied below in-line.

-eric

On Mon, Sep 18, 2017 at 4:46 AM, Antoine Pitrou <soli...@pitrou.net> wrote:
> First my high-level opinion about the PEP: the CSP model can probably
> be already implemented using Queues. To me, the interesting promise of
> subinterpreters is if they allow to remove the GIL while sharing memory
> for big objects (such as Numpy arrays). This means the PEP should
> probably focus on potential concurrency improvements rather than try to
> faithfully follow the CSP model.

Please elaborate. I'm interested in understanding what you mean here.
Do you have some subinterpreter-based concurrency improvements in
mind? What aspect of CSP is the PEP following too faithfully?

>> ``list_all()``::
>>
>> Return a list of all existing interpreters.
>
> See my naming proposal in the previous thread.

Sorry, your previous comment slipped through the cracks. You suggested:

As for the naming, let's make it both unconfusing and explicit?
How about three functions: `all_interpreters()`, `running_interpreters()`
and `idle_interpreters()`, for example?

As to "all_interpreters()", I suppose it's the difference between
"interpreters.all_interpreters()" and "interpreters.list_all()". To
me the latter looks better.

As to "running_interpreters()" and "idle_interpreters()", I'm not sure
what the benefit would be. You can compose either list manually with
a simple comprehension:

[interp for interp in interpreters.list_all() if interp.is_running()]
[interp for interp in interpreters.list_all() if not interp.is_running()]

>> run(source_str, /, **shared):
>>
>> Run the provided Python source code in the interpreter. Any
>> keyword arguments are added to the interpreter's execution
>> namespace.
>
> "Execution namespace" specifically means the __main__ module in the
> target interpreter, right?

Right. It's explained in more detail a little further down and
elsewhere in the PEP. I've updated the PEP to explicitly mention
__main__ here too.

>> If any of the values are not supported for sharing
>> between interpreters then RuntimeError gets raised. Currently
>> only channels (see "create_channel()" below) are supported.
>>
>> This may not be called on an already running interpreter. Doing
>> so results in a RuntimeError.
>
> I would distinguish between both error cases: RuntimeError for calling
> run() on an already running interpreter, ValueError for values which
> are not supported for sharing.

Good point.

>> Likewise, if there is any uncaught
>> exception, it propagates into the code where "run()" was called.
>
> That makes it a bit harder to differentiate with errors raised by run()
> itself (see above), though how much of an annoyance this is remains
> unclear. The more litigious implication, though, is that it forces the
> interpreter to support migration of arbitrary objects from one
> interpreter to another (since a traceback keeps all local variables
> alive).

Yeah, the proposal to propagate exceptions out of the subinterpreter
is still rather weak. I've added some notes the the PEP about this
open issue.

>> The mechanism for passing objects between interpreters is through
>> channels. A channel is a simplex FIFO similar to a pipe. The main
>> difference is that channels can be associated with zero or more
>> interpreters on either end.
>
> So it seems channels have become more complicated now? Is it important
> to support multi-producer multi-consumer channels?

To me it made the API simpler. The change did introduce the "close()"
method, which I suppose could be confusing. However, I'm sure that in
practice it won't be. In contrast, the FIFO/pipe-based API that I had
before required passing names around, required more calls, required
managing the channel/interpreter relationship more carefully, and made
it hard to follow that relationship.

>> Unlike queues, which are also many-to-many,
>> channels have no buffer.
>
> How does it work? Does send() block until someone else calls recv()?
> That does not sound like a good idea to me.

Correct "send()" blocks until the other end receives (if ever).
Likewise "recv()" blocks until the other end sends. This specific
behavior is probably the main thing I borrowed from CSP. It is *the*
synchronization mechanism. Given the isolated nature of
subinterpreters, I consider using this concept from CSP to be a good
fit.

> I don't think it's a
> coincidence that the most varied kinds of I/O (from socket or file IO
> to threading Queues to multiprocessing Pipes) have non-blocking send().

Interestingly, you can set sockets to blocking mode, in which case
send() will block until there is room in the kernel buffer. Likewise,
queue.Queue.send() supports blocking, in addition to providing a
put_nowait() method.

Note that the PEP provides "recv_nowait()" and "send_nowait()" (names
inspired by queue.Queue), allowing for a non-blocking send. It's just
not the default. I deliberated for a little while on which one to
make the default.

In the end I went with blocking-by-default to stick to the CSP model.
However, I want to do what's most practical for users. I can imagine
folks at first not expecting blocking send by default. However, it
otherwise isn't clear yet which one is better for interpreter
channels. I'll add on "open question" about switching to
non-blocking-by-default for send().

> send() blocking until someone else calls recv() is not only bad for
> performance,

What is the performance problem?

> it also increases the likelihood of deadlocks.

How much of a problem will deadlocks be in practice? (FWIW, CSP
provides rigorous guarantees about deadlock detection (which Go
leverages), though I'm not sure how much benefit that can offer such a
dynamic language as Python.) Regardless, I'll make sure the PEP
discusses deadlocks.

> EOFError normally means the *other* (sending) side has closed the
> channel (but it becomes complicated with a multi-producer multi-consumer
> setup...). When *this* side has closed the channel, we should raise
> ValueError.

I've fixed this in the PEP.

>> The Python runtime
>> will garbage collect all closed channels. Note that "close()" is
>> automatically called when it is no longer used in the current
>> interpreter.
>
> "No longer used" meaning it loses all references in this interpreter?

Correct. I've clarified this in the PEP.

> Similar remark as above (EOFError vs. ValueError).
> More generally, send() raising EOFError sounds unheard of.

Hmm. I've fixed this in the PEP, but perhaps using EOFError here (and
even for read()) isn't right. I was drawing inspiration from pipes,
but certainly the semantics aren't exactly the same. So it may make
sense to use something else less I/O-related, like a new exception
type in the "interpreters" module. I'll make a note in the PEP about
this.

> A sidenote: context manager support (__enter__ / __exit__) on channels
> would sound more useful to me than iteration support.

Yeah, I can see that. FWIW, I've dropped __next__() from the PEP.
I've also added a note about added context manager support.

>> An alternative to support for bytes in channels in support for
>> read-only buffers (the PEP 3119 kind).
>
> Probably you mean PEP 3118.

Yep. :)

>> Then ``recv()`` would return
>> a memoryview to expose the buffer in a zero-copy way.
>
> It will probably not do much if you only can pass buffers and not
> structured objects, because unserializing (e.g. unpickling) from a
> buffer will still copy memory around.
>
> To pass a Numpy array, for example, you not only need to pass its
> contents but also its metadata (its value type -- named "dtype" --, its
> shape and strides). This may be serialized as simple tuples of atomic
> types (str, int, bytes, other tuples), but you want to include a
> memoryview of the data area somewhere in those tuples.
>
> (and, of course, at some point, this will feel like reinventing
> pickle :)) but pickle has no mechanism to avoid memory copies, so it
> can't readily be reused here -- otherwise you're just reinventing
> multiprocessing...)

I'm still working through all the passing-buffers-through-channels
feedback, so I'll defer on a reply for now. :)

>> timeout arg to pop() and push()
>> -------------------------------
>
> pop() and push() don't exist anymore :-)

Fixed! :)

>> Synchronization Primitives
>> --------------------------
>>
>> The ``threading`` module provides a number of synchronization primitives
>> for coordinating concurrent operations. This is especially necessary
>> due to the shared-state nature of threading. In contrast,
>> subinterpreters do not share state. Data sharing is restricted to
>> channels, which do away with the need for explicit synchronization.
>
> I think this rationale confuses Python-level data sharing with
> process-level data sharing. The main point of subinterpreters
> (compared to multiprocessing) is that they live in the same OS
> process. So it's really not true that you can't share a low-level
> synchronization primitive (say a semaphore) between subinterpreters.

I'm not sure I understand your concern here. Perhaps I used the word
"sharing" too ambiguously? By "sharing" I mean that the two actors
have read access to something that at least one of them can modify.
If they both only have read-only access then it's effectively the same
as if they are not sharing.

While I can imagine the *possibility* (some day) of an opt-in
mechanism to share objects (r/rw or rw/rw), that is definitely not a
part of this PEP. I expect that in reality we will only ever pass
immutable data between interpreters. So I'm unclear on what need
there might be for any synchronization primitives other than what is
inherent to channels.

>> * a ``create()`` arg to indicate resetting ``__main__`` after each
>> ``run`` call
>> * an ``Interpreter.reset_main`` flag to support opting in or out
>> after the fact
>> * an ``Interpreter.reset_main()`` method to opt in when desired
>
> This would all be a false promise. Persistent state lives in other
> places than __main__ (for example the loaded modules and their
> respective configurations - think logging or decimal).

I've added a bit more explanation to the PEP to clarify this point.

>> The main difference between queues and channels is that queues support
>> buffering. This would complicate the blocking semantics of ``recv()``
>> and ``send()``. Also, queues can be built on top of channels.
>
> But buffering with background threads in pure Python will be order
> of magnitudes slower than optimized buffering in a custom low-level
> implementation. It would be a pity if a subinterpreters Queue ended
> out as slow as a multiprocessing Queue.

I agree. I'm entirely open to supporting other object-passing types,
including adding low-level implementations. I've added a note to the
PEP to that effect.

However, I wanted to start off with the most basic object-passing
type, and I felt that channels provides the simplest solution. My
goal is to get a basic API landed in 3.7 and then build on it from
there for 3.8.

That said, in the interest of enabling extra utility in the near-term,
I expect that we will be able to design the PyInterpreterState changes
(few as they are) in such a way that a C-extension could implement an
efficient multi-interpreter Queue type that would run under 3.7.
Actually, would that be strictly necessary if you can interact with
channels without the GIL in the C-API? Regardless, I'll make a note
in the PEP about the relationship between C-API and implementing an
efficient multi-interepter Queue. I suppose that means I need to add
C-API changes to the PEP (which I had wanted to avoid).

Antoine Pitrou

unread,
Sep 23, 2017, 5:49:42 AM9/23/17
to pytho...@python.org

Hi Eric,

On Fri, 22 Sep 2017 19:09:01 -0600
Eric Snow <ericsnow...@gmail.com> wrote:
>
> Please elaborate. I'm interested in understanding what you mean here.
> Do you have some subinterpreter-based concurrency improvements in
> mind? What aspect of CSP is the PEP following too faithfully?

See below the discussion of blocking send()s :-)

> As to "running_interpreters()" and "idle_interpreters()", I'm not sure
> what the benefit would be. You can compose either list manually with
> a simple comprehension:
>
> [interp for interp in interpreters.list_all() if interp.is_running()]
> [interp for interp in interpreters.list_all() if not interp.is_running()]

There is a inherit race condition in doing that, at least if
interpreters are running in multiple threads (which I assume is going
to be the overly dominant usage model). That is why I'm proposing all
three variants.

> > I don't think it's a
> > coincidence that the most varied kinds of I/O (from socket or file IO
> > to threading Queues to multiprocessing Pipes) have non-blocking send().
>
> Interestingly, you can set sockets to blocking mode, in which case
> send() will block until there is room in the kernel buffer.

Yes, but there *is* a kernel buffer. Which is the whole point of my
comment: most alike primitives have internal buffering to prevent the
user-facing send() API from blocking in the common case.

> Likewise,
> queue.Queue.send() supports blocking, in addition to providing a
> put_nowait() method.

queue.Queue.put() never blocks in the usual case (*), which is of an
unbounded queue. Only bounded queues (created with an explicit
non-zero max_size parameter) can block in Queue.put().

(*) and therefore also never deadlocks :-)

> Note that the PEP provides "recv_nowait()" and "send_nowait()" (names
> inspired by queue.Queue), allowing for a non-blocking send.

True, but it's not the same thing at all. In the objects I mentioned,
send() mostly doesn't block and doesn't fail either. In your model,
send_nowait() will routinely fail with an error if a recipient isn't
immediately available to recv the data.

> > send() blocking until someone else calls recv() is not only bad for
> > performance,
>
> What is the performance problem?

Intuitively, there must be some kind of context switch (interpreter
switch?) at each send() call to let the other end receive the data,
since you don't have any internal buffering.

Also, suddenly an interpreter's ability to exploit CPU time is
dependent on another interpreter's ability to consume data in a timely
manner (what if the other interpreter is e.g. stuck on some disk I/O?).
IMHO it would be better not to have such coupling.

> > it also increases the likelihood of deadlocks.
>
> How much of a problem will deadlocks be in practice?

I expect more often than expected, in complex systems :-) For example,
you could have a recv() loop that also from time to time send()s some
data on another queue, depending on what is received. But if that
send()'s recipient also has the same structure (a recv() loop which
send()s from time to time), then it's easy to imagine to two getting in
a deadlock.

> (FWIW, CSP
> provides rigorous guarantees about deadlock detection (which Go
> leverages), though I'm not sure how much benefit that can offer such a
> dynamic language as Python.)

Hmm... deadlock detection is one thing, but when detected you must still
solve those deadlock issues, right?

> I'm not sure I understand your concern here. Perhaps I used the word
> "sharing" too ambiguously? By "sharing" I mean that the two actors
> have read access to something that at least one of them can modify.
> If they both only have read-only access then it's effectively the same
> as if they are not sharing.

Right. What I mean is that you *can* share very simple "data" under
the form of synchronization primitives. You may want to synchronize
your interpreters even they don't share user-visible memory areas. The
point of synchronization is not only to avoid memory corruption but
also to regulate and orchestrate processing amongst multiple workers
(for example processes or interpreters). For example, a semaphore is
an easy way to implement "I want no more than N workers to do this
thing at the same time" ("this thing" can be something such as disk
I/O).

Regards

Antoine.

MRAB

unread,
Sep 23, 2017, 10:34:53 AM9/23/17
to pytho...@python.org
On 2017-09-23 10:45, Antoine Pitrou wrote:
>
> Hi Eric,
>
> On Fri, 22 Sep 2017 19:09:01 -0600
> Eric Snow <ericsnow...@gmail.com> wrote:
>>
>> Please elaborate. I'm interested in understanding what you mean here.
>> Do you have some subinterpreter-based concurrency improvements in
>> mind? What aspect of CSP is the PEP following too faithfully?
>
> See below the discussion of blocking send()s :-)
>
>> As to "running_interpreters()" and "idle_interpreters()", I'm not sure
>> what the benefit would be. You can compose either list manually with
>> a simple comprehension:
>>
>> [interp for interp in interpreters.list_all() if interp.is_running()]
>> [interp for interp in interpreters.list_all() if not interp.is_running()]
>
> There is a inherit race condition in doing that, at least if
> interpreters are running in multiple threads (which I assume is going
> to be the overly dominant usage model). That is why I'm proposing all
> three variants.
>
An alternative to 3 variants would be:

interpreters.list_all(running=True)

interpreters.list_all(running=False)

interpreters.list_all(running=None)

[snip]

Nathaniel Smith

unread,
Sep 25, 2017, 8:43:51 PM9/25/17
to Antoine Pitrou, Python Dev
On Sat, Sep 23, 2017 at 2:45 AM, Antoine Pitrou <soli...@pitrou.net> wrote:
>> As to "running_interpreters()" and "idle_interpreters()", I'm not sure
>> what the benefit would be. You can compose either list manually with
>> a simple comprehension:
>>
>> [interp for interp in interpreters.list_all() if interp.is_running()]
>> [interp for interp in interpreters.list_all() if not interp.is_running()]
>
> There is a inherit race condition in doing that, at least if
> interpreters are running in multiple threads (which I assume is going
> to be the overly dominant usage model). That is why I'm proposing all
> three variants.

There's a race condition no matter what the API looks like -- having a
dedicated running_interpreters() lets you guarantee that the returned
list describes the set of interpreters that were running at some
moment in time, but you don't know when that moment was and by the
time you get the list, it's already out-of-date. So this doesn't seem
very useful. OTOH if we think that invariants like this are useful, we
might also want to guarantee that calling running_interpreters() and
idle_interpreters() gives two lists such that each interpreter appears
in exactly one of them, but that's impossible with this API; it'd
require a single function that returns both lists.

What problem are you trying to solve?

>> Likewise,
>> queue.Queue.send() supports blocking, in addition to providing a
>> put_nowait() method.
>
> queue.Queue.put() never blocks in the usual case (*), which is of an
> unbounded queue. Only bounded queues (created with an explicit
> non-zero max_size parameter) can block in Queue.put().
>
> (*) and therefore also never deadlocks :-)

Unbounded queues also introduce unbounded latency and memory usage in
realistic situations. (E.g. a producer/consumer setup where the
producer runs faster than the consumer.) There's a reason why sockets
always have bounded buffers -- it's sometimes painful, but the pain is
intrinsic to building distributed systems, and unbounded buffers just
paper over it.

>> > send() blocking until someone else calls recv() is not only bad for
>> > performance,
>>
>> What is the performance problem?
>
> Intuitively, there must be some kind of context switch (interpreter
> switch?) at each send() call to let the other end receive the data,
> since you don't have any internal buffering.

Technically you just need the other end to wake up at some time in
between any two calls to send(), and if there's no GIL then this
doesn't necessarily require a context switch.

> Also, suddenly an interpreter's ability to exploit CPU time is
> dependent on another interpreter's ability to consume data in a timely
> manner (what if the other interpreter is e.g. stuck on some disk I/O?).
> IMHO it would be better not to have such coupling.

A small buffer probably is useful in some cases, yeah -- basically
enough to smooth out scheduler jitter.

>> > it also increases the likelihood of deadlocks.
>>
>> How much of a problem will deadlocks be in practice?
>
> I expect more often than expected, in complex systems :-) For example,
> you could have a recv() loop that also from time to time send()s some
> data on another queue, depending on what is received. But if that
> send()'s recipient also has the same structure (a recv() loop which
> send()s from time to time), then it's easy to imagine to two getting in
> a deadlock.

You kind of want to be able to create deadlocks, since the alternative
is processes that can't coordinate and end up stuck in livelocks or
with unbounded memory use etc.

>> I'm not sure I understand your concern here. Perhaps I used the word
>> "sharing" too ambiguously? By "sharing" I mean that the two actors
>> have read access to something that at least one of them can modify.
>> If they both only have read-only access then it's effectively the same
>> as if they are not sharing.
>
> Right. What I mean is that you *can* share very simple "data" under
> the form of synchronization primitives. You may want to synchronize
> your interpreters even they don't share user-visible memory areas. The
> point of synchronization is not only to avoid memory corruption but
> also to regulate and orchestrate processing amongst multiple workers
> (for example processes or interpreters). For example, a semaphore is
> an easy way to implement "I want no more than N workers to do this
> thing at the same time" ("this thing" can be something such as disk
> I/O).

It's fairly reasonable to implement a mutex using a CSP-style
unbuffered channel (send = acquire, receive = release). And the same
trick turns a channel with a fixed-size buffer into a bounded
semaphore. It won't be as efficient as a modern specialized mutex
implementation, of course, but it's workable.

Unfortunately while technically you can construct a buffered channel
out of an unbuffered channel, the construction's pretty unreasonable
(it needs two dedicated threads per channel).

-n

--
Nathaniel J. Smith -- https://vorpus.org

Antoine Pitrou

unread,
Sep 26, 2017, 3:22:01 AM9/26/17
to pytho...@python.org
On Mon, 25 Sep 2017 17:42:02 -0700
Nathaniel Smith <n...@pobox.com> wrote:
> On Sat, Sep 23, 2017 at 2:45 AM, Antoine Pitrou <soli...@pitrou.net> wrote:
> >> As to "running_interpreters()" and "idle_interpreters()", I'm not sure
> >> what the benefit would be. You can compose either list manually with
> >> a simple comprehension:
> >>
> >> [interp for interp in interpreters.list_all() if interp.is_running()]
> >> [interp for interp in interpreters.list_all() if not interp.is_running()]
> >
> > There is a inherit race condition in doing that, at least if
> > interpreters are running in multiple threads (which I assume is going
> > to be the overly dominant usage model). That is why I'm proposing all
> > three variants.
>
> There's a race condition no matter what the API looks like -- having a
> dedicated running_interpreters() lets you guarantee that the returned
> list describes the set of interpreters that were running at some
> moment in time, but you don't know when that moment was and by the
> time you get the list, it's already out-of-date.

Hmm, you're right of course.

> >> Likewise,
> >> queue.Queue.send() supports blocking, in addition to providing a
> >> put_nowait() method.
> >
> > queue.Queue.put() never blocks in the usual case (*), which is of an
> > unbounded queue. Only bounded queues (created with an explicit
> > non-zero max_size parameter) can block in Queue.put().
> >
> > (*) and therefore also never deadlocks :-)
>
> Unbounded queues also introduce unbounded latency and memory usage in
> realistic situations.

This doesn't seem to pose much a problem in common use cases, though.
How many Python programs have you seen switch from an unbounded to a
bounded Queue to solve this problem?

Conversely, choosing a buffer size is tricky. How do you know up front
which amount you need? Is a fixed buffer size even ok or do you want
it to fluctuate based on the current conditions?

And regardless, my point was that a buffer is desirable. That send()
may block when the buffer is full doesn't change that it won't block in
the common case.

> There's a reason why sockets
> always have bounded buffers -- it's sometimes painful, but the pain is
> intrinsic to building distributed systems, and unbounded buffers just
> paper over it.

Papering over a problem is sometimes the right answer actually :-) For
example, most Python programs assume memory is unbounded...

If I'm using a queue or channel to push events to a logging system,
should I really block at every send() call? Most probably I'd rather
run ahead instead.

> > Also, suddenly an interpreter's ability to exploit CPU time is
> > dependent on another interpreter's ability to consume data in a timely
> > manner (what if the other interpreter is e.g. stuck on some disk I/O?).
> > IMHO it would be better not to have such coupling.
>
> A small buffer probably is useful in some cases, yeah -- basically
> enough to smooth out scheduler jitter.

That's not about scheduler jitter, but catering for activities which
occur at inherently different speed or rhythms. Requiring things run
in lockstep removes a lot of flexibility and makes it harder to exploit
CPU resources fully.

> > I expect more often than expected, in complex systems :-) For example,
> > you could have a recv() loop that also from time to time send()s some
> > data on another queue, depending on what is received. But if that
> > send()'s recipient also has the same structure (a recv() loop which
> > send()s from time to time), then it's easy to imagine to two getting in
> > a deadlock.
>
> You kind of want to be able to create deadlocks, since the alternative
> is processes that can't coordinate and end up stuck in livelocks or
> with unbounded memory use etc.

I am not advocating we make it *impossible* to create deadlocks; just
saying we should not make them more *likely* than they need to.

> >> I'm not sure I understand your concern here. Perhaps I used the word
> >> "sharing" too ambiguously? By "sharing" I mean that the two actors
> >> have read access to something that at least one of them can modify.
> >> If they both only have read-only access then it's effectively the same
> >> as if they are not sharing.
> >
> > Right. What I mean is that you *can* share very simple "data" under
> > the form of synchronization primitives. You may want to synchronize
> > your interpreters even they don't share user-visible memory areas. The
> > point of synchronization is not only to avoid memory corruption but
> > also to regulate and orchestrate processing amongst multiple workers
> > (for example processes or interpreters). For example, a semaphore is
> > an easy way to implement "I want no more than N workers to do this
> > thing at the same time" ("this thing" can be something such as disk
> > I/O).
>
> It's fairly reasonable to implement a mutex using a CSP-style
> unbuffered channel (send = acquire, receive = release). And the same
> trick turns a channel with a fixed-size buffer into a bounded
> semaphore. It won't be as efficient as a modern specialized mutex
> implementation, of course, but it's workable.

We are drifting away from the point I was trying to make here. I was
pointing out that the claim that nothing can be shared is a lie.
If it's possible to share a small datum (a synchronized counter aka
semaphore) between processes, certainly there's no technical reason
that should prevent it between interpreters.

By the way, I do think efficiency is a concern here. Otherwise
subinterpreters don't even have a point (just use multiprocessing).

> Unfortunately while technically you can construct a buffered channel
> out of an unbuffered channel, the construction's pretty unreasonable
> (it needs two dedicated threads per channel).

And the reverse is quite cumbersome as well. So we should favour the
construct that's more convenient for users, or provide both.

Regards

Antoine.

francismb

unread,
Sep 26, 2017, 8:58:14 AM9/26/17
to Eric Snow, pytho...@python.org
Hi Eric,

>> To make this work, the mutable shared state will be managed by the
>> Python runtime, not by any of the interpreters. Initially we will
>> support only one type of objects for shared state: the channels
>> provided by create_channel(). Channels, in turn, will carefully
>> manage passing objects between interpreters. [0]

Would it make sense to make the default channel type explicit,
something like ``create_channel(bytes)`` ?

Thanks in advance,
--francis

[0] https://www.python.org/dev/peps/pep-0554/

Walter Dörwald

unread,
Sep 26, 2017, 10:54:32 AM9/26/17
to Eric Snow, Antoine Pitrou, Python-Dev
On 23 Sep 2017, at 3:09, Eric Snow wrote:

> [...]
>>> ``list_all()``::
>>>
>>> Return a list of all existing interpreters.
>>
>> See my naming proposal in the previous thread.
>
> Sorry, your previous comment slipped through the cracks. You
> suggested:
>
> As for the naming, let's make it both unconfusing and explicit?
> How about three functions: `all_interpreters()`,
> `running_interpreters()`
> and `idle_interpreters()`, for example?
>
> As to "all_interpreters()", I suppose it's the difference between
> "interpreters.all_interpreters()" and "interpreters.list_all()". To
> me the latter looks better.

But in most cases when Python returns a container (list/dict/iterator)
of things, the name of the function/method is the name of the things,
not the name of the container, i.e. we have sys.modules, dict.keys,
dict.values etc.. Or if the collection of things itself has a name, it
is that name, i.e. os.environ, sys.path etc.

Its a little bit unfortunate that the name of the module would be the
same as the name of the function, but IMHO interpreters() would be
better than list().

> As to "running_interpreters()" and "idle_interpreters()", I'm not sure
> what the benefit would be. You can compose either list manually with
> a simple comprehension:
>
> [interp for interp in interpreters.list_all() if
> interp.is_running()]
> [interp for interp in interpreters.list_all() if not
> interp.is_running()]

Servus,
Walter

Nick Coghlan

unread,
Sep 27, 2017, 1:28:49 AM9/27/17
to Antoine Pitrou, pytho...@python.org
On 26 September 2017 at 17:04, Antoine Pitrou <soli...@pitrou.net> wrote:
> On Mon, 25 Sep 2017 17:42:02 -0700 Nathaniel Smith <n...@pobox.com> wrote:
>> Unbounded queues also introduce unbounded latency and memory usage in
>> realistic situations.
>
> This doesn't seem to pose much a problem in common use cases, though.
> How many Python programs have you seen switch from an unbounded to a
> bounded Queue to solve this problem?
>
> Conversely, choosing a buffer size is tricky. How do you know up front
> which amount you need? Is a fixed buffer size even ok or do you want
> it to fluctuate based on the current conditions?
>
> And regardless, my point was that a buffer is desirable. That send()
> may block when the buffer is full doesn't change that it won't block in
> the common case.

It's also the case that unlike Go channels, which were designed from
scratch on the basis of implementing pure CSP, Python has an
established behavioural precedent in the APIs of queue.Queue and
collections.deque: they're unbounded by default, and you have to opt
in to making them bounded.

>> There's a reason why sockets
>> always have bounded buffers -- it's sometimes painful, but the pain is
>> intrinsic to building distributed systems, and unbounded buffers just
>> paper over it.
>
> Papering over a problem is sometimes the right answer actually :-) For
> example, most Python programs assume memory is unbounded...
>
> If I'm using a queue or channel to push events to a logging system,
> should I really block at every send() call? Most probably I'd rather
> run ahead instead.

While the article title is clickbaity,
http://www.jtolds.com/writing/2016/03/go-channels-are-bad-and-you-should-feel-bad/
actually has a good discussion of this point. Search for "compose" to
find the relevant section ("Channels don’t compose well with other
concurrency primitives").

The specific problem cited is that only offering unbuffered or
bounded-buffer channels means that every send call becomes a potential
deadlock scenario, as all that needs to happen is for you to be
holding a different synchronisation primitive when the send call
blocks.

>> > Also, suddenly an interpreter's ability to exploit CPU time is
>> > dependent on another interpreter's ability to consume data in a timely
>> > manner (what if the other interpreter is e.g. stuck on some disk I/O?).
>> > IMHO it would be better not to have such coupling.
>>
>> A small buffer probably is useful in some cases, yeah -- basically
>> enough to smooth out scheduler jitter.
>
> That's not about scheduler jitter, but catering for activities which
> occur at inherently different speed or rhythms. Requiring things run
> in lockstep removes a lot of flexibility and makes it harder to exploit
> CPU resources fully.

The fact that the proposal now allows for M:N sender:receiver
relationships (just as queue.Queue does with threads) makes that
problem worse, since you may now have variability not only on the
message consumption side, but also on the message production side.

Consider this example where you have an event processing thread pool
that we're attempting to isolate from blocking IO by using channels
rather than coroutines.

Desired flow:

1. Listener thread receives external message from socket
2. Listener thread files message for processing on receive channel
3. Listener thread returns to blocking on the receive socket

4. Processing thread picks up message from receive channel
5. Processing thread processes message
6. Processing thread puts reply on the send channel

7. Sending thread picks up message from send channel
8. Sending thread makes a blocking network send call to transmit the message
9. Sending thread returns to blocking on the send channel

When queue.Queue is used to pass the messages between threads, such an
arrangement will be effectively non-blocking as long as the send rate
is greater than or equal to the receive rate. However, the GIL means
it won't exploit all available cores, even if we create multiple
processing threads: you have to switch to multiprocessing for that,
with all the extra overhead that entails.

So I see the essential premise of PEP 554 as being to ask the question
"If each of these threads was running its own *interpreter*, could we
use Sans IO style protocols with interpreter channels to separate
internally "synchronous" processing threads from separate IO threads
operating at system boundaries, without having to make the entire
application pervasively asynchronous?"

If channels are an unbuffered blocking primitive, then we don't get
that benefit: even when there are additional receive messages to be
processed, the processing thread will block until the previous send
has completed. Switching the listener and sender threads over to
asynchronous IO would help with that, but they'd also end up having to
implement their own message buffering to manage the lack of buffering
in the core channel primitive.

By contrast, if the core channels are designed to offer an unbounded
buffer by default, then you can get close-to-CSP semantics just by
setting the buffer size to 1 (it's still not exactly CSP, since that
has a buffer size of 0, but you at least get the semantics of having
to alternate sending and receiving of messages).

>> > I expect more often than expected, in complex systems :-) For example,
>> > you could have a recv() loop that also from time to time send()s some
>> > data on another queue, depending on what is received. But if that
>> > send()'s recipient also has the same structure (a recv() loop which
>> > send()s from time to time), then it's easy to imagine to two getting in
>> > a deadlock.
>>
>> You kind of want to be able to create deadlocks, since the alternative
>> is processes that can't coordinate and end up stuck in livelocks or
>> with unbounded memory use etc.
>
> I am not advocating we make it *impossible* to create deadlocks; just
> saying we should not make them more *likely* than they need to.

Right, and I think the queue.Queue and collections.deque model works
well for that, since you can start introducing queue bounds to
propagate backpressure through a system if you're seeing undesirable
memory growth.

>> It's fairly reasonable to implement a mutex using a CSP-style
>> unbuffered channel (send = acquire, receive = release). And the same
>> trick turns a channel with a fixed-size buffer into a bounded
>> semaphore. It won't be as efficient as a modern specialized mutex
>> implementation, of course, but it's workable.
>
> We are drifting away from the point I was trying to make here. I was
> pointing out that the claim that nothing can be shared is a lie.
> If it's possible to share a small datum (a synchronized counter aka
> semaphore) between processes, certainly there's no technical reason
> that should prevent it between interpreters.
>
> By the way, I do think efficiency is a concern here. Otherwise
> subinterpreters don't even have a point (just use multiprocessing).

Agreed, and I think the interaction between the threading module and
the interpreters module is one we're going to have to explicitly call
out as being covered by the provisional status of the interpreters
module, as I think it could be incredibly valuable to be able to send
at least some threading objects through channels, and have them be an
interpreter-specific reference to a common underlying sync primitive.

>> Unfortunately while technically you can construct a buffered channel
>> out of an unbuffered channel, the construction's pretty unreasonable
>> (it needs two dedicated threads per channel).
>
> And the reverse is quite cumbersome as well. So we should favour the
> construct that's more convenient for users, or provide both.

As noted above, I think consistency with design intuitions formed
through the use of queue.Queue is also an important consideration.

Cheers,
Nick.

--
Nick Coghlan | ncog...@gmail.com | Brisbane, Australia

Eric Snow

unread,
Oct 2, 2017, 9:33:23 PM10/2/17
to Nick Coghlan, Python Dev
On Thu, Sep 14, 2017 at 8:44 PM, Nick Coghlan <ncog...@gmail.com> wrote:
> Not really, because the only way to ensure object separation (i.e no
> refcounted objects accessible from multiple interpreters at once) with
> a bytes-based API would be to either:
>
> 1. Always copy (eliminating most of the low overhead communications
> benefits that subinterpreters may offer over multiple processes)
> 2. Make the bytes implementation more complicated by allowing multiple
> bytes objects to share the same underlying storage while presenting as
> distinct objects in different interpreters
> 3. Make the output on the receiving side not actually a bytes object,
> but instead a view onto memory owned by another object in a different
> interpreter (a "memory view", one might say)

4. Pass Bytes through directly.

The only problem of which I'm aware is that when Py_DECREF() triggers
Bytes.__del__(), it happens in the current interpreter, which may not
be the "owner" (i.e. allocated the object). So the solution would be
to make PyBytesType.tp_free() effectively run as a "pending call"
under the owner. This would require two things:

1. a new PyBytesObject.owner field (PyInterpreterState *), or a
separate owner table, which would be set when the object is passed
through a channel
2. a Py_AddPendingCall() that targets a specific interpreter (which I
expect would be desirable regardless)

Then, when the object has an owner, PyBytesType.tp_free() would add a
pending call on the owner to call PyObject_Del() on the Bytes object.

The catch is that currently "pending" calls (via Py_AddPendingCall)
are run only in the main thread of the main interpreter. We'd need a
similar mechanism that targets a specific interpreter .

> By contrast, if we allow an actual bytes object to be shared, then
> either every INCREF or DECREF on that bytes object becomes a
> synchronisation point, or else we end up needing some kind of
> secondary per-interpreter refcount where the interpreter doesn't drop
> its shared reference to the original object in its source interpreter
> until the internal refcount in the borrowing interpreter drops to
> zero.

There shouldn't be a need to synchronize on INCREF. If both
interpreters have at least 1 reference then either one adding a
reference shouldn't be a problem. If only one interpreter has a
reference then the other won't be adding any references. If neither
has a reference then neither is going to add any references. Perhaps
I've missed something. Under what circumstances would INCREF happen
while the refcount is 0?

On DECREF there shouldn't be a problem except possibly with a small
race between decrementing the refcount and checking for a refcount of
0. We could address that several different ways, including allowing
the pending call to get queued only once (or being a noop the second
time).

FWIW, I'm not opposed to the CIV/memoryview approach, but want to make
sure we really can't use Bytes before going down that route.

-eric

Eric Snow

unread,
Oct 2, 2017, 10:17:00 PM10/2/17
to Antoine Pitrou, Python-Dev
After having looked it over, I'm leaning toward supporting buffering,
as well as not blocking by default. Neither adds much complexity to
the implementation.

On Sat, Sep 23, 2017 at 5:45 AM, Antoine Pitrou <soli...@pitrou.net> wrote:
> On Fri, 22 Sep 2017 19:09:01 -0600
> Eric Snow <ericsnow...@gmail.com> wrote:
>> > send() blocking until someone else calls recv() is not only bad for
>> > performance,
>>
>> What is the performance problem?
>
> Intuitively, there must be some kind of context switch (interpreter
> switch?) at each send() call to let the other end receive the data,
> since you don't have any internal buffering.

There would be an internal size-1 buffer.

>> (FWIW, CSP
>> provides rigorous guarantees about deadlock detection (which Go
>> leverages), though I'm not sure how much benefit that can offer such a
>> dynamic language as Python.)
>
> Hmm... deadlock detection is one thing, but when detected you must still
> solve those deadlock issues, right?

Yeah, I haven't given much thought into how we could leverage that
capability but my
gut feeling is that we won't have much opportunity to do so. :)

>> I'm not sure I understand your concern here. Perhaps I used the word
>> "sharing" too ambiguously? By "sharing" I mean that the two actors
>> have read access to something that at least one of them can modify.
>> If they both only have read-only access then it's effectively the same
>> as if they are not sharing.
>
> Right. What I mean is that you *can* share very simple "data" under
> the form of synchronization primitives. You may want to synchronize
> your interpreters even they don't share user-visible memory areas. The
> point of synchronization is not only to avoid memory corruption but
> also to regulate and orchestrate processing amongst multiple workers
> (for example processes or interpreters). For example, a semaphore is
> an easy way to implement "I want no more than N workers to do this
> thing at the same time" ("this thing" can be something such as disk
> I/O).

I'm still not convinced that sharing synchronization primitives is
important enough to be worth including it in the PEP. It can be added
later, or via an extension module in the meantime. To that end, I'll
add a mechanism to the PEP for third-party types to indicate that they
can be passed through channels. Something like
"obj.__channel_support__ = True".

-eric

Eric Snow

unread,
Oct 2, 2017, 10:20:02 PM10/2/17
to Nick Coghlan, Python Dev
On Mon, Oct 2, 2017 at 9:31 PM, Eric Snow <ericsnow...@gmail.com> wrote:
> On DECREF there shouldn't be a problem except possibly with a small
> race between decrementing the refcount and checking for a refcount of
> 0. We could address that several different ways, including allowing
> the pending call to get queued only once (or being a noop the second
> time).

Alternately, the channel could own a reference and DECREF it in the
owning interpreter once the refcount reaches 1.

Eric Snow

unread,
Oct 2, 2017, 10:22:06 PM10/2/17
to Nathaniel Smith, Antoine Pitrou, Python Dev
On Mon, Sep 25, 2017 at 8:42 PM, Nathaniel Smith <n...@pobox.com> wrote:
> It's fairly reasonable to implement a mutex using a CSP-style
> unbuffered channel (send = acquire, receive = release). And the same
> trick turns a channel with a fixed-size buffer into a bounded
> semaphore. It won't be as efficient as a modern specialized mutex
> implementation, of course, but it's workable.
>
> Unfortunately while technically you can construct a buffered channel
> out of an unbuffered channel, the construction's pretty unreasonable
> (it needs two dedicated threads per channel).

Yeah, if threading's synchronization primitives make sense between
interpreters then we'll add direct support. Using channels for that
isn't a good option.

-eric

Eric Snow

unread,
Oct 2, 2017, 10:37:06 PM10/2/17
to Nick Coghlan, Antoine Pitrou, pytho...@python.org
On Wed, Sep 27, 2017 at 1:26 AM, Nick Coghlan <ncog...@gmail.com> wrote:
> It's also the case that unlike Go channels, which were designed from
> scratch on the basis of implementing pure CSP,

FWIW, Go's channels (and goroutines) don't implement pure CSP. They
provide a variant that the Go authors felt was more in-line with the
language's flavor. The channels in the PEP aim to support a more pure
implementation.

> Python has an
> established behavioural precedent in the APIs of queue.Queue and
> collections.deque: they're unbounded by default, and you have to opt
> in to making them bounded.

Right. That's part of why I'm leaning toward support for buffered channels.

> While the article title is clickbaity,
> http://www.jtolds.com/writing/2016/03/go-channels-are-bad-and-you-should-feel-bad/
> actually has a good discussion of this point. Search for "compose" to
> find the relevant section ("Channels don’t compose well with other
> concurrency primitives").
>
> The specific problem cited is that only offering unbuffered or
> bounded-buffer channels means that every send call becomes a potential
> deadlock scenario, as all that needs to happen is for you to be
> holding a different synchronisation primitive when the send call
> blocks.

Yeah, that blog post was a reference for me as I was designing the
PEP's channels.

+1

> If channels are an unbuffered blocking primitive, then we don't get
> that benefit: even when there are additional receive messages to be
> processed, the processing thread will block until the previous send
> has completed. Switching the listener and sender threads over to
> asynchronous IO would help with that, but they'd also end up having to
> implement their own message buffering to manage the lack of buffering
> in the core channel primitive.
>
> By contrast, if the core channels are designed to offer an unbounded
> buffer by default, then you can get close-to-CSP semantics just by
> setting the buffer size to 1 (it's still not exactly CSP, since that
> has a buffer size of 0, but you at least get the semantics of having
> to alternate sending and receiving of messages).

Yep, I came to the same conclusion.

>> By the way, I do think efficiency is a concern here. Otherwise
>> subinterpreters don't even have a point (just use multiprocessing).
>
> Agreed, and I think the interaction between the threading module and
> the interpreters module is one we're going to have to explicitly call
> out as being covered by the provisional status of the interpreters
> module, as I think it could be incredibly valuable to be able to send
> at least some threading objects through channels, and have them be an
> interpreter-specific reference to a common underlying sync primitive.

Agreed. I'll add a note to the PEP.

-eric

Antoine Pitrou

unread,
Oct 3, 2017, 7:02:07 AM10/3/17
to pytho...@python.org
On Mon, 2 Oct 2017 22:15:01 -0400
Eric Snow <ericsnow...@gmail.com> wrote:
>
> I'm still not convinced that sharing synchronization primitives is
> important enough to be worth including it in the PEP. It can be added
> later, or via an extension module in the meantime. To that end, I'll
> add a mechanism to the PEP for third-party types to indicate that they
> can be passed through channels. Something like
> "obj.__channel_support__ = True".

How would that work? If it's simply a matter of flipping a bit, why
don't we do it for all objects?

Regards

Antoine.

Eric Snow

unread,
Oct 3, 2017, 10:38:42 AM10/3/17
to Antoine Pitrou, Python-Dev
On Tue, Oct 3, 2017 at 5:00 AM, Antoine Pitrou <soli...@pitrou.net> wrote:
> On Mon, 2 Oct 2017 22:15:01 -0400
> Eric Snow <ericsnow...@gmail.com> wrote:
>>
>> I'm still not convinced that sharing synchronization primitives is
>> important enough to be worth including it in the PEP. It can be added
>> later, or via an extension module in the meantime. To that end, I'll
>> add a mechanism to the PEP for third-party types to indicate that they
>> can be passed through channels. Something like
>> "obj.__channel_support__ = True".
>
> How would that work? If it's simply a matter of flipping a bit, why
> don't we do it for all objects?

The type would also have to be safe to share between interpreters. :)
Eventually I'd like to make that work for all immutable objects (and
immutable containers thereof), but until then each type must be
adapted individually. The PEP starts off with just Bytes.

-eric

Antoine Pitrou

unread,
Oct 3, 2017, 10:57:33 AM10/3/17
to pytho...@python.org
On Tue, 3 Oct 2017 08:36:55 -0600
Eric Snow <ericsnow...@gmail.com> wrote:
> On Tue, Oct 3, 2017 at 5:00 AM, Antoine Pitrou <soli...@pitrou.net> wrote:
> > On Mon, 2 Oct 2017 22:15:01 -0400
> > Eric Snow <ericsnow...@gmail.com> wrote:
> >>
> >> I'm still not convinced that sharing synchronization primitives is
> >> important enough to be worth including it in the PEP. It can be added
> >> later, or via an extension module in the meantime. To that end, I'll
> >> add a mechanism to the PEP for third-party types to indicate that they
> >> can be passed through channels. Something like
> >> "obj.__channel_support__ = True".
> >
> > How would that work? If it's simply a matter of flipping a bit, why
> > don't we do it for all objects?
>
> The type would also have to be safe to share between interpreters. :)

But what does it mean to be safe to share, while the exact degree
and nature of the isolation between interpreters (and also their
concurrent execution) is unspecified?

I think we need a sharing protocol, not just a flag. We also need to
think carefully about that protocol, so that it does not imply
unnecessary memory copies. Therefore I think the protocol should be
something like the buffer protocol, that allows to acquire and release
a set of shared memory areas, but without imposing any semantics onto
those memory areas (each type implementing its own semantics). And
there needs to be a dedicated reference counting for object shares, so
that the original object can be notified when all its shares have
vanished.

Regards

Antoine.

Steve Dower

unread,
Oct 3, 2017, 1:03:34 PM10/3/17
to Antoine Pitrou, pytho...@python.org
On 03Oct2017 0755, Antoine Pitrou wrote:
> On Tue, 3 Oct 2017 08:36:55 -0600
> Eric Snow <ericsnow...@gmail.com> wrote:
>> On Tue, Oct 3, 2017 at 5:00 AM, Antoine Pitrou <soli...@pitrou.net> wrote:
>>> On Mon, 2 Oct 2017 22:15:01 -0400
>>> Eric Snow <ericsnow...@gmail.com> wrote:
>>>>
>>>> I'm still not convinced that sharing synchronization primitives is
>>>> important enough to be worth including it in the PEP. It can be added
>>>> later, or via an extension module in the meantime. To that end, I'll
>>>> add a mechanism to the PEP for third-party types to indicate that they
>>>> can be passed through channels. Something like
>>>> "obj.__channel_support__ = True".
>>>
>>> How would that work? If it's simply a matter of flipping a bit, why
>>> don't we do it for all objects?
>>
>> The type would also have to be safe to share between interpreters. :)
>
> But what does it mean to be safe to share, while the exact degree
> and nature of the isolation between interpreters (and also their
> concurrent execution) is unspecified?
>
> I think we need a sharing protocol, not just a flag.

The easiest such protocol is essentially:

* an object can represent itself as bytes (e.g. generate a bytes object
representing some global token, such as a kernel handle or memory address)
* those bytes are sent over the standard channel
* the object can instantiate itself from those bytes (e.g. wrap the
existing handle, create a memoryview over the same block of memory, etc.)
* cross-interpreter refcounting is either ignored (because the kernel is
refcounting the resource) or manual (by including more shared info in
the token)

Since this is trivial to implement over the basic bytes channel, and
doesn't even require a standard protocol except for convenience, Eric
decided to avoid blocking the core functionality on this. I'm inclined
to agree - get the basic functionality supported and let people build on
it before we try to lock down something we don't fully understand yet.

About the only thing that seems to be worth doing up-front is some sort
of pending-call callback mechanism between interpreters, but even that
doesn't need to block the core functionality (you can do it trivially
with threads and another channel right now, and there's always room to
make something more efficient later).

There are plenty of smart people out there who can and will figure out
the best way to design this. By giving them the tools and the ability to
design something awesome, we're more likely to get something awesome
than by committing to a complete design now. Right now, they're all
blocked on the fact that subinterpreters are incredibly hard to start
running, let alone experiment with. Eric's PEP will fix that part and
enable others to take it from building blocks to powerful libraries.

Cheers,
Steve

Nick Coghlan

unread,
Oct 4, 2017, 1:41:02 AM10/4/17
to Eric Snow, Python Dev
On 3 October 2017 at 11:31, Eric Snow <ericsnow...@gmail.com> wrote:
> There shouldn't be a need to synchronize on INCREF. If both
> interpreters have at least 1 reference then either one adding a
> reference shouldn't be a problem. If only one interpreter has a
> reference then the other won't be adding any references. If neither
> has a reference then neither is going to add any references. Perhaps
> I've missed something. Under what circumstances would INCREF happen
> while the refcount is 0?

The problem relates to the fact that there aren't any memory barriers
around CPython's INCREF operations (they're implemented as an ordinary
C post-increment operation), so you can get the following scenario:

* thread on CPU A has the sole reference (ob_refcnt=1)
* thread on CPU B acquires a new reference, but hasn't pushed the
updated ob_refcnt value back to the shared memory cache yet
* original thread on CPU A drops its reference, *thinks* the refcnt is
now zero, and deletes the object
* bad things now happen in CPU B as the thread running there tries to
use a deleted object :)

The GIL currently protects us from this, as switching CPUs requires
switching threads, which means the original thread has to release the
GIL (flushing all of its state changes to the shared cache), and the
new thread has to acquire it (hence refreshing its local cache from
the shared one).

The need to switch all incref/decref operations over to using atomic
thread-safe primitives when removing the GIL is one of the main
reasons that attempting to remove the GIL *within* an interpreter is
expensive (and why Larry et al are having to explore completely
different ref count management strategies for the GILectomy).

By contrast, if you rely on a new memoryview variant to mediate all
data sharing between interpreters, then you can make sure that *it* is
using synchronisation primitives as needed to ensure the required
cache coherency across different CPUs, without any negative impacts on
regular single interpreter code (which can still rely on the cache
coherency guarantees provided by the GIL).

Cheers,
Nick.

--
Nick Coghlan | ncog...@gmail.com | Brisbane, Australia

Eric Snow

unread,
Oct 4, 2017, 9:53:23 AM10/4/17
to Nick Coghlan, Python Dev
On Tue, Oct 3, 2017 at 11:36 PM, Nick Coghlan <ncog...@gmail.com> wrote:
> The problem relates to the fact that there aren't any memory barriers
> around CPython's INCREF operations (they're implemented as an ordinary
> C post-increment operation), so you can get the following scenario:
>
> * thread on CPU A has the sole reference (ob_refcnt=1)
> * thread on CPU B acquires a new reference, but hasn't pushed the
> updated ob_refcnt value back to the shared memory cache yet
> * original thread on CPU A drops its reference, *thinks* the refcnt is
> now zero, and deletes the object
> * bad things now happen in CPU B as the thread running there tries to
> use a deleted object :)

I'm not clear on where we'd run into this problem with channels.
Mirroring your scenario:

* interpreter A (in thread on CPU A) INCREFs the object (the GIL is still held)
* interp A sends the object to the channel
* interp B (in thread on CPU B) receives the object from the channel
* the new reference is held until interp B DECREFs the object

From what I see, at no point do we get a refcount of 0, such that
there would be a race on the object being deleted.

The only problem I'm aware of (it dawned on me last night), is in the
case that the interpreter that created the object gets deleted before
the object does. In that case we can't pass the deletion back to the
original interpreter. (I don't think this problem is necessarily
exclusive to the solution I've proposed for Bytes.)

-eric

Antoine Pitrou

unread,
Oct 4, 2017, 11:52:36 AM10/4/17
to pytho...@python.org
On Mon, 2 Oct 2017 21:31:30 -0400
Eric Snow <ericsnow...@gmail.com> wrote:
>
> > By contrast, if we allow an actual bytes object to be shared, then
> > either every INCREF or DECREF on that bytes object becomes a
> > synchronisation point, or else we end up needing some kind of
> > secondary per-interpreter refcount where the interpreter doesn't drop
> > its shared reference to the original object in its source interpreter
> > until the internal refcount in the borrowing interpreter drops to
> > zero.
>
> There shouldn't be a need to synchronize on INCREF. If both
> interpreters have at least 1 reference then either one adding a
> reference shouldn't be a problem.

I'm not sure what Nick meant by "synchronization point", but at least
you certainly need INCREF and DECREF to be atomic, which is a departure
from today's Py_INCREF / Py_DECREF behaviour (and is significantly
slower, even on high-level benchmarks).

Regards

Antoine.

Koos Zevenhoven

unread,
Oct 4, 2017, 11:59:39 AM10/4/17
to Eric Snow, Nick Coghlan, Python Dev
On Wed, Oct 4, 2017 at 4:51 PM, Eric Snow <ericsnow...@gmail.com> wrote:
On Tue, Oct 3, 2017 at 11:36 PM, Nick Coghlan <ncog...@gmail.com> wrote:
> The problem relates to the fact that there aren't any memory barriers
> around CPython's INCREF operations (they're implemented as an ordinary
> C post-increment operation), so you can get the following scenario:
>
> * thread on CPU A has the sole reference (ob_refcnt=1)
> * thread on CPU B acquires a new reference, but hasn't pushed the
> updated ob_refcnt value back to the shared memory cache yet
> * original thread on CPU A drops its reference, *thinks* the refcnt is
> now zero, and deletes the object
> * bad things now happen in CPU B as the thread running there tries to
> use a deleted object :)

I'm not clear on where we'd run into this problem with channels.
Mirroring your scenario:

* interpreter A (in thread on CPU A) INCREFs the object (the GIL is still held)
* interp A sends the object to the channel
* interp B (in thread on CPU B) receives the object from the channel
* the new reference is held until interp B DECREFs the object

From what I see, at no point do we get a refcount of 0, such that
there would be a race on the object being deleted.

So what you're saying is that when Larry finishes the gilectomy, subinterpreters will work GIL-free too?​-)

​––Koos

The only problem I'm aware of (it dawned on me last night), is in the
case that the interpreter that created the object gets deleted before
the object does.  In that case we can't pass the deletion back to the
original interpreter.  (I don't think this problem is necessarily
exclusive to the solution I've proposed for Bytes.)

-eric
_______________________________________________
Python-Dev mailing list
Pytho...@python.org
https://mail.python.org/mailman/listinfo/python-dev

Antoine Pitrou

unread,
Oct 4, 2017, 12:14:34 PM10/4/17
to pytho...@python.org
On Wed, 4 Oct 2017 17:50:33 +0200
Antoine Pitrou <soli...@pitrou.net> wrote:
> On Mon, 2 Oct 2017 21:31:30 -0400
> Eric Snow <ericsnow...@gmail.com> wrote:
> >
> > > By contrast, if we allow an actual bytes object to be shared, then
> > > either every INCREF or DECREF on that bytes object becomes a
> > > synchronisation point, or else we end up needing some kind of
> > > secondary per-interpreter refcount where the interpreter doesn't drop
> > > its shared reference to the original object in its source interpreter
> > > until the internal refcount in the borrowing interpreter drops to
> > > zero.
> >
> > There shouldn't be a need to synchronize on INCREF. If both
> > interpreters have at least 1 reference then either one adding a
> > reference shouldn't be a problem.
>
> I'm not sure what Nick meant by "synchronization point", but at least
> you certainly need INCREF and DECREF to be atomic, which is a departure
> from today's Py_INCREF / Py_DECREF behaviour (and is significantly
> slower, even on high-level benchmarks).

To be clear, I'm writing this under the hypothesis of per-interpreter
GILs. I'm not really interested in the per-process GIL case :-)

Nick Coghlan

unread,
Oct 4, 2017, 9:43:15 PM10/4/17
to Eric Snow, Python Dev
On 4 October 2017 at 23:51, Eric Snow <ericsnow...@gmail.com> wrote:
> On Tue, Oct 3, 2017 at 11:36 PM, Nick Coghlan <ncog...@gmail.com> wrote:
>> The problem relates to the fact that there aren't any memory barriers
>> around CPython's INCREF operations (they're implemented as an ordinary
>> C post-increment operation), so you can get the following scenario:
>>
>> * thread on CPU A has the sole reference (ob_refcnt=1)
>> * thread on CPU B acquires a new reference, but hasn't pushed the
>> updated ob_refcnt value back to the shared memory cache yet
>> * original thread on CPU A drops its reference, *thinks* the refcnt is
>> now zero, and deletes the object
>> * bad things now happen in CPU B as the thread running there tries to
>> use a deleted object :)
>
> I'm not clear on where we'd run into this problem with channels.
> Mirroring your scenario:
>
> * interpreter A (in thread on CPU A) INCREFs the object (the GIL is still held)
> * interp A sends the object to the channel
> * interp B (in thread on CPU B) receives the object from the channel
> * the new reference is held until interp B DECREFs the object
>
> From what I see, at no point do we get a refcount of 0, such that
> there would be a race on the object being deleted.

Having the sending interpreter do the INCREF just changes the problem
to be a memory leak waiting to happen rather than an access-after-free
issue, since the problematic non-synchronised scenario then becomes:

* thread on CPU A has two references (ob_refcnt=2)
* it sends a reference to a thread on CPU B via a channel
* thread on CPU A releases its reference (ob_refcnt=1)
* updated ob_refcnt value hasn't made it back to the shared memory cache yet
* thread on CPU B releases its reference (ob_refcnt=1)
* both threads have released their reference, but the refcnt is still
1 -> object leaks!

We simply can't have INCREFs and DECREFs happening in different
threads without some way of ensuring cache coherency for *both*
operations - otherwise we risk either the refcount going to zero when
it shouldn't, or *not* going to zero when it should.

The current CPython implementation relies on the process global GIL
for that purpose, so none of these problems will show up until you
start trying to replace that with per-interpreter locks.

Free threaded reference counting relies on (expensive) atomic
increments & decrements.

The cross-interpreter view proposal aims to allow per-interpreter GILs
without introducing atomic increments & decrements by instead relying
on the view itself to ensure that it's holding the right GIL for the
object whose refcount it's manipulating, and the receiving interpreter
explicitly closing the view when it's done with it.

So while CIVs wouldn't be as easy to use as regular object references:

1. They'd be no harder to use than memoryviews in general
2. They'd structurally ensure that regular object refcounts can still
rely on "protected by the GIL" semantics
3. They'd structurally ensure zero performance degradation for regular
object refcounts
4. By virtue of being memoryview based, they'd encourage the adoption
of interfaces and practices that can be adapted to multiple processes
through the use of techniques like shared memory regions and memory
mapped files (see
http://www.boost.org/doc/libs/1_54_0/doc/html/interprocess/sharedmemorybetweenprocesses.html
for some detailed explanations of how that works, and
https://arrow.apache.org/ for an example of ways tools like Pandas can
use that to enable zero-copy data sharing)

> The only problem I'm aware of (it dawned on me last night), is in the
> case that the interpreter that created the object gets deleted before
> the object does. In that case we can't pass the deletion back to the
> original interpreter. (I don't think this problem is necessarily
> exclusive to the solution I've proposed for Bytes.)

The cross-interpreter-view idea proposes to deal with that by having
the CIV hold a strong reference not only to the sending object (which
is already part of the regular memoryview semantics), but *also* to
the sending interpreter - that way, neither the sending object nor the
sending interpreter can go away until the receiving interpreter closes
the view.

The refcount-integrity-ensuring sequence of events becomes:

1. Sending interpreter submits the object to the channel
2. Channel creates a CIV with references to the sending interpreter &
sending object, and a view on the sending object's memory
3. Receiving interpreter gets the CIV from the channel
4. Receiving interpreter closes the CIV either explicitly or via
__del__ (the latter would emit ResourceWarning)
5. CIV switches execution back to the sending interpreter and releases
both the memory buffer and the reference to the sending object
6. CIV switches execution back to the receiving interpreter, and
releases its reference to the sending interpreter
7. Execution continues in the receiving interpreter

Cheers,
Nick.

--
Nick Coghlan | ncog...@gmail.com | Brisbane, Australia

Eric Snow

unread,
Oct 5, 2017, 4:48:03 AM10/5/17
to Nick Coghlan, Antoine Pitrou, Python Dev
On Tue, Oct 3, 2017 at 8:55 AM, Antoine Pitrou <soli...@pitrou.net> wrote:
> I think we need a sharing protocol, not just a flag. We also need to
> think carefully about that protocol, so that it does not imply
> unnecessary memory copies. Therefore I think the protocol should be
> something like the buffer protocol, that allows to acquire and release
> a set of shared memory areas, but without imposing any semantics onto
> those memory areas (each type implementing its own semantics). And
> there needs to be a dedicated reference counting for object shares, so
> that the original object can be notified when all its shares have
> vanished.

I've come to agree. :) I actually came to the same conclusion tonight
before I'd been able to read through your message carefully. My idea
is below. Your suggestion about protecting shared memory areas is
something to discuss further, though I'm not sure it's strictly
necessary yet (before we stop sharing the GIL).

On Wed, Oct 4, 2017 at 7:41 PM, Nick Coghlan <ncog...@gmail.com> wrote:
> Having the sending interpreter do the INCREF just changes the problem
> to be a memory leak waiting to happen rather than an access-after-free
> issue, since the problematic non-synchronised scenario then becomes:
>
> * thread on CPU A has two references (ob_refcnt=2)
> * it sends a reference to a thread on CPU B via a channel
> * thread on CPU A releases its reference (ob_refcnt=1)
> * updated ob_refcnt value hasn't made it back to the shared memory cache yet
> * thread on CPU B releases its reference (ob_refcnt=1)
> * both threads have released their reference, but the refcnt is still
> 1 -> object leaks!
>
> We simply can't have INCREFs and DECREFs happening in different
> threads without some way of ensuring cache coherency for *both*
> operations - otherwise we risk either the refcount going to zero when
> it shouldn't, or *not* going to zero when it should.
>
> The current CPython implementation relies on the process global GIL
> for that purpose, so none of these problems will show up until you
> start trying to replace that with per-interpreter locks.
>
> Free threaded reference counting relies on (expensive) atomic
> increments & decrements.

Right. I'm not sure why I was missing that, but I'm clear now.

Below is a rough idea of what I think may work instead (the result of
much tossing and turning in bed*).

While we're still sharing a GIL between interpreters:

Channel.send(obj): # in interp A
incref(obj)
if type(obj).tp_share == NULL:
raise ValueError("not a shareable type")
ch.objects.append(obj)

Channel.recv(): # in interp B
orig = ch.objects.pop(0)
obj = orig.tp_share()
return obj

bytes.tp_share():
return self

After we move to not sharing the GIL between interpreters:

Channel.send(obj): # in interp A
incref(obj)
if type(obj).tp_share == NULL:
raise ValueError("not a shareable type")
set_owner(obj) # obj.owner or add an obj -> interp entry to global table
ch.objects.append(obj)

Channel.recv(): # in interp B
orig = ch.objects.pop(0)
obj = orig.tp_share()
set_shared(obj, orig) # add to a global table
return obj

bytes.tp_share():
obj = blank_bytes(len(self))
obj.ob_sval = self.ob_sval # hand-wavy memory sharing
return obj

bytes.tp_free(): # under no-shared-GIL:
# most of this could be pulled into a macro for re-use
orig = lookup_shared(self)
if orig != NULL:
current = release_LIL()
interp = lookup_owner(orig)
acquire_LIL(interp)
decref(orig)
release_LIL(interp)
acquire_LIL(current)
# clear shared/owner tables
# clear/release self.ob_sval
free(self)

The CIV approach could be facilitated through something like a new
SharedBuffer type, or through a separate BufferViewChannel, etc.

Most notably, this approach avoids hard-coding specific type support
into channels and should work out fine under no-shared-GIL
subinterpreters. One nice thing about the tp_share slot is that it
makes it much easier (along with C-API for managing the global
owned/shared tables) to implement other types that are legal to pass
through channels. Such could be provided via extension modules.
Numpy arrays could be made to support it, if that's your thing.
Antoine could give tp_share to locks and semaphores. :) Of course,
any such types would have to ensure that they are actually safe to
share between intepreters without a GIL between them...

For PEP 554, I'd only propose the tp_share slot and its use in
Channel.send()/.recv(). The parts related to global tables and memory
sharing and tp_free() wouldn't be necessary until we stop sharing the
GIL between interpreters. However, I believe that tp_share would make
us ready for that.

-eric


* I should know by now that some ideas sound better in the middle of
the night than they do the next day, but this idea is keeping me awake
so I'll risk it! :)

Nick Coghlan

unread,
Oct 5, 2017, 6:59:00 AM10/5/17
to Eric Snow, Antoine Pitrou, Python Dev
On 5 October 2017 at 18:45, Eric Snow <ericsnow...@gmail.com> wrote:
After we move to not sharing the GIL between interpreters:

Channel.send(obj):  # in interp A
    incref(obj)
    if type(obj).tp_share == NULL:
        raise ValueError("not a shareable type")
    set_owner(obj)  # obj.owner or add an obj -> interp entry to global table
    ch.objects.append(obj)

Channel.recv():  # in interp B
    orig = ch.objects.pop(0)
    obj = orig.tp_share()
    set_shared(obj, orig)  # add to a global table
    return obj

This would be hard to get to work reliably, because "orig.tp_share()" would be running in the receiving interpreter, but all the attributes of "orig" would have been allocated by the sending interpreter. It gets more reliable if it's *Channel.send* that calls tp_share() though, but moving the call to the sending side makes it clear that a tp_share protocol would still need to rely on a more primitive set of "shareable objects" that were the permitted return values from the tp_share call.

And that's the real pay-off that comes from defining this in terms of the memoryview protocol: Py_buffer structs *aren't* Python objects, so it's only a regular C struct that gets passed across the interpreter boundary (the reference to the original objects gets carried along passively as part of the CIV - it never gets *used* in the receiving interpreter).
 
bytes.tp_share():
    obj = blank_bytes(len(self))
    obj.ob_sval = self.ob_sval # hand-wavy memory sharing
    return obj

This is effectively reinventing memoryview, while trying to pretend it's an ordinary bytes object. Don't reinvent memoryview :)
 
bytes.tp_free():  # under no-shared-GIL:
    # most of this could be pulled into a macro for re-use
    orig = lookup_shared(self)
    if orig != NULL:
        current = release_LIL()
        interp = lookup_owner(orig)
        acquire_LIL(interp)
        decref(orig)
        release_LIL(interp)
        acquire_LIL(current)
        # clear shared/owner tables
        # clear/release self.ob_sval
    free(self)

I don't think we should be touching the behaviour of core builtins solely to enable message passing to subinterpreters without a shared GIL.

The simplest possible variant of CIVs that I can think of would be able to avoid that outcome by being a memoryview subclass, since they just need to hold the extra reference to the original interpreter, and include some logic to swtich interpreters at the appropriate time.

That said, I think there's definitely a useful design question to ask in this area, not about bytes (which can be readily represented by a memoryview variant in the receiving interpreter), but about *strings*: they have a more complex internal layout than bytes objects, but as long as the receiving interpreter can make sure that the original string continues to exist, then you could usefully implement a "strview" type to avoid having to go through an encode/decode cycle just to pass a string to another subinterpreter.

That would provide a reasonable compelling argument that CIVs *shouldn't* be implemented as memoryview subclasses, but instead defined as *containing* a managed view of an object owned by a different interpreter.

That way, even if the initial implementation only supported CIVs that contained a memoryview instance, we'd have the freedom to define other kinds of views later (such as strview), while being able to reuse the same CIV machinery.

Eric Snow

unread,
Oct 5, 2017, 9:51:34 PM10/5/17
to Nick Coghlan, Antoine Pitrou, Python Dev
On Thu, Oct 5, 2017 at 4:57 AM, Nick Coghlan <ncog...@gmail.com> wrote:
> This would be hard to get to work reliably, because "orig.tp_share()" would
> be running in the receiving interpreter, but all the attributes of "orig"
> would have been allocated by the sending interpreter. It gets more reliable
> if it's *Channel.send* that calls tp_share() though, but moving the call to
> the sending side makes it clear that a tp_share protocol would still need to
> rely on a more primitive set of "shareable objects" that were the permitted
> return values from the tp_share call.

The point of running tp_share() in the receiving interpreter is to
force allocation under that interpreter, so that GC applies there. I
agree that you basically can't do anything in tp_share() that would
affect the sending interpreter, including INCREF and DECREF. Since we
INCREFed in send(), we know that the we have a safe reference, so we
don't have to worry about that part in tp_share(). We would only be
able to do low-level things (like the buffer protocol) that don't
interact with the original object's interpreter.

Given that this is a quite low-level tp slot and low-level
functionality, I'd expect that a sufficiently clear entry (i.e.
warning) in the docs would be enough for the few that dare. <wink>
From my perspective adding the tp_share slot allows for much more
experimentation with object sharing (right now, long before we get to
considering how to stop sharing the GIL) by us *and* third parties.
None of the alternatives seem to offer the same opportunity while
still working out *after* we stop sharing the GIL.

>
> And that's the real pay-off that comes from defining this in terms of the
> memoryview protocol: Py_buffer structs *aren't* Python objects, so it's only
> a regular C struct that gets passed across the interpreter boundary (the
> reference to the original objects gets carried along passively as part of
> the CIV - it never gets *used* in the receiving interpreter).

Yeah, the (PEP 3118) buffer protocol offers precedent in a number of
ways that are applicable to channels here. I'm simply reticent to
lock PEP 554 into such a specific solution as the buffer-specific CIV.
I'm trying to accommodate anticipated future needs while keeping the
PEP as simple and basic as possible. It's driving me nuts! :P Things
were *much* simpler before I added Channels to the PEP. :)

>
>>
>> bytes.tp_share():
>> obj = blank_bytes(len(self))
>> obj.ob_sval = self.ob_sval # hand-wavy memory sharing
>> return obj
>
>
> This is effectively reinventing memoryview, while trying to pretend it's an
> ordinary bytes object. Don't reinvent memoryview :)
>
>>
>> bytes.tp_free(): # under no-shared-GIL:
>> # most of this could be pulled into a macro for re-use
>> orig = lookup_shared(self)
>> if orig != NULL:
>> current = release_LIL()
>> interp = lookup_owner(orig)
>> acquire_LIL(interp)
>> decref(orig)
>> release_LIL(interp)
>> acquire_LIL(current)
>> # clear shared/owner tables
>> # clear/release self.ob_sval
>> free(self)
>
>
> I don't think we should be touching the behaviour of core builtins solely to
> enable message passing to subinterpreters without a shared GIL.

Keep in mind that I included the above as a possible solution using
tp_share() that would work *after* we stop sharing the GIL. My point
is that with tp_share() we have a solution that works now *and* will
work later. I don't care how we use tp_share to do so. :) I long to
be able to say in the PEP that you can pass bytes through the channel
and get bytes on the other side.

That said, I'm not sure how this could be made to work without
involving tp_free(). If that is really off the table (even in the
simplest possible ways) then I don't think there is a way to actually
share objects of builtin types between interpreters other than through
views like CIV. We could still support tp_share() for the sake of
third parties, which would facilitate that simplicity I was aiming for
in sending data between interpreters, as well as leaving the door open
for nearly all the same experimentation. However, I expect that most
*uses* of channels will involve builtin types, particularly as we
start off, so having to rely on view types for builtins would add
not-insignificant awkwardness to using channels.

I'd still like to avoid that if possible, so let's not rush to
completely close the door on small modifications to tp_free for
builtins. :) Regardless, I still (after a night's rest and a day of
not thinking about it) consider tp_share() to be the solution I'd been
hoping we'd find, whether or not we can apply it to builtin types.

>
> The simplest possible variant of CIVs that I can think of would be able to
> avoid that outcome by being a memoryview subclass, since they just need to
> hold the extra reference to the original interpreter, and include some logic
> to swtich interpreters at the appropriate time.
>
> That said, I think there's definitely a useful design question to ask in
> this area, not about bytes (which can be readily represented by a memoryview
> variant in the receiving interpreter), but about *strings*: they have a more
> complex internal layout than bytes objects, but as long as the receiving
> interpreter can make sure that the original string continues to exist, then
> you could usefully implement a "strview" type to avoid having to go through
> an encode/decode cycle just to pass a string to another subinterpreter.
>
> That would provide a reasonable compelling argument that CIVs *shouldn't* be
> implemented as memoryview subclasses, but instead defined as *containing* a
> managed view of an object owned by a different interpreter.
>
> That way, even if the initial implementation only supported CIVs that
> contained a memoryview instance, we'd have the freedom to define other kinds
> of views later (such as strview), while being able to reuse the same CIV
> machinery.

Hmm, so a CIV implementation that accomplishes something similar to tp_share()?

For some reason I'm seeing similarities between CIV-vs.-tp_share and
the import machinery before PEP 451. Before we added module specs,
import hook authors had to do a bunch of the busy work that the import
machinery does for you now by leveraging module specs. Back then we
worked to provide a number of helpers to reduce that extra pain of
writing an import hook. Now the helpers are irrelevant and the extra
burden is gone.

My mind is drawn to the comparison between that and the question of
CIV vs. tp_share(). CIV would be more like the post-451 import world,
where I expect the CIV would take care of the data sharing operations.
That said, the situation in PEP 554 is sufficiently different that I'm
not convinced a generic CIV protocol would be better. I'm not sure
how much CIV could do for you over helpers+tp_share.

Anyway, here are the leading approaches that I'm looking at now:

* adding a tp_share slot
+ you send() the object directly and recv() the object coming out of
tp_share()
(which will probably be the same type as the original)
+ this would eventually require small changes in tp_free for
participating types
+ we would likely provide helpers (eventually), similar to the new
buffer protocol,
to make it easier to manage sharing data
* simulating tp_share via an external global registry (or a registry
on the Channel type)
+ it would still be hard to make work without hooking into tp_free()
* CIVs hard-coded in Channel (or BufferViewChannel, etc.) for specific
types (e.g. buffers)
+ you send() the object like normal, but recv() the view
* a CIV protocol on Channel by which you can add support for more types
+ you send() the object like normal but recv() the view
+ could work through subclassing or a registry
+ a lot of conceptual similarity with tp_share+tp_free
* a CIV-like proxy
+ you wrap the object, send() the proxy, and recv() a proxy
+ this is entirely compatible with tp_share()

Here are what I consider the key metrics relative to the utility of a
solution (not in any significant order):

* how hard to understand as a Python programmer?
* how much extra work (if any) for folks calling Channel.send()?
* how much extra work (if any) for folks calling Channel.recv()?
* how complex is the CPython implementation?
* how hard to understand as a type author (wanting to add support for
their type)?
* how hard to add support for a new type?
* what variety of types could be supported?
* what breadth of experimentation opens up?

The most important thing to me is keeping things simple for Python
programmers. After that is ease-of-use for type authors. However, I
also want to put us in a good position in 3.7 to experiment
extensively with subinterpreters, so that's a big consideration.

Consequently, for PEP 554 my goal is to find a solution for object
sharing that keeps things simple in Python while laying a basic
foundation we can build on at the C level, so we don't get locked in
but still maximize our opportunities to experiment. :)

-eric

Nick Coghlan

unread,
Oct 5, 2017, 11:40:23 PM10/5/17
to Eric Snow, Antoine Pitrou, Python Dev
On 6 October 2017 at 11:48, Eric Snow <ericsnow...@gmail.com> wrote:
> And that's the real pay-off that comes from defining this in terms of the
> memoryview protocol: Py_buffer structs *aren't* Python objects, so it's only
> a regular C struct that gets passed across the interpreter boundary (the
> reference to the original objects gets carried along passively as part of
> the CIV - it never gets *used* in the receiving interpreter).

Yeah, the (PEP 3118) buffer protocol offers precedent in a number of
ways that are applicable to channels here.  I'm simply reticent to
lock PEP 554 into such a specific solution as the buffer-specific CIV.
I'm trying to accommodate anticipated future needs while keeping the
PEP as simple and basic as possible.  It's driving me nuts! :P  Things
were *much* simpler before I added Channels to the PEP. :)

Starting with memory-sharing only doesn't lock us into anything, since you can still add a more flexible kind of channel based on a different protocol later if it turns out that memory sharing isn't enough.

By contrast, if you make the initial channel semantics incompatible with multiprocessing by design, you *will* prevent anyone from experimenting with replicating the shared memory based channel API for communicating between processes :)

That said, if you'd prefer to keep the "Channel" name available for the possible introduction of object channels at a later date, you could call the initial memoryview based channel a "MemChannel".
 
> I don't think we should be touching the behaviour of core builtins solely to
> enable message passing to subinterpreters without a shared GIL.

Keep in mind that I included the above as a possible solution using
tp_share() that would work *after* we stop sharing the GIL.  My point
is that with tp_share() we have a solution that works now *and* will
work later.  I don't care how we use tp_share to do so. :)  I long to
be able to say in the PEP that you can pass bytes through the channel
and get bytes on the other side.

Memory views are a builtin type as well, and they emphasise the practical benefit we're trying to get relative to typical multiprocessing arranagements: zero-copy data sharing.

So here's my proposed experimentation-enabling development strategy:

1. Start out with a MemChannel API, that accepts any buffer-exporting object as input, and outputs only a cross-interpreter memoryview subclass
2. Use that as the basis for the work to get to a per-interpreter locking arrangement that allows subinterpreters to fully exploit multiple CPUs
3. Only then try to design a Channel API that allows for sharing builtin immutable objects between interpreters (bytes, strings, numbers), at a time when you can be certain you won't be inadvertently making it harder to make the GIL a truly per-interpreter lock, rather than the current process global runtime lock.

The key benefit of this approach is that we *know* MemChannel can work: the buffer protocol already operates at the level of C structs and pointers, not Python objects, and there are already plenty of interesting buffer-protocol-supporting objects around, so as long as the CIV switches interpreters at the right time, there aren't any fundamentally new runtime level capabilities needed to implement it.

The lower level MemChannel API could then also be replicated for multiprocessing, while the higher level more speculative object-based Channel API would be specific to subinterpreters (and probably only ever designed and implemented if you first succeed in making subinterpreters sufficiently independent that they don't rely on a process-wide GIL any more).

So I'm not saying "Never design an object-sharing protocol specifically for use with subinterpreters". I'm saying "You don't have a demonstrated need for that yet, so don't try to define it until you do".

 
My mind is drawn to the comparison between that and the question of
CIV vs. tp_share().  CIV would be more like the post-451 import world,
where I expect the CIV would take care of the data sharing operations.
That said, the situation in PEP 554 is sufficiently different that I'm
not convinced a generic CIV protocol would be better.  I'm not sure
how much CIV could do for you over helpers+tp_share.

Anyway, here are the leading approaches that I'm looking at now:

* adding a tp_share slot
  + you send() the object directly and recv() the object coming out of
tp_share()
     (which will probably be the same type as the original)
  + this would eventually require small changes in tp_free for
participating types
  + we would likely provide helpers (eventually), similar to the new
buffer protocol,
     to make it easier to manage sharing data

I'm skeptical about this approach because you'll be designing in a vacuum against future possible constraints that you can't test yet: the inherent complexity in the object sharing protocol will come from *not* having a process-wide GIL, but you'll be starting out with a process-wide GIL still in place. And that means third parties will inevitably rely on the process-wide GIL in their tp_share implementations (despite their best intentions), and you'll end up with the same issue that causes problems for the rest of the C API.

By contrast, if you delay this step until *after* the GIL has successfully been shifted to being per-interpreter, then by the time the new protocol is defined, people will also be able to test their tp_share implementations properly.

At that point, you'd also presumably have evidence of demand to justify the introduction of a new core language protocol, as:

* folks will only complain about the limitations of MemChannel if they're actually using subinterpreters
* the complaints about the limitations of MemChannel would help guide the object sharing protocol design
 
* simulating tp_share via an external global registry (or a registry
on the Channel type)
  + it would still be hard to make work without hooking into tp_free()
* CIVs hard-coded in Channel (or BufferViewChannel, etc.) for specific
types (e.g. buffers)
  + you send() the object like normal, but recv() the view
* a CIV protocol on Channel by which you can add support for more types
  + you send() the object like normal but recv() the view
  + could work through subclassing or a registry
  + a lot of conceptual similarity with tp_share+tp_free
* a CIV-like proxy
  + you wrap the object, send() the proxy, and recv() a proxy
  + this is entirely compatible with tp_share()

* Allow for multiple channel types, such that MemChannel is merely the *first* channel type, rather than the *only* channel type
  + Allows PEP 554 to be restricted to things we already know can be made to work
  + Doesn't block the introduction of an object-sharing based Channel in some future release
  + Allows for at least some channel types to be adapted for use with shared memory and multiprocessing
 
Here are what I consider the key metrics relative to the utility of a
solution (not in any significant order):

* how hard to understand as a Python programmer?

Not especially important yet - this is more a criterion for the final API, not the initial experimental platform.
 
* how much extra work (if any) for folks calling Channel.send()?
* how much extra work (if any) for folks calling Channel.recv()?

I don't think either are particularly important yet, although we also don't want to raise any pointless barriers to experimentation.
 
* how complex is the CPython implementation?

This is critical, since we want to minimise any potential for undesirable side effects on regular single interpreter code.
 
* how hard to understand as a type author (wanting to add support for
their type)?
* how hard to add support for a new type?
* what variety of types could be supported?
* what breadth of experimentation opens up?

You missed the big one: what risk does the initial channel design pose to the underlying objective of making the GIL a genuinely per-interpreter lock?

If we don't eventually reach the latter goal, then subinterpreters won't really offer much in the way of compelling benefits over just using a thread pool and queue.Queue.

MemChannel poses zero additional risk to that, since we wouldn't be sharing actual Python objects between interpreters, only C pointers and structs.

By contrast, introducing an object channel early poses significant new risks to that goal, since it will force you to solve hard protocol design and refcount management problems *before* making the switch, rather than being able to defer the design of the object channel protocol until *after* you've already enabled the ability to run subinterpreters in completely independent threads.
 
The most important thing to me is keeping things simple for Python
programmers.  After that is ease-of-use for type authors.  However, I
also want to put us in a good position in 3.7 to experiment
extensively with subinterpreters, so that's a big consideration.

Consequently, for PEP 554 my goal is to find a solution for object
sharing that keeps things simple in Python while laying a basic
foundation we can build on at the C level, so we don't get locked in
but still maximize our opportunities to experiment. :)

I think our priorities are quite different then, as I believe PEP 554 should be focused on defining a relatively easy to implement API that nevertheless makes it possible to write interesting programs while working on the goal of making the GIL per-interpreter, without worrying too much about whether or not the initial cross-interpreter communication channels closely resemble the final ones that will be intended for more general use.

Cheers,
Nick.

Koos Zevenhoven

unread,
Oct 6, 2017, 12:31:37 PM10/6/17
to Nick Coghlan, Antoine Pitrou, Python Dev
While I'm actually trying not to say much here so that I can avoid this discussion now, here's just a couple of ideas and thoughts from me at this point:

(A) 
Instead of sending bytes and receiving memoryviews, one could consider sending *and* receiving memoryviews for now. That could then be extended into more types of objects in the future without changing the basic concept of the channel. Probably, the memoryview would need to be copied (but not the data of course). But I'm guessing copying a memoryview would be quite fast.

This would hopefully require less API changes or additions in the future. OTOH, giving it a different name like MemChannel or making it 3rd party will buy some more time to figure out the right API. But maybe that's not needed.

(B) 
We would probably then like to pretend that the object coming out the other end of a Channel *is* the original object. As long as these channels are the only way to directly pass objects between interpreters, there are essentially only two ways to tell the difference (AFAICT):

1. Calling id(...) and sending it over to the other interpreter and checking if it's the same.

2. When the same object is sent twice to the same interpreter. Then one can compare the two with id(...) or using the `is` operator. 

There are solutions to the problems too:

1. Send the id() from the sending interpreter along with the sent object so that the receiving interpreter can somehow attach it to the object and then return it from id(...).

2. When an object is received, make a lookup in an interpreter-wide cache to see if an object by this id has already been received. If yes, take that one.

Now it should essentially look like the received object is really "the same one" as in the sending interpreter. This should also work with multiple interpreters and multiple channels, as long as the id is always preserved.

(C)
One further complication regarding memoryview in general is that .release() should probably be propagated to the sending interpreter somehow.

(D)
I think someone already mentioned this one, but would it not be better to start a new interpreter in the background in a new thread by default? I think this would make things simpler and leave more freedom regarding the implementation in the future. If you need to run an interpreter within the current thread, you could perhaps optionally do that too.


––Koos


PS. I have lots of thoughts related to this, but I can't afford to engage in them now. (Anyway, it's probably more urgent to get some stuff with PEP 555 and its spin-off thoughts out of the way).



_______________________________________________
Python-Dev mailing list
Pytho...@python.org
https://mail.python.org/mailman/listinfo/python-dev

Nick Coghlan

unread,
Oct 8, 2017, 10:29:15 PM10/8/17
to Koos Zevenhoven, Antoine Pitrou, Python Dev
On 7 October 2017 at 02:29, Koos Zevenhoven <k7h...@gmail.com> wrote:
While I'm actually trying not to say much here so that I can avoid this discussion now, here's just a couple of ideas and thoughts from me at this point:

(A) 
Instead of sending bytes and receiving memoryviews, one could consider sending *and* receiving memoryviews for now. That could then be extended into more types of objects in the future without changing the basic concept of the channel. Probably, the memoryview would need to be copied (but not the data of course). But I'm guessing copying a memoryview would be quite fast.

The proposal is to allow sending any buffer-exporting object, so sending a memoryview would be supported.
 
This would hopefully require less API changes or additions in the future. OTOH, giving it a different name like MemChannel or making it 3rd party will buy some more time to figure out the right API. But maybe that's not needed.

I think having both a memory-centric data channel and an object-centric data channel would be useful long term, so I don't see a lot of downsides to starting with the easier-to-implement MemChannel, and then looking at how to define a plain Channel later.

For example, it occurs to me is that the closest current equivalent we have to an object level counterpart to the memory buffer protocol would be the weak reference protocol, wherein a multi-interpreter-aware proxy object could actually take care of switching interpreters as needed when manipulating reference counts.

While weakrefs themselves wouldn't be usable in the general case (many builtin types don't support weak references, and we'd want to support strong cross-interpreter references anyway), a wrapt-style object proxy would provide us with a way to maintain a single strong reference to the original object in its originating interpreter (implicitly switching to that interpreter as needed), while also maintaining a regular local reference count on the proxy object in the receiving interpreter.

And here's the neat thing: since subinterpreters share an address space, it would be possible to experiment with an object-proxy based channel by passing object pointers over a memoryview based channel.
 
(B) 
We would probably then like to pretend that the object coming out the other end of a Channel *is* the original object. As long as these channels are the only way to directly pass objects between interpreters, there are essentially only two ways to tell the difference (AFAICT):

1. Calling id(...) and sending it over to the other interpreter and checking if it's the same.

2. When the same object is sent twice to the same interpreter. Then one can compare the two with id(...) or using the `is` operator. 

There are solutions to the problems too:

1. Send the id() from the sending interpreter along with the sent object so that the receiving interpreter can somehow attach it to the object and then return it from id(...).

2. When an object is received, make a lookup in an interpreter-wide cache to see if an object by this id has already been received. If yes, take that one.

Now it should essentially look like the received object is really "the same one" as in the sending interpreter. This should also work with multiple interpreters and multiple channels, as long as the id is always preserved.

I don't personally think we want to expend much (if any) effort on presenting the illusion that the objects on either end of the channel are the "same" object, but postponing the question entirely is also one of the benefits I see to starting with MemChannel, and leaving the object-centric Channel until later.
 
(C)
One further complication regarding memoryview in general is that .release() should probably be propagated to the sending interpreter somehow.

Yep, switching interpreters when releasing the buffer is the main reason you couldn't use a regular memoryview for this purpose - you need a variant that holds a strong reference to the sending interpreter, and switches back to it for the buffer release operation.
 
(D)
I think someone already mentioned this one, but would it not be better to start a new interpreter in the background in a new thread by default? I think this would make things simpler and leave more freedom regarding the implementation in the future. If you need to run an interpreter within the current thread, you could perhaps optionally do that too.

Not really, as that approach doesn't compose as well with existing thread management primitives like concurrent.futures.ThreadPoolExecutor. It also doesn't match the way the existing subinterpreter machinery works, where threads can change their active interpreter.

Nick Coghlan

unread,
Oct 8, 2017, 10:41:23 PM10/8/17
to Eric Snow, Python-Dev
On 14 September 2017 at 11:44, Eric Snow <ericsnow...@gmail.com> wrote:
Examples
========

Run isolated code
-----------------

::

   interp = interpreters.create()
   print('before')
   interp.run('print("during")')
   print('after')

A few more suggestions for examples:

Running a module:

    main_module = mod_name
    interp.run(f"import runpy; runpy.run_module({main_module!r})")

Running as script (including zip archives & directories):

    main_script = path_name
    interp.run(f"import runpy; runpy.run_path({main_script!r})")

Running in a thread pool executor:

    interps = [interpreters.create() for i in range(5)]
    with concurrent.futures.ThreadPoolExecutor(max_workers=len(interps)) as pool:
        print('before')
        for interp in interps:
            pool.submit(interp.run, 'print("starting"); print("stopping")'
        print('after')

That last one is prompted by the questions about the benefits of keeping the notion of an interpreter state distinct from the notion of a main thread (it allows a single "MainThread" object to be mapped to different OS level threads at different points in time, which means it's easier to combine with existing constructs for managing OS level thread pools).
Reply all
Reply to author
Forward
0 new messages