Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

multiprocessing module backport from 3 to 2.7 - spawn feature

228 views
Skip to first unread message

Andres Riancho

unread,
Jan 28, 2015, 12:53:03 PM1/28/15
to pytho...@python.org
List,

I've been searching around for a multiprocessing module backport from
3 to 2.7.x and the closest thing I've found was celery's billiard [0]
which seems to be a work in progress.

The feature I'm specially interested in is the ability to spawn
processes [1] instead of forking, which is not present in the 2.7
version of the module.

Anyone knows about a working backport of the multiprocessing 3k module?

[0] https://github.com/celery/billiard
[1] https://docs.python.org/3.4/library/multiprocessing.html#multiprocessing.set_start_method

Regards,
--
Andrés Riancho
Project Leader at w3af - http://w3af.org/
Web Application Attack and Audit Framework
Twitter: @w3af
GPG: 0x93C344F3

Skip Montanaro

unread,
Jan 28, 2015, 1:07:06 PM1/28/15
to Andres Riancho, Python

On Wed, Jan 28, 2015 at 7:07 AM, Andres Riancho <andres....@gmail.com> wrote:
The feature I'm specially interested in is the ability to spawn
processes [1] instead of forking, which is not present in the 2.7
version of the module.

Can you explain what you see as the difference between "spawn" and "fork" in this context? Are you using Windows perhaps? I don't know anything obviously different between the two terms on Unix systems.

Skip

Devin Jeanpierre

unread,
Jan 28, 2015, 3:51:48 PM1/28/15
to Skip Montanaro, Python
On Unix, if you fork without exec*, and had threads open, threads
abruptly terminate, resulting in completely broken mutex state etc.,
which leads to deadlocks or worse if you try to acquire resources in
the forked child process. So in such circumstances, multiprocessing
(in 2.7) is not a viable option. But 3.x adds a feature, "spawn", that
lets you fork+exec instead of just forking.

I too would be interested in such a backport. I considered writing
one, but haven't had a strong enough need yet.

-- Devin

Andres Riancho

unread,
Jan 29, 2015, 3:12:32 AM1/29/15
to Skip Montanaro, Python
On Wed, Jan 28, 2015 at 3:06 PM, Skip Montanaro
<skip.mo...@gmail.com> wrote:
>
> On Wed, Jan 28, 2015 at 7:07 AM, Andres Riancho <andres....@gmail.com>
> wrote:
>>
>> The feature I'm specially interested in is the ability to spawn
>> processes [1] instead of forking, which is not present in the 2.7
>> version of the module.
>
>
> Can you explain what you see as the difference between "spawn" and "fork" in
> this context?

Well, fork is a system call [0] where a process creates a copy of
itself, usually using COW [1]. This copy receives a new process ID and
is slightly dependent on the parent: they share the same address
space.

Spawn, and I took that from the multiprocessing 3 documentation, will
create a new process without using fork(). This means that no memory
is shared between the MainProcess and the spawn'ed sub-process created
by multiprocessing.

My goal is to prevent dead-locks and other issues [2][3] which come
from forking a multithreaded program (situation I'm in right now).

[0] https://en.wikipedia.org/wiki/Fork_%28system_call%29
[1] https://en.wikipedia.org/wiki/Copy-on-write
[2] See "Note that safely forking a multithreaded process is
problematic." at
https://docs.python.org/3.4/library/multiprocessing.html#multiprocessing.set_start_method
[3] http://www.linuxprogrammingblog.com/threads-and-fork-think-twice-before-using-them

> Are you using Windows perhaps? I don't know anything obviously
> different between the two terms on Unix systems.

Nope, I'm on linux

> Skip

Sturla Molden

unread,
Jan 30, 2015, 4:11:58 PM1/30/15
to pytho...@python.org
Skip Montanaro <skip.mo...@gmail.com> wrote:

> Can you explain what you see as the difference between "spawn" and "fork"
> in this context? Are you using Windows perhaps? I don't know anything
> obviously different between the two terms on Unix systems.

spawn is fork + exec.

Only a handful of POSIX functions are required to be "fork safe", i.e.
callable on each side of a fork without an exec.

An example of an API which is not safe to use on both sides of a fork is
Apple's GCD. The default builds of NumPy and SciPy depend on it on OSX
because it is used in Accelerate Framework. You can thus get problems if
you use numpy.dot in a process started with multiprocessing. What will
happen is that the call to numpy.dot never returns, given that you called
any BLAS or LAPACK function at least once before the instance of
multiprocessing.Process was started. This is not a bug in NumPy or in
Accelerate Framework, it is a bug in multiprocessing because it assumes
that BLAS is fork safe. The correct way of doing this is to start processes
with spawn (fork + exec), which multiprocessing does on Python 3.4.

Sturla

Sturla Molden

unread,
Jan 30, 2015, 4:20:55 PM1/30/15
to pytho...@python.org
Andres Riancho <andres....@gmail.com> wrote:

> Spawn, and I took that from the multiprocessing 3 documentation, will
> create a new process without using fork().

> This means that no memory
> is shared between the MainProcess and the spawn'ed sub-process created
> by multiprocessing.

If you memory map a segment with MAP_SHARED it will be shared, even after a
spawn.

File descriptors are also shared.

Marko Rauhamaa

unread,
Jan 30, 2015, 5:25:40 PM1/30/15
to
Sturla Molden <sturla...@gmail.com>:

> Only a handful of POSIX functions are required to be "fork safe", i.e.
> callable on each side of a fork without an exec.

That is a pretty surprising statement. Forking without an exec is a
routine way to do multiprocessing.

I understand there are things to consider, but all system calls are
available and safe.


Marko

Sturla Molden

unread,
Jan 30, 2015, 9:15:17 PM1/30/15
to pytho...@python.org
POSIX says this:


- No asynchronous input or asynchronous output operations shall be
inherited by the child process.

- A process shall be created with a single thread. If a multi-threaded
process calls fork(), the new process shall contain a replica of the
calling thread and its entire address space, possibly including the
states of mutexes and other resources. Consequently, to avoid errors,
the child process may only execute async-signal-safe operations until
such time as one of the exec functions is called.

- Fork handlers may be established by means of the pthread_atfork()
function in order to maintain application invariants across fork() calls.

- When the application calls fork() from a signal handler and any of the
fork handlers registered by pthread_atfork() calls a function that is
not asynch-signal-safe, the behavior is undefined.


Hence you must be very careful which functions you use after calling
forking before you have called exec. Generally never use an API above
POSIX, e.g. BLAS or Apple's CoreFoundation.



Apple said this when the problem with multiprocessing and Accelerate
Framework first was discovered:


---------- Forwarded message ----------
From: <dev...@apple.com>
Date: 2012/8/2
Subject: Bug ID 11036478: Segfault when calling dgemm with Accelerate
/ GCD after in a forked process
To: ******@******


Hi Olivier,

Thank you for contacting us regarding Bug ID# 11036478.

Thank you for filing this bug report.

This usage of fork() is not supported on our platform.

For API outside of POSIX, including GCD and technologies like
Accelerate, we do not support usage on both sides of a fork(). For
this reason among others, use of fork() without exec is discouraged in
general in processes that use layers above POSIX.

We recommend that you either restrict usage of blas to the parent or
the child process but not both, or that you switch to using GCD or
pthreads rather than forking to create parallelism.



Also see this:

http://bugs.python.org/issue8713

https://mail.python.org/pipermail/python-ideas/2012-November/017930.html





Sturla

0 new messages