parallelize for-loop with side effects

Christian Stump

unread,

Aug 4, 2014, 5:38:17 AM8/4/14

to sage-s...@googlegroups.com

Hi there,

I wonder how to parallelize the following scenario.

I have a method that initiates a (not very simple) data strucure and then runs a for-loop (of, say, length 1,000-20,000) to populate that data structure with data. The computations in each loop is not trivial, but fairly optimized using cython. All iteration steps done serially take a few secs (about 2 or 3). Nevertheless, the computations are fairly independent and I would like to do them in parallel.

If I extract the content of the for-loop into an @parallel(2) decorated function, it still seems to be using only one cpu to do the computation (why?), but all the forking takes tons of time (i.e., including 80secs for posix.wait and 15 for posix.fork). If I read the documentation right, this is due to the issue that every computation is done in a subprocess itself and the data structure is also forked and passed to the subprocess. Is that correct?

If I use @parallel('reference',2) instead (without knowing what that actually does), it is again as quick as in the beginning but also uses only a single cpu.

What am I doing wrong here? Does anyone know how I should handle such a (I suspect not very uncommon) situation?

Thanks, Christian

William A Stein

unread,

Aug 4, 2014, 10:19:54 AM8/4/14

to sage-support

On Mon, Aug 4, 2014 at 2:38 AM, Christian Stump
<christi...@gmail.com> wrote:
> Hi there,
>
> I wonder how to parallelize the following scenario.
>
> I have a method that initiates a (not very simple) data strucure and then
> runs a for-loop (of, say, length 1,000-20,000) to populate that data
> structure with data. The computations in each loop is not trivial, but
> fairly optimized using cython. All iteration steps done serially take a few
> secs (about 2 or 3). Nevertheless, the computations are fairly independent
> and I would like to do them in parallel.
>
> If I extract the content of the for-loop into an @parallel(2) decorated
> function, it still seems to be using only one cpu to do the computation
> (why?),

It absolutely will use two additional *processes*, as you might see by
watching with htop, top, or using ps.
Whether or not the operating system runs those processes in parallel
depends on many things, e.g., do you
have two processors? Is your user allowed to use both of them fully
(not by default on SageMathCloud, say)? Etc.

> but all the forking takes tons of time (i.e., including 80secs for
> posix.wait and 15 for posix.fork).

20,000 forks should take about 80 seconds. That sounds about right.
The fork OS system call takes a few milliseconds -- multiply that by
20,000 and you get about 80 seconds.

> If I read the documentation right, this
> is due to the issue that every computation is done in a subprocess itself
> and the data structure is also forked and passed to the subprocess. Is that
> correct?

Yes-ish. Just to be clear there is one single fork that happens,
which means that
(almost) all state of the process is inherited by the subprocess.

> If I use @parallel('reference',2) instead (without knowing what that
> actually does), it is again as quick as in the beginning but also uses only
> a single cpu.

That fakes @parallel -- providing the same API -- but actually running
everything in serial in the parent process. No forking or anything
else happens. It's for testing and development purposes.

> What am I doing wrong here? Does anyone know how I should handle such a (I
> suspect not very uncommon) situation?

Break up your computation into far less than 20,000 separate steps,
then us @parallel. For example, if your 20,000 steps are "compute
f(n)" for n in range(20000), instead do "compute f(1) through f(1000)
as the first step, then compute f(1001) through f(2000) as the next
step", etc.

For example, if you wanted to use @parallel to factor the integers
[1..20000], you would do this:

@parallel
def f(m):
return [factor(k) for k in range(1000*m, 1000*(m+1)) if k]

t = []
for x in f([1..20]):
print x[0]
t.append(x)

Christian Stump

unread,

Aug 4, 2014, 11:48:39 AM8/4/14

to sage-s...@googlegroups.com

Thanks, William!

> It absolutely will use two additional *processes*, as you might see by
> watching with htop, top, or using ps.

Is it right that the master process is creating all the subprocesses?
I'd suspect I don't quite see the other processes in action simply
because they are there only for milliseconds...

> Yes-ish. Just to be clear there is one single fork that happens,
> which means that
> (almost) all state of the process is inherited by the subprocess.

Yes, but every subprocess modified the data structure *within the
subprocess*, the object in the main process is not modified, or am I
missing something? (That's at least how I understand the
documentation, and what I see happening in my computation output.)

> That fakes @parallel -- providing the same API -- but actually running
> everything in serial in the parent process. No forking or anything
> else happens. It's for testing and development purposes.

I see.

> Break up your computation into far less than 20,000 separate steps, then us @parallel.

Okay, I'll do that.

But I still don't see how I should handle the side effects that are
supposed to effect objects in the main process.

Or are you suggesting that I should (actually also for clarity of the
code) completely remove all side effects, do the computations in
parallel, but instead of the side effects return the stuff I need and
then do the side effect stuff in serial. Sth. like

@parallel
def f(m):
return [ factor(k) for k in range(1000*m, 1000*(m+1)) ]

obj = MyObj()

for x in f([1..20]):
print x[0]

for y in x:
obj.store_new_data(y)

If I should do it this way: is the body of the for-loop executed in
the main process *in parallel* to the subprocess computing the next
element of the iterator f([1..20]) ?

Thanks again for the clarification!

Christian

William A Stein

unread,

Aug 4, 2014, 12:06:20 PM8/4/14

to sage-support

On Mon, Aug 4, 2014 at 8:48 AM, Christian Stump
<christi...@gmail.com> wrote:
> Thanks, William!
>
>> It absolutely will use two additional *processes*, as you might see by
>> watching with htop, top, or using ps.
>
> Is it right that the master process is creating all the subprocesses?
> I'd suspect I don't quite see the other processes in action simply
> because they are there only for milliseconds...

Yes, exactly.

I forgot to mention that another approach, which is what
multiprocessing (a python module) does, is to create n subprocesses
and keep feeding them data. Then they stay running. This might make
more sense in your situation. However, it has very significant
drawbacks in some situations as well.

>
>> Yes-ish. Just to be clear there is one single fork that happens,
>> which means that
>> (almost) all state of the process is inherited by the subprocess.
>
> Yes, but every subprocess modified the data structure *within the
> subprocess*, the object in the main process is not modified, or am I
> missing something? (That's at least how I understand the
> documentation, and what I see happening in my computation output.)

That's exactly right and how it is documented to behave, and must behave
(unless one uses shared memory, which @parallel doesn't use).

>
>> That fakes @parallel -- providing the same API -- but actually running
>> everything in serial in the parent process. No forking or anything
>> else happens. It's for testing and development purposes.
>
> I see.
>
>> Break up your computation into far less than 20,000 separate steps, then us @parallel.
>
> Okay, I'll do that.
>
> But I still don't see how I should handle the side effects that are
> supposed to effect objects in the main process.

You didn't ask about that explicitly before. It's impossible to do
that *implicitly* with arbitrarily Python data structures, due to the
global interpreter lock (GIL). To solve this problem using @parallel
(or multiprocessing) you have to work harder and pass data back from
your function, then insert that data in the data structure. E.g., in
my example below, with factor, that's what I do.

With multiprocessing they do have some limited support for shared
memory, but only for certain specific data types, e.g., a bunch of C
ints.

>
> Or are you suggesting that I should (actually also for clarity of the
> code) completely remove all side effects, do the computations in
> parallel, but instead of the side effects return the stuff I need and
> then do the side effect stuff in serial. Sth. like

YES. It's the only solid way to do this sort of thing in Python. And
it can possibly result in clearer and easier to debug code. Functions
without side effects are sometimes easier to reason about.

>
> @parallel
> def f(m):
> return [ factor(k) for k in range(1000*m, 1000*(m+1)) ]
>
> obj = MyObj()
> for x in f([1..20]):
> print x[0]
> for y in x:
> obj.store_new_data(y)
>
> If I should do it this way: is the body of the for-loop executed in
> the main process *in parallel* to the subprocess computing the next
> element of the iterator f([1..20]) ?

Mostly. More precisely, when you execute

for x in f([1..20]) :

...

it unleashes many subprocesses, which start running. When one
completes, it's result is saved to a pickle, and the calling parent
processes notices that subprocess terminates, reads the pickle
(deletes the file), adds the result to the iterator, and starts
another subproc going. So the x's in the example above can come back
in any random order. Also, I think (not tested) if you did:

@parallel
def f(m):
return [ factor(k) for k in range(1000*m, 1000*(m+1)) ]

obj = MyObj()
for x in f([1..20]):

while True: pass

then it would compute k(=number of cores) values of f, then hang and
not even start computing more.
In other words, the code that forks off subprocesses has to run at
some point, and there's no threading involved
here -- the forking off of subprocesses is *caused* by iterating over
f([1..20]). It's a possibly annoying/surprising
trick, but it makes the implementation of @parallel really simple, by
avoiding any use of threads or asynchronous
execution.

I encourage you to read the source code of this @parallel stuff --
it's only about 2 pages of actual code,
which I wrote at some Sage days as my project back in maybe 2008.

https://github.com/sagemath/sage/blob/master/src/sage/parallel/use_fork.py

It's critical to your question to understand how the Python yield
keyword works (on line 183 in the code).

>
> Thanks again for the clarification!
>
> Christian
>

> --
> You received this message because you are subscribed to the Google Groups "sage-support" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to sage-support...@googlegroups.com.
> To post to this group, send email to sage-s...@googlegroups.com.
> Visit this group at http://groups.google.com/group/sage-support.
> For more options, visit https://groups.google.com/d/optout.

--
William Stein
Professor of Mathematics
University of Washington
http://wstein.org
wst...@uw.edu

Christian Stump

unread,

Aug 4, 2014, 12:29:05 PM8/4/14

to sage-s...@googlegroups.com

> I encourage you to read the source code of this @parallel stuff --
> it's only about 2 pages of actual code,
> which I wrote at some Sage days as my project back in maybe 2008.

Will do, thanks again!

Reply all

Reply to author

Forward